Saturday, 24 July 2021

Intel Hardware P-State (HWP) / Intel Speed Shift

Intel Hardware P-State (aka Harware Controlled Performance or "Speed Shift") (HWP) is a feature found in more modern x86 Intel CPUs (Skylake onwards). It attempts to select the best CPU frequency and voltage to match the optimal power efficiency for the desired CPU performance.  HWP is more responsive than the older operating system controlled methods and should therefore be more effective.

To test this theory, I exercised my Lenovo T480 i5-8350U 8 thread CPU laptop with the stress-ng cpu stressor using the "double" precision math stress method, exercising 1 to 8 of the CPU threads over a 60 second test run.  The average CPU temperature and average CPU frequency were measured using powerstat and the CPU compute throughput was measured using the stress-ng bogo-ops count.

The HWP mode was set using the x86_energy_perf_policy tool (as found in the Linux source in tools/power/x86/x86_energy_perf_policy).  This allows one to select one of 5 policies:  "normal", "performance", "balance-performance", "balance-power" and "power" as well as enabling or disabling turbo frequencies.  For the tests, turbo mode was also enabled to allow the CPU to run at higher CPU turbo frequencies.

The "performance" policy is the least efficient option as the CPU is clocked at a high frequency even when the system is idle and is not idea for a laptop. The "power" policy will optimize for low power; on my system it set the CPU to a maximum of 400MHz which is not ideal for typical uses.

The more useful "balance-performance" option optimizes for good throughput at the cost of power consumption where as the "balance-power" option optimizes for good power consumption in preference to performance, so I tested these two options.

Comparison #1,  CPU throughput (bogo-ops) vs CPU frequency.

The two HWP policies are almost identical in CPU bogo-ops throughput vs CPU frequency. This is hardly surprising - the compute throughput for math intensive operations should scale with CPU clock frequency. Note that 5 or more CPU threads sees a reduction in compute throughput because the CPU hyper-threads are being used.

Comparison #2, CPU package temperature vs CPU threads used.

Not a big surprise, the more CPU threads being exercised the hotter the CPU package will get. The balance-power policy shows a cooler running CPU than the balance-performance policy.  The balance-performance policy is running hot even when one or a few threads are being used.

Comparison #3, Power consumed vs CPU threads used.

Clearly the balance-performance option is consuming more power than balance-power, this matches the CPU temperature measurements too. More power, higher temperature.

Comparison #4, Maximum CPU frequency vs CPU threads used.

With the balance-performance option, the average maximum CPU frequency drops as more CPU threads are used.  Intel turbo boost allows one to clock a few CPUs to higher frequencies,  exercising more CPUs leads to more power and hence more heat. To keep the CPU package from hitting thermal overrun, CPU frequency and voltage has to be scaled down when using more CPUs. 

This also is true (but far less pronounced) for the balance-power option. As once can see, balance-performance runs the CPU at a much higher frequency, which is great for compute at the expense of power consumption and heat.

Comparison #5, Compute throughput vs power consumed.

So running with the balance-performance runs the CPU at higher speed and hence one gets more compute throughput per unit of time compared to the balance-power mode.  That's great if your laptop is plugged into the mains and you want to get some compute intensive tasks performed quickly.   However, is this more efficient? 

Comparing the amount of compute performance with the power consumed shows that the balance-power option is more efficient than balance-performance.  Basically with balance-power more compute is possible with the same amount of energy compared to balance-performance, but it will take longer to complete.

CPU frequency scaling over time

The 60 second duration tests were long enough for the CPU to warm up enough reach thermal limits causing HWP to throttle back the voltage and CPU frequencies.  The following graphs illustrate how running with the balance-performance option allows the CPU to run for several seconds at a high turbo frequency before it hits a thermal limit and then the CPU frequency and power is adjusted to avoid thermal overrun:

 

After 8 seconds the CPU package reached 92 degrees C and then CPU frequency scaling kicks in:

..and power consumption drops too:

..it is interesting to note that we can only run for ~9 seconds before the CPU is scaled back to around the same CPU frequency that the balance-power option allows.

Conclusion

Running with HWP balance-power option is a good default choice for maximizing compute while minimizing power consumption for a modern Intel based laptop.  If one wants to crank up the performance at the expense of battery life, then the balance-performance option is most useful.

The balance-performance option when a laptop is plugged into the mains (e.g. via a base-station) may seem like a good idea to get peak compute performance. Note that this may not be useful in the long term as the CPU frequency may drop back to reduce thermal overrun.  However, for bursty infrequent demanding CPU uses this may be a good choice.  I personally refrain from using this as it makes my CPU rather run hot and it's less efficient so it's not ideal for reducing my carbon footprint.

Laptop manufacturers normally set the default HWP option as "balance-power", but this may be changed in the BIOS settings (look for power balance, HWP or Speed Shift options) or changed with x86_energy_perf_policy tool (found in the linux-tools-generic package in Ubuntu).

Friday, 9 July 2021

New features in stress-ng 0.12.12

The release of stress-ng 0.12.12 incorporates some useful features and a handful of new stressors.

Media devices such as HDDs and SSDs normally support Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) to detect and report various measurements of drive reliability.  To complement the various file system and I/O stressors, stress-ng now has a --smart option that checks for any changes in the S.M.A.R.T. measurements and will report these at the end of a stress run, for example:

..as one can see, there are errors on /dev/sdc and this explains why the ZFS pool was having performance issues.

For x86 CPUs I have added a new stressor to trigger System Management Interrupts via writes to port 0xb2 to force the CPU into System Management Mode in ring -2. The --smi stressor option will also measure the time taken to service the SMI. To run this stressor, one needs the --pathological option since this may hang the computer and they behave like non-maskable interrupts:

To exercise the munmap(2) system call a new munmap stressor has been added. This creates child processes that walk through their memory mappings from /proc/$pid/maps and unmap pages on libraries that are not being used. The unapping is performed by striding across the mapping in page multiples of prime size to create many mapping holes to exercise the VM mapping structures. These unmappings can create SIGSEGV segmentation faults that silently get handled and respawn a new child stressor. Example:

 

There some new options for the fork, vfork and vforkmany stressors, a new vm mode has been added to try and exercise virtual memory mappings. This enables detrimental performance virtual memory advice using  madvise  on  all  pages of the new child process. Where possible this will try to set every page in the new process with using madvise MADV_MERGEABLE, MADV_WILLNEED, MADV_HUGEPAGE  and  MADV_RANDOM flags.  The following shows how to enable the vm options for the fork and vfork stressors:

One final new feature is the --skip-silent option.  This will disable printing of messages when a stressor is skipped, for example, if the stressor is not supported by the kernel, the hardware or a support library is not available.

As usual for each release, stress-ng incorporates bug fixes and has been tested on a wide variety of Linux/*BSD/UNIX/POSIX systems and across a range of processor architectures (arm32, arm64, amd64, i386, ppc64el, RISC-V s390x, sparc64, m68k.  It has also been statically analysed with Coverity and cppcheck and built cleanly with pedantic build flags on gcc and clang.


 




 








Friday, 21 May 2021

Adjacent C string concatenation gotcha

C has the useful feature of adjacent allowing literal strings to be automatically concatenated. This is described in K&R "The C programming language" 2nd edition, page 194, section A2.6 "String Literals":

"Adjacent string literals are concatenated into a single string."

Unfortunately over the years I've seen several occasions where this useful feature has led to bugs not being detected, for example, the following code:

 

A simple typo in the "Does Not Compute" error message ends up with the last two literal strings being "silently" concatenated, causing an array out of bounds read when error_msgs[7] is accessed.

This can also bite when a new literal string is added to the end of the array.  The previous string requires a comma to be added when a new string is added to the array, and sometimes this is overlooked.  An example of this is in ACPICA commit 81eb9c383e6dee0f1b6620e91e5c3dbb48234831  - fortunately static analysis detected this and it has been fixed with commit 3f8c79fc22fd64b739f51268654a6783a874520e

The concerning issue is that this useful string concatenation feature can produce hazardous outcomes with coding mistakes that are simple to make but hard to notice.

 

                                              

Friday, 23 April 2021

C Ternary operator gotcha (type conversions)

The C ternary operator expr1 ? expr2 : expr3 has a subtle side note described in K&R 2nd edition, page 52, section 2.11:

"If expr2 and expr3 are of different types, the type of the result is determined by the conversion rules discussed earlier in the chapter".

This refers to page 44, section 2.7 "Type Conversions". It's worth reading a few times along with section A6.5 "Arithmetic Conversions".

Here is an example of a type conversion gotcha:

 

At a glance one would think the program would print out -1 as the output.  Note that the expr2 is actually type converted to unsigned int so the result is 4294967295 if int types are 32 bits wide.

One solution is to type convert expr2 and/or expr3 to a long int and because that is wider than the unsigned int x this takes precedence. Alternatively just use:

References: soc: aspeed: fix a ternary sign expansion bug