Saturday 24 July 2021

Intel Hardware P-State (HWP) / Intel Speed Shift

Intel Hardware P-State (aka Harware Controlled Performance or "Speed Shift") (HWP) is a feature found in more modern x86 Intel CPUs (Skylake onwards). It attempts to select the best CPU frequency and voltage to match the optimal power efficiency for the desired CPU performance.  HWP is more responsive than the older operating system controlled methods and should therefore be more effective.

To test this theory, I exercised my Lenovo T480 i5-8350U 8 thread CPU laptop with the stress-ng cpu stressor using the "double" precision math stress method, exercising 1 to 8 of the CPU threads over a 60 second test run.  The average CPU temperature and average CPU frequency were measured using powerstat and the CPU compute throughput was measured using the stress-ng bogo-ops count.

The HWP mode was set using the x86_energy_perf_policy tool (as found in the Linux source in tools/power/x86/x86_energy_perf_policy).  This allows one to select one of 5 policies:  "normal", "performance", "balance-performance", "balance-power" and "power" as well as enabling or disabling turbo frequencies.  For the tests, turbo mode was also enabled to allow the CPU to run at higher CPU turbo frequencies.

The "performance" policy is the least efficient option as the CPU is clocked at a high frequency even when the system is idle and is not idea for a laptop. The "power" policy will optimize for low power; on my system it set the CPU to a maximum of 400MHz which is not ideal for typical uses.

The more useful "balance-performance" option optimizes for good throughput at the cost of power consumption where as the "balance-power" option optimizes for good power consumption in preference to performance, so I tested these two options.

Comparison #1,  CPU throughput (bogo-ops) vs CPU frequency.

The two HWP policies are almost identical in CPU bogo-ops throughput vs CPU frequency. This is hardly surprising - the compute throughput for math intensive operations should scale with CPU clock frequency. Note that 5 or more CPU threads sees a reduction in compute throughput because the CPU hyper-threads are being used.

Comparison #2, CPU package temperature vs CPU threads used.

Not a big surprise, the more CPU threads being exercised the hotter the CPU package will get. The balance-power policy shows a cooler running CPU than the balance-performance policy.  The balance-performance policy is running hot even when one or a few threads are being used.

Comparison #3, Power consumed vs CPU threads used.

Clearly the balance-performance option is consuming more power than balance-power, this matches the CPU temperature measurements too. More power, higher temperature.

Comparison #4, Maximum CPU frequency vs CPU threads used.

With the balance-performance option, the average maximum CPU frequency drops as more CPU threads are used.  Intel turbo boost allows one to clock a few CPUs to higher frequencies,  exercising more CPUs leads to more power and hence more heat. To keep the CPU package from hitting thermal overrun, CPU frequency and voltage has to be scaled down when using more CPUs. 

This also is true (but far less pronounced) for the balance-power option. As once can see, balance-performance runs the CPU at a much higher frequency, which is great for compute at the expense of power consumption and heat.

Comparison #5, Compute throughput vs power consumed.

So running with the balance-performance runs the CPU at higher speed and hence one gets more compute throughput per unit of time compared to the balance-power mode.  That's great if your laptop is plugged into the mains and you want to get some compute intensive tasks performed quickly.   However, is this more efficient? 

Comparing the amount of compute performance with the power consumed shows that the balance-power option is more efficient than balance-performance.  Basically with balance-power more compute is possible with the same amount of energy compared to balance-performance, but it will take longer to complete.

CPU frequency scaling over time

The 60 second duration tests were long enough for the CPU to warm up enough reach thermal limits causing HWP to throttle back the voltage and CPU frequencies.  The following graphs illustrate how running with the balance-performance option allows the CPU to run for several seconds at a high turbo frequency before it hits a thermal limit and then the CPU frequency and power is adjusted to avoid thermal overrun:

 

After 8 seconds the CPU package reached 92 degrees C and then CPU frequency scaling kicks in:

..and power consumption drops too:

..it is interesting to note that we can only run for ~9 seconds before the CPU is scaled back to around the same CPU frequency that the balance-power option allows.

Conclusion

Running with HWP balance-power option is a good default choice for maximizing compute while minimizing power consumption for a modern Intel based laptop.  If one wants to crank up the performance at the expense of battery life, then the balance-performance option is most useful.

The balance-performance option when a laptop is plugged into the mains (e.g. via a base-station) may seem like a good idea to get peak compute performance. Note that this may not be useful in the long term as the CPU frequency may drop back to reduce thermal overrun.  However, for bursty infrequent demanding CPU uses this may be a good choice.  I personally refrain from using this as it makes my CPU rather run hot and it's less efficient so it's not ideal for reducing my carbon footprint.

Laptop manufacturers normally set the default HWP option as "balance-power", but this may be changed in the BIOS settings (look for power balance, HWP or Speed Shift options) or changed with x86_energy_perf_policy tool (found in the linux-tools-generic package in Ubuntu).

Friday 9 July 2021

New features in stress-ng 0.12.12

The release of stress-ng 0.12.12 incorporates some useful features and a handful of new stressors.

Media devices such as HDDs and SSDs normally support Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) to detect and report various measurements of drive reliability.  To complement the various file system and I/O stressors, stress-ng now has a --smart option that checks for any changes in the S.M.A.R.T. measurements and will report these at the end of a stress run, for example:

..as one can see, there are errors on /dev/sdc and this explains why the ZFS pool was having performance issues.

For x86 CPUs I have added a new stressor to trigger System Management Interrupts via writes to port 0xb2 to force the CPU into System Management Mode in ring -2. The --smi stressor option will also measure the time taken to service the SMI. To run this stressor, one needs the --pathological option since this may hang the computer and they behave like non-maskable interrupts:

To exercise the munmap(2) system call a new munmap stressor has been added. This creates child processes that walk through their memory mappings from /proc/$pid/maps and unmap pages on libraries that are not being used. The unapping is performed by striding across the mapping in page multiples of prime size to create many mapping holes to exercise the VM mapping structures. These unmappings can create SIGSEGV segmentation faults that silently get handled and respawn a new child stressor. Example:

 

There some new options for the fork, vfork and vforkmany stressors, a new vm mode has been added to try and exercise virtual memory mappings. This enables detrimental performance virtual memory advice using  madvise  on  all  pages of the new child process. Where possible this will try to set every page in the new process with using madvise MADV_MERGEABLE, MADV_WILLNEED, MADV_HUGEPAGE  and  MADV_RANDOM flags.  The following shows how to enable the vm options for the fork and vfork stressors:

One final new feature is the --skip-silent option.  This will disable printing of messages when a stressor is skipped, for example, if the stressor is not supported by the kernel, the hardware or a support library is not available.

As usual for each release, stress-ng incorporates bug fixes and has been tested on a wide variety of Linux/*BSD/UNIX/POSIX systems and across a range of processor architectures (arm32, arm64, amd64, i386, ppc64el, RISC-V s390x, sparc64, m68k.  It has also been statically analysed with Coverity and cppcheck and built cleanly with pedantic build flags on gcc and clang.


 




 








Friday 21 May 2021

Adjacent C string concatenation gotcha

C has the useful feature of adjacent allowing literal strings to be automatically concatenated. This is described in K&R "The C programming language" 2nd edition, page 194, section A2.6 "String Literals":

"Adjacent string literals are concatenated into a single string."

Unfortunately over the years I've seen several occasions where this useful feature has led to bugs not being detected, for example, the following code:

 

A simple typo in the "Does Not Compute" error message ends up with the last two literal strings being "silently" concatenated, causing an array out of bounds read when error_msgs[7] is accessed.

This can also bite when a new literal string is added to the end of the array.  The previous string requires a comma to be added when a new string is added to the array, and sometimes this is overlooked.  An example of this is in ACPICA commit 81eb9c383e6dee0f1b6620e91e5c3dbb48234831  - fortunately static analysis detected this and it has been fixed with commit 3f8c79fc22fd64b739f51268654a6783a874520e

The concerning issue is that this useful string concatenation feature can produce hazardous outcomes with coding mistakes that are simple to make but hard to notice.

 

                                              

Friday 23 April 2021

C Ternary operator gotcha (type conversions)

The C ternary operator expr1 ? expr2 : expr3 has a subtle side note described in K&R 2nd edition, page 52, section 2.11:

"If expr2 and expr3 are of different types, the type of the result is determined by the conversion rules discussed earlier in the chapter".

This refers to page 44, section 2.7 "Type Conversions". It's worth reading a few times along with section A6.5 "Arithmetic Conversions".

Here is an example of a type conversion gotcha:

 

At a glance one would think the program would print out -1 as the output.  Note that the expr2 is actually type converted to unsigned int so the result is 4294967295 if int types are 32 bits wide.

One solution is to type convert expr2 and/or expr3 to a long int and because that is wider than the unsigned int x this takes precedence. Alternatively just use:

References: soc: aspeed: fix a ternary sign expansion bug

Tuesday 30 March 2021

A C for-loop Gotcha

The C infinite for-loop gotcha is one of the less frequent issues I find with static analysis, but I feel it is worth documenting because it is obscure but easy to do.

Consider the following C example:

Since i is a 8 bit integer, it will wrap around to zero when it reaches the maximum 8 bit value of 255 and so we end up with an infinite loop if the upper limit of the loop n is 256 or more. 

The fix is simple, always ensure the loop counter is at least as wide as the type of the maximum limit of the loop. This example, variable i should be a uint32_t type.

I've seen this occur in the Linux kernel a few times.  Sometimes it is because the loop counter is being passed into a function call that expects a specific type such as a u8, u16.  In other occasions I've seen a u16 (or short) integer being used presumably because it was expected to produce faster code, however, most commonly 32 bit integers just as fast (or sometimes faster) than 16 bit integers for this kind of operation.

Sunday 28 March 2021

A Common C Integer Multiplication Mistake

Multiplying integers in C is easy.  It is also easy to get it wrong.  A common issue found using static analysis on the Linux kernel is the integer overflow before widening gotcha.

Consider the following code that takes the 2 unsigned 32 bit integers, multiplies them together and returns the unsigned 64 bit result:

The multiplication is performed using unsigned 32 bit arithmetic and the unsigned 32 bit results is widened to an unsigned 64 bit when assigned to ret. A way to fix this is to explicitly cast a to a uint64_t before the multiplication to ensure an unsigned 64 bit multiplication is performed:

Fortunately static analysis finds these issues.  Unfortunately it is a bug that keeps on occurring in new code.

Thursday 18 March 2021

A common C integer shifting mistake

Shifting integers in C is easy.  Alas it is also easy to get it wrong.  A common issue found using static analysis on the Linux kernel is the unintentional sign extension gotcha.

Consider the following code that takes the 4 unsigned 8 bit integers in array data and returns an unsigned 64 bit integer:

C promotes the uint8_t integers into signed ints on the right shift. If data[3] has the upper bit set, for example with the value 0x80 and data[2]..data[0] are zero, the shifted 32 bit signed result is sign extended to a 64 bit long and the final result is 0xffffffff80000000.  A way to fix this is to explicitly cast data[3] to a uint64_t before shifting it.


Fortunately static analysis finds these issues.  Unfortunately it is a bug that keeps on occurring in new code.

Monday 4 January 2021

Improving kernel test coverage with stress-ng

Over the past year there has been focused work on improving the test coverage of the Linux Kernel with stress-ng.  Increased test coverage exercises more kernel code and hence improves the breadth of testing, allowing us to be more confident that more corner cases are being handled correctly.

The test coverage has been improved in several ways:

  1. testing more system calls; most system calls are being now exercised
  2. adding more ioctl() command tests
  3. exercising system call error handling paths
  4. exercise more system call options and flags
  5. keeping track of new features added to recent kernels and adding stress test cases for these
  6. adding support for new architectures (RISC-V for example)

Each stress-ng release is run with various stressor options against the latest kernel (built with gcov enabled).  The gcov data is processed with lcov to produce human readable kernel source code containing coverage annotations to help inform where to add more test coverage for the next release cycle of stress-ng. 

Linux Foundation sponsored Piyush Goyal for 3 months to add test cases that exercise system call test failure paths and I appreciate this help in improving stress-ng. I finally completed this tedious task at the end of 2020 with the release of stress-ng 0.12.00.

Below is a chart showing how the kernel coverage generated by stress-ng has been increasing since 2015. The amber line shows lines of code exercised and the green line shows kernel functions exercised.

 


..one can see that there was a large increase of kernel test coverage in the latter half of 2020 with stress-ng.  In all, 2020 saw ~20% increase on kernel coverage, most of this was driven using the gcov analysis, however, there is more to do.

What next?  Apart from continuing to add support for new kernel system calls and features I hope to improve the kernel coverage test script to exercise more file systems; it will be interesting to see what kind of bugs get found. I'll also be keeping the stress-ng project page refreshed as this tracks bugs that stress-ng has found in the Linux kernel.

As it stands, release 0.12.00 was a major milestone for stress-ng as it marks the completion of the major work items to improve kernel test coverage.