Over the last few weeks I have been toying with the idea of adding more performance monitoring to stress-ng so one can see how much a stress test impacts on the CPU.  The obvious choice to get such low level data is via Linux perf events using perf_event_open(2).
The man page for perf_event_open() provides plenty of information to get perf working from userspace, however, I was a bit stumped when I used several hardware perf events and then ran out of hardware Perf Monitoring Units (PMUs) resulting in some strange event counter readings. I discovered that when one runs out of PMUs, perf will multiplex event counting and so the perf counters need to be scaled by multiplying by PERF_FORMAT_TOTAL_TIME_ENABLED and divided by PERF_FORMAT_TOTAL_TIME_RUNNING.
Once I had figured this out, it was relatively plain sailing to get perf working in stress-ng. 
So stress-ng V0.04.04 now supports the --perf option that just enables perf monitoring on each stress test being run, it is as simple as that.  For multiple instances of a stress test, stress-ng will sum all the perf counters of each processes running the stress-test to provide an overall total.
The following example will run the stress-ng cache stress test.  The first run enables cache flushing and so fetches of data will cause cache misses.  The second run has no cache flushing and hence has far lower cache miss rate.
Note how the cache-flushing not only causes a far higher cache miss rate, but also reduces the effective number of instructions per cycle being executed and hence reduces the throughput (as one would expect).  With cache-flushing enabled I was seeing only 17.53 bogo ops per second compared to the 35.97 bogo ops per second with cache-flushing disabled.
The perf stats are enlightening. I still find it incredible that my laptop has so much computing power.  Some of the more compute bound stressors (such as the stress-ng bitops cpu stressor) are hitting over 20 billion instructions per second on my machine, which is rather impressive.  It seems that gcc optimization and the x86 superscaler micro-ops are working efficiently with some of these stress tests.
My hope is that the integrated perf monitoring in stress-ng will be instructive when comparing results on different processor architectures across the range of stress-ng stress tests.
Sunday, 31 May 2015
Monday, 25 May 2015
comparing gcc 4.9.1 with 5.1.1 with CPU stressors
As simple experiment, I thought it would be interesting to investigate stress-ng compiled with GCC 4.9.1 and GCC 5.1.1 in terms of computational improvement and power consumption on various CPU stress methods.   The stress-ng CPU stress test contains various different mixes of integer, floating point, bit operations and logic operations that can be used for processor loading, so it makes a useful test to see how well the code gets optimized with GCC.
Stress-ng provides a "bogo-ops" mechanism to measure a "unit of operation", normally this is just a count of the number of operations performed in a unit of time, hence allowing us to compare the relative performance of each stress method when compiled with different versions of GCC. Running each stress method for a relatively long time (a few minutes) on an idle machine allows us to get a fairly stable and accurate measurement of bogo-ops per second. Tests were run on a Lenovo x230 with an i5-3210M CPU.
The first chart below shows the relative improvement in bogo-ops per second between the two versions of GCC. A value of n indicates GCC 5.1.1 is n times faster in terms of bogo-ops per second than GCC 4.9.1, hence values less than 1.0 show that GCC 5.1.1 has regressed in performance.
It appears that int64, int32, int16, int8 and rand show some remarkable improvements with GCC 5.1.1; these all perform various integer operations (add, subtract, multiply, divide, xor, and, or, shift).
In contrast, hamming, hanoi, parity and sieve show degraded performance with GCC 5.1.1. Hanoi just exercises recursion of a function with a few arguments and some memory load/stores. Hamming, parity and sieve exercise bit twiddling operations and memory load/stores.
Further to just measuring computation, I used the Intel RAPL CPU package power measurements (using powerstat) to next measure the power consumed and then compute bogo ops per Watt for stress-ng built with GCC 4.9.1 and 5.1.1. I then compared the relative improvement of 5.1.1 compared to 4.9.1:
The chart above shows the same kind of characteristics as the first chart, but in terms of computational improvement per Watt. Note that there are even better improvements in relative terms for the integer and rand CPU stress methods. For example, the rand stress method shows a 1.6 x improvement in terms of computation per second and a 2.1 x improvement in terms of computation per Watt comparing GCC 4.9.1 with 5.1.1.
It seems that benchmarking performance in terms of just compute improvements really should take into consideration the power consumption too to get a better idea of how compiler optimization improvements. Compute-per-watt rather than compute-per-second should perhaps be the preferred benchmark in the modern high-density compute farms.
Of course, these comparisons are just with one specific x86 micro-architecture, so one would expect different results for different x86 CPUs.. I guess that is for another weekend to test if I get time.
Stress-ng provides a "bogo-ops" mechanism to measure a "unit of operation", normally this is just a count of the number of operations performed in a unit of time, hence allowing us to compare the relative performance of each stress method when compiled with different versions of GCC. Running each stress method for a relatively long time (a few minutes) on an idle machine allows us to get a fairly stable and accurate measurement of bogo-ops per second. Tests were run on a Lenovo x230 with an i5-3210M CPU.
The first chart below shows the relative improvement in bogo-ops per second between the two versions of GCC. A value of n indicates GCC 5.1.1 is n times faster in terms of bogo-ops per second than GCC 4.9.1, hence values less than 1.0 show that GCC 5.1.1 has regressed in performance.
It appears that int64, int32, int16, int8 and rand show some remarkable improvements with GCC 5.1.1; these all perform various integer operations (add, subtract, multiply, divide, xor, and, or, shift).
In contrast, hamming, hanoi, parity and sieve show degraded performance with GCC 5.1.1. Hanoi just exercises recursion of a function with a few arguments and some memory load/stores. Hamming, parity and sieve exercise bit twiddling operations and memory load/stores.
Further to just measuring computation, I used the Intel RAPL CPU package power measurements (using powerstat) to next measure the power consumed and then compute bogo ops per Watt for stress-ng built with GCC 4.9.1 and 5.1.1. I then compared the relative improvement of 5.1.1 compared to 4.9.1:
The chart above shows the same kind of characteristics as the first chart, but in terms of computational improvement per Watt. Note that there are even better improvements in relative terms for the integer and rand CPU stress methods. For example, the rand stress method shows a 1.6 x improvement in terms of computation per second and a 2.1 x improvement in terms of computation per Watt comparing GCC 4.9.1 with 5.1.1.
It seems that benchmarking performance in terms of just compute improvements really should take into consideration the power consumption too to get a better idea of how compiler optimization improvements. Compute-per-watt rather than compute-per-second should perhaps be the preferred benchmark in the modern high-density compute farms.
Of course, these comparisons are just with one specific x86 micro-architecture, so one would expect different results for different x86 CPUs.. I guess that is for another weekend to test if I get time.
Sunday, 24 May 2015
comparing cpuburn and stress-ng
The cpuburn package contains several hand crafted assembler "burn" programs to load x86 processors and to maximize heat production to stress a system.  This also is the intention of the stress-ng "cpu" stress test which contains a variety of methods to stress CPUs with a wide range of instruction mixes.   Stress-ng is written in C and relies on the the compiler to generate efficient code to hopefully load the CPU.  So how does stress-ng compared to the hand crafted cpuburn suite of programs on modern processors?
Since there is a correlation between power consumed and heat generated, I took the liberty to measure the CPU package power consumption measures using the Intel RAPL interface as one way of comparing cpuburn and stress-ng. Recent versions of powerstat supports RAPL, so I ran each stressor for 120 seconds and took CPU package power measurements every 4 seconds over this interval with powerstat.
So, the cpuburn "burn" programs do well, however, some of the stress-ng CPU stress methods seem to do better. The best stress-ng CPU methods are: ackermann, callfunc, hanoi, decimal128, dither, int128decimal128, trig and zeta. It appears that ackermann, callfunc and hanoi do well because these are very localised deeply recursive function calls, so I expect register save/restores and some stack activity is the main power consumer. The rest exercise the integer and floating point units and memory load/stores.
As it stands, a handful of stress-ng CPU stressors aren't as good as cpuburn. What is noticeable is that burnBX on an i3120M seems to do rather well in terms of loading the CPU.
One conclusion to draw from this is that modern C compilers such as gcc (in this case, gcc 4.9.2) with a suitably chosen mix of stores, loads and integer/floating point operations can outperform hand written assembler in terms of loading the full CPU package. When I have a little more time, I will try and repeat this experiment with clang and gcc 5
Since there is a correlation between power consumed and heat generated, I took the liberty to measure the CPU package power consumption measures using the Intel RAPL interface as one way of comparing cpuburn and stress-ng. Recent versions of powerstat supports RAPL, so I ran each stressor for 120 seconds and took CPU package power measurements every 4 seconds over this interval with powerstat.
So, the cpuburn "burn" programs do well, however, some of the stress-ng CPU stress methods seem to do better. The best stress-ng CPU methods are: ackermann, callfunc, hanoi, decimal128, dither, int128decimal128, trig and zeta. It appears that ackermann, callfunc and hanoi do well because these are very localised deeply recursive function calls, so I expect register save/restores and some stack activity is the main power consumer. The rest exercise the integer and floating point units and memory load/stores.
As it stands, a handful of stress-ng CPU stressors aren't as good as cpuburn. What is noticeable is that burnBX on an i3120M seems to do rather well in terms of loading the CPU.
One conclusion to draw from this is that modern C compilers such as gcc (in this case, gcc 4.9.2) with a suitably chosen mix of stores, loads and integer/floating point operations can outperform hand written assembler in terms of loading the full CPU package. When I have a little more time, I will try and repeat this experiment with clang and gcc 5
Wednesday, 6 May 2015
stress-ng updates for Ubuntu 15.10 Wily Werewolf
An on-going background project of mine is to add various interesting system stress tests to stress-ng.  Over the past several months I've been looking at the ways to exercise various less used or obscure system calls just to add more kernel coverage to the tool.
I've recently been using stress-ng to exercise various kernels on a range of hardware and it has been useful in forcing bugs, especially with the memory specific stressors that seem to trip low memory corner cases.
stress-ng 0.04.01 will be soon available in Ubuntu 15.10 Wily Werewolf. Visit the stress-ng project page for more details.
- rlimit - generate tens of thousands of SIGXFSZ and many SIGXCPU signals
- itimer - exercise ITIMER_PROF and generate SIGPROF signals
- mlock - lock and unlock pages with mlock()/munlock()
- timerfd - exercise rapid CLOCK_REALTIME events by select() and read() on a timerfd.
- memfd - exercise anonymous populated page memory mapping and unmappoing using memfd.
- more aggressive affinity stressor changes to force more CPU IPIs
- hdd - add readv/writev I/O option
- tee - tee data between a writer and reader process using tee()
- crypt - encrypt data with MD5, SHA-256 and SHA-512 using libcrypt
- mmapmany - perform tens of thousands of memory maps/unmaps to exhaust the per-process mapping limit.
- zombie - fill up process table with tens of thousands of zombie processes
- str - heavily exercise a range of glibc string functions
- xattr - exercise file extended attributes
- readahead - random reads with readaheads
- vm - add a rowhammer memory stressor
I've recently been using stress-ng to exercise various kernels on a range of hardware and it has been useful in forcing bugs, especially with the memory specific stressors that seem to trip low memory corner cases.
stress-ng 0.04.01 will be soon available in Ubuntu 15.10 Wily Werewolf. Visit the stress-ng project page for more details.
powerstat improvements with RAPL
The Linux Running Average Power Limit (RAPL) interface was introduced about 2 years ago in the Linux kernel and allows userspace to read the power consumption from various x86 System-on-a-Chip (SoC) power domains.  The power domains range from the SoC package, CPU core, DRAM controller and graphics power plane.
It appears that the Intel energy status MSRs can be read very rapidly and the resolution is exceptionally good; however, reading the MSR too frequently will consume some power when using the RAPL interface.
I've improved powerstat to now use the RAPL interface with a new -R option (to measure just the total package power consumption). A new -D option will show all the RAPL domain measurements available. RAPL measurements are very responsive and one can easily correlate power spikes with bursts of system activity.
Finally, I have added a basic histogram output with the new -H option. This will plot histograms of the power measurements and CPU load from the stats gathered during the powerstat run.
Powerstat 0.01.37 is available in Ubuntu 15.10 Wily Werewolf and the source is available from the git repository.
It appears that the Intel energy status MSRs can be read very rapidly and the resolution is exceptionally good; however, reading the MSR too frequently will consume some power when using the RAPL interface.
I've improved powerstat to now use the RAPL interface with a new -R option (to measure just the total package power consumption). A new -D option will show all the RAPL domain measurements available. RAPL measurements are very responsive and one can easily correlate power spikes with bursts of system activity.
Finally, I have added a basic histogram output with the new -H option. This will plot histograms of the power measurements and CPU load from the stats gathered during the powerstat run.
Powerstat 0.01.37 is available in Ubuntu 15.10 Wily Werewolf and the source is available from the git repository.
Subscribe to:
Comments (Atom)

 
 


