In my first test, I wanted to see how the Intel rdrand instruction performed with 2 CPUs loaded (each with a hyper-thread):
$ perf stat stress-ng --rdrand 4 -t 60 --times
stress-ng: info: [7762] dispatching hogs: 4 rdrand
stress-ng: info: [7762] successful run completed in 60.00s
stress-ng: info: [7762] for a 60.00s run time:
stress-ng: info: [7762] 240.01s available CPU time
stress-ng: info: [7762] 231.05s user time ( 96.27%)
stress-ng: info: [7762] 0.11s system time ( 0.05%)
stress-ng: info: [7762] 231.16s total time ( 96.31%)
Performance counter stats for 'stress-ng --rdrand 4 -t 60 --times':
231161.945062 task-clock (msec) # 3.852 CPUs utilized
18,450 context-switches # 0.080 K/sec
92 cpu-migrations # 0.000 K/sec
821 page-faults # 0.004 K/sec
667,745,260,420 cycles # 2.889 GHz
646,960,295,083 stalled-cycles-frontend # 96.89% frontend cycles idle
stalled-cycles-backend
13,702,533,103 instructions # 0.02 insns per cycle
# 47.21 stalled cycles per insn
6,549,840,185 branches # 28.334 M/sec
2,352,175 branch-misses # 0.04% of all branches
60.006455711 seconds time elapsed
stress-ng's rdrand test just performs a 64 bit rdrand read and loops on this until the data is ready, and performs this 32 times in an unrolled loop. Perf stat shows that each rdrand + loop sequence on average consumes about 47 stall cycles showing that rdrand is probably just waiting for the PRNG block to produce random data.
My next experiment was to run the stress-ng ackermann stressor; this performs a lot of recursion, hence one should see a predominantly large amount of branching.
$ perf stat stress-ng --cpu 4 --cpu-method ackermann -t 60 --times
stress-ng: info: [7796] dispatching hogs: 4 cpu
stress-ng: info: [7796] successful run completed in 60.03s
stress-ng: info: [7796] for a 60.03s run time:
stress-ng: info: [7796] 240.12s available CPU time
stress-ng: info: [7796] 226.69s user time ( 94.41%)
stress-ng: info: [7796] 0.26s system time ( 0.11%)
stress-ng: info: [7796] 226.95s total time ( 94.52%)
Performance counter stats for 'stress-ng --cpu 4 --cpu-method ackermann -t 60 --times':
226928.278602 task-clock (msec) # 3.780 CPUs utilized
21,752 context-switches # 0.096 K/sec
127 cpu-migrations # 0.001 K/sec
927 page-faults # 0.004 K/sec
594,117,596,619 cycles # 2.618 GHz
298,809,437,018 stalled-cycles-frontend # 50.29% frontend cycles idle
stalled-cycles-backend
845,746,011,976 instructions # 1.42 insns per cycle
# 0.35 stalled cycles per insn
298,414,546,095 branches # 1315.017 M/sec
95,739,331 branch-misses # 0.03% of all branches
60.032115099 seconds time elapsed
..so about 35% of the time is used in branching and we're getting about 1.42 instructions per cycle and no many stall cycles, so the code is most probably executing inside the instruction cache, which isn't surprising because the test is rather small.
My final experiment was to measure the stall cycles when performing complex long double floating point math operations, again with stress-ng.
$ perf stat stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times
stress-ng: info: [7854] dispatching hogs: 4 cpu
stress-ng: info: [7854] successful run completed in 60.00s
stress-ng: info: [7854] for a 60.00s run time:
stress-ng: info: [7854] 240.00s available CPU time
stress-ng: info: [7854] 225.15s user time ( 93.81%)
stress-ng: info: [7854] 0.44s system time ( 0.18%)
stress-ng: info: [7854] 225.59s total time ( 93.99%)
Performance counter stats for 'stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times':
225578.329426 task-clock (msec) # 3.757 CPUs utilized
38,443 context-switches # 0.170 K/sec
96 cpu-migrations # 0.000 K/sec
845 page-faults # 0.004 K/sec
651,620,307,394 cycles # 2.889 GHz
521,346,311,902 stalled-cycles-frontend # 80.01% frontend cycles idle
stalled-cycles-backend
17,079,721,567 instructions # 0.03 insns per cycle
# 30.52 stalled cycles per insn
2,903,757,437 branches # 12.873 M/sec
52,844,177 branch-misses # 1.82% of all branches
60.048819970 seconds time elapsed
The complex math operations take some time to complete, stalling on average over 35 cycles per op. Instead of using 4 concurrent processes, I re-ran this using just the two CPUs and eliminating 2 of the hyperthreads. This resulted in 25.4 stall cycles per instruction showing that hyperthreaded processes are stalling because of contention on the floating point units.Perf stat is an incredibly useful tool for examining performance issues at a very low level. It is simple to use and yet provides excellent stats to allow one to identify issues and fine tune performance critical code. Well worth using.