Saturday, 17 October 2015

combining RAPL and perf to do power calibration

A useful feature on modern x86 CPUs is the Running Average Power Limit (RAPL) that allows one to monitor System on Chip (SoC) power consumption.  Combine this data with the ability to accurately measure CPU cycles and instructions via perf and we can get some way to get a rough estimate energy consumed to perform a single operation on the CPU.

power-calibrate is a simple tool that  hacked up to perform some synthetic loading of the processor, gather the RAPL and CPU stats and using simple linear regression to compute some power related metrics.

In the example below, I run power-calibrate on an Intel  i5-3210M (2 Cores, 4 threads) with each test run taking 10 seconds (-r 10),  using the RAPL interface to measure power and gathering 11 samples on CPU threads 1..4:

power-calibrate -r 10 -R  -s 11
  CPU load  User   Sys  Idle  Run  Ctxt/s  IRQ/s  Ops/s Cycl/s Inst/s  Watts
    0% x 1   0.1   0.1  99.8  1.0   181.6   61.1   0.0    2.5K 380.2   2.485
    0% x 2   0.0   1.0  98.9  1.2   161.8   63.8   0.0    5.7K   0.8K  2.366
    0% x 3   0.1   1.3  98.5  1.1   204.2   75.2   0.0    7.6K   1.9K  2.518
    0% x 4   0.1   0.1  99.9  1.0   124.7   44.9   0.0   11.4K   2.7K  2.167
   10% x 1   2.4   0.2  97.4  1.5   203.8  104.9  21.3M 123.1M 297.8M  2.636
   10% x 2   5.1   0.0  94.9  1.3   185.0  137.1  42.0M 243.0M   0.6B  2.754
   10% x 3   7.5   0.2  92.3  1.2   275.3  190.3  58.1M 386.9M   0.8B  3.058
   10% x 4  10.0   0.1  89.9  1.9   213.5  206.1  64.5M 486.1M   0.9B  2.826
   20% x 1   5.0   0.1  94.9  1.0   288.8  170.0  69.6M 403.0M   1.0B  3.283
   20% x 2  10.0   0.1  89.9  1.6   310.2  248.7  96.4M   0.8B   1.3B  3.248
   20% x 3  14.6   0.4  85.0  1.7   640.8  450.4 238.9M   1.7B   3.3B  5.234
   20% x 4  20.0   0.2  79.8  2.1   633.4  514.6 270.5M   2.1B   3.8B  4.736
   30% x 1   7.5   0.2  92.3  1.4   444.3  278.7 149.9M   0.9B   2.1B  4.631
   30% x 2  14.8   1.2  84.0  1.2   541.5  418.1 200.4M   1.7B   2.8B  4.617
   30% x 3  22.6   1.5  75.9  2.2   960.9  694.3 365.8M   2.6B   5.1B  7.080
   30% x 4  30.0   0.2  69.8  2.4   959.2  774.8 421.1M   3.4B   5.9B  5.940
   40% x 1   9.7   0.3  90.0  1.7   551.6  356.8 201.6M   1.2B   2.8B  5.498
   40% x 2  19.9   0.3  79.8  1.4   668.0  539.4 288.0M   2.4B   4.0B  5.604
   40% x 3  29.8   0.5  69.7  1.8  1124.5  851.8 481.4M   3.5B   6.7B  7.918
   40% x 4  40.3   0.5  59.2  2.3  1186.4 1006.7   0.6B   4.6B   7.7B  6.982
   50% x 1  12.1   0.4  87.4  1.7   536.4  378.6 193.1M   1.1B   2.7B  4.793
   50% x 2  24.4   0.4  75.2  2.2   816.2  668.2 362.6M   3.0B   5.1B  6.493
   50% x 3  35.8   0.5  63.7  3.1  1300.2 1004.6   0.6B   4.2B   8.2B  8.800
   50% x 4  49.4   0.7  49.9  3.8  1455.2 1240.0   0.7B   5.7B   9.6B  8.130
   60% x 1  14.5   0.4  85.1  1.8   735.0  502.7 295.7M   1.7B   4.1B  6.927
   60% x 2  29.4   1.3  69.4  2.0   917.5  759.4 397.2M   3.3B   5.6B  6.791
   60% x 3  44.1   1.7  54.2  3.1  1615.4 1243.6   0.7B   5.1B   9.9B 10.056
   60% x 4  58.5   0.7  40.8  4.0  1728.1 1456.6   0.8B   6.8B  11.5B  9.226
   70% x 1  16.8   0.3  82.9  1.9   841.8  579.5 349.3M   2.0B   4.9B  7.856
   70% x 2  34.1   0.8  65.0  2.8   966.0  845.2 439.4M   3.7B   6.2B  6.800
   70% x 3  49.7   0.5  49.8  3.5  1834.5 1401.2   0.8B   5.9B  11.8B 11.113
   70% x 4  68.1   0.6  31.4  4.7  1771.3 1572.3   0.8B   7.0B  11.8B  8.809
   80% x 1  18.9   0.4  80.7  1.9   871.9  613.0 357.1M   2.1B   5.0B  7.276
   80% x 2  38.6   0.3  61.0  2.8  1268.6 1029.0   0.6B   4.8B   8.2B  9.253
   80% x 3  58.8   0.3  40.8  3.5  2061.7 1623.3   1.0B   6.8B  13.6B 11.967
   80% x 4  78.6   0.5  20.9  4.0  2356.3 1983.7   1.1B   9.0B  16.0B 12.047
   90% x 1  21.8   0.3  78.0  2.0  1054.5  737.9 459.3M   2.6B   6.4B  9.613
   90% x 2  44.2   1.2  54.7  2.7  1439.5 1174.7   0.7B   5.4B   9.2B 10.001
   90% x 3  66.2   1.4  32.4  3.9  2326.2 1822.3   1.1B   7.6B  15.0B 12.579
   90% x 4  88.5   0.2  11.4  4.8  2627.8 2219.1   1.3B  10.2B  17.8B 12.832
  100% x 1  25.1   0.0  74.8  2.0   135.8  314.0   0.5B   3.1B   7.5B 10.278
  100% x 2  50.0   0.0  50.0  3.0    91.9  560.4   0.7B   6.2B  10.4B 10.470
  100% x 3  75.1   0.1  24.8  4.0   120.2  824.1   1.2B   8.7B  16.8B 13.028
  100% x 4 100.0   0.0   0.0  5.0    76.8 1054.8   1.4B  11.6B  19.5B 13.156

For 4 CPUs (of a 4 CPU system):
  Power (Watts) = (% CPU load * 1.176217e-01) + 3.461561
  1% CPU load is about 117.62 mW
  Coefficient of determination R^2 = 0.809961 (good)

  Energy (Watt-seconds) = (bogo op * 8.465141e-09) + 3.201355
  1 bogo op is about 8.47 nWs
  Coefficient of determination R^2 = 0.911274 (strong)

  Energy (Watt-seconds) = (CPU cycle * 1.026249e-09) + 3.542463
  1 CPU cycle is about 1.03 nWs
  Coefficient of determination R^2 = 0.841894 (good)

  Energy (Watt-seconds) = (CPU instruction * 6.044204e-10) + 3.201433
  1 CPU instruction is about 0.60 nWs
  Coefficient of determination R^2 = 0.911272 (strong)

The results at the end are estimates based on the gathered samples. The samples are compared to the computed linear regression coefficients using the coefficient of determination (R^2);  a value of 1 is a perfect linear fit, less than 1 a poorer fit.

For more accurate results, increase the run time (-r option) and also increase the number of samples (-s option).

Power-calibrate is available in Ubuntu Wily 15.10.  It is just an academic toy for getting some power estimates and may be useful to compare compute vs power metrics across different x86 CPUs.  I've not been able to verify how accurate it really is, so I am interested to see how this works across a range of systems.

2 comments:

  1. Hi Colin,

    Many thanks for the tool, looks very nice!

    I'm studying your code in the hope of learning a bit more about how CPU utilization is tracked by the Linux kernel and how it is calculated by your tool. Here are a couple of questions, I'm for sure missing something about the concepts behind the calculations.

    It seems that CPU utilization (user, sys, etc.) is computed as: current - past value of CPU time, where the value is got from the /proc/stat fields. I can see from the man that /proc/stat reports

    "the amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the right value), that the system spent in  various states".

    Questions:

    1) In your code, I'm not able to find where you normalize the CPU times from /proc/stat with the sysconf(_SC_CLK_TCK) call. It looks like you simply subtract current and past value, then you print it.

    What am I missing here?

    2) This is a more general question about the correctness of this approach of calculating CPU utilization, which is AFAIK the standard approach also adopted by sar, vmstat, etc.

    What about variable frequency CPUs? Is this CPU utilization metric still correct?

    I can't see how it could be right if we don't take into account the actual frequency of the CPU: the amount of work would be very different at say 1.3 GHZ wrt 2 GHZ. Is the kernel doing special things to keep track of accurate CPU time in this case?


    That's it! Thanks again for your time and your interesting tools!

    ReplyDelete
    Replies
    1. Good questions. The clock tick rate is inferred from the total number of ticks counted during the sample interval from the CPU user, nice, sys, idle and iowait ticks. Since this theoretically *should* add up to the total ticks during the sample period we don't have to rely on determining the ticks from sysconf() and the total ticks allow us to compute the % share over that (possibly variable) time period too. If I was to divide by the sysconf() tick rate I need to convert that into the exact number of ticks over the sample duration, which is more work and can have jitter of +/- 1 or so ticks of we quantize the sample at the wrong time.

      Delete