Thursday 4 February 2016

Intel Platform Quality of Service and Cache Allocation Technology

One issue when running parallel processes is contention of shared resources such as the Last Level Cache (aka LLC or L3 Cache).  For example, a server may be running a set of Virtual Machines with processes that are memory and cache intensive hence producing a large amount of cache activity. This can impact on the other VMs and is known as the "Noisy Neighbour" problem.

Fortunately the next generation Intel processors allow one to monitor and also fine tune cache allocation using Intel Cache Monitoring Technology (CMT) and Cache Allocation Technology (CAT).

Intel kindly loaned me a 12 thread development machine with CMT and CAT support to experiment with this technology using the Intel pqos tool.   For my experiment, I installed Ubuntu Xenial Server on the machine. I then installed KVM and an VM instance of Ubuntu Xenial Server.   I then loaded the instance using stress-ng running a memory bandwidth stressor:

 stress-ng --stream 1 -v --stream-l3-size 16M  
..which allocates 16MB in 4 buffers and performs various read/compute and writes to these, hence causing a "noisy neighbour".

Using pqos,  one can monitor and see the cache/memory activity:
sudo apt-get install intel-cmt-cat
sudo modprobe msr  
sudo pqos -r  
TIME 2016-02-04 10:25:06
    CORE   IPC  MISSES    LLC[KB]  MBL[MB/s]  MBR[MB/s]
       0  0.59 168259k     9144.0    12195.0        0.0
       1  1.33    107k        0.0        3.3        0.0
       2  0.20      2k        0.0        0.0        0.0
       3  0.70    104k        0.0        2.0        0.0
       4  0.86     23k        0.0        0.7        0.0
       5  0.38     42k       24.0        1.5        0.0
       6  0.12      2k        0.0        0.0        0.0
       7  0.24     48k        0.0        3.0        0.0
       8  0.61     26k        0.0        1.6        0.0
       9  0.37     11k      144.0        0.9        0.0
      10  0.48      1k        0.0        0.0        0.0
      11  0.45      2k        0.0        0.0        0.0
Now to run a stress-ng stream stressor on the host and see the performance while the noisy neighbour is also running:
stress-ng --stream 4 --stream-l3-size 2M --perf --metrics-brief -t 60
stress-ng: info:  [2195] dispatching hogs: 4 stream
stress-ng: info:  [2196] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [2196] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [2196] stress-ng-stream: Using L3 CPU cache size of 2048K
stress-ng: info:  [2196] stress-ng-stream: memory rate: 1842.22 MB/sec, 736.89 Mflop/sec (instance 0)
stress-ng: info:  [2198] stress-ng-stream: memory rate: 1847.88 MB/sec, 739.15 Mflop/sec (instance 2)
stress-ng: info:  [2199] stress-ng-stream: memory rate: 1833.89 MB/sec, 733.56 Mflop/sec (instance 3)
stress-ng: info:  [2197] stress-ng-stream: memory rate: 1847.16 MB/sec, 738.86 Mflop/sec (instance 1)
stress-ng: info:  [2195] successful run completed in 60.01s (1 min, 0.01 secs)
stress-ng: info:  [2195] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [2195]                          (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [2195] stream           22101     60.01    239.93      0.04       368.31        92.10
stress-ng: info:  [2195] stream:
stress-ng: info:  [2195]            547,520,600,744 CPU Cycles                     9.12 B/sec
stress-ng: info:  [2195]             69,959,954,760 Instructions                   1.17 B/sec (0.128 instr. per cycle)
stress-ng: info:  [2195]             11,066,905,620 Cache References               0.18 B/sec
stress-ng: info:  [2195]             11,065,068,064 Cache Misses                   0.18 B/sec (99.98%)
stress-ng: info:  [2195]              8,759,154,716 Branch Instructions            0.15 B/sec
stress-ng: info:  [2195]                  2,205,904 Branch Misses                 36.76 K/sec ( 0.03%)
stress-ng: info:  [2195]             23,856,890,232 Bus Cycles                     0.40 B/sec
stress-ng: info:  [2195]            477,143,689,444 Total Cycles                   7.95 B/sec
stress-ng: info:  [2195]                         36 Page Faults Minor              0.60 sec
stress-ng: info:  [2195]                          0 Page Faults Major              0.00 sec
stress-ng: info:  [2195]                         96 Context Switches               1.60 sec
stress-ng: info:  [2195]                          0 CPU Migrations                 0.00 sec
stress-ng: info:  [2195]                          0 Alignment Faults               0.00 sec
.. so about 1842 MB/sec memory rate and 736 Mflop/sec per CPU across 4 CPUs.  And pqos shows the cache/memory actitivity as:
sudo pqos -r
TIME 2016-02-04 10:35:27
    CORE   IPC  MISSES    LLC[KB]  MBL[MB/s]  MBR[MB/s]
       0  0.14  43060k     1104.0     2487.9        0.0
       1  0.12 3981523k     2616.0     2893.8        0.0
       2  0.26    320k       48.0       18.0        0.0
       3  0.12 3980489k     1800.0     2572.2        0.0
       4  0.12 3979094k     1728.0     2870.3        0.0
       5  0.12 3970996k     2112.0     2734.5        0.0
       6  0.04     20k        0.0        0.3        0.0
       7  0.04     29k        0.0        1.9        0.0
       8  0.09    143k        0.0        5.9        0.0
       9  0.15      0k        0.0        0.0        0.0
      10  0.07      2k        0.0        0.0        0.0
      11  0.13      0k        0.0        0.0        0.0
Using pqos again, we can find out how much LLC cache the processor has:
sudo pqos -v
NOTE:  Mixed use of MSR and kernel interfaces to manage
       CAT or CMT & MBM may lead to unexpected behavior.
INFO: Monitoring capability detected
INFO: CPUID.0x7.0: CAT supported
INFO: CAT details: CDP support=0, CDP on=0, #COS=16, #ways=12, ways contention bit-mask 0xc00
INFO: LLC cache size 9437184 bytes, 12 ways
INFO: LLC cache way size 786432 bytes
INFO: L3CA capability detected
INFO: Detected PID API (perf) support for LLC Occupancy
INFO: Detected PID API (perf) support for Instructions/Cycle
INFO: Detected PID API (perf) support for LLC Misses
ERROR: IPC and/or LLC miss performance counters already in use!
Use -r option to start monitoring anyway.
Monitoring start error on core(s) 5, status 6
So this CPU has 12 cache "ways", each of 786432 bytes (768K).  One or more  "Class of Service" (COS)  types can be defined that can use one or more of these ways.  One uses a bitmap with each bit representing a way to indicate how the ways are to be used by a COS.  For example, to use all the 12 ways on my example machine, the bit map is 0xfff  (111111111111).   A way can be exclusively mapped to a COS or shared, or not used at all.   Note that the ways in the bitmap must be contiguously allocated, so a mask such as 0xf3f (111100111111) is invalid and cannot be used.

In my experiment, I want to create 2 COS types, the first COS will have just 1 cache way assigned to it and CPU 0 will be bound to this COS as well as pinning the VM instance to CPU 0  The second COS will have the other 11 cache ways assigned to it, and all the other CPUs can use this COS.

So, create COS #1 with just 1 way of cache, and bind CPU 0 to this COS, and pin the VM to CPU 0:
sudo pqos -e llc:1=0x0001
sudo pqos -a llc:1=0
sudo taskset  -apc 0 $(pidof qemu-system-x86_64)
And create COS #2, with 11 ways of cache and bind CPUs 1-11 to this COS:
sudo pqos -e "llc:2=0x0ffe"
sudo pqos -a "llc:2=1-11"
And let's see the new configuration:
sudo pqos  -s
NOTE:  Mixed use of MSR and kernel interfaces to manage
       CAT or CMT & MBM may lead to unexpected behavior.
L3CA COS definitions for Socket 0:
    L3CA COS0 => MASK 0xfff
    L3CA COS1 => MASK 0x1
    L3CA COS2 => MASK 0xffe
    L3CA COS3 => MASK 0xfff
    L3CA COS4 => MASK 0xfff
    L3CA COS5 => MASK 0xfff
    L3CA COS6 => MASK 0xfff
    L3CA COS7 => MASK 0xfff
    L3CA COS8 => MASK 0xfff
    L3CA COS9 => MASK 0xfff
    L3CA COS10 => MASK 0xfff
    L3CA COS11 => MASK 0xfff
    L3CA COS12 => MASK 0xfff
    L3CA COS13 => MASK 0xfff
    L3CA COS14 => MASK 0xfff
    L3CA COS15 => MASK 0xfff
Core information for socket 0:
    Core 0 => COS1, RMID0
    Core 1 => COS2, RMID0
    Core 2 => COS2, RMID0
    Core 3 => COS2, RMID0
    Core 4 => COS2, RMID0
    Core 5 => COS2, RMID0
    Core 6 => COS2, RMID0
    Core 7 => COS2, RMID0
    Core 8 => COS2, RMID0
    Core 9 => COS2, RMID0
    Core 10 => COS2, RMID0
    Core 11 => COS2, RMID0
..showing Core 0 bound to COS1, and Cores 1-11 bound to COS2, with COS1 with 1 cache way and COS2 with the remaining 11 cache ways.

Now re-run the stream stressor and see if the VM has less impact on the LL3 cache:
stress-ng --stream 4 --stream-l3-size 1M --perf --metrics-brief -t 60
stress-ng: info:  [2232] dispatching hogs: 4 stream
stress-ng: info:  [2233] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [2233] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [2233] stress-ng-stream: Using L3 CPU cache size of 1024K
stress-ng: info:  [2235] stress-ng-stream: memory rate: 2616.90 MB/sec, 1046.76 Mflop/sec (instance 2)
stress-ng: info:  [2233] stress-ng-stream: memory rate: 2562.97 MB/sec, 1025.19 Mflop/sec (instance 0)
stress-ng: info:  [2234] stress-ng-stream: memory rate: 2541.10 MB/sec, 1016.44 Mflop/sec (instance 1)
stress-ng: info:  [2236] stress-ng-stream: memory rate: 2652.02 MB/sec, 1060.81 Mflop/sec (instance 3)
stress-ng: info:  [2232] successful run completed in 60.00s (1 min, 0.00 secs)
stress-ng: info:  [2232] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [2232]                          (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [2232] stream           62223     60.00    239.97      0.00      1037.01       259.29
stress-ng: info:  [2232] stream:
stress-ng: info:  [2232]            547,364,185,528 CPU Cycles                     9.12 B/sec
stress-ng: info:  [2232]             97,037,047,444 Instructions                   1.62 B/sec (0.177 instr. per cycle)
stress-ng: info:  [2232]             14,396,274,512 Cache References               0.24 B/sec
stress-ng: info:  [2232]             14,390,808,440 Cache Misses                   0.24 B/sec (99.96%)
stress-ng: info:  [2232]             12,144,372,800 Branch Instructions            0.20 B/sec
stress-ng: info:  [2232]                  1,732,264 Branch Misses                 28.87 K/sec ( 0.01%)
stress-ng: info:  [2232]             23,856,388,872 Bus Cycles                     0.40 B/sec
stress-ng: info:  [2232]            477,136,188,248 Total Cycles                   7.95 B/sec
stress-ng: info:  [2232]                         44 Page Faults Minor              0.73 sec
stress-ng: info:  [2232]                          0 Page Faults Major              0.00 sec
stress-ng: info:  [2232]                         72 Context Switches               1.20 sec
stress-ng: info:  [2232]                          0 CPU Migrations                 0.00 sec
stress-ng: info:  [2232]                          0 Alignment Faults               0.00 sec
Now with the noisy neighbour VM constrained to use just 1 way of LL3 cache, the stream stressor on the host now can achieve about 2592 MB/sec and about 1030 Mflop/sec per CPU across 4 CPUs.

This is a relatively simple example.  With the ability to monitor cache and memory bandwidth activity with one can carefully tune a system to make best use of the limited LL3 cache resource and maximise throughput where needed.

There are many applications where Intel CMT/CAT can be useful, for example fine tuning containers or VM instances, or pinning user space networking buffers to cache ways in DPDK for improved throughput.

4 comments:

  1. I tried your method to test on my system, when I use CAT to assign 1 way of cache to VM, then I assigned the rest of ways to other CPUs, that actually makes performance worse. How could this happen?
    I also noticed that your stream-l3-sizes are deffer froms first run and second run? Is this the fact that makes performance better on second run?
    Sorry for my English!

    ReplyDelete
  2. Very informative. Thanks for this wonderful article.

    ReplyDelete
  3. I followed your step to test, performance was not better after using CAT to allocate cache, sometimes even worse, which is pretty weird.
    Anyway, still a very good article.

    ReplyDelete