Tuesday, 29 December 2015

pagemon: an ncurses based tool to monitor process memory

While developing stress-ng I wanted to be able to see if the various memory stressors were touching memory in the way I had anticipated.  While digging around in the Linux documentation I discovered the very useful soft/dirty bit on Page Table Entries (PTEs) that get set when a page is written to.  The mechanism to check for the soft/dirty bit is described in Documentation/vm/soft-dirty.txt; one needs to:
  1. Clear the soft-dirty bits on the PTEs on a chosen process by writing "4" to /proc/$PID/clear_refs
  2. Wait a while for some page activity to occur
  3. Read the soft-dirty bits on the PTEs to see which pages got written to.
Not too tricky, so how about using this neat feature? While on rather long and dull flight over the Atlantic back in August I hacked up a very crude ncurses based tool to continually check the PTEs of a given process and display the soft/dirty activity in real time.  During this Christmas break I picked this code up and re-worked into a more polished tool.  One can scroll up/down the memory maps and also select a page and view the contents changing in real time.  The tool identifies the type of memory mapping a page belongs to, so one can easily scan through memory looking at pages of memory belonging data, code, heap, stack, anonymous mappings or even swapped out pages.

Running it on X, compiz, firefox or thunderbird is quite instructive as one can see a lot of page activity on the large heap allocations.  The ability to see pages getting swapped out when memory pressure is high is also rather useful.

Page view of Xorg
Memory view of stack
The code is still early development quality (so expect some buglets!) and I need to work on optimising it in a lot of places, but for now, it works well enough to be a fairly interesting tool. I've currently got a package built for Ubuntu Xenial in ppa:colin-king/pagemon and the source can be cloned from http://kernel.ubuntu.com/git/cking/pagemon.git/

So, to install on Xenial, currently one needs to do:

sudo add-apt-repository ppa:colin-king/pagemon
sudo apt-get update
sudo apt-get install pagemon

I may be adding a few more features in the next few weeks, and then getting the tool into Ubuntu and Debian.

and as an example, running it on Xorg, it is invoked as:

sudo pagemon -p $(pidof Xorg)

Unfortunately sudo is required to allow one to dig so intrusively into a running process. For more details on how to use pagemon consult the pagemon man page, or press "h" or "?" while running pagemon.

Thursday, 17 December 2015

Incorporating and accessing binary data into a C program

The other day I needed to incorporate a large blob of binary data in a C program. One simple way is to use xxd, for example, on the binary data in file "blob", one can do:

xxd --include blob 

 unsigned char blob[] = {  
  0xc8, 0xe5, 0x54, 0xee, 0x8f, 0xd7, 0x9f, 0x18, 0x9a, 0x63, 0x87, 0xbb,  
  0x12, 0xe4, 0x04, 0x0f, 0xa7, 0xb6, 0x16, 0xd0, 0x70, 0x06, 0xbc, 0x57,  
  0x4b, 0xaf, 0xae, 0xa2, 0xf2, 0x6b, 0xf4, 0xc6, 0xb1, 0xaa, 0x93, 0xf2,  
  0x12, 0x39, 0x19, 0xee, 0x7c, 0x59, 0x03, 0x81, 0xae, 0xd3, 0x28, 0x89,  
  0x05, 0x7c, 0x4e, 0x8b, 0xe5, 0x98, 0x35, 0xe8, 0xab, 0x2c, 0x7b, 0xd7,  
  0xf9, 0x2e, 0xba, 0x01, 0xd4, 0xd9, 0x2e, 0x86, 0xb8, 0xef, 0x41, 0xf8,  
  0x8e, 0x10, 0x36, 0x46, 0x82, 0xc4, 0x38, 0x17, 0x2e, 0x1c, 0xc9, 0x1f,  
  0x3d, 0x1c, 0x51, 0x0b, 0xc9, 0x5f, 0xa7, 0xa4, 0xdc, 0x95, 0x35, 0xaa,  
  0xdb, 0x51, 0xf6, 0x75, 0x52, 0xc3, 0x4e, 0x92, 0x27, 0x01, 0x69, 0x4c,  
  0xc1, 0xf0, 0x70, 0x32, 0xf2, 0xb1, 0x87, 0x69, 0xb4, 0xf3, 0x7f, 0x3b,  
  0x53, 0xfd, 0xc9, 0xd7, 0x8b, 0xc3, 0x08, 0x8f  
 };  
 unsigned int blob_len = 128;  

..and redirecting the output from xxd into a C source and compiling this simple and easy to do.

However, for large binary blobs, the C source can be huge, so an alternative way is to use the linker ld as follows:

ld -s -r -b binary -o blob.o blob  

...and this generates the blob.o object code. To reference the data in a program one needs to determine the symbol names of the start, end and perhaps the length too. One can use objdump to find this as follows:

 objdump -t blob.o  
 blob.o:   file format elf64-x86-64  
 SYMBOL TABLE:  
 0000000000000000 l  d .data        0000000000000000 .data  
 0000000000000080 g    .data        0000000000000000 _binary_blob_end  
 0000000000000000 g    .data        0000000000000000 _binary_blob_start  
 0000000000000080 g    *ABS*        0000000000000000 _binary_blob_size  

To access the data in C, use something like the following:

 cat test.c  
 
 #include <stdio.h>  
 int main(void)  
 {  
         extern void *_binary_blob_start, *_binary_blob_end;  
         void *start = &_binary_blob_start,  
            *end = &_binary_blob_end;  
         printf("Data: %p..%p (%zu bytes)\n",   
                 start, end, end - start);  
         return 0;  
 }          

...and link and run as follows:

 gcc test.c blob.o -o test  
 ./test   
 Data: 0x601038..0x6010b8 (128 bytes)  

So for large blobs, I personally favour using ld to do the hard work for me since I don't need another tool (such as xxd) and it removes the need to convert a blob into C and then compile this.

Firmware Test Suite, 15.12.00

The Canonical Hardware Enablement Team and myself are continuing the work to enhance the Firmware Test Suite (fwts) on a regular monthly cadence.  The latest changes in FWTS 15.12.00 includes the following new features and changes:

  • ACPI: ASPT (System Performance Tuning Table)
  • Update ACPICA to version 20151124 
  • Boot path sync with UEFI specification 2.5 adding:
    • SD device path 
    • Bluetooth device path
    • Wireless device path
    • Ramdisk device path
  • Mixed tests and test category options, e.g. fwts --uefitests klog cpufreq will run all the UEFI tests as well as klog and cpufreq tests
  • A new --log-level option that allows one to log test that fail at specified a level or higher, e.g. fwts --log-level high will just show high and critical test failures.
  • The apcidump table dump pseudo-test is now aligned with the ACPICA table dumping (disassembly) engine.
  • Various bug fixes.
It is also worth mentioning that the UEFI Board of Directors recommends FWTS as the ACPI v5.1 Self-Certification Test (SCT). This is exciting news and we welcome this decision for FWTS to be recognised in this way.

We are also very grateful for the community contributions to FWTS, this buy-in from community is appreciated and makes FWTS a better tool to support different architectures and systems.

As ever, with new releases, please consult the release notes.

Friday, 11 December 2015

Another seasonal obfuscated C program

During an idle moment while on vacation I was reading the paper "Reliable Two-Dimensional Graphing Methods for Mathematical Formulae with Two Free Variables" by Jeff Tupper and I stumbled upon rather amusing inequality at the end of section 12.   In tribute to this most excellent graphing formula, I felt inspired to use the same concept in my Christmas 2015 obfuscated C offering.

tupper.c

I cheated a little by also using a Makefile, but I hope this also adds to the magic of the resulting code.  To make the program more fun I thought I'd use a lot of confusion logic operator names in the code and mix in some incorrect Roman numeral constants too.  I could have obfuscated the code more and made it smaller, but life is too short. I will leave that as an exercise to the reader.

The source is available in my Christmas Obfuscated C git repository if you want to try it out:

 git clone https://github.com/ColinIanKing/christmas-obfuscated-C.git  
 cd christmas-obfuscated-C/2015  
 make  
 ./tupper | less  

Enjoy!

Sunday, 22 November 2015

Using PR_SET_PDEATHSIG to reap child processes

The prctl() system call provides a rather useful PR_SET_PDEATHSIG option to allow a signal to be sent to child processes when the parent unexpectedly dies. A quick and dirty mechanism is trigger the SIGHUP or SIGKILL signal to kill the child immediately, or perhaps more elegantly to invoke a resource tidy up before exiting.

In the trivial example below, we use the SIGUSR1 signal to inform the child that the parent has died. I know printf() should not be used in a signal handler, it just makes the example simpler.

 #include <stdlib.h>                                 
 #include <unistd.h>                                 
 #include <signal.h>                                 
 #include <sys/prctl.h>                               
 #include <err.h>                                  
                                           
 void sigusr1_handler(int dummy)                           
 {                                          
     printf("Parent died, child now exiting\n");                 
     exit(0);                                  
 }                                          
                                           
 int main()                                     
 {                                          
     pid_t pid;                                 
                                           
     pid = fork();                                
     if (pid < 0)                                
         err(1, "fork failed");                       
     if (pid == 0) {                               
         /* Child */                             
         if (signal(SIGUSR1, sigusr1_handler) == SIG_ERR)          
             err(1, "signal failed");                  
         if (prctl(PR_SET_PDEATHSIG, SIGUSR1) < 0)              
             err(1, "prctl failed");                   
                                           
         for (;;)                              
             sleep(60);                         
     }                                      
     if (pid > 0) {                               
         /* Parent */                            
         sleep(5);                              
         printf("Parent exiting...\n");                   
     }                                      
                                           
     return 0;                                  
 }   

..the child process sits in an infinite loop, performing 60 second sleeps.  The parent sleeps for 5 seconds and then exits.  The child is then sent a SIGUSR1 signal and the handler exits.  In practice the signal handler would be used to trigger a more sophisticated clean up of resources if required.

Anyhow, this is a useful Linux feature that seems to be overlooked.

Thursday, 19 November 2015

Intel Platform Shared Resource Monitoring and Cache Allocation Technology

The Intel Platform Shared Resource Monitoring features were introduced in the Intel Xeon E5v3 processor family. These new features provide a mechanism to measure platform shared resources, such as L3 cache occupancy via Cache Monitoring Technology (CMT) and memory bandwidth utilisation via Memory Bandwidth Monitoring (MBM).

Intel have written a Platform Quality of Service Tool (pqos) to use these monitoring features and I've packaged this up for Ubuntu 16.04 Xenial Xerus.

To install, use:

sudo apt-get install intel-cmt-cat

The tool requires access to the Intel MSRs, so one has to also install the msr module if it is not already loaded:

sudo modprobe msr

To see the Last Level Cache (llc) utilisation on a system, listing the most used first, use:

sudo pqos -T

pqos running on a 48 thread Xeon based server

The -p option allows one to specify specific monitoring events for specific process IDs. Event types can be Last Level Cache (llc), Local Memory Bandwidth (mbl) and Remote Memory Bandwidth (mbr).  For example, on a Xeon E5-2680 I have just Last Level Cache monitoring capability, so lets view the llc for stress-ng while running some VM stressor tests:

sudo pqos -T -p llc:$(pidof stress-ng | tr ' ' ',')

pqos showing equally shared cache between two stressor processes

Cache and Memory Bandwidth monitoring is especially useful to examine the impact of memory/cache hogging processes (such as VM instances).  pqos allows one to identify these processes simply and effectively.

Future Intel Xeon processors will provide capabilities to configure cache resources to specific classes of service using Intel Cache Allocation Technology (CAT).  The pqos tool allows one to modify the CAT settings, however, not having access to a CPU with these capabilities I was unable to experiment with this feature.  I refer you to the pqos manual for more details on this useful feature.  The beauty of CAT is that is allows one to tweak and fine tune the cache allocation for specific demanding use cases.  Given that the cache is a shared resource that can be impacted by badly behaving processes, the ability to tune the cache behaviour is potentially a big performance win.

For more details of these features, see the Intel 64 And IA-32 Architecture Software Development manual, section 17.15 "Platform Share Resource Monitoring: Cache Monitoring Technology" and 17.16 "Platform Shared Resource Control: Cache Allocation Technology".

Wednesday, 11 November 2015

Firmware Test Suite in active development

Another month passes and another release of the Firmware Test Suite is being prepared.  The tool has been growing in functionality (and size!) over time, so I thought I would look at some statistics to see any trends.

There has been a steady growth of the number of authors sending patches to the Firmware Test Suite.  Community contributions to a project is a sign that we have buy-in from different parties, so I'm pleased to see contributions from Intel, Linaro and Redhat.   Patches are always welcome, send them to fwts-devel@ubuntu.com for review and inclusion into the project.

The number of commits is one metric to see if the project is growing healthily. We're adding about 35 patches a month, about 3/4 of which is added functionality, the rest are fixes and general code maintenance.

One more meaningless but interesting metric is code size. I used sloccount to count the lines of C in the project.  We're seeing ~2200 lines of code being added per month, mainly through added test functionality.
Kudos to the Canonical Hardware Enablement firmware folk for wrangling the patches and preparing each FWTS release.

Saturday, 17 October 2015

combining RAPL and perf to do power calibration

A useful feature on modern x86 CPUs is the Running Average Power Limit (RAPL) that allows one to monitor System on Chip (SoC) power consumption.  Combine this data with the ability to accurately measure CPU cycles and instructions via perf and we can get some way to get a rough estimate energy consumed to perform a single operation on the CPU.

power-calibrate is a simple tool that  hacked up to perform some synthetic loading of the processor, gather the RAPL and CPU stats and using simple linear regression to compute some power related metrics.

In the example below, I run power-calibrate on an Intel  i5-3210M (2 Cores, 4 threads) with each test run taking 10 seconds (-r 10),  using the RAPL interface to measure power and gathering 11 samples on CPU threads 1..4:

power-calibrate -r 10 -R  -s 11
  CPU load  User   Sys  Idle  Run  Ctxt/s  IRQ/s  Ops/s Cycl/s Inst/s  Watts
    0% x 1   0.1   0.1  99.8  1.0   181.6   61.1   0.0    2.5K 380.2   2.485
    0% x 2   0.0   1.0  98.9  1.2   161.8   63.8   0.0    5.7K   0.8K  2.366
    0% x 3   0.1   1.3  98.5  1.1   204.2   75.2   0.0    7.6K   1.9K  2.518
    0% x 4   0.1   0.1  99.9  1.0   124.7   44.9   0.0   11.4K   2.7K  2.167
   10% x 1   2.4   0.2  97.4  1.5   203.8  104.9  21.3M 123.1M 297.8M  2.636
   10% x 2   5.1   0.0  94.9  1.3   185.0  137.1  42.0M 243.0M   0.6B  2.754
   10% x 3   7.5   0.2  92.3  1.2   275.3  190.3  58.1M 386.9M   0.8B  3.058
   10% x 4  10.0   0.1  89.9  1.9   213.5  206.1  64.5M 486.1M   0.9B  2.826
   20% x 1   5.0   0.1  94.9  1.0   288.8  170.0  69.6M 403.0M   1.0B  3.283
   20% x 2  10.0   0.1  89.9  1.6   310.2  248.7  96.4M   0.8B   1.3B  3.248
   20% x 3  14.6   0.4  85.0  1.7   640.8  450.4 238.9M   1.7B   3.3B  5.234
   20% x 4  20.0   0.2  79.8  2.1   633.4  514.6 270.5M   2.1B   3.8B  4.736
   30% x 1   7.5   0.2  92.3  1.4   444.3  278.7 149.9M   0.9B   2.1B  4.631
   30% x 2  14.8   1.2  84.0  1.2   541.5  418.1 200.4M   1.7B   2.8B  4.617
   30% x 3  22.6   1.5  75.9  2.2   960.9  694.3 365.8M   2.6B   5.1B  7.080
   30% x 4  30.0   0.2  69.8  2.4   959.2  774.8 421.1M   3.4B   5.9B  5.940
   40% x 1   9.7   0.3  90.0  1.7   551.6  356.8 201.6M   1.2B   2.8B  5.498
   40% x 2  19.9   0.3  79.8  1.4   668.0  539.4 288.0M   2.4B   4.0B  5.604
   40% x 3  29.8   0.5  69.7  1.8  1124.5  851.8 481.4M   3.5B   6.7B  7.918
   40% x 4  40.3   0.5  59.2  2.3  1186.4 1006.7   0.6B   4.6B   7.7B  6.982
   50% x 1  12.1   0.4  87.4  1.7   536.4  378.6 193.1M   1.1B   2.7B  4.793
   50% x 2  24.4   0.4  75.2  2.2   816.2  668.2 362.6M   3.0B   5.1B  6.493
   50% x 3  35.8   0.5  63.7  3.1  1300.2 1004.6   0.6B   4.2B   8.2B  8.800
   50% x 4  49.4   0.7  49.9  3.8  1455.2 1240.0   0.7B   5.7B   9.6B  8.130
   60% x 1  14.5   0.4  85.1  1.8   735.0  502.7 295.7M   1.7B   4.1B  6.927
   60% x 2  29.4   1.3  69.4  2.0   917.5  759.4 397.2M   3.3B   5.6B  6.791
   60% x 3  44.1   1.7  54.2  3.1  1615.4 1243.6   0.7B   5.1B   9.9B 10.056
   60% x 4  58.5   0.7  40.8  4.0  1728.1 1456.6   0.8B   6.8B  11.5B  9.226
   70% x 1  16.8   0.3  82.9  1.9   841.8  579.5 349.3M   2.0B   4.9B  7.856
   70% x 2  34.1   0.8  65.0  2.8   966.0  845.2 439.4M   3.7B   6.2B  6.800
   70% x 3  49.7   0.5  49.8  3.5  1834.5 1401.2   0.8B   5.9B  11.8B 11.113
   70% x 4  68.1   0.6  31.4  4.7  1771.3 1572.3   0.8B   7.0B  11.8B  8.809
   80% x 1  18.9   0.4  80.7  1.9   871.9  613.0 357.1M   2.1B   5.0B  7.276
   80% x 2  38.6   0.3  61.0  2.8  1268.6 1029.0   0.6B   4.8B   8.2B  9.253
   80% x 3  58.8   0.3  40.8  3.5  2061.7 1623.3   1.0B   6.8B  13.6B 11.967
   80% x 4  78.6   0.5  20.9  4.0  2356.3 1983.7   1.1B   9.0B  16.0B 12.047
   90% x 1  21.8   0.3  78.0  2.0  1054.5  737.9 459.3M   2.6B   6.4B  9.613
   90% x 2  44.2   1.2  54.7  2.7  1439.5 1174.7   0.7B   5.4B   9.2B 10.001
   90% x 3  66.2   1.4  32.4  3.9  2326.2 1822.3   1.1B   7.6B  15.0B 12.579
   90% x 4  88.5   0.2  11.4  4.8  2627.8 2219.1   1.3B  10.2B  17.8B 12.832
  100% x 1  25.1   0.0  74.8  2.0   135.8  314.0   0.5B   3.1B   7.5B 10.278
  100% x 2  50.0   0.0  50.0  3.0    91.9  560.4   0.7B   6.2B  10.4B 10.470
  100% x 3  75.1   0.1  24.8  4.0   120.2  824.1   1.2B   8.7B  16.8B 13.028
  100% x 4 100.0   0.0   0.0  5.0    76.8 1054.8   1.4B  11.6B  19.5B 13.156

For 4 CPUs (of a 4 CPU system):
  Power (Watts) = (% CPU load * 1.176217e-01) + 3.461561
  1% CPU load is about 117.62 mW
  Coefficient of determination R^2 = 0.809961 (good)

  Energy (Watt-seconds) = (bogo op * 8.465141e-09) + 3.201355
  1 bogo op is about 8.47 nWs
  Coefficient of determination R^2 = 0.911274 (strong)

  Energy (Watt-seconds) = (CPU cycle * 1.026249e-09) + 3.542463
  1 CPU cycle is about 1.03 nWs
  Coefficient of determination R^2 = 0.841894 (good)

  Energy (Watt-seconds) = (CPU instruction * 6.044204e-10) + 3.201433
  1 CPU instruction is about 0.60 nWs
  Coefficient of determination R^2 = 0.911272 (strong)

The results at the end are estimates based on the gathered samples. The samples are compared to the computed linear regression coefficients using the coefficient of determination (R^2);  a value of 1 is a perfect linear fit, less than 1 a poorer fit.

For more accurate results, increase the run time (-r option) and also increase the number of samples (-s option).

Power-calibrate is available in Ubuntu Wily 15.10.  It is just an academic toy for getting some power estimates and may be useful to compare compute vs power metrics across different x86 CPUs.  I've not been able to verify how accurate it really is, so I am interested to see how this works across a range of systems.

Friday, 18 September 2015

NumaTop: A NUMA system monitoring tool

NumaTop is a useful tool developed by Intel for monitoring runtime memory locality and analysis of processes on Non-Uniform Memory Access (NUMA) systems.  NumaTop can identify potential NUMA related performance bottlenecks and hence help one to re-balance memory/CPU allocations to maximise the potential of a NUMA system.

Initial "Top" like process view

One can select specific processes and drill down and characteristics such as memory latencies or call chains to see where code is hot.

Observing a specific process..
..and observing memory latencies
Observing per Node CPU and memory statistics
The tool uses perf to collect deeper system statistics and hence needs to be run with root privileges will only run on NUMA systems. I've recently packaged NumaTop and it is now available in Ubuntu Wily 15.10 and the source is available on github.

Monday, 14 September 2015

light-weight process stats with cpustat

A while ago I was working on identifying busy processes on small Ubuntu devices and required a tool that could look at per process stats (from /proc/$pid/stat) in a fast and efficient way with minimal overhead.   There are plenty of tools such as "top" and "atop" that can show per-process CPU utilisation stats, but most of these aren't useful on really slow low-power devices as they consume several tens of megacycles collecting and displaying the results.

I developed cpustat to be compact and efficient, as well as provide enough stats to allow me to easily identify CPU sucking processes.   To optimise the code, I used tools such as perf to identify code hotspots as well as valgrind's cachegrind to identify poorly designed cache inefficient data structures.

The majority of the savings were in the parsing of data from /proc - originally I used simple fscanf() style parsing; over several optimisation rounds I ended up with hand-crafted numeric and string scanning parsing that saved several hundred thousand cycles per iteration.

I also made some optimisations by tweaking the hash table sizes to match the input data more appropriately.  Also, by careful re-use of heap allocations, I was able to reduce malloc()/free() calls and save some heap management overhead.

Some very frequent string look-ups were replaced with hash lookups and frequently accessed data was duplicated rather than referenced indirectly to keep data local to reduce cache stalls and hence speed up data comparison lookup time.

The source has been statically checked by CoverityScan, cppcheck and also clang's scan-build to check for bugs introduced in the optimisation steps.

Example of cpustat
cpustat is now available in Ubuntu 15.10 Wily Werewolf.   Visit the cpustat project page for more details.

Thursday, 10 September 2015

Tweaking the thermald configuration file

The Intel Thermal deamon (aka thermald) actively monitors thermal sensors and will modify cooling controls to try to keep the hardware cool.   By default, thermald will run in a "zero-configuration" mode and attempt to use the available CPU Digital Thermal Sensor(s) (DTS) to sense the temperature and use the P-state driver, Running Average Power Limit (RAPL), PowerClamp and cpufreq to control cooling.

Some systems may not work well in the default mode, perhaps the machine just runs too hot and one would like to tweak the settings to kick in passive or active cooling at a lower temperature than the default configuration. Thermald has a configuration file /etc/thermald/thermal-conf.xml that allows fine tuning of thermald. Essentially one declares the thermal sensors on the machine and a set of thermal zone controls that read these thermal sensors and inform thermald the policy to control cooling when specific temperature thresholds are crossed.

For an example, I've picked on an old Acer Aspire One (AMD C-60). Let's see the sensors for this machine:
find /sys/class/hwmon/* -exec echo -n "{}: " \; -exec cat {}/name \;
/sys/class/hwmon/hwmon0: radeon
/sys/class/hwmon/hwmon1: k10temp
one can use tools such as sensors (from the lm-sensors package) to get an idea of the high and critical trip points for these:
$ sudo apt-get install lm-sensors
$ sensors
radeon-pci-0008
Adapter: PCI adapter
temp1:        +60.0°C  (crit = +120.0°C, hyst = +90.0°C)

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +60.5°C  (high = +70.0°C)
                       (crit = +115.0°C, hyst = +107.5°C)

So, in this simple example, I will just use the CPU sensor k10temp (from /sys/class/hwmon/hwmon1) as my thermald CPU temperature sensor. Next, I need to define a policy on what to do when this sensor reaches a specific high temperature threshold. In this example, I want to trigger passive (non-fan) cooling by adjusting the CPU frequency using cpufreq and also the ACPI processor sysfs cooling controls when we reach 85 degrees C. I require thermald to control both cooling methods to run together in parallel with 60% of the influence to come from cpufreq and 40% from the ACPI processor cooling controls. My thermald config file for this is as follows:
 <ThermalConfiguration>  
  <Platform>  
   <Name>Aspire One</Name>  
   <ProductName>*</ProductName>  
   <Preference>QUIET</Preference>  
   <ThermalSensors>  
    <ThermalSensor>  
     <Type>CPU_TEMP</Type>  
     <Path>/sys/class/hwmon/hwmon0/temp1_input</Path>  
     <AsyncCapable>0</AsyncCapable>  
    </ThermalSensor>  
   </ThermalSensors>  
   <ThermalZones>  
    <ThermalZone>  
     <Type>cpu package</Type>  
      <TripPoints>  
       <TripPoint>  
        <SensorType>CPU_TEMP</SensorType>  
         <Temperature>90000</Temperature>  
         <type>passive</type>  
         <ControlType>PARALLEL</ControlType>  
         <CoolingDevice>  
          <index>1</index>  
          <type>cpufreq</type>  
          <influence>60</influence>  
          <SamplingPeriod>1</SamplingPeriod>  
         </CoolingDevice>  
         <CoolingDevice>  
          <index>2</index>  
          <type>Processor</type>  
          <influence>40</influence>  
          <SamplingPeriod>1</SamplingPeriod>  
         </CoolingDevice>  
        </TripPoint>  
       </TripPoints>  
      </ThermalZone>  
    </ThermalZones>  
  </Platform>  
 </ThermalConfiguration>  
One can observe this working by starting thermald in verbose debug mode:
$ sudo thermald --no-daemon --loglevel=debug
it is worth exercising the machine (I use stress-ng --cpu 0) to ramp up the load and temperature to observe how thermald is working. Once one is happy with the results, one can then start thermald using:
$ sudo systemctl start thermald
More examples can be found in the thermald manual page:
$ man thermal-conf.xml 

Tuesday, 8 September 2015

static code analysis (revisited)

A while ago I was extolling the virtues of static analysis tools such as cppcheck, smatch and CoverityScan for C and C++ projects.  I've recently added to this armoury the clang analyser scan-build, which has been most helpful in finding even more obscure bugs that the previous three did not catch.

Using scan-build is very simple indeed, install clang and then in your source tree just build your project with scan-build, e.g. for a project built by make, use:
scan-build make
..and at the end of a build one will see a summary message:
scan-build make
scan-build: 366 bugs found.
scan-build: Run 'scan-view /tmp/scan-build-2015-09-08-094505-16657-1' 
to examine bug reports.
scan-build: The analyzer encountered problems on some source files.
scan-build: Preprocessed versions of these sources were deposited in 
'/tmp/scan-build-2015-09-08-094505-16657-1/failures'.
scan-build: Please consider submitting a bug report using these files:
scan-build:   http://clang-analyzer.llvm.org/filing_bugs.html

..and running scan-view will show the issues found.  For an example of the kind of results scan-build can find, I ran it against a systemd build (head commit 4df0514d299e349ce1d0649209155b9e83a23539). 

As one can see, scan-build is a powerful and easy to use open-source static analyser.  I heartily recommend using it on every C and C++ project.

Monday, 7 September 2015

Monitoring temperatures with psensor

While doing some thermal debugging this weekend I stumbled upon the rather useful temperature monitoring utility "Psensor".   I configured it to update stats every second and according to perf it was only using 0.02 CPU's worth of compute, so it seems relatively lightweight and shouldn't contribute to warming the machine up!

I like the min/max values being clearly shown and also the ability to change graph colours and toggle data on or off.  Quick, easy and effective.  Not sure why I haven't found this tool earlier, but I wish I had!


Saturday, 29 August 2015

Identifying Suspend/Resume delays

The Intel SuspendResume project aims to help identify delays in suspend and resume.  After seeing it demonstrated by Len Brown (Intel) at this years Linux Plumbers conference I gave it a quick spin and was delighted to see how easy it is to use.

The project has some excellent "getting started" documentation describing how to configure a system and run the suspend resume analysis script which should be read before diving in too deep.

For the impatient, one can do try it out using the following:

git clone https://github.com/01org/suspendresume.git
cd suspendresume
sudo ./analyze_suspend.py


..and manually resume once after the machine has completed a successful suspend.

This will create a directory containing dumps of the kernel log and ftrace output as well as an html web page that one can read into your favourite web browser to view the results.  One can zoom in/out of the web page to drill down and see where the delays are occurring, an example from the SuspendResume project page is shown below:

example webpage (from https://01.org/suspendresume)

It is a useful project, kudos to Intel for producing it.  I thoroughly recommend using it to identify the delays in suspend/resume.

Friday, 7 August 2015

More ACPI table tests in fwts 15.08.00

The Canonical Hardware Enablement Team and myself are continuing the work to add more ACPI table tests to the Firmware Test Suite (fwts).  The latest 15.08.00 release added sanity checks for the following tables:
The release also added a test for the ACPI _CPC revision 2 control method and we updated the ACPICA core to version 20150717.

Our aim is to continue to add support for existing and new ACPI tables to make fwts a comprehensive firmware test tool.  For more information about fwts, please refer to the fwts jump start wiki page.

Wednesday, 15 July 2015

stress-ng adds more features

Since I last wrote about perf being added to stress-ng back in the end of May I have been busy in my spare time adding more features to stress-ng.

New stressors include:
  • ptrace - traces a child process performing many simple system calls
  • sigsuspend - sends SIGUSR1 signals to multiple children waiting on sigsuspend(2)
  • sigpending - checks if SIGUSR1 signals are pending on a process that alternatively masks and umasks this signal
  • mmapfork - rapidly spawn multiple child processes that try to allocate a chunk of free memory (and try to avoid swapping). Each process then uses  madvise(2) to hints before and after the memory is memset and then the child dies.
  • quota - exercise various quotactl(2) Q_GET* commands
  • sockpair - client/server socket I/O using socket pair and random sized I/O
  • getrandom - exercise the new getrandom(2) system call
  • numa -  migrates a memory mapped buffer and processes around NUMA modes, exercising migrate_pages(2), mbind(2) and move_pages(2).
  • fcntl - exercises the fcntl(2) commands F_DUP, FDF_DUP, FD_CLOEXEC,  F_GETFD,  F_SETFD, F_GETFL, F_SETFL, F_GETOWN, F_SETOWN, F_GETOWN_EX, F_SETOWN_EX, F_GETSIG and F_SETSIG
  • wcs - exercises libc wide character string functions (thanks to Christian Ehrhardt for this contribution).
 ..and I have added some improvements too:
  • --yaml option to dump stats from --metrics, --perf, -tz into a YAML structured log.
  • made the --aggressive option more aggressive by forcing more CPU migrations and context switches.
I have also added a thermal zone stats gathering option --tz to see how warm the machine is getting when running a test.  For example:



... where x86_pkg_temp is the CPU package temperature and acpitz are the two ACPI thermal zones on my desktop.

Stress-ng is being used to run stress test various kernels across a range of Ubuntu devices, such as phone, desktop and server.   Thrashing a system with hundreds of processes and a lot of low memory pressure is just one method of checking that kernel and daemons can handle a mix of demanding work loads.

stress-ng 0.04.12 is now available in Ubuntu Wily.   See the stress-ng project page for more details.

Wednesday, 8 July 2015

New ACPI table tests in fwts 15.07.00

The Canonical Hardware Enablement Team and myself have been working on some significant improvements and changes to the Firmware Test Suite 15.07.00 over the past several weeks.  This cycle has been focused on adding more ACPI table testing support:

1. Added ACPI table tests:
  • BERT (Boot Error Record Table)
  • BGRT (Boot Graphics Resource Table)
  • BOOT (Boot Table)
  • CPEP (Corrected Platform Error Polling Table)
  • CSRT (Core System Resource Table)
  • DBG2 (Debug Port Table 2)
  • DBGP (Debug Port Table)
  • ECDT (Embedded Controller Boot Resources Table)
  • ERST (Error Record Serialization Table)
  • FACS (Firmware ACPI Control Structure)
  • HEST (Hardware Error Source Table)
  • LPIT (Low Power Idle Table test)
  • MSDM (Microsoft Data Management Table)
  • SLIC (Software Licensing Description Table)
  • SLIT (System Locality Distance Information)
  • SPCR (Serial Port Console Redirection Table)
  • SPMI (Service Processor Management Interface Description Table)
  • SRAT (System Resource Affinity Table)
  • TCPA (Trusted Computing Platform Alliance Capabilities Table)
  • UEFI (UEFI Data Table)
  • WAET (Windows ACPI Emulated Devices Table)
  • XENV (Xen Environment Table)
2. Moved the following tests out of the generic "acpitables" test into their own ACPI table tests:
  • FADT (Fixed ACPI Description Table)
  • HPET (HPET IA-PC High Precision Event Timer Table)
  • GTDT (GTDT Generic Timer Description Table)
  • MADT (Multiple APIC Description Table)
  • RSDP (Root System Description Pointer)
  • RSDT (Root System Description Table)
  • SBST (Smart Battery Specification Table)
  • XSDT (Extended System Description Table)
3. Updated ACPICA to version 20150616 and also 20150619 (ACPICA is used for the assembler/dissassembler and execution engine).

4. Renamed the --uefi and --acpi options to --uefitests and --acpitests respectively.

5. Improved fwts built-time regression tests.  To ensure future changes don't break fwts, we have added more regression tests to sanity check fwts ACPI table tests. Quality matters to us.

This release also incorporates some important bug fixes too, such making the acpidump dump file loading parser more robust, updating the SMM Communication fields on the UEFI table and fixing a segfault in the regular expression kernel log scanner on 32 bit systems.

For the next release of fwts, we are planning to continue to add table more tests from ACPI 5.x and ACPI 6.0 to get full coverage.

As ever, like all releases, for more details please consult the change log and the release notes.

Thursday, 2 July 2015

Firmware Related Blogs

More often than not, I'm looking at ACPI and UEFI related issues, so I was glad to see that  Vincent Zimmer has collected up various useful links to blogs that are Firmware Related.   Very useful, thanks Vincent!

Saturday, 27 June 2015

Static code analysis on kernel source

Since 2014 I have been running static code analysis using tools such as cppcheck and smatch against the Linux kernel source on a regular basis to catch bugs that creep into the kernel.   After each cppcheck run I then diff the logs and get a list of deltas on the error and warning messages, and I periodically review these to filter out false positives and I end up with a list of bugs that need some attention.

Bugs such as allocations returning NULL pointers without checks, memory leaks, duplicate memory frees and uninitialized variables are easy to find with static analyzers and generally just require generally one or two line fixes.

So what are the overall trends like?

Warnings and error messages from cppcheck have been dropping over time and "portable warnings" have been steadily increasing.  "Portable warnings" are mainly from arithmetic on void * pointers (which GCC handles has byte sized but is not legal C), and these are slowly increasing over time.   Note that there is some variation in the results as I use the latest versions of cppcheck, and occasionally it finds a lot of false positives and then this gets fixed in later versions of cppcheck.

Comparing it to the growth in kernel size the drop overall warning and error message trends from cppcheck aren't so bad considering the kernel has grown by nearly 11% over the time I have been running the static analysis.

Kernel source growth over time
Since each warning or error reported has to be carefully scrutinized to determine if they are false positives (and this takes a lot of effort and time), I've not yet been able determine the exact false positive rates on these stats.  Compared to the actual lines of code, cppcheck is finding ~1 error per 15K lines of source.

It would be interesting to run this analysis on commercial static analyzers such as Coverity and see how the stats compare.  As it stands, cppcheck is doing it's bit in detecting errors and helping engineers to improve code quality.

Friday, 19 June 2015

Powerstat and thermal zones

Last night I was mulling over an overheating laptop issue that was reported by a user that turned out to be fluff and dust clogging up the fan rather than the intel_pstate driver being broken.

While it is a relief that the kernel driver is not at fault, it still bothered me that this kind of issue should be very simple to diagnose but I overlooked the obvious.   When solving these issues it is very easy to doubt that the complex part of a system is working correctly (e.g. a kernel driver) rather than the simpler part (e.g. the fan not working efficiently).  Normally, I try to apply Occam's Razor which in the simplest form can be phrased as:

"when you have two competing theories that make exactly the same predictions, the simpler one is the better."

..e.g. in this case, the fan is clogged up.

Fortunately, laptops invariably provide Thermal Zone information that can be monitored and hence one can correlate CPU activity with the temperature of various components of a laptop.  So last night I added Thermal Zone sampling to powerstat 0.02.00 which is enabled with the new -t option.
 
powerstat -tfR 0.5
Running for 60.0 seconds (120 samples at 0.5 second intervals).
Power measurements will start in 0 seconds time.

  Time    User  Nice   Sys  Idle    IO  Run Ctxt/s  IRQ/s  Watts x86_pk acpitz  CPU Freq
11:13:15   5.1   0.0   2.1  92.8   0.0    1   7902   1152   7.97  62.00  63.00  1.93 GHz  
11:13:16   3.9   0.0   2.5  93.1   0.5    1   7168    960   7.64  63.00  63.00  2.73 GHz  
11:13:16   1.0   0.0   2.0  96.9   0.0    1   7014    950   7.20  63.00  63.00  2.61 GHz  
11:13:17   2.0   0.0   3.0  94.5   0.5    1   6950    960   6.76  64.00  63.00  2.60 GHz  
11:13:17   3.0   0.0   3.0  93.9   0.0    1   6738    994   6.21  63.00  63.00  1.68 GHz  
11:13:18   3.5   0.0   2.5  93.6   0.5    1   6976    948   7.08  64.00  63.00  2.29 GHz  
....  

..the -t option now shows x86_pk (x86 CPU package temperature) and acpitz (APCI thermal zone) temperature readings in degrees Celsius.

Now this is where the fun begins.  I ran powerstat for 60 seconds at 2 samples per second and then imported the data into LibreOffice.  To easily show corrleations between CPU load, power consumption, temperature and CPU frequency I normalized the data so that the lowest values were 0.0 and the highest were 1.0 and produced the following graph:

One can see that the CPU frequency (green) scales with the the CPU load (blue) and so does the CPU power (orange).   CPU temperature (yellow) jumps up quickly when the CPU is loaded and then steadily increases.  Meanwhile, the ACPI thermal zone (purple) trails the CPU load because it takes time for the machine to warm up and then cool down (it takes time for a fan to pump out the heat from the machine).

So, next time a laptop runs hot, running powerstat will capture the activity and correlating temperature with CPU activity should allow one to see if the overheating is related to a real CPU frequency scaling issue or a clogged up fan (or broken heat pipe!).

Thursday, 18 June 2015

Snooping on I/O using iosnoop

A while ago I blogged about Brendan Gregg's excellent book for tracking down performance issues titled "Systems Performance, Enterprise and the Cloud".   Brendan has also produced a useful I/O diagnostic bash script iosnoop that uses ftrace to gather block device I/O events in real time.

The following example snoops on I/O for 1 second:
$ sudo iosnoop 1
Tracing block I/O for 1 seconds (buffered)...
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
kworker/u16:2    650    W    8,0      441077032    28672      1.46
kworker/u16:2    650    W    8,0      441077024    4096       1.45
kworker/u16:2    650    W    8,0      364810624    462848     1.35
kworker/u16:2    650    W    8,0      364810240    69632      1.34

And the next example snoops and shows start and end time stamps:
$ sudo iosnoop -ts  
Tracing block I/O. Ctrl-C to end.  
STARTs        ENDs          COMM           PID  TYPE DEV   BLOCK    BYTES   LATms  
35253.062020  35253.063148  jbd2/sda1-211  211  WS   8,0   29737200   53248   1.13  
35253.063210  35253.063261  jbd2/sda1-211  211  FWS  8,0   18446744073709551615 0     0.05  
35253.063282  35253.063616  <idle>         0    WS   8,0   29737304   4096    0.33  
35253.063650  35253.063688  gawk           551  FWS  8,0   18446744073709551615 0     0.04  
35253.766711  35253.767158  kworker/u16:0  305  W    8,0   433580264  4096    0.45  
35253.766778  35253.767258  kworker/0:1H   321  FWS  8,0   18446744073709551615 0     0.48  
35253.767289  35253.767635  <idle>         0    WS   8,0   273358464  4096    0.35  
35253.767309  35253.767654  <idle>         0    W    8,0   118371312  4096    0.35  
35253.767648  35253.767741  <idle>         0    FWS  8,0   18446744073709551615 0     0.09  
^C  
Ending tracing...  
One needs to run the tool as root as it uses ftrace. There are a selection of filtering options, such as showing I/O from a specific device, I/O issues of a specific I/O type, selecting I/O on a specific PID or a specific name. iosnoop also can display the I/O completion times, start times and Queue insertion I/O start time. On Ubuntu, iosnoop can be installed using:
sudo apt-get install perf-tools-unstable
A useful I/O analysis tool indeed. For more details, install the tool and read the iosnoop man page.

Thursday, 11 June 2015

Adding CPU states and CPU frequency stats to powerstat

During some spare moments I've added a couple of minor CPU related enhancements to powerstat.    The new -c option gathers CPU C-state activity over the run and shows a summary at the end, for example:
 C-State  Resident   Count Latency
 C7-IVB   75.239%   102315      87  
 C6-IVB    0.004%       60      80  
 C3-IVB    0.138%     2892      59  
 C1E-IVB   1.150%     7599      10  
 C1-IVB    0.948%     4611       1  
 POLL      0.000%        3       0  
 C0       22.521%
The above example shows that my Ivybridge i5-3210M spent ~75% of the time in the deepest C7 sleep state and ~22.5% of the time in the fully operating C0 state.

A new -f option gathers CPU frequency statistics across all the on-line CPUs and displays the running average.   This provides an "instantaneous" view of the current CPU frequencies rather than a running average between the last sample, so beware that just gathering statistics using powerstat can cause CPU activity which of course can change CPU frequency.

For a simple test, I ran powerstat for a short 250 second run and normalised the CPU Core Power, CPU Load and CPU Frequency stats so that the data ranges are 0..1 so I can plot all three stats and easily compare them:

One can easily see the correlation between CPU Frequency, CPU Load and CPU core power consumed just from the powerstat data.

Powerstat tries to be as lightweight and as small as possible to minimize the impact on system behaviour.  My hope is that adding these extra CPU instrumentation features adds more useful functionality without adding a larger system impact.  I've instrumented powerstat with perf and I believe that the overhead is sufficiently small to justify these changes.

These two new features will be landing in powerstat 0.01.40 in Ubuntu Wily.

Sunday, 31 May 2015

Adding perf stats to stress-ng

Over the last few weeks I have been toying with the idea of adding more performance monitoring to stress-ng so one can see how much a stress test impacts on the CPU. The obvious choice to get such low level data is via Linux perf events using perf_event_open(2).

The man page for perf_event_open() provides plenty of information to get perf working from userspace, however, I was a bit stumped when I used several hardware perf events and then ran out of hardware Perf Monitoring Units (PMUs) resulting in some strange event counter readings. I discovered that when one runs out of PMUs, perf will multiplex event counting and so the perf counters need to be scaled by multiplying by PERF_FORMAT_TOTAL_TIME_ENABLED and divided by PERF_FORMAT_TOTAL_TIME_RUNNING.

Once I had figured this out, it was relatively plain sailing to get perf working in stress-ng.  So stress-ng V0.04.04 now supports the --perf option that just enables perf monitoring on each stress test being run, it is as simple as that. For multiple instances of a stress test, stress-ng will sum all the perf counters of each processes running the stress-test to provide an overall total.

The following example will run the stress-ng cache stress test.  The first run enables cache flushing and so fetches of data will cause cache misses.  The second run has no cache flushing and hence has far lower cache miss rate.


Note how the cache-flushing not only causes a far higher cache miss rate, but also reduces the effective number of instructions per cycle being executed and hence reduces the throughput (as one would expect).  With cache-flushing enabled I was seeing only 17.53 bogo ops per second compared to the 35.97 bogo ops per second with cache-flushing disabled.

The perf stats are enlightening. I still find it incredible that my laptop has so much computing power.  Some of the more compute bound stressors (such as the stress-ng bitops cpu stressor) are hitting over 20 billion instructions per second on my machine, which is rather impressive.  It seems that gcc optimization and the x86 superscaler micro-ops are working efficiently with some of these stress tests.

My hope is that the integrated perf monitoring in stress-ng will be instructive when comparing results on different processor architectures across the range of stress-ng stress tests.

Monday, 25 May 2015

comparing gcc 4.9.1 with 5.1.1 with CPU stressors

As simple experiment, I thought it would be interesting to investigate stress-ng compiled with GCC 4.9.1 and GCC 5.1.1 in terms of computational improvement and power consumption on various CPU stress methods.   The stress-ng CPU stress test contains various different mixes of integer, floating point, bit operations and logic operations that can be used for processor loading, so it makes a useful test to see how well the code gets optimized with GCC.

Stress-ng provides a "bogo-ops" mechanism to measure a "unit of operation", normally this is just a count of the number of operations performed in a unit of time, hence allowing us to compare the relative performance of each stress method when compiled with different versions of GCC.  Running each stress method for a relatively long time (a few minutes) on an idle machine allows us to get a fairly stable and accurate measurement of bogo-ops per second.  Tests were run on a Lenovo x230 with an i5-3210M CPU.

The first chart below shows the relative improvement in bogo-ops per second between the two versions of GCC.  A value of n indicates GCC 5.1.1 is n times faster  in terms of bogo-ops per second than GCC 4.9.1, hence values less than 1.0 show that GCC 5.1.1 has regressed in performance.

It appears that int64, int32, int16, int8 and rand show some remarkable improvements with GCC 5.1.1; these all perform various integer operations (add, subtract, multiply, divide, xor, and, or, shift).

In contrast, hamming, hanoi, parity and sieve show degraded performance with GCC 5.1.1.  Hanoi just exercises recursion of a function with a few arguments and some memory load/stores.  Hamming, parity and sieve exercise bit twiddling operations and memory load/stores.

Further to just measuring computation, I used the Intel RAPL CPU package power measurements (using powerstat) to next measure the power consumed and then compute bogo ops per Watt for stress-ng built with GCC 4.9.1 and 5.1.1.  I then compared the relative improvement of 5.1.1 compared to 4.9.1:
The chart above shows the same kind of characteristics as the first chart, but in terms of computational improvement per Watt.  Note that there are even better improvements in relative terms for the integer and rand CPU stress methods.  For example, the rand stress method shows a 1.6 x improvement in terms of computation per second and a 2.1 x improvement in terms of computation per Watt comparing GCC 4.9.1 with 5.1.1.

It seems that benchmarking performance in terms of just compute improvements really should take into consideration the power consumption too to get a better idea of how compiler optimization improvements.  Compute-per-watt rather than compute-per-second should perhaps be the preferred benchmark in the modern high-density compute farms.

Of course, these comparisons are just with one specific x86 micro-architecture,  so one would expect different results for different x86 CPUs..  I guess that is for another weekend to test if I get time.

Sunday, 24 May 2015

comparing cpuburn and stress-ng

The cpuburn package contains several hand crafted assembler "burn" programs to load x86 processors and to maximize heat production to stress a system.  This also is the intention of the stress-ng "cpu" stress test which contains a variety of methods to stress CPUs with a wide range of instruction mixes.   Stress-ng is written in C and relies on the the compiler to generate efficient code to hopefully load the CPU.  So how does stress-ng compared to the hand crafted cpuburn suite of programs on modern processors?

Since there is a correlation between power consumed and heat generated, I took the liberty to measure the CPU package power consumption measures using the Intel RAPL interface as one way of comparing cpuburn and stress-ng.  Recent versions of powerstat supports RAPL, so I ran each stressor for 120 seconds and took CPU package power measurements every 4 seconds over this interval with powerstat.

So, the cpuburn "burn" programs do well, however, some of the stress-ng CPU stress methods seem to do better.   The best stress-ng CPU methods are: ackermann, callfunc, hanoi, decimal128, dither, int128decimal128, trig and zeta.  It appears that ackermann, callfunc and hanoi do well because these are very localised deeply recursive function calls, so I expect register save/restores and some stack activity is the main power consumer.  The rest exercise the integer and floating point units and memory load/stores.

As it stands, a handful of stress-ng CPU stressors aren't as good as cpuburn. What is noticeable is that burnBX on an i3120M seems to do rather well in terms of loading the CPU.

One conclusion to draw from this is that modern C compilers such as gcc (in this case, gcc 4.9.2) with a suitably chosen mix of stores, loads and integer/floating point operations can outperform hand written assembler in terms of loading the full CPU package.  When I have a little more time, I will try and repeat this experiment with clang and gcc 5

Wednesday, 6 May 2015

stress-ng updates for Ubuntu 15.10 Wily Werewolf

An on-going background project of mine is to add various interesting system stress tests to stress-ng.  Over the past several months I've been looking at the ways to exercise various less used or obscure system calls just to add more kernel coverage to the tool.
  • rlimit - generate tens of thousands of SIGXFSZ and many SIGXCPU signals
  • itimer - exercise ITIMER_PROF and generate SIGPROF signals
  • mlock - lock and unlock pages with mlock()/munlock()
  • timerfd - exercise rapid CLOCK_REALTIME events by select() and read() on a timerfd.
  • memfd - exercise anonymous populated page memory mapping and unmappoing using memfd.
  • more aggressive affinity stressor changes to force more CPU IPIs
  • hdd - add readv/writev I/O option
  • tee - tee data between a writer and reader process using tee()
  • crypt - encrypt data with MD5, SHA-256 and SHA-512 using libcrypt
  • mmapmany - perform tens of thousands of memory maps/unmaps to exhaust the per-process mapping limit.
  • zombie - fill up process table with tens of thousands of zombie processes
  • str - heavily exercise a range of glibc string functions
  • xattr - exercise file extended attributes
  • readahead - random reads with readaheads
  • vm - add a rowhammer memory stressor
..as well as extra per-stressor configuration settings and a lot of code clean up and bug fixing.

I've recently been using stress-ng to exercise various kernels on a range of hardware and it has been useful in forcing bugs, especially with the memory specific stressors that seem to trip low memory corner cases.

stress-ng 0.04.01 will be soon available in Ubuntu 15.10 Wily Werewolf.  Visit the stress-ng project page for more details.

powerstat improvements with RAPL

The Linux Running Average Power Limit (RAPL) interface was introduced about 2 years ago in the Linux kernel and allows userspace to read the power consumption from various x86 System-on-a-Chip (SoC) power domains.  The power domains range from the SoC package, CPU core, DRAM controller and graphics power plane.

It appears that the Intel energy status MSRs can be read very rapidly and the resolution is exceptionally good; however, reading the MSR too frequently will consume some power when using the RAPL interface.

I've improved powerstat to now use the RAPL interface with a new -R option (to measure just the total package power consumption).  A new -D option will show all the RAPL domain measurements available.  RAPL measurements are very responsive and one can easily correlate power spikes with bursts of system activity.

Finally, I have added a basic histogram output with the new -H option. This will plot histograms of the power measurements and CPU load from the stats gathered during the powerstat run.

Powerstat 0.01.37 is available in Ubuntu 15.10 Wily Werewolf and the source is available from the git repository.

Monday, 23 February 2015

fnotifystat - a tool to show file system activity

Over the past year or more I was focused on identifying power consuming processes on various mobile devices.  One of many strategies to reducing power is to remove unnecessary file system activity, such as extraneous logging, repeated file writes, unnecessary file re-reads and to reduce metadata updates.

Fnotifystat is a utility I wrote to help identify such file system activity. My desire was to make the tool as small as possible for small embedded devices and to be relatively flexible without the need of using perf just in case the target device did not have perf built into the kernel by default.

By default, fnotifystat will dump out every second any file system open/close/read/write operations across all mounted file systems, however, one can specify the delay in seconds and the number of times to dump out statistics.   fnotifystat uses the fanotify(7) interface to get file activity across the system, hence it needs to be run with CAP_SYS_ADMIN capability.

An open(2), read(2)/write(2) and close(2) sequence by a process can produce multiple events, so fnotifystat has a -m option to merge events and hence reduce the amount of output.  A verbose -v option will output all file events if one desires to see the full system activity.

If one desires to just monitor a specific collection of processes, one can specify a list of the process ID(s) or process names using the -p option, for example:

sudo fnotifystat -p firefox,thunderbird

fnotifystat catch events on all mounted file systems, but one can restrict that by specifying just path(s) one is interested in using the -i (include) option, for example:

sudo fnotifystat -i /proc

..and one can exclude paths using the -x option.

More information and examples can be found on the fnotifystat project page and the manual also contains more details and some examples too.

Fnotifystat 0.01.10 is available in Ubuntu Vivid Vervet 15.04 and can also be installed for older releases from my power management tools PPA.

Tuesday, 27 January 2015

Finding kernel bugs with cppcheck

For the past year I have been running the cppcheck static analyzer against the linux kernel sources to see if it can detect any bugs introduced by new commits. Most of the bugs being found are minor thinkos, null pointer de-referencing, uninitialized variables, memory leaks and mistakes in error handling paths.

A useful feature of cppcheck is the --force option that will check against all the configurations in the source (and the kernel does have many!).  This allows us to check for code that may not be exercised much (because it is normally not built in with most config options) or even find dead code.

The downside of using the --force option is that each source file may need to be checked multiple times for each configuration.  For ~20800 sources files this can take a 24 processor server several hours to process.  Errors and warnings are then compared to previous runs (a delta), making it relatively easy to spot new issues on each run.

We also use the latest sources from the cppcheck git repository.  The upside of this is that new static analysis features are used early and this can result in finding existing bugs that previous versions of cppcheck missed.

A typical cppcheck run against the linux kernel source finds about 600 potential errors and 1700 warnings; however a lot of these are false positives.  These need to be individually eyeballed to sort the wheat from the chaff.

Finally, the data is passed through a gnu plot script to generate a trend graph so I can see how errors (red) and warnings (green) are progressing over time:


..note that the large changes in the graph are mostly with features being enabled (or fixed) in cppcheck.

I have been running the same experiment with smatch too, however I am finding that cppcheck seems to have better code coverage because of the --force option and seems to have less false positives.   As it stands, I am finding that the most productive time for finding issues is around the -rc1 and -rc2 merge times (obviously when most of the the major changes land in the kernel).  The outcome of this work has been a bunch of small fixes landing in the kernel to address bugs that cppcheck has found.

Anyhow, cppcheck is an excellent open source static analyzer for C and C++ that I'd heartily recommend as it does seem to catch useful bugs.