A Smackerel of Opinion

C void return gotcha

2024-12-18T17:43:00.005+00:00

Last week I was bitten by a interesting C feature. The following terminate function was expected to exit if okay was zero (false) however it exited when zero was passed to it. The reason is the missing semicolon after the return function.

The interesting part this that is compiles fine because the void function terminate is allowed to return the void return value, in this case the void return from exit().

Integer shift gotcha

2023-02-10T11:27:00.000+00:00

Left shifting values seems simple, but the following code contains a bug:

The literal value 1 is actually a signed int, so the promotion to a uint64_t type will sign extend the value before assigning to x, causing x to be 0xffffffff80000000.

The fix is simple, just make the literal value 1 an unsigned int using 1U instead of 1.

Intel Hardware P-State (HWP) / Intel Speed Shift

2021-07-24T23:57:00.003+01:00

Intel Hardware P-State (aka Harware Controlled Performance or "Speed Shift") (HWP) is a feature found in more modern x86 Intel CPUs (Skylake onwards). It attempts to select the best CPU frequency and voltage to match the optimal power efficiency for the desired CPU performance. HWP is more responsive than the older operating system controlled methods and should therefore be more effective.

To test this theory, I exercised my Lenovo T480 i5-8350U 8 thread CPU laptop with the stress-ng cpu stressor using the "double" precision math stress method, exercising 1 to 8 of the CPU threads over a 60 second test run. The average CPU temperature and average CPU frequency were measured using powerstat and the CPU compute throughput was measured using the stress-ng bogo-ops count.

The HWP mode was set using the x86_energy_perf_policy tool (as found in the Linux source in tools/power/x86/x86_energy_perf_policy). This allows one to select one of 5 policies: "normal", "performance", "balance-performance", "balance-power" and "power" as well as enabling or disabling turbo frequencies. For the tests, turbo mode was also enabled to allow the CPU to run at higher CPU turbo frequencies.

The "performance" policy is the least efficient option as the CPU is clocked at a high frequency even when the system is idle and is not idea for a laptop. The "power" policy will optimize for low power; on my system it set the CPU to a maximum of 400MHz which is not ideal for typical uses.

The more useful "balance-performance" option optimizes for good throughput at the cost of power consumption where as the "balance-power" option optimizes for good power consumption in preference to performance, so I tested these two options.

Comparison #1, CPU throughput (bogo-ops) vs CPU frequency.

The two HWP policies are almost identical in CPU bogo-ops throughput vs CPU frequency. This is hardly surprising - the compute throughput for math intensive operations should scale with CPU clock frequency. Note that 5 or more CPU threads sees a reduction in compute throughput because the CPU hyper-threads are being used.

Comparison #2, CPU package temperature vs CPU threads used.

Not a big surprise, the more CPU threads being exercised the hotter the CPU package will get. The balance-power policy shows a cooler running CPU than the balance-performance policy. The balance-performance policy is running hot even when one or a few threads are being used.

Comparison #3, Power consumed vs CPU threads used.

Clearly the balance-performance option is consuming more power than balance-power, this matches the CPU temperature measurements too. More power, higher temperature.

Comparison #4, Maximum CPU frequency vs CPU threads used.

With the balance-performance option, the average maximum CPU frequency drops as more CPU threads are used. Intel turbo boost allows one to clock a few CPUs to higher frequencies, exercising more CPUs leads to more power and hence more heat. To keep the CPU package from hitting thermal overrun, CPU frequency and voltage has to be scaled down when using more CPUs.

This also is true (but far less pronounced) for the balance-power option. As once can see, balance-performance runs the CPU at a much higher frequency, which is great for compute at the expense of power consumption and heat.

Comparison #5, Compute throughput vs power consumed.

So running with the balance-performance runs the CPU at higher speed and hence one gets more compute throughput per unit of time compared to the balance-power mode. That's great if your laptop is plugged into the mains and you want to get some compute intensive tasks performed quickly. However, is this more efficient?

Comparing the amount of compute performance with the power consumed shows that the balance-power option is more efficient than balance-performance. Basically with balance-power more compute is possible with the same amount of energy compared to balance-performance, but it will take longer to complete.

CPU frequency scaling over time

The 60 second duration tests were long enough for the CPU to warm up enough reach thermal limits causing HWP to throttle back the voltage and CPU frequencies. The following graphs illustrate how running with the balance-performance option allows the CPU to run for several seconds at a high turbo frequency before it hits a thermal limit and then the CPU frequency and power is adjusted to avoid thermal overrun:

After 8 seconds the CPU package reached 92 degrees C and then CPU frequency scaling kicks in:

..and power consumption drops too:

..it is interesting to note that we can only run for ~9 seconds before the CPU is scaled back to around the same CPU frequency that the balance-power option allows.

Conclusion

Running with HWP balance-power option is a good default choice for maximizing compute while minimizing power consumption for a modern Intel based laptop. If one wants to crank up the performance at the expense of battery life, then the balance-performance option is most useful.

The balance-performance option when a laptop is plugged into the mains (e.g. via a base-station) may seem like a good idea to get peak compute performance. Note that this may not be useful in the long term as the CPU frequency may drop back to reduce thermal overrun. However, for bursty infrequent demanding CPU uses this may be a good choice. I personally refrain from using this as it makes my CPU rather run hot and it's less efficient so it's not ideal for reducing my carbon footprint.

Laptop manufacturers normally set the default HWP option as "balance-power", but this may be changed in the BIOS settings (look for power balance, HWP or Speed Shift options) or changed with x86_energy_perf_policy tool (found in the linux-tools-generic package in Ubuntu).

New features in stress-ng 0.12.12

2021-07-09T11:15:00.003+01:00

The release of stress-ng 0.12.12 incorporates some useful features and a handful of new stressors.

Media devices such as HDDs and SSDs normally support Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) to detect and report various measurements of drive reliability. To complement the various file system and I/O stressors, stress-ng now has a --smart option that checks for any changes in the S.M.A.R.T. measurements and will report these at the end of a stress run, for example:

..as one can see, there are errors on /dev/sdc and this explains why the ZFS pool was having performance issues.

For x86 CPUs I have added a new stressor to trigger System Management Interrupts via writes to port 0xb2 to force the CPU into System Management Mode in ring -2. The --smi stressor option will also measure the time taken to service the SMI. To run this stressor, one needs the --pathological option since this may hang the computer and they behave like non-maskable interrupts:

To exercise the munmap(2) system call a new munmap stressor has been added. This creates child processes that walk through their memory mappings from /proc/$pid/maps and unmap pages on libraries that are not being used. The unapping is performed by striding across the mapping in page multiples of prime size to create many mapping holes to exercise the VM mapping structures. These unmappings can create SIGSEGV segmentation faults that silently get handled and respawn a new child stressor. Example:

There some new options for the fork, vfork and vforkmany stressors, a new vm mode has been added to try and exercise virtual memory mappings. This enables detrimental performance virtual memory advice using madvise on all pages of the new child process. Where possible this will try to set every page in the new process with using madvise MADV_MERGEABLE, MADV_WILLNEED, MADV_HUGEPAGE and MADV_RANDOM flags. The following shows how to enable the vm options for the fork and vfork stressors:

One final new feature is the --skip-silent option. This will disable printing of messages when a stressor is skipped, for example, if the stressor is not supported by the kernel, the hardware or a support library is not available.

As usual for each release, stress-ng incorporates bug fixes and has been tested on a wide variety of Linux/*BSD/UNIX/POSIX systems and across a range of processor architectures (arm32, arm64, amd64, i386, ppc64el, RISC-V s390x, sparc64, m68k. It has also been statically analysed with Coverity and cppcheck and built cleanly with pedantic build flags on gcc and clang.

Adjacent C string concatenation gotcha

2021-05-21T10:12:00.002+01:00

C has the useful feature of adjacent allowing literal strings to be automatically concatenated. This is described in K&R "The C programming language" 2nd edition, page 194, section A2.6 "String Literals":

"Adjacent string literals are concatenated into a single string."

Unfortunately over the years I've seen several occasions where this useful feature has led to bugs not being detected, for example, the following code:

A simple typo in the "Does Not Compute" error message ends up with the last two literal strings being "silently" concatenated, causing an array out of bounds read when error_msgs[7] is accessed.

This can also bite when a new literal string is added to the end of the array. The previous string requires a comma to be added when a new string is added to the array, and sometimes this is overlooked. An example of this is in ACPICA commit 81eb9c383e6dee0f1b6620e91e5c3dbb48234831 - fortunately static analysis detected this and it has been fixed with commit 3f8c79fc22fd64b739f51268654a6783a874520e

The concerning issue is that this useful string concatenation feature can produce hazardous outcomes with coding mistakes that are simple to make but hard to notice.

C Ternary operator gotcha (type conversions)

2021-04-23T12:55:00.004+01:00

The C ternary operator expr1 ? expr2 : expr3 has a subtle side note described in K&R 2nd edition, page 52, section 2.11:

"If expr2 and expr3 are of different types, the type of the result is determined by the conversion rules discussed earlier in the chapter".

This refers to page 44, section 2.7 "Type Conversions". It's worth reading a few times along with section A6.5 "Arithmetic Conversions".

Here is an example of a type conversion gotcha:

At a glance one would think the program would print out -1 as the output. Note that the expr2 is actually type converted to unsigned int so the result is 4294967295 if int types are 32 bits wide.

One solution is to type convert expr2 and/or expr3 to a long int and because that is wider than the unsigned int x this takes precedence. Alternatively just use:

References: soc: aspeed: fix a ternary sign expansion bug

A C for-loop Gotcha

2021-03-30T12:37:00.004+01:00

The C infinite for-loop gotcha is one of the less frequent issues I find with static analysis, but I feel it is worth documenting because it is obscure but easy to do.

Consider the following C example:

Since i is a 8 bit integer, it will wrap around to zero when it reaches the maximum 8 bit value of 255 and so we end up with an infinite loop if the upper limit of the loop n is 256 or more.

The fix is simple, always ensure the loop counter is at least as wide as the type of the maximum limit of the loop. This example, variable i should be a uint32_t type.

I've seen this occur in the Linux kernel a few times. Sometimes it is because the loop counter is being passed into a function call that expects a specific type such as a u8, u16. In other occasions I've seen a u16 (or short) integer being used presumably because it was expected to produce faster code, however, most commonly 32 bit integers just as fast (or sometimes faster) than 16 bit integers for this kind of operation.

A Common C Integer Multiplication Mistake

2021-03-28T23:33:00.003+01:00

Multiplying integers in C is easy. It is also easy to get it wrong. A common issue found using static analysis on the Linux kernel is the integer overflow before widening gotcha.

Consider the following code that takes the 2 unsigned 32 bit integers, multiplies them together and returns the unsigned 64 bit result:

The multiplication is performed using unsigned 32 bit arithmetic and the unsigned 32 bit results is widened to an unsigned 64 bit when assigned to ret. A way to fix this is to explicitly cast a to a uint64_t before the multiplication to ensure an unsigned 64 bit multiplication is performed:

Fortunately static analysis finds these issues. Unfortunately it is a bug that keeps on occurring in new code.

A common C integer shifting mistake

2021-03-18T17:20:00.003+00:00

Shifting integers in C is easy. Alas it is also easy to get it wrong. A common issue found using static analysis on the Linux kernel is the unintentional sign extension gotcha.

Consider the following code that takes the 4 unsigned 8 bit integers in array data and returns an unsigned 64 bit integer:

C promotes the uint8_t integers into signed ints on the right shift. If data[3] has the upper bit set, for example with the value 0x80 and data[2]..data[0] are zero, the shifted 32 bit signed result is sign extended to a 64 bit long and the final result is 0xffffffff80000000. A way to fix this is to explicitly cast data[3] to a uint64_t before shifting it.

Fortunately static analysis finds these issues. Unfortunately it is a bug that keeps on occurring in new code.

Improving kernel test coverage with stress-ng

2021-01-04T16:44:00.000+00:00

Over the past year there has been focused work on improving the test coverage of the Linux Kernel with stress-ng. Increased test coverage exercises more kernel code and hence improves the breadth of testing, allowing us to be more confident that more corner cases are being handled correctly.

The test coverage has been improved in several ways:

testing more system calls; most system calls are being now exercised
adding more ioctl() command tests
exercising system call error handling paths
exercise more system call options and flags
keeping track of new features added to recent kernels and adding stress test cases for these
adding support for new architectures (RISC-V for example)

Each stress-ng release is run with various stressor options against the latest kernel (built with gcov enabled). The gcov data is processed with lcov to produce human readable kernel source code containing coverage annotations to help inform where to add more test coverage for the next release cycle of stress-ng.

Linux Foundation sponsored Piyush Goyal for 3 months to add test cases that exercise system call test failure paths and I appreciate this help in improving stress-ng. I finally completed this tedious task at the end of 2020 with the release of stress-ng 0.12.00.

Below is a chart showing how the kernel coverage generated by stress-ng has been increasing since 2015. The amber line shows lines of code exercised and the green line shows kernel functions exercised.

..one can see that there was a large increase of kernel test coverage in the latter half of 2020 with stress-ng. In all, 2020 saw ~20% increase on kernel coverage, most of this was driven using the gcov analysis, however, there is more to do.

What next? Apart from continuing to add support for new kernel system calls and features I hope to improve the kernel coverage test script to exercise more file systems; it will be interesting to see what kind of bugs get found. I'll also be keeping the stress-ng project page refreshed as this tracks bugs that stress-ng has found in the Linux kernel.

As it stands, release 0.12.00 was a major milestone for stress-ng as it marks the completion of the major work items to improve kernel test coverage.

Kernel janitor work: fixing spelling mistakes in kernel messages

2020-09-25T13:07:00.003+01:00

The Linux 5.9-rc6 kernel source contains over 300,000 literal strings used in kernel messages of various sorts (errors, warnings, etc) and it is no surprise that typos and spelling mistakes slip into these messages from time to time.

To catch spelling mistakes I run a daily automated job that fetches the tip from linux-next and runs a fast spelling checker tool that finds all spelling mistakes and then diff's these against the results from the previous day. The diff is emailed to me and I put my kernel janitor hat on, fix these up and send these to the upstream developers and maintainers.

The spelling checker tool is a fast-and-dirty C parser that finds literal strings and also variable names and checks these against a US English dictionary containing over 100,000 words. As fun weekend side project I hand optimized the checker to be able to parse and spell check several millions lines of kernel C code per second.

Every 3 or so months I collate all the fixes I've made and where appropriate I add new spelling mistake patterns to the kernel checkpatch spelling dictionary. Kernel developers should in practice run checkpatch.pl on their patches before submitting them upstream and hopefully the dictionary will catch a lot of the regular spelling mistakes.

Over the past couple of years I've seen less spelling mistakes creep into the kernel, either because folk are running checkpatch more nowadays and/or that the dictionary is now able to catch more spelling mistakes. As it stands, this is good as it means less work to fix these up.

Spelling mistakes may be trivial fixes, but cleaning these up helps make the kernel errors appear more professional and can also help clear up some ambiguous messages.

easy capturing of kernel stack traces with virsh

2020-04-30T13:01:00.003+01:00

Today I needed to capture a rather large kernel stack dump, this is rather trivial using virsh. Using virt-manager I created a VM named vm-focal and in the guest ran:

sudo systemctl enable serial-getty@ttyS0.service

Then on the host running the VM I ran:

virsh console vm-focal

Then all I needed to do was produce the stack dump and the console output was successfully dumped by virsh. Easy.

stress-ng Embedded Linux Conference Europe 2019 presentation

2019-11-04T09:33:00.001+00:00

Last week I attended the Embedded Linux Conference (Europe 2019) and presented a talk on stress-ng. The slide deck for this presentation is now available.

Stress testing CPU temperatures

2019-10-17T11:24:00.002+01:00

Stress testing CPU temperatures is not exactly straight forward. CPU designs vary from CPU to CPU and each have their own strengths and weaknesses in cache design, integer maths, floating point math, bit-wise logical operations and branch prediction to name but a few. I've been asked several times about the "best" CPU stressor method in stress-ng to use to make a CPU run hot.

As an experiment I ran all the CPU stressor methods in stress-ng for 60 seconds across a range of devices, from small ARM based Raspberry Pi 2 and 3 to much larger Xeon desktop servers just to see how hot CPUs get. The thermal measurements were based on the most relevant thermal zones, for example, on x86 this is the CPU package thermal zone. In between each stress run 60 seconds of idle time was added to allow the CPU to cool.

Below are the results:

As one can see, quite a mixed set of results and it is hard to recommend any specific CPU stressor method as the "best" across a range of CPUs. It does appear that the mix of 64 bit integer and floating point cpu stress methods do seem to be generally rather good for making most CPUs run hot.

With this is mind, I think we can conclude there is no such thing as a perfect way to make a CPU run hot as it is very architecture dependant. Fortunately the stress-ng CPU stressor has a large suite of methods to exercise the CPU in different ways, so there should be a good stressor somewhere in that collection to max out your CPU. Knowing which one is the tricky part(!)

Boot speed improvements for Ubuntu 19.10 Eoan Ermine

2019-09-10T10:47:00.002+01:00

The early boot requires loading and decompressing the kernel and initramfs from the boot storage device. This speed is dependent on several factors, speed of loading an image from the boot device, the CPU and memory/cache speed for decompression and the compression type.

Generally speaking, the smallest (best) compression takes longer to decompress due to the extra complexity in the compression algorithm. Thus we have a trade-off between load time vs decompression time.

For slow rotational media (such as a 5400 RPM HDD) with a slow CPU the loading time can be the dominant factor. For faster devices (such as a SSD) with a slow CPU, decompression time may be the dominate factor. For devices with fast 7200-10000 RPM HDDs with fast CPUs, the time to seek to the data starts to dominate the load time, so load times for different compressed kernel sizes is only slightly different in load time.

The Ubuntu kernel team ran several experiments benchmarking several x86 configurations using the x86 TSC (Time Stamp Counter) to measure kernel load and decompression time for 6 different compression types: BZIP2, GZIP, LZ4, LZMA, LZMO and XZ. BZIP2, LZMA and XZ are slow to decompress so they got ruled out very quickly from further tests.

In compression size, GZIP produces the smallest compressed kernel size, followed by LZO (~16% larger) and LZ4 (~25% larger). With decompression time, LZ4 is over 7 times faster than GZIP, and LZO being ~1.25 times faster then GZIP on x86.

In absolute wall-clock times, the following kernel load and decompress results were observed:

Lenovo x220 laptop, 5400 RPM HDD:
LZ4 best, 0.24s faster than the GZIP total time of 1.57s

Lenovo x220 laptop, SSD:
LZ4 best, 0.29s faster than the GZIP total time of 0.87s

Xeon 8 thread desktop with 7200 RPM HDD:
LZ4 best, 0.05s faster than the GZIP total time of 0.32s

VM on a Xeon 8 thread desktop host with SSD RAID ZFD backing store:
LZ4 best, 0.05s faster than the GZIP total time of 0.24s

Even with slow spinning media and a slow CPU, the longer load time of the LZ4 kernel is overcome by the far faster decompression time. As media gets faster, the load time difference between GZIP, LZ4 and LZO diminishes and the decompression time becomes the dominant speed factor with LZ4 the clear winner.

For Ubuntu 19.10 Eoan Ermine, LZ4 will be the default decompression for x86, ppc64el and s390 kernels and for the initramfs too.

References:
Analysis: https://kernel.ubuntu.com/~cking/boot-speed-eoan-5.3/kernel-compression-method.txt
Data: https://kernel.ubuntu.com/~cking/boot-speed-eoan-5.3/boot-speed-compression-5.3-rc4.ods

Monitoring page faults with faultstat

2019-08-13T12:14:00.001+01:00

Whenever a process accesses a virtual address where there isn't currently a physical page mapped into its process space then a page fault occurs. This causes an interrupt so that the kernel can handle the page fault.

A minor page fault occurs when the kernel can successfully map a physically resident page for the faulted user-space virtual address (for example, accessing a memory resident page that is already shared by other processes). Major page faults occur when accessing a page that has been swapped out or accessing a file backed memory mapped page that is not resident in memory.

Page faults incur latency in the running of a program, major faults especially so because of the delay of loading pages in from a storage device.

The faultstat tool allows one to easily monitor page fault activity allowing one to find the most active page faulting processes. Running faultstat with no options will dump the page fault statistics of all processes sorted in major+minor page fault order.

Faultstat also has a "top" like mode, inoking it with the -T option will display the top page faulting processes again in major+minor page fault order.

The Major and Minor columns show the respective major and minor page faults. The +Major and +Minor columns show the recent increase of page faults. The Swap column shows the swap size of the process in pages.

Pressing the 's' key will switch through the sort order. Pressing the 'a' key will add an arrow annotation showing page fault growth change. The 't' key will toggle between cumulative major/minor page total to current change in major/minor faults.

The faultstat tool has just landed in Ubuntu Eoan and can also be installed as a snap. The source can is available on github.

Working towards stress-ng 0.10.00

2019-06-08T17:26:00.000+01:00

Over the past 9+ months I've been cleaning up stress-ng in preparation for a V0.10.00 release. Stress-ng is a portable Linux/UNIX Swiss army knife of micro-benchmarking kernel stress tests.

The Ubuntu kernel team uses stress-ng for kernel regression testing in several ways:

Checking that the kernel does not crash when being stressed tested
Performance (bogo-op throughput) regression checks
Power consumption regression checks
Core CPU Thermal regression checks

The wide range of micro benchmarks in stress-ng allow us to keep track of a range of metrics so that we can catch regressions.

I've tried to focus on several aspects of stress-ng over the last last development cycle:

Improve per-stressor modularization. A lot of code has been moved from the core of stress-ng back into each stress test.
Clean up a lot of corner case bugs found when we've been testing stress-ng in production. We exercise stress-ng on a lot of hardware and in various cloud instances, so we find occasional bugs in stress-ng.
Improve usability, for example, adding bash command completion.
Improve portability (various kernels, compilers and C libraries). It really builds on runs on a *lot* of Linux/UNIX/POSIX systems.
Improve kernel test coverage. Try to exercise more kernel core functionality and reach parts other tests don't yet reach.

Over the past several days I've been running various versions of stress-ng on a gcov enabled 5.0 kernel to measure kernel test coverage with stress-ng. As shown below, the tool has been slowly gaining more core kernel coverage over time:

With the use of gcov + lcov, I can observe where stress-ng is not currently exercising the kernel and this allows me to devise stress tests to touch these un-exercised parts. The tool has a history of tripping kernel bugs, so I'm quite pleased it has helped us to find corners of the kernel that needed improving.

This week I released V0.09.59 of stress-ng. Apart from the usual sets of clean up changes and bug fixes, this new release now incorporates bash command line completion to make it easier to use. Once the 5.2 Linux kernel has been released and I'm satisfied that stress-ng covers new 5.2 features I will probably be releasing V0.10.00. This will be a major release milestone now that stress-ng has realized most of my original design goals.

Kernel commits with "Fixes" Tag (revisited)

2019-01-05T19:39:00.000+00:00

Last year I wrote about kernel commits that are tagged with the "Fixes" tag. Kernel developers use the "Fixes" tag on a bug fix commit to reference an older commit that originally introduced the bug. The adoption of the tag has been steadily increasing since v3.12 of the kernel:

The red line shows the number of commits per release of the kernel, and the blue line shows the number of commits that contain a "Fixes" tag.

In terms of % of commits that contain the "Fixes" tag, one can see it has been steadily increasing since v3.12 and almost 12.5% of kernel commits in v4.20 are tagged this way.

The fixes tag contains the commit SHA of the commit that was fixed, so one can look up the date of the fix and of the commit being fixed and determine the time taken to fix a bug.

As one can see, a lot of issues get fixed on the first few hundred days, and some bugs take years to get fixed. Zooming into the first hundred days of fixes the distribution looks like:

..the modal point is at day 4, I suspect these are issues that get found quickly when commits land in linux-next and are found in early testing, integration builds and static analysis.

Out of the thousands of "Fixes" tagged commits and the time to fix an issue one can determine how long it takes to fix a specific percentage of the bugs:

In the graph above, 50% of fixes are made within 151 days of the original commit, ~69% of fixes are made within a year of the original commit and ~82% of fixes are made within 2 years. The long tail indicates that there are some bugs that take a while to be found and fixed, the final 10% of bugs take more than 3.5 years to be found and fixed.

Comparing the time to fix issues for kernel versions v4.0, v4.10 and v4.20 for bugs that are fixed in less than 50 days we have:

... the trends are similar, however it is worth noting that more bugs are getting found and fixed a little faster in v4.10 and v4.20 than v4.0. It will be interesting to see how these trends develop over the next few years.

Linux I/O Schedulers

2018-12-12T09:55:00.002+00:00

The Linux kernel I/O schedulers attempt to balance the need to get the best possible I/O performance while also trying to ensure the I/O requests are "fairly" shared among the I/O consumers. There are several I/O schedulers in Linux, each try to solve the I/O scheduling issues using different mechanisms/heuristics and each has their own set of strengths and weaknesses.

For traditional spinning media it makes sense to try and order I/O operations so that they are close together to reduce read/write head movement and hence decrease latency. However, this reordering means that some I/O requests may get delayed, and the usual solution is to schedule these delayed requests after a specific time. Faster non-volatile memory devices can generally handle random I/O requests very easily and hence do not require reordering.

Balancing the fairness is also an interesting issue. A greedy I/O consumer should not block other I/O consumers and there are various heuristics used to determine the fair sharing of I/O. Generally, the more complex and "fairer" the solution the more compute is required, so selecting a very fair I/O scheduler with a fast I/O device and a slow CPU may not necessarily perform as well as a simpler I/O scheduler.

Finally, the types of I/O patterns on the I/O devices influence the I/O scheduler choice, for example, mixed random read/writes vs mainly sequential reads and occasional random writes.

Because of the mix of requirements, there is no such thing as a perfect all round I/O scheduler. The defaults being used are chosen to be a good best choice for the general user, however, this may not match everyone's needs. To clarify the choices, the Ubuntu Kernel Team has provided a Wiki page describing the choices and how to select and tune the various I/O schedulers. Caveat emptor applies, these are just guidelines and should be used as a starting point to finding the best I/O scheduler for your particular need.

New features in Forkstat

2018-12-01T12:43:00.000+00:00

Forkstat is a simple utility I wrote a while ago that can trace process activity using the rather useful Linux NETLINK_CONNECTOR API. Recently I have added two extra features that may be of interest:

1. Improved output using some UTF-8 glyphs. These are used to show process parent/child relationships and various process events, such as termination, core dumping and renaming. Use the new -g (glyph) option to enable this mode. For example:

In the above example, the program "wobble" was started and forks off a child process. The parent then renames itself to wibble (indicated by a turning arrow). The child then segfaults and generates a core dump (indicated by a skull and crossbones), triggering apport to investigate the crash. After this, we observe NetworkManager creating a thread that runs for a very short period of time. This kind of activity is normally impossible to spot while running conventions tools such as ps or top.

2. By default, forkstat will show the process name using the contents of /proc/$PID/cmdline. The new -c option allows one to instead use the 16 character task "comm" field, and this can be helpful for spotting process name changes on PROC_EVENT_COMM events.

These are small changes, but I think they make forkstat more useful. The updated forkstat will be available in Ubuntu 19.04 "Disco Dingo".

High-level tracing with bpftrace

2018-11-22T12:14:00.001+00:00

Bpftrace is a new high-level tracing language for Linux using the extended Berkeley packet filter (eBPF). It is a very powerful and flexible tracing front-end that enables systems to be analyzed much like DTrace.

The bpftrace tool is now installable as a snap. From the command line one can install it and enable it to use system tracing as follows:

 sudo snap install bpftrace  
 sudo snap connect bpftrace:system-trace

To illustrate the power of bpftrace, here are some simple one-liners:

 # trace openat() system calls
 sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s %s\n", pid, comm, str(args->filename)); }'   
 Attaching 1 probe...  
 1080 irqbalance /proc/interrupts  
 1080 irqbalance /proc/stat  
 2255 dmesg /etc/ld.so.cache  
 2255 dmesg /lib/x86_64-linux-gnu/libtinfo.so.5  
 2255 dmesg /lib/x86_64-linux-gnu/librt.so.1  
 2255 dmesg /lib/x86_64-linux-gnu/libc.so.6  
 2255 dmesg /lib/x86_64-linux-gnu/libpthread.so.0  
 2255 dmesg /usr/lib/locale/locale-archive  
 2255 dmesg /lib/terminfo/l/linux  
 2255 dmesg /home/king/.config/terminal-colors.d  
 2255 dmesg /etc/terminal-colors.d  
 2255 dmesg /dev/kmsg  
 2255 dmesg /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache

 # count system calls using tracepoints:  
 sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'  
 @[tracepoint:syscalls:sys_enter_getsockname]: 1  
 @[tracepoint:syscalls:sys_enter_kill]: 1  
 @[tracepoint:syscalls:sys_enter_prctl]: 1  
 @[tracepoint:syscalls:sys_enter_epoll_wait]: 1  
 @[tracepoint:syscalls:sys_enter_signalfd4]: 2  
 @[tracepoint:syscalls:sys_enter_utimensat]: 2  
 @[tracepoint:syscalls:sys_enter_set_robust_list]: 2  
 @[tracepoint:syscalls:sys_enter_poll]: 2  
 @[tracepoint:syscalls:sys_enter_socket]: 3  
 @[tracepoint:syscalls:sys_enter_getrandom]: 3  
 @[tracepoint:syscalls:sys_enter_setsockopt]: 3  
 ...

Note that it is recommended to use bpftrace with Linux 4.9 or higher.

The bpftrace github project page has an excellent README guide with some worked examples and is a very good place to start. There is also a very useful reference guide and one-liner tutorial too.

If you have any useful btftrace one-liners, it would be great to share them. This is an amazingly powerful tool, and it would be interesting to see how it will be used.

Static Analysis Trends on Linux Next

2018-10-03T10:17:00.000+01:00

I've been running static analysis using CoverityScan on linux-next for 2 years with the aim to find bugs (and try to fix some) before they are merged into Linux. I have also been gathering the defect count data and tracking the defect trends:

As one can see from above, CoverityScan has found a considerable amount of defects and these are being steadily fixed by the Linux developer community. The encouraging fact is that the outstanding issues are reducing over time. Some of the spikes in the data are because of changes in the analysis that I'm running (e.g. getting more coverage), but even so, one can see a definite trend downwards in the total defects in the Kernel.

With static analysis, some of these reported defects are false positives or corner cases that are in fact impossible to occur in real life and I am slowly working through these and annotating them so they don't get reported in the defect count.

It must be also noted that over these two years the kernel has grown from around 14.6 million to 17.1 million lines of code so the defect count has dropped from 1 defect in every ~2100 lines to 1 defect in every ~3000 lines over the past 2 years. All in all, it is a remarkable improvement for such a large and complex codebase that is growing in size at such rate.

Comparing Latencies and Power consumption with various CPU schedulers

2018-07-16T12:31:00.002+01:00

The low-latency kernel offering with Ubuntu provides a kernel tuned for low-latency environments using low-latency kernel configuration options. The x86 kernels by default run with the Intel-Pstate CPU scheduler set to run with the powersave scaling governor biased towards power efficiency.

While power efficiency is fine for most use-cases, it can introduce latencies due to the fact that the CPU can be running at a low frequency to save power and also switching from a deep C state when idle to a higher C state when servicing an event can also increase on latencies.

In a somewhat contrived experiment, I rigged up an i7-3770 to collect latency timings of clock_nanosleep() wake-ups with timer event coalescing disabled (timer_slack set to zero) over 60 seconds across a range of CPU scheduler and governor settings on a 4.15 low-latency kernel. This can be achieved using stress-ng, for example:

 sudo stress-ng --cyclic 1 --cyclic-dist 100 –cyclic-sleep=10000 --cpu 1 -l 0 -v \
--cyclic-policy rr --cyclic-method clock_ns --cpu 0 -t 60 --timer-slack 0

..the above runs a cyclic measurement collecting latency counts in 100ns buckets with a clock_nanosecond wakeup interval of 10,000 nanoseconds with zero % load CPU stressor and timer slack set to 0 nanoseconds. This dumps latency distribution stats that can be plotted to see where the modal latency points occur and the latency characteristics of the CPU scheduler.

I also used powerstat to measure the power consumed by the CPU package over a 60 second interval. Measurements for the Intel-Pstate CPU scheduler [performance, powersave] and the ACPI CPU scheduler (intel_pstate=disabled) [performance, powersave, conservative and ondemand] were taken for 1,000,000 down to 10,000 nanosecond timer delays.

1,000,000 nanosecond timer delays (1 millisecond)

Strangely the powersave Intel-Pstate is using the most power (not what I expected).

The ACPI CPU scheduler in performance mode has the best latency distribution followed by the Intel-Pstate CPU scheduler also in performance mode.

100,000 nanosecond timer delays (100 microseconds)

Note that Intel-Pstate performance consumes the most power...

...and also has the most responsive low-latency distribution.

10,000 nanosecond timer delays (10 microseconds)

In this scenario, the ACPI CPU scheduler in performance mode was consuming the most power and had the best latency distribution.

It is clear that the best latency responses occur when a CPU scheduler is running in performance mode and this consumes a little more power than other CPU scheduler modes. However, it is not clear which CPU scheduler (Intel-Pstate or ACPI) is best in specific use-cases.

The conclusion is rather obvious; but needs to be stated. For best low-latency response, set the CPU governor to the performance mode at the cost of higher power consumption. Depending on the use-case, the extra power cost is probably worth the improved latency response.

As mentioned earlier, this is a somewhat contrived experiment, only one CPU was being exercised with a predictable timer wakeup. A more interesting test would be with data handling, such as incoming packet handling over ethernet at different rates; I will probably experiment with that if and when I get more time. Since this was a synthetic test using stress-ng, it does not represent real world low-latency scenarios, however, it may be worth exploring CPU scheduler settings to tune a low-latency configuration rather than relying on the default CPU scheduler setting.

Kernel Commits with "Fixes" tag

2018-03-23T00:26:00.000+00:00

Over the past 5 years there has been a steady increase in the number of kernel bug fix commits that use the "Fixes" tag. Kernel developers use this annotation on a commit to reference an older commit that originally introduced the bug, which is obviously very useful for bug tracking purposes. What is interesting is that there has been a steady take-up of developers using this annotation:

With the 4.15 release, 1859 of the 16223 commits (11.5%) were tagged as "Fixes", so that's a fair amount of work going into bug fixing. I suspect there are more commits that are bug fixes, but aren't using the "Fixes" tag, so it's hard to tell for certain how many commits are fixes without doing deeper analysis. Probably over time this tag will be widely adopted for all bug fixes and the trend line will level out and we will have a better idea of the proportion of commits per release that are just devoted to fixing issues. Let's see how this looks in another 5 years time, I'll keep you posted!

Linux Kernel Module Growth

2018-02-21T16:46:00.005+00:00

The Linux kernel grows at an amazing pace, each kernel release adds more functionality, more drivers and hence more kernel modules. I recently wondered what the trend was for kernel module growth per release, so I performed module builds on kernels v2.6.24 through to v4.16-rc2 for x86-64 to get a better idea of growth rates:

..as one can see, the rate of growth is relatively linear with about 89 modules being added to each kernel release, which is not surprising as the size of the kernel is growing at a fairly linear rate too. It is interesting to see that the number of modules has easily more than tripled in the 10 years between v2.6.24 and v4.16-rc2, with a rate of about 470 new modules per year. At this rate, Linux will see the 10,000th module land in around the year 2025.