Sunday 31 January 2016

Pagemon improvements

Over the past month I've been finding the odd moments [1] to add some small improvements and fix a few bugs to pagemon (a tool to monitor process memory).  The original code went from a sketchy proof of concept prototype to a somewhat more usable tool in a few weeks, so my main concern recently was to clean up the code and make it more efficient.

With the use of tools such as valgrind's cachegrind and perf I was able to work on some of the code hot-spots [2] and reduce it from ~50-60% CPU down to 5-9% CPU utilisation on my laptop, so it's definitely more machine friendly now.  In addition I've added the following small features:
  • Now one can specify the name of a process to monitor as well as the PID.  This also allows one to run pagemon on itself(!), which is a bit meta.
  • Perf events showing Page Faults and Kernel Page Allocates and Frees, toggled on/off with the 'p' key.
  • Improved and snappier clean up and exit when a monitored process exits.
  • Far more efficient page map reading and rendering.
  • Out of Memory (OOM) scores added to VM statistics window.
  • Process activity (busy, sleeping, etc) to VM statistics window.
  • Zoom mode min/max with '[' (min) and ']' (max) keys.
  • Close pop-up windows with key 'c'.
  • Improved handling of rapid map expansion and shrinking.
  • Jump to end of map using 'End' key.
  • Improve the man page.
I've tried to keep the tool small and focused and I don't want feature bloat to make it unwieldy and overly complexed.  "Do one job, and do it well" is the philosophy behind pagemon. At just 1500 lines of C, it is as complex as I want it to be for now.

Version 0.01.08 should be hitting the Ubuntu 16.04 Xenial Xerus archive in the next 24 hours or so.  I have also the lastest version in my PPA (ppa:colin-king/pagemon) built for Trusty, Vivid, Wily and Xenial.


Pagemon is useful for spotting unexpected memory activity and it is just interesting watching the behaviour memory hungry processes such as web-browsers and Virtual Machines.

Notes:
[1] Mainly very late at night when I can't sleep (but that's another story...).  The git log says it all.
[2] Reading in /proc/$PID/maps and efficiently reading per page data from /proc/$PID/pagemap

Thursday 28 January 2016

Forcing out bugs with stress-ng

stress-ng logo
Over the past few months I've been adding several new stress tests and a lot more stressor options to stress-ng for Ubuntu 16.04 Xenial Xerus.  I try to track new system calls and features landing in the kernel and where appropriate add a stress test to try and force out bugs.

Stress-ng has found various kernel bugs, such as CVE-2015-1333 and LP:#1526811 as well as bugs in user space (for example, daemons crashing) when memory pressure is very high.  Simple abusive tricks, such as aggressively trying to allocate every free page in memory are useful in finding drivers that don't necessary check for memory allocation failures.  For example, today I was caught out when a USB ethernet dongle driver didn't check for a null pointer due to an allocation failure and stress-ng ended up triggering a kernel oops (fortunately, this bug was fixed in a recent kernel).

The underlying philosophy for stress-ng is "use and abuse standard Linux interfaces and see how far we can push them to destruction".  I'm pretty sure there are plenty of creative folk out there who can dream up dastardly ways to make stress-ng even more stressy, so contributions are always warmly accepted!  I have a mirrorred copy of the git repository on github to make it easy for developers to get their hands on the code.

We've been using stress-ng on ARM based SoC kernels to force out bugs and this has been useful in finding areas where non-swap based systems break. You really don't want your kernel oopsing or processes segfaulting when a IoT device has run low on memory.

My original intent for stress-ng was just to make a system run hot and force thermal overruns. However, I soon discovered it is useful to force kernel bugs out by attempting to (pathologically) thrash most of the system calls.  I've also added perf stats to stress-ng to track performance of standard stress scenarios over kernel versions to get an early warning of any potential performance regressions.  So stress-ng is a bit of a mixed bag of stress tests and performance measuring goodness.

When I get some free time I hope to run stress-ng against a GCOV instrumented kernel at see how much test coverage I get on a kernel. I suspect there are a lot of core kernel functionality still not being touched by stress-ng.

I've also tried to make stress-ng portable, so it can build fine on GNU/Hurd and Debian kFreeBSD (with Linux specific tests not built-in of course). It also contains some architecture specific features, such as handling the data and instruction cache as well as the x86 rdrand instruction and cache line locking. If there are any ARM specific features than can be stressed I'd like to know and perhaps implement stressors for them.

Anyhow, I believe stress-ng is almost feature complete for Ubuntu Xenial, however, I expect it to grow in features over time since there is always new functionality landing in the Linux kernel that needs to be thrashed tested.

Friday 8 January 2016

FIXME and TODO comments in the Linux kernel source

While looking at some code in the Linux Kernel this morning I spotted a few FIXME comments and that got me wondering just how many there are in the source code.  After a quick grep I found nearly 4200 in v4.4.0-rc8 and that got me thinking about other similar comment tags such as TODO that are in the source and how this has been changing over time.


So the trends are certainly upwards, but then again, so is the size of the kernel source:

Note: Data gathered using sloccount on the lines of C in the kernel source.

Using the sloccount data I then calculated the number of FIXME and TODOs per 1000 lines of code to see what the underlying trend is:

So FIXMEs are actually dropping in relative terms to the size of the kernel where as TODOs are increasing.

Of course, these statistics are bogus because it is dependent on kernel developers adding and removing FIXMEs and TODOs in a consistent manner, however, it is interesting to see how many comments exist and hence how much work has been tagged in comments as work to be done later. I wonder how this compares to other large open source projects.