Thursday, 5 September 2013

health-check: a tool to diagnose resource usage

Recently I have been focused on ways to reduce power consumption on Ubuntu phones. To identify resource hungry issues one can use tools such as top, strace and gdb, or inspect per process properties in /proc/$pid, however this can be time consuming and not practical with many processes in a system. Instead, I decided it would be profitable to write a tool to inspect and monitor a running program and report on areas where it appeared to be sub-optimal.  And so health-check was born.

One provides health-check with a list of one or more processes and it will monitor all the associated threads (and child processes) and then report back on the resources used. Heath-check will report on:
  • CPU utilisation
  • Wakeup events
  • Context Switches
  • File I/O operations (Open/Read/Write/Close using fnotify)
  • System calls (using ptrace)
  • Analysis of polling system calls
  • Memory utilisation (including memory growth)
  • Network connections
  • Wakelock activity
CPU utilisation, wakeup events and context switches are useful just to see how much CPU related activity is occurring.   For example,  a program that has multiple threads that frequently ping-pongs between them will have a high context switch rate, which may indicate that it is busy passing data or messages between threads.  A process may be frequently polling on short timer delays may show up as generating a high level of wakeup events.

Some applications may be sub-optimally writing out data frequently, causing dirty pages and meta data that needs to be written back to the file system.  Health-check will capture file I/O activity and report on the names of the files being opened, read, written and closed.

To help identify excessive or heavy system call usage, health-check uses ptrace to trap and monitor all the system calls that the program makes.  For example, it has been observed that some applications excessively call poll() and nanosleep() with poorly chosen timeouts causing excessive CPU utilisation.  For system calls such as these where they can wait until an event or a timeout occur, health-check has some deeper monitoring.  It inspects the given timeout delay and checks to see if the call timed out, for example,  health-check can identify CPU sucking repeated polling where zero timeouts are being used or excessive nanosleeps with zero or negative delays.

The ptrace ability of heath-check also allow it to monitor per-process wake lock writes. Abuse of wakelocks can keep the a kernel from suspending into deep sleep so it is useful to keep track of wakelock activity on some processes. This is not enabled by default as it is an expensive operation to monitor this via ptrace and also some kernels may not have wakelocks, so one has to use the -W option to enable this.

Health-check also inspects /proc/$pid/smaps and will determine if memory utilisation has grown or shrunk.  Unusually high heap growth over time may indicate that an application has a memory leak.

Finally, health-check will inspect /proc/$pid/fd and from this determine any open sockets and then try and resolve the host names of the IP addresses.  For example, it is entirely possible for an application to be making spurious or unwanted connections to various machines, so it is helpful to check up on this kind of activity.

Health-check is still very early alpha quality, so beware of possible bugs.   However, it has been helpful in identifying some misbehaving applications, so it is already proving to be rather useful.

Source code can be found at: git://kernel.ubuntu.com/cking/health-check

Packages found in my White PPA in ppa:colin-king/white so to install on Ubuntu systems use:

 sudo add-apt-repository ppa:colin-king/white  
 sudo apt-get update  
 sudo apt-get install health-check  
..and go and track down some resource sucking apps..

7 comments:

  1. It would be nice to see a tool like this that was able to submit data to a server and graph apps so these data can be used upstream to optimize apps for desktop, server and phone.

    Mozilla does this with their browser they call it telemetry.

    ReplyDelete
  2. Please add this to Debian so that users of Debian and other Debian derivatives can use it.

    ReplyDelete
    Replies
    1. It is my intention to do so once I am happy with the overall quality of this tool. It still is alpha quality, so I expect quite a few more development iterations.

      Delete
  3. health-check is now in Debian unstable, cf http://packages.qa.debian.org/h/health-check.html

    ReplyDelete