There are many different classes of bugs and hence many different debugging techniques can be used. Sometimes there is a lot of complexity in a problem and it is hard to make a mental model of what exactly is going on between multiple interdependent components, especially when non-deterministic behaviour is occurring.
Unfortunately the human mind is limited; there is only so much debugging state that it can follow. The problem is compounded when a bug manifests itself after several hours or days of continuous execution time - there just seems like there is too much state to track and to make sense from.
Looking at thousands of lines of debug trace and trying to spot any lurking evidence that may offer some hint to why code is failing is not trivial. However the brain is highly developed at making sense of visual input, so it makes sense to visualise copious amounts of debug data to spot anomalies.
The kernel offers many ways to gather data, be it via different trace mechanisms or just by using plain old printk(). The steps to debug are thus:
1. Try and find a way to reliably reproduce the problem to be debugged.
2. Where possible, try to remove complexity from the problem and still end up with a reliable way of reproducing the problem.
3. Rig up a way of collecting internal state or data on system activity. Obviously the type of data to be collected is dependant on the type of problem being worked on. Be careful that any instrumentation does not affect the behaviour of the system.
4. Start collecting data and try to reproduce the bug. You may have to do this multiple times to collect enough data to allow one to easily spot trends over different runs.
5. Visualise the data.
Iterating on steps 2 to 5 allow one to keep on refining a problem down to the bare minimum required to corner a bug.
Now step 5 is the fun part. Sometimes one has to lightly parse the output to collect specific items of data. I generally use tools such as awk to extract specific fields or to re-munge data into a format that can be easily processed by graphing tools. It can be useful to also collect time stamps with the data as some bugs are timing dependant and seeing interactions between components is key to understanding why issues occur. If one gathers multiple sets of data from different sources then being able to correlate the data on a timestamp can be helpful.
If I have just a few tens of hundreds items of data to visualise I generally just collate my data into tab or comma separated output and then import it into LibreOffice Calc and then generate graphs from the raw data. However, for more demanding graphing I normally resort to using gnuplot. Gnuplot is an incredibly powerful plotting tool - however for more complex graphs one often needs to delve deep into the manual and perhaps crib from the very useful worked examples.
Graphing data allows one to easily spot trends or correlate between seemingly unrelated parts of a system. What was originally an overwhelmingly huge mass of debug data turns into a powerful resource. Sometimes I find it useful to run multiple tests over a range of tweaked kernel tuneables to see if bugs change behaviour - often this aids understanding when there is significant amounts of inter-component complexity occurring.
Perhaps it is just the way I like to think about issues, but I do recommend experimenting with collecting large data sets and creatively transforming the data into visualise representations to allow one to easily spot issues. It can be surprising just how much one can glean from thousands of seemingly unrelated traces of debug data.
Excellent post!
ReplyDeleteHi, I ran into your post while looking what things people use for visualization of program performance, and allow me the non-modesty of suggesting to take a look at two tools of my own: http://jkff.info/software/timeplotters/ - they are specifically designed for visualizing program performance from logs, and I dare say they are dramatically better at it than gnuplot. In fact, I wrote them because I realized that gnuplot isn't going to give me the smooth and productive workflow I need for performance visualization and debugging.
ReplyDeletePlease take a look at the tools and tell me what you think :)