Saturday 17 November 2012

Benford's Law with real world data.

If one has a large enough real life source of data (such as the size of files in the file system) and look at the distribution of the first digit of these values then one will find something that at first glance is rather surprising.  The leading digit 1 appears about 30% of the time and as the digits increase to 9 their frequency drops until we reach 9, which only appears about 5% of the time.   This seemingly curious frequency distribution is commonly known as Benford's law or the first digit law.

The probability P of digit d can be expresses as follows:

P(d) = log10(1 + 1 / d)

..where d is any integer value 1 to 9 inclusive. So for each leading digit in the data, the distribution works out to be about:

   Digit   
  Probability
1
0.301
2
0.176
3
0.125
4
0.097
5
0.079
6
0.067
7
0.058
8
0.051
9
0.046

But how does this hold up with some "real world" data? Can it really be true?  Well, for my first experiment, I analysed the leading digit of all the source files in the current Linux source tree and compared that to Benford's Law:


So, this is convincing enough.  How about something more exotic?  For my second experiment I counted up the number of comments in each file that start with /* in just the C source files in the Linux source tree and again looked at the distribution of the leading digits.  I was hazarding a guess that there are a reasonable amount of comments in each file (knowing the way some code is commented this may be pushing my luck).  Anyhow, the data generated also produces a distribution that obeys Benford's Law too:


Well, that certainly shows that Kernel developers are sprinkling enough comments in the Kernel source to be statistically meaningful.  If the comments themselves are meaningful is another matter...

How about one more test?  This time I gathered the length of every executable in /usr/bin and plotted the distribution of the leading digits from this data:


..this data set has far less files to analyse, so the distribution deviates a little, but the trend is still rather good.

As mentioned earlier, one has to have a large set of data for this too work well.  Interesting this may be, but what kind of practical use is it?   It can be applied to accountancy - if one has a large enough set of data in the accounts and the leading digits of the data do not fit Benford's Law then maybe one should suspect that somebody has been fiddling the books.  Humans are rather poor at making up lots of "random" values that don't skew Benford's Law.

One more interesting fact is that it applies even if one rescales the data.  For example, if you are looking at accounts in terms of £ sterling and covert it into US dollars or Albanian Lek the rescaled data still obeys Benford's Law.  Thus if re-ran my tests and didn't analyse the size of files in bytes but instead used size in 512 byte blocks it still would produce a leading digit distribution that obeyed Benford's Law.  Nice.

How can we apply this in computing? Perhaps we could use it to detect tampering with the sizes of a large set of files.  Who knows?  I am sure somebody can think of a useful way to use it.   I just find it all rather fascinating.

1 comment:

  1. Hi Colin,

    Few days ago, I came across a disk I/O scheduler called 'BFQ', which according to its creators, has the ability to speed up program loading times (quite significantly) when the disk is busy, or other disk I/O bound apps are running from the background.

    So I decided to test it out and have been using it for the couple of days in my Ubuntu 12.10.

    I don't know if you had used it before, but I'm extremely impressed by its above mentioned ability, that is to load programs really fast, when other disk I/O hungry apps running from the background.


    I do understand that taking something in like a disk I/O scheduler into an OS is a risky task, and is not something that is simple.

    But it looks so impressive at what it does, and might well worth testing it out (if you haven't done it already, of course).

    Below is a link to one of my lame articles about it, but I would love to see some extensive benchmarks from you guys.

    http://www.hecticgeek.com/2012/11/bfq-loads-programs-extremely-fast-under-heavy-disk-io-workloads-ubuntu/

    Thanks you very much.

    ReplyDelete