Wednesday, 29 July 2009

Faster bzip2 compression

Compressing files with bzip2 can take a while, especially at the highest compression setting, even when using the faster processor. Nowadays most modern machines have more than one CPU, so why not use all the available cycles for compression?

This is where pbzip2 comes to the rescue. It is a re-implementation of bzip2, but uses pthreads to parallelize the compression on SMP machines. It promises "near linear speedup on SMP machines", so I thought I'd check this out.

To install pbzip2, use:

sudo apt-get install pbzip2

I started with fairly typical 187MB tar archive of an evolution mail directory - I wanted to see how bzip2 and pbzip2 compare with the -9 (best compression).

bzip2: 53.8 seconds, 117339200 bytes.
pbzip2: 31.4 seconds, 117419551 bytes.

(The CPU was an Intel(R) Core(TM)2 Duo CPU T5800 @ 2.00GHz)

So pbzip2 was not quite twice as fast, and the resulting file was just 0.068% larger. So there is little bit of overhead going on but all in all it's a very good result.

I repeated the test with the -1 fastest compression setting:

bzip2: 43.8 seconds, 120885523 bytes
pbzip2: 22.9 seconds, 120915716 bytes

So with the lower compression pbzip2 is almost twice as fast and the file is only 0.025% larger - almost no difference in size.

Next, I tested pbzip on a Quad Xeon server ( Intel(R) Xeon(R) CPU X5350 @ 2.66GHz) on a copy of a 644MB tar'd Jaunty kernel git repository:

bzip2: 1minute 58.4 seconds, 376813785 bytes
pbzip2: 38.2 seconds, 377014857 bytes

So not quite 1/4 of the time with 4 CPUs, so there is some scheduling overhead going on.

More in-depth benchmarks can be found here.

There we have it; pbzip2 works very well on SMP machines. All we now need is a parallel version of bunzip2...

References: http://compression.ca/pbzip2

No comments:

Post a Comment