Saturday, 13 October 2012

Intel rdrand instruction revisited

A few months ago I did a quick and dirty benchmark of the Intel rdrand instruction found on the new Ivybridge processors.  I did some further analysis a while ago and I've only just got around to writing up my findings. I've improved the test by exercising the Intel Digital Random Number Generator (DRNG) with multiple threads and also re-writing the rdrand wrapper in assembler and ensuring the code is inline'd.  The source code for this test is available here.

So, how does it shape up?  On a i5-3210M (2.5GHz) Ivybridge (2 cores, 4 threads) I get a peak of ~99.6 million 64 bit rdrands per second with 4 threads which equates to ~6.374 billion bits per second.  Not bad at all.

With a 4 threaded i5-3210M CPU we hit maximum rdrand throughput with 4 threads.

..and with a 8 threaded i7-3770 (3.4GHz) Ivybridge (4 cores, 8 threads) we again hit a peak throughput of 99.6 million 64 bit rdrands a second on 3 threads. One can therefore conclude that this is the peak rate of the DNRG on both CPUs tested.  A 2 threaded i3 Ivybridge CPU won't be able to hit the peak rate of the DNRG, and a 4 threaded i5 can only just max out the DNRG with some hand-optimized code.

Now how random is this random data?  There are several tests available; I chose to exercise the DRNG using the dieharder test suite.  The test is relatively simple; install dieharder and do 64 bit rdrand reads and output these as a raw random number stream and pipe this into dieharder:

 sudo apt-get install dieharder  
 ./rdrand-test | dieharder -g 200 -a
#=============================================================================#
#            dieharder version 3.31.1 Copyright 2003 Robert G. Brown          #
#=============================================================================#
   rng_name    |rands/second|   Seed   |
stdin_input_raw|  3.66e+07  | 639263374|
#=============================================================================#
        test_name   |ntup| tsamples |psamples|  p-value |Assessment
#=============================================================================#
   diehard_birthdays|   0|       100|     100|0.40629140|  PASSED  
      diehard_operm5|   0|   1000000|     100|0.79942347|  PASSED  
  diehard_rank_32x32|   0|     40000|     100|0.35142889|  PASSED  
    diehard_rank_6x8|   0|    100000|     100|0.75739694|  PASSED  
   diehard_bitstream|   0|   2097152|     100|0.65986567|  PASSED  
        diehard_opso|   0|   2097152|     100|0.24791918|  PASSED  
        diehard_oqso|   0|   2097152|     100|0.36850828|  PASSED  
         diehard_dna|   0|   2097152|     100|0.52727856|  PASSED  
diehard_count_1s_str|   0|    256000|     100|0.08299753|  PASSED  
diehard_count_1s_byt|   0|    256000|     100|0.31139908|  PASSED  
 diehard_parking_lot|   0|     12000|     100|0.47786440|  PASSED  
    diehard_2dsphere|   2|      8000|     100|0.93639860|  PASSED  
    diehard_3dsphere|   3|      4000|     100|0.43241488|  PASSED  
     diehard_squeeze|   0|    100000|     100|0.99088862|  PASSED  
        diehard_sums|   0|       100|     100|0.00422846|   WEAK   
        diehard_runs|   0|    100000|     100|0.48432365|  PASSED 
..
        dab_monobit2|  12|  65000000|       1|0.98439048|  PASSED 

..and leave to cook for about 45 minutes.  The -g 200 option specifies that the random numbers come from stdin and the -a option runs all the dieharder tests.  All the tests passed with the exception of the diehard_sums test which produced "weak" results, however, this test is known to be unreliable and recommended not to be used.  Quite honestly, I would be surprised if the tests failed, but you never know until one runs them.

The CA cert research labs have an on-line random number generator analysis website allowing one to submit and test at least 12 MB of random numbers. I submitted 32 MB of data, and I am currently waiting to see if I get any results back.  Watch this space.

11 comments:

  1. You can rest assured that I ran the DRNG's output though dieharder a few times during its development. The thing with dieharder is that on perfectly random data, it will randomly throw up some 'weak's . Do it a second time and you will get a different set of weak results. If you take enough P values, some of them will land in the margins.
    6.374GBits/s is 796.75MBytes/s which is closer to the theoretical maximum of 800MBytes/s than I achieved (I got about 780). So well done.

    ReplyDelete
    Replies
    1. Can confirm.

      I'm just working on improving dieharder. The problem of random WEAKs is due to https://github.com/rurban/dieharder/issues/6
      the Multiple testing problem, which dieharder does by default not mitigate against. Therefore it is currently recommended to use -Y1 in all cases when WEAK results are returned. This will retry such WEAK results with a different seed for some time and checks if it will get better. A better idea would be to subtract the expected number of bad p-values, the alpha, or to check for outliers, as we do with smhasher.

      I also got crazily slow rdrand benchmarks on my AMD Ryzen 3, with 72 ints/second, not 36000 as expected on Intel. But I'm really torturing rdrand, without any buffer.

      I also got extremely bad test results with dieharder. Intel uses a good AES_CBC-MAC, wonder what AMD uses. https://software.intel.com/content/www/us/en/develop/articles/intel-digital-random-number-generator-drng-software-implementation-guide.html?wapkw=rdrand64_step

      Delete
  2. I'm sure with some careful optimization I can get some more performance out of the test rig I hacked up.

    ReplyDelete
  3. On my Core i7-3520M (on same X230) I top at ~42M rdrand/s at 4 threads.

    It seems you didn't commit rdrand-test on the same repository, I'd be interested too.

    ReplyDelete
    Replies
    1. rdrand-test simply dumped the rdrand64() reads to stdout, it's quite trivial.

      Delete
  4. Good point, with this (gross) patch I manage to get ~10MB/s on that X230, but doing the write every 64 bits is quite a bad idea :)

    http://paste.debian.net/204718/

    ReplyDelete
  5. Looks from the graph like, until the RNG bottleneck is hit at 99.6 MHz, it's taking 9 cycles per read.

    ReplyDelete
  6. On IVB, 800MiBytes/s is the limit (100MHz x 64bits), imposed by the bus local to the DRNG. From my perspective, the CPU core is way over there on the other side of the chip. I don't care how fast it is, it's not getting more that 800MiB/s :)

    ReplyDelete
  7. There is a bug on some Ivy Bridge processors that cause rdrand to signal an illegal instruction exception. My new laptop has that problem and I am not happy. I would actually have a very particular use for that instruction.

    ReplyDelete
    Replies
    1. This is described in note BV54 in http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/3rd-gen-core-desktop-specification-update.pdf

      Delete
  8. I have a question regarding practical use of rdrand: is it better to use rdrand instead of C rand() function? will I get better performance? I need this for simple, not critical apps likes games, animations.

    ReplyDelete