(Micro) optimizations. - nettle-bugs

6 Feb 2004


      After being prodded by Jonas Walldén, who showed me how to speed up
the C implementation of arcfour by 50%, I've been spending a few hours
on micro optimizing arcfour and sha1.
Benchmark figures below are on my P4 laptop. First, arcfour:
Original C version:   47 MB/s
  Improved C version:   43 MB/s [1]
  x86 assembler:       160 MB/s
[1] So this is actually a slowdown on my machine. Jonas reports a 50%
    speedup on G4, and a smallish x86 speedup too. I think the new
    code should be faster on all cpu:s with a decent number of
    registers.
For sha1, the difference is smaller,
Original C version: 64 MB/s
  x86 assembler:      80 MB/s
Now, I'm by no means an x86 guru. I've tried to get a small
instruction count, and fit as many variables as possible into
registers. I haven't attempted to do any clever instruction
scheduling. So if anybody with more x86 experience could have a look
at the code, that is much appreciated.
Source code can be viewed at
http://cvs.lysator.liu.se/viewcvs/viewcvs.cgi/lsh/src/nettle/x86/sha1-compre...
http://cvs.lysator.liu.se/viewcvs/viewcvs.cgi/lsh/src/nettle/x86/arcfour-cry...
If you find the m4 macrology hard to read, you can check out the code
from cvs (following the instructions on the nettle or lsh homepages),
build it, and then look at the sha1-compress.s file, which is the
result of sending sha1-compress.asm through m4.
Happy hacking,
/Niels