After being prodded by Jonas Walldén, who showed me how to speed up the C implementation of arcfour by 50%, I've been spending a few hours on micro optimizing arcfour and sha1.
Benchmark figures below are on my P4 laptop. First, arcfour:
Original C version: 47 MB/s Improved C version: 43 MB/s [1] x86 assembler: 160 MB/s
[1] So this is actually a slowdown on my machine. Jonas reports a 50% speedup on G4, and a smallish x86 speedup too. I think the new code should be faster on all cpu:s with a decent number of registers.
For sha1, the difference is smaller,
Original C version: 64 MB/s x86 assembler: 80 MB/s
Now, I'm by no means an x86 guru. I've tried to get a small instruction count, and fit as many variables as possible into registers. I haven't attempted to do any clever instruction scheduling. So if anybody with more x86 experience could have a look at the code, that is much appreciated.
Source code can be viewed at
http://cvs.lysator.liu.se/viewcvs/viewcvs.cgi/lsh/src/nettle/x86/sha1-compre... http://cvs.lysator.liu.se/viewcvs/viewcvs.cgi/lsh/src/nettle/x86/arcfour-cry...
If you find the m4 macrology hard to read, you can check out the code from cvs (following the instructions on the nettle or lsh homepages), build it, and then look at the sha1-compress.s file, which is the result of sending sha1-compress.asm through m4.
Happy hacking, /Niels
nettle-bugs@lists.lysator.liu.se