[ Repost. It seems that sometimes the list server sends out mail with "lists." dropped from the To:-header. I have no clue why that happens. The correct list address is nettle-bugs@lists.lysator.liu.se. /nisse ]
Now I've benchmarked new and old code (on an old 32-bit x86). The difference was much larger than I expected (numbers are Mbyte/s):
old nettle code:
blowfish128 ECB encrypt 52.58 blowfish128 ECB decrypt 55.05 blowfish128 CBC encrypt 32.18 blowfish128 CBC decrypt 52.99
libgcrypt code (from Simon):
blowfish128 ECB encrypt 20.76 blowfish128 ECB decrypt 19.23 blowfish128 CBC encrypt 16.58 blowfish128 CBC decrypt 19.02
libgcrypt code, but with F macro replaced:
blowfish128 ECB encrypt 32.29 blowfish128 ECB decrypt 34.05 blowfish128 CBC encrypt 23.45 blowfish128 CBC decrypt 33.29
I think I will have to reapply the hacks I did since the old code was copied from gnupg. I suspect another culprit are the extra local variables assigned at the top of the most crucial function do_encrypt:
p = ctx->p; s0 = ctx->s[0]; s1 = ctx->s[1]; s2 = ctx->s[2]; s3 = ctx->s[3];
This creates a lot of extra pressure for the register allocator. And even *if* all the variables fit in registers, the gain is minimal, since the difference between indexing
s0[x] /* Assume s0 in a register */
and indexing
ctx->s[0][x] /* Assume ctx in a register */
is only a constant offset (s is a two-dimensional array, not an array of pointers). The offset should be essentially free for the indexed addressing instructions on machines that have one. And even with a pure load-store machine, the addition of the offset should be quite cheap, and if there are a sufficient number of registers the general loop invariant machinery in the compiler might put it in a separate register if appropriate.
Regards, /Niels