-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
(Sorry for slow response to ChaCha stuff.)
Niels Möller wrote:
I've now added some basic chacha x86_64 assembly. This gives a modest speedup over the code generated by gcc-4.7.2, about 8% in this machine. Apparently, gcc is pretty good at vectorizing this (and there seems to virtually no difference for salsa20).
By vectorizing you mean running quarterrounds in parallel? You should be able to do at least four in parallel (if there are regs available). 8 requires pipelining. I've implemented ChaCha with four parallel QRs in HW:
https://github.com/secworks/swchacha
(Which is just anecdotal to this discussion.)
I have one question, regarding the different rotation counts in chacha, including 16 and 8. I think I've read that this is supposed to be advantageous on x86_64, but after reviewing the various pshuf* instructions, it's not clear how. I now do these as left shith + right shift + or. Maybe the rotate by 16 bits can be done with pshufhw + pshuflw. Or am I missing some other way to do a rotate on an %xmm register?
Have you looked at the asm code by DJB? He does up to four blocks in parallel and do some tricks with the shifts. xmm-5 should be relevant.
Ah, and chacha seems to be about 15% faster than salsa20
Which seems to match what DJB claims in the paper. Good.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================