Joachim Strömbergson joachim@secworks.se writes:
By vectorizing you mean running quarterrounds in parallel?
I mean putting several uint32_t values in a simd register, and using simd instructions.
Have you looked at the asm code by DJB?
Not really, I find the generated assembly pretty hard to read, and I haven't tried to understand his qhasm tool.
He does up to four blocks in parallel and do some tricks with the shifts. xmm-5 should be relevant.
To me, it looks like all rotates are done with psrld + pslld. But I might be missing something. On the few machines I have benchmarked the code (I haven't been very systematic), pshufhw + pshuflw seems to be slightly faster. It saves one por instruction.
I'm pretty sure doing a couple of blocks at a time in parellel, interleaving the instructions, will give some speedup.
Regards, /Niels