Re: chacha assembly

30 Jan 2014


      Joachim Strömbergson joachim@secworks.se writes:
...
By vectorizing you mean running quarterrounds in parallel?
I mean putting several uint32_t values in a simd register, and using
simd instructions.
...
Have you looked at the asm code by DJB?
Not really, I find the generated assembly pretty hard to read, and I
haven't tried to understand his qhasm tool.
...
He does up to four blocks in
parallel and do some tricks with the shifts. xmm-5 should be relevant.
To me, it looks like all rotates are done with psrld + pslld. But I
might be missing something. On the few machines I have benchmarked the
code (I haven't been very systematic), pshufhw + pshuflw seems to be
slightly faster. It saves one por instruction.
I'm pretty sure doing a couple of blocks at a time in parellel,
interleaving the instructions, will give some speedup.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: chacha assembly