Maamoun TK maamoun.tk@googlemail.com writes:
It looks like wider multiplication would achieve higher speed on different aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
- Radix 26: 0.83 GByte/s
- Radix 26 (2-way interleaved): 0.70 GByte/s
- Radix 64 (Latest version): 1.25 GByte/s
These numbers are a bit of a surprise too since the 2-way interleaving is supposed to perform better than the old C version similarly to other architectures! Anyway, the benchmark numbers of powerpc and s390x were not taken from gfarm instances and it's ok to be based on.
In the meantime, I pushed the latest C radix-32 version to a branch poly1305-radix32, and benchmarked on one of the 32-bit ARM boards that are part of the GMP test systems (the odxu4 system on https://gmplib.org/devel/testsystems, labeled Cortex-A15/A7). I got:
Radix 26 (old code): 326 MByte/s Radix 32 (new code): 260 MByte/s
So radix-32 doesn't seem to be a good option there. I had a quick look at the generated assembly, besides the expected umull and umlal instructions to do the main multiply work, there's an awful lot of loads and stores to the stack, roughly half of the instructions. The compiler installed is gcc-5.4, rather old.
Vector 32-bit multiplication applies two multiply operations on inputs and places the concatenated results on the destination vector. So the current simd/altivec implementations interleave two blocks horizontally over vector registers and execute both multiplication and reduction phases on both blocks simultaneously. However, after each block iteration we should combine the two states together by splitting the concatenated value and adding it to the origin but to avoid that overhead Ione can multiply both state parts with r^2 except for the last two blocks that imply multiplying the first part with r^2 and the second one with r. Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by hash has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r With interleaved implementation this sequence is executed in two iteration. First iteration: (h+b0) r^2 || b1 r^2 Second iteration: ((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
When getting out of the loop we combine the two state parts together so we get the same correct sequence of r powers for each block.
Also, the two-independent carry technique that mentioned previously overlaps two carry procedures with each other including the long carry from h4 to h0 which offers the opportunity for further boost.
Hmm, it seems that avoiding long carry chains is the main reason why radix 26 can be faster.
Great! It performs better on all tested architectures. Apparently, AArch64 SIMD doesn't support 64*64->128 vector multiplication so I've implemented this version on powerpc by utilizing vmsumudm power9-specific instruction. I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly version, I'll attach the hardware implementation in this email.
But you had 1.15 GByte/s for the 2-way interleaved version on this machine?
define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_poly1305_block) ld H0, 32(CTX) ld H1, 40(CTX) ld H2, 48(CTX) ld T0, 0(M) ld T1, 8(M)
addc T0, T0, H0 adde T1, T1, H1 adde T2, M128, H2
li IDX, 16 lxvd2x VSR(R), 0, CTX lxvd2x VSR(S), IDX, CTX
li RZ, 0 vxor ZERO, ZERO, ZERO vxor F0, F0, F0 vxor F1, F1, F1 vxor TMP, TMP, TMP
xxpermdi VSR(MU0), VSR(R), VSR(S), 0b01 xxswapd VSR(MU1), VSR(R)
mtvsrdd VSR(T), T0, T1 mtvsrdd VSR(T10), 0, T2 andi. T2A, T2, 3 mtvsrdd VSR(T11), 0, T2A srdi T2A, T2, 2 mtvsrdd VSR(T00), T2A, RZ
I don't get all of the setup, but perhaps it would be better to load input (T0, T1) and state (H0, H1, H2) directly into vector registers, and avoid move between regular registers and vectors.
For the R and S values, the key setup could store them in the right order so they don't have to be permuted after load.
vmsumudm F0, T, MU0, F0 vmsumudm F1, T, MU1, F1 vmsumudm TMP, T11, MU1, TMP
vmsumudm F0, T00, S, F0 vmsumudm F1, T10, MU0, F1
This part is as neat as I had hoped! Is there some variant of the instructions that writes the result register without adding, to avoid the explicit clearing of F0 and F1? It may also be doable with one instruction less; the 5 instructions does 10 multiplies, but I think we use only 7, the rest must somehow be zeroed or ignored.
xxmrgld VSR(TMP), VSR(TMP), VSR(ZERO) li IDX, 32 xxswapd VSR(F0), VSR(F0) vadduqm F1, F1, TMP stxsdx VSR(F0), IDX, CTX
li IDX, 40 xxmrgld VSR(F0), VSR(ZERO), VSR(F0) vadduqm F1, F1, F0 xxswapd VSR(F1), VSR(F1) stxvd2x VSR(F1), IDX, CTX
This is looks a bit verbose, if what we need to do is just to add high part of F0 to low part of F1 (with carry to the high part of F1), and store the result?
Regards, /Niels