POWER9 and Z14 have introduced ' vmsumudm' and 'vmslg' instructions respectively for full 64-bit vector multiplication. This note demonstrates implementing Poly1305 based on radix 2^44 for optimal architecture utilization.
In radix B = 2^44 we have the state and key as follows H = B^2 H_2 + B H_1 + H_0 R = B^2 R_2 + B R_1 + R_0 Where degrees of H_2 and R_2 are 42 and 36 respectively whereas other coefficients are of degree 44 We have B^3 = 20 so we can pre-compute R_1' = 20 * R_1 and R_2' = 20 * R_2 Now to multiply H by R we can write R H = B^2(R_2 H_0 + R_1 H_1 + R_0 H_2) + B(R_1 H_0 + R_0 H_1 + R_2' H_2) ____________T2__________/ ____________T1__________/ + R_0 H_0 + R_2' H_1 + R_1' H_2 ____________T0___________/
T_0 < 2^88 + 2^85 + 2^91 = 2^93 T_1 < 2^88 + 2^88 + 2^83 = 2^90 T_2 < 2^80 + 2^88 + 2^86 = 2^90
This is a good place to combine the interleaved multiplication state products of next consecutive blocks (In this case we interleave four-block multiplications by per-computing powers of key R) which rises up the degree of product parts.
T_0 < 2^93 + 3*2^93 = 2^96 T_1 < 2^90 + 3*2^91 = 2^94 T_2 < 2^90 + 3*2^90 = 2^93
Now let's reduce this product to 2^130 divided to 2^44, 2^44, 2^42 for each part. T_0 ------------> T_1 ---------------> T_2 -------------------> T_0 ---------------> T_1 (96-44) (94-44+1) (93-42+1+3) (55-44+1) This chain keeps the carry addition for the last part (2^44+2^12) less than 2^44+2^22. Note the carry from T_2 would be multiplied by 5 before adding it to T_0
While the sequential carry addition achieves decent performance speed on PowerPC arch, an interleaved carry handling would get more speed up for s390x so let's figure an interleaved variant of carry addition. *----------------------------------------------------------------------------------------------* | Phase 1 | Phase 2 | Phase 3 | |------------------------------|---------------------------------|-------------------------------|
| T_1 ------------> T_2 | T_2 -----------------> T_0 | T_0 --------------> T_1 | | (94-44) | (93-42+1+3) | (55-44+1) | |------------------------------|---------------------------------|-------------------------------| | T_0 -------------> T_1 | T_1 --------------> T_2 | | | (96-44) | (52-44+1) | | *-----------------------------------------------------------------------------------------------*
This variant implies 3 sequential phases rather than four. Moreover, the long carry path from T_2 -> T_0 would've executed in parallel with another carry path.
Patches of implementation based on radix 2^44 for both architectures are submitted to nettle with fat build support. https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
Benchmark the implementations on POWER9 and Z15 *---------------------------------------------------------------------------------------------* | Arch | Nettle patch (radix 2^44) | OpenSSL (radix 2^26) | |-------------------|--------------------------------------|-----------------------------------|
| PowerPC | 0.972 cycles/byte | 1.453 cycles/byte | |-------------------|--------------------------------------|-----------------------------------| | IBMz | 0.840 cycles/byte | 0.936 cycles/byte | *---------------------------------------------------------------------------------------------*
This note can also be applied on x86_64 arch with 'AVX512VL' and 'AVX512_IFMA' extension support by utilizing 'VPMADD52HUQ' and 'VPMADD52LUQ' instructions of full 52-bit multiplication.
regards, Mamone