On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm): * Radix 26: 0.65 GByte/s * Radix 26 (2-way interleaved): 0.92 GByte/s * Radix 32: 0.55 GByte/s * Radix 64: 0.58 GByte/s POWER9: * Radix 26: 0.47 GByte/s * Radix 26 (2-way interleaved): 1.15 GByte/s * Radix 32: 0.52 GByte/s * Radix 64: 0.58 GByte/s Z15: * Radix 26: 0.65 GByte/s * Radix 26 (2-way interleaved): 3.17 GByte/s * Radix 32: 0.82 GByte/s * Radix 64: 1.22 GByte/s
Apparently, the higher radix version has performance improvements on x86_64, powerpc, and s390x but this is not the case for arm64 arch where the performance has a slight hit there.
I tried to compile the new code with -m32 flag on x86_64 but I got "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this target". Unfortunately, I don't have access to arm 32-bit too.
Also, I've disassembled the update function of Radix 64 and none of the architectures has made use of SIMD support (including x86_64 that hasn't used XMM registers which is standard for this arch, I don't know if gcc supports such behavior for C compiling but I'm aware that MSVC takes advantage of that standardization for further optimization on compiled C code).
I'm trying to implement the radix 64 using SIMD to see if we can get any performance boost, I'll post the result once I get done with it.
regards, Mamone
If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755... .
Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.