Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

23 Jan 2022


      On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se wrote:
...
nisse@lysator.liu.se (Niels Möller) writes:
...
The current C implementation uses radix 26, and 25 multiplies (32x32
--> 64) per block. And quite a lot of shifts. A radix 32 variant
analogous to the above would need 16 long multiplies and 4 short. I'd
expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose
between radix 64 (depending on the non-standard __int128 type for
accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my
x86_64 laptop:

Radix 26: 1.2 GByte/s (old code)

Radix 32: 1.3 GByte/s

Radix 64: 2.2 GByte/s


It would be interesting with benchmarks on actual 32-bit hardware,
32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I
have access to.
Arm64 (gcc117 gfarm):
* Radix 26: 0.65 GByte/s
* Radix 26 (2-way interleaved): 0.92 GByte/s
* Radix 32: 0.55 GByte/s
* Radix 64: 0.58 GByte/s
POWER9:
* Radix 26: 0.47 GByte/s
* Radix 26 (2-way interleaved): 1.15 GByte/s
* Radix 32: 0.52 GByte/s
* Radix 64: 0.58 GByte/s
Z15:
* Radix 26: 0.65 GByte/s
* Radix 26 (2-way interleaved): 3.17 GByte/s
* Radix 32: 0.82 GByte/s
* Radix 64: 1.22 GByte/s
Apparently, the higher radix version has performance improvements on
x86_64, powerpc, and s390x but this is not the case for arm64 arch where
the performance has a slight hit there.
I tried to compile the new code with -m32 flag on x86_64 but I got
"poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this
target". Unfortunately, I don't have access to arm 32-bit too.
Also, I've disassembled the update function of Radix 64 and none of the
architectures has made use of SIMD support (including x86_64 that hasn't
used XMM registers which is standard for this arch, I don't know if gcc
supports such behavior for C compiling but I'm aware that MSVC takes
advantage of that standardization for further optimization on compiled C
code).
I'm trying to implement the radix 64 using SIMD to see if we can get any
performance boost, I'll post the result once I get done with it.
regards,
Mamone
...
If I understood correctly, the suggestion to use radix 26 in djb's
original paper was motivated by a high-speed implementation using
floating point arithmetic (possibly in combination with SIMD), where the
product of two 26-bit integers can be represented exactly in an IEEE
double (but it gets a bit subtle if we want to accumulate several
products), I haven't really looked into implementing poly1305 with
either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests
on random inputs, with results compared to a reference implementation
based on gmp/mini-gmp. I intend to merge those testing changes soon.
See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755...
.
Unfortunately, the http interface of the main git repo at Lysator is
inaccessible at the moment due to an expired certificate; should be
fixed in a day or two.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305