Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

25 Jan 2022

      On Mon, Jan 24, 2022 at 12:58 AM David Edelsohn dje.gcc@gmail.com wrote:
...
On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK maamoun.tk@googlemail.com
wrote:
...
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se
wrote:
...
...
nisse@lysator.liu.se (Niels Möller) writes:
...
The current C implementation uses radix 26, and 25 multiplies (32x32
--> 64) per block. And quite a lot of shifts. A radix 32 variant
analogous to the above would need 16 long multiplies and 4 short. I'd
expect that to be faster on most machines, but I'd have to try that
out.
...
...
I've tried this out, see attached file. It has an #if 0/1 to choose
between radix 64 (depending on the non-standard __int128 type for
accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my
x86_64 laptop:

Radix 26: 1.2 GByte/s (old code)

Radix 32: 1.3 GByte/s

Radix 64: 2.2 GByte/s

It would be interesting with benchmarks on actual 32-bit hardware,
32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I
have access to.
Arm64 (gcc117 gfarm):

Radix 26: 0.65 GByte/s
Radix 26 (2-way interleaved): 0.92 GByte/s
Radix 32: 0.55 GByte/s
Radix 64: 0.58 GByte/s

POWER9:

Radix 26: 0.47 GByte/s
Radix 26 (2-way interleaved): 1.15 GByte/s
Radix 32: 0.52 GByte/s
Radix 64: 0.58 GByte/s

Z15:

Radix 26: 0.65 GByte/s
Radix 26 (2-way interleaved): 3.17 GByte/s
Radix 32: 0.82 GByte/s
Radix 64: 1.22 GByte/s

Apparently, the higher radix version has performance improvements on
x86_64, powerpc, and s390x but this is not the case for arm64 arch where
the performance has a slight hit there.
That might be an artifact of the specific ARM processor
microarchitecture or the memory subsystem of the ARM system, not
inherent to the Arm AArch64 architecture and ISA.
Seems right, I've tested the enhanced versions of wider multiplication on
other gfarm aarch64 instances and I got different results.
On Mon, Jan 24, 2022 at 10:01 AM Niels Möller nisse@lysator.liu.se wrote:
...
Maamoun TK maamoun.tk@googlemail.com writes:
...
I made a performance test of this patch on the available architectures I
have access to.
Arm64 (gcc117 gfarm):

Radix 26: 0.65 GByte/s
Radix 26 (2-way interleaved): 0.92 GByte/s
Radix 32: 0.55 GByte/s
Radix 64: 0.58 GByte/s

POWER9:

Radix 26: 0.47 GByte/s
Radix 26 (2-way interleaved): 1.15 GByte/s
Radix 32: 0.52 GByte/s
Radix 64: 0.58 GByte/s

Z15:

Radix 26: 0.65 GByte/s
Radix 26 (2-way interleaved): 3.17 GByte/s
Radix 32: 0.82 GByte/s
Radix 64: 1.22 GByte/s

Interesting. I'm a bit surprised the radix-64 doesn't perform better, in
particular on arm64. (But I'm not yet familiar with arm64 multiply
instructions).
It looks like wider multiplication would achieve higher speed on different
aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
* Radix 26: 0.83 GByte/s
* Radix 26 (2-way interleaved): 0.70 GByte/s
* Radix 64 (Latest version): 1.25 GByte/s
These numbers are a bit of a surprise too since the 2-way interleaving is
supposed to perform better than the old C version similarly to other
architectures!
Anyway, the benchmark numbers of powerpc and s390x were not taken from
gfarm instances and it's ok to be based on.
...
Numbers for 2-way interleaving are impressive, I'd like to understand
how that works.
Vector 32-bit multiplication applies two multiply operations on inputs and
places the concatenated results on the destination vector. So the current
simd/altivec implementations interleave two blocks horizontally over vector
registers and execute both multiplication and reduction phases on both
blocks simultaneously. However, after each block iteration we should
combine the two states together by splitting the concatenated value and
adding it to the origin but to avoid that overhead Ione can multiply both
state parts with r^2 except for the last two blocks that imply multiplying
the first part with r^2 and the second one with r.
Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by hash
has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r
With interleaved implementation this sequence is executed in two iteration.
First iteration:
(h+b0) r^2 || b1 r^2
Second iteration:
((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
When getting out of the loop we combine the two state parts together so we
get the same correct sequence of r powers for each block.
Also, the two-independent carry technique that mentioned previously
overlaps two carry procedures with each other including the long carry from
h4 to h0 which offers the opportunity for further boost.
Might be useful derive corresponding multiply
...
throughput, i.e., number of multiply operations (and with which multiply
instruction) completed per cycle, as well as total cycles per block
I'm not sure if I can depend on gfarm machines for such a purpose as it
seems to make incoherent performance results to me. I think I should look
for alternatives for more reliable results.
On Mon, Jan 24, 2022 at 10:56 PM Niels Möller nisse@lysator.liu.se wrote:
...
I've tried reworking folding, to reduce latency. Idea is to let the most
significant state word be close to a word, rather than limited to <= 4
as in the previous version. When multiplying by r, split one of the
multiplies to take out the low 2 bits. For the radix 64 version, that
term is
B^2 t_2 * r0
Split t_2 as 4*hi + lo, then this can be reduced to
B^2 lo * r0 + hi * 5*r0
(Using the same old B^2 = 5/4 (mod p) in a slightly different way).
The 5*r0 fits one word and can be precomputed, and then this
multiplication goes in parallell with the other multiplies, and no
multiply left in the final per-block folding. With this trick I get on
the same machine
Radix 32: 1.65 GByte/s
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
I haven't yet done a strict analysis of bounds on the state and
temporaries, but I would expect that it works out with no possibility of
overflow.
See attached file. To fit the precomputed 5*r0 in a nice way I had to
rearrange the unions in struct poly1305_ctx a bit, I also attach the
patch to do this. Size of the struct should be the same, so I think it
can be done without any abi bump.
Great! It performs better on all tested architectures. Apparently, AArch64
SIMD doesn't support 64*64->128 vector multiplication so I've implemented
this version on powerpc by utilizing vmsumudm power9-specific instruction.
I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly
version, I'll attach the hardware implementation in this email.
regards,
Mamone

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305