Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

25 Jan 2022

      On Tue, Jan 25, 2022 at 10:24 PM Niels Möller nisse@lysator.liu.se wrote:
...
Maamoun TK maamoun.tk@googlemail.com writes:
...
It looks like wider multiplication would achieve higher speed on
different
...
aarch64 instance on gfarm. Here are the numbers on gcc185 instance:

Radix 26: 0.83 GByte/s
Radix 26 (2-way interleaved): 0.70 GByte/s
Radix 64 (Latest version): 1.25 GByte/s

These numbers are a bit of a surprise too since the 2-way interleaving is
supposed to perform better than the old C version similarly to other
architectures!
Anyway, the benchmark numbers of powerpc and s390x were not taken from
gfarm instances and it's ok to be based on.
In the meantime, I pushed the latest C radix-32 version to a branch
poly1305-radix32, and benchmarked on one of the 32-bit ARM boards that
are part of the GMP test systems (the odxu4 system on
https://gmplib.org/devel/testsystems, labeled Cortex-A15/A7). I got:
Radix 26 (old code): 326 MByte/s
  Radix 32 (new code): 260 MByte/s
So radix-32 doesn't seem to be a good option there. I had a quick look
at the generated assembly, besides the expected umull and umlal
instructions to do the main multiply work, there's an awful lot of loads
and stores to the stack, roughly half of the instructions. The compiler
installed is gcc-5.4, rather old.
...
Vector 32-bit multiplication applies two multiply operations on inputs
and
...
places the concatenated results on the destination vector. So the current
simd/altivec implementations interleave two blocks horizontally over
vector
...
registers and execute both multiplication and reduction phases on both
blocks simultaneously. However, after each block iteration we should
combine the two states together by splitting the concatenated value and
adding it to the origin but to avoid that overhead Ione can multiply both
state parts with r^2 except for the last two blocks that imply
multiplying
...
the first part with r^2 and the second one with r.
Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by
hash
...
has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r
With interleaved implementation this sequence is executed in two
iteration.
...
First iteration:
(h+b0) r^2 || b1 r^2
Second iteration:
((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
When getting out of the loop we combine the two state parts together so
we
...
get the same correct sequence of r powers for each block.
Also, the two-independent carry technique that mentioned previously
overlaps two carry procedures with each other including the long carry
from
...
h4 to h0 which offers the opportunity for further boost.
Hmm, it seems that avoiding long carry chains is the main reason why
radix 26 can be faster.
Yes, with the sequential carry path the SIMD function would slightly beat
the C function of radix 26.
...
...
Great! It performs better on all tested architectures. Apparently,
AArch64
...
SIMD doesn't support 64*64->128 vector multiplication so I've implemented
this version on powerpc by utilizing vmsumudm power9-specific
instruction.
...
I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly
version, I'll attach the hardware implementation in this email.
But you had 1.15 GByte/s for the 2-way interleaved version on this machine?
Right, the 2-way interleaved version is more efficient than this one and
supports POWER7+ processors.
...
...
define(`FUNC_ALIGN', `5')
PROLOGUE(_nettle_poly1305_block)
      ld                      H0, 32(CTX)
      ld                      H1, 40(CTX)
      ld                      H2, 48(CTX)
      ld                      T0, 0(M)
      ld                      T1, 8(M)
  addc            T0, T0, H0
  adde            T1, T1, H1
  adde            T2, M128, H2

  li                      IDX, 16
  lxvd2x          VSR(R), 0, CTX
  lxvd2x          VSR(S), IDX, CTX

  li                      RZ, 0
  vxor            ZERO, ZERO, ZERO
  vxor            F0, F0, F0
  vxor            F1, F1, F1
  vxor            TMP, TMP, TMP

  xxpermdi        VSR(MU0), VSR(R), VSR(S), 0b01
  xxswapd         VSR(MU1), VSR(R)

  mtvsrdd         VSR(T), T0, T1
  mtvsrdd         VSR(T10), 0, T2
  andi.           T2A, T2, 3
  mtvsrdd         VSR(T11), 0, T2A
  srdi            T2A, T2, 2
  mtvsrdd         VSR(T00), T2A, RZ

I don't get all of the setup, but perhaps it would be better to load
input (T0, T1) and state (H0, H1, H2) directly into vector registers,
and avoid move between regular registers and vectors.
I was having difficulty using vector addition with carry so I got to deal
with the general register for that purpose. Also, general AND and Shift
operations are more easier to use than the vector ones since the latter
requires setting up a vector register for the immediate value.
...
For the R and S values, the key setup could store them in the right
order so they don't have to be permuted after load.
...
  vmsumudm        F0, T, MU0, F0
  vmsumudm        F1, T, MU1, F1
  vmsumudm        TMP, T11, MU1, TMP

  vmsumudm        F0, T00, S, F0
  vmsumudm        F1, T10, MU0, F1

This part is as neat as I had hoped! Is there some variant of the
instructions that writes the result register without adding, to avoid
the explicit clearing of F0 and F1? It may also be doable with one
instruction less; the 5 instructions does 10 multiplies, but I think we
use only 7, the rest must somehow be zeroed or ignored.
POWER10 adds 'vmuloud' and 'vmuleud' for one doubleword multiply which fits
well here. Now I realized that there is no need to clear F0, F1, TMP
registers since we can use ZERO register in place of the fourth operand.
However, the purpose of this implementation is to get an approximate
measurement of speed up in comparison to C version to vouch for adapting
2-way interleaving as a high-performance implementation.
...
...
  xxmrgld         VSR(TMP), VSR(TMP), VSR(ZERO)
  li                      IDX, 32
  xxswapd         VSR(F0), VSR(F0)
  vadduqm         F1, F1, TMP
  stxsdx          VSR(F0), IDX, CTX

  li                      IDX, 40
  xxmrgld         VSR(F0), VSR(ZERO), VSR(F0)
  vadduqm         F1, F1, F0
  xxswapd         VSR(F1), VSR(F1)
  stxvd2x         VSR(F1), IDX, CTX

This is looks a bit verbose, if what we need to do is just to add high
part of F0 to low part of F1 (with carry to the high part of F1), and
store the result?
I couldn't find a neat way to do that so I sticked with the C theme besides
some vector adjusting operations.
...
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305