Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

28 Jan 2022


      Maamoun TK maamoun.tk@googlemail.com writes:
...
Great! I believe this is the best we can get for processing one block.
One may be able to squeeze out one or two cycles more using the mulx
extension, which should make it possible to eliminate some of the move
instructions (I don't think moves cost any execution unit resources, but
they do consume decoding resources).
...
I'm trying to implement two-way interleaving using AVX extension and
the main instruction of interest here is 'vpmuludq' that does double
multiply operation
My manual seems a bit confused if it's called pmuludq or vpmuludq. But
you're thinking of the instruction that does two 32x32 --> 64
multiplies? It will be interesting to see how that works out! It does
half the work compared to a 64 x 64 --> 128 multiply instruction, but
accumulation/folding may get more efficient by using vector registers.
(There seems to also be an avx variant doing four 32x32 --> 64
multiplies, using 256-bit registers).
...
the main concern here is there's a shortage of XMM registers as
there are 16 of them, I'm working on addressing this issue by using memory
operands of key values for 'vpmuludq' and hope the processor cache do his
thing here.
Reading cached values from memory is usally cheap. So probably fine as
long as values modified are kept in registers.
...
I'm expecting to complete the assembly implementation tomorrow.
If my analysis of the single-block code is right, I'd expect it to be
rather important to trim number of instructions per block.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305