Re: [PowerPC] GCM optimization

17 Nov 2020

      I replaced the method of using the stack to handle the leftovers with the
first approach, also I changed some vector registers in the defines because
I defined `LE_MASK' in a non-volatile register which is not
always preserved.
This patch is built on the top ppc-gcm branch.
regards,
Mamone
On Sat, Nov 14, 2020 at 8:11 PM Maamoun TK maamoun.tk@googlemail.com
wrote:
...
For the first approach I can think of this method:
lxvd2x      VSR(C0),0,DATA
IF_LE(`
vperm       C0,C0,C0,LE_MASK
')
slwi        LENGTH,LENGTH,4     (Shift left 4 bitls because vsro get
bit[121:124])
vspltisb    v10,-1
(0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF)
mtvrwz      v11,LENGTH             (LENGTH in bit[57:60])
xxspltd     VSR(v11),VSR(v11),0 (LENGTH in bit[121:124])
vsro        v10,v10,v11                  (Sift right by octet)
vnot        v10,v10
vand        C0,C0,v10
I recommend the third approach so we don't have to deal with the leftover
bytes in the upcoming implementations but the problem is that
gcm_init_key() initialize the table for the compatible gcm_hash() function,
that means we can't process the remaining bytes using gcm_gf_mul() of
gcm_gf_shift_8() because its table potentially has not been initialized, so
I'm thinking of keeping gcm_gf_mul() of the one that don't need table
(where GCM_TABLE_BITS == 0) and always process the remaining bytes with
this function.
The test coverage is fine, I can't think of any potential untested cases.
regards,
Mamone
On Sat, Nov 14, 2020 at 6:54 PM Niels Möller nisse@lysator.liu.se wrote:
...
Maamoun TK maamoun.tk@googlemail.com writes:
...
+Lmod:

C --- process the modulo bytes, padding the low-order bytes with

zeros
...

cmpldi         LENGTH,0
beq            Ldone

C load table elements
li             r8,1*TableElemAlign
lxvd2x         VSR(H1M),0,TABLE
lxvd2x         VSR(H1L),r8,TABLE

C push every modulo byte to the stack and load them with padding

into
...
vector register

vxor           ZERO,ZERO,ZERO
addi           r8,SP,-16
stvx           ZERO,0,r8

+Lstb_loop:

subic.         LENGTH,LENGTH,1
lbzx           r7,LENGTH,DATA
stbx           r7,LENGTH,r8
bne            Lstb_loop
lxvd2x         VSR(C0),0,r8

It's always a bit annoying to have to deal with leftovers like this
in the assembly code. Can we avoid having to store it to memory and read
back? I can see three other approaches:

Loop, reading a byte at a time, and shift into a target register. I
guess we would need to assemble the bytes in a regular register, and
then transfer the final value to a vector register. Is that
expensive?

Round the address down to make it aligned, read an aligned word and,
only if needed, the next word. And shift and mask to get the needed
bytes. I think it is fine to read a few bytes outside of the input
area, as long as the reads do *not* cross any word boundary (and
hence a potential page boundary). We do things like this in some
other places, but then for reading unaligned data in general, not
just leftover parts.

Adapt the internal C/asm interface, so that the assembly routine only
needs to handle complete blocks. It could provide a gcm_gf_mul, and
let the C code handle partial blocks using memxor + gcm_gf_mul.

I would guess (1) or maybe (3) is the most reasonable. I don't think
performance is that important, since it looks like for each message,
this case can happen only for the last call to gcm_update and the last
call to gcm_encrypt/gcm_decrypt.
What about test coverage? It looks like we have test cases for sizes up
to 8 blocks, and for partial blocks, so I guess that should be fine?
Reards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PowerPC] GCM optimization