I replaced the method of using the stack to handle the leftovers with the first approach, also I changed some vector registers in the defines because I defined `LE_MASK' in a non-volatile register which is not always preserved.
This patch is built on the top ppc-gcm branch.
regards, Mamone
On Sat, Nov 14, 2020 at 8:11 PM Maamoun TK maamoun.tk@googlemail.com wrote:
For the first approach I can think of this method: lxvd2x VSR(C0),0,DATA IF_LE(` vperm C0,C0,C0,LE_MASK ') slwi LENGTH,LENGTH,4 (Shift left 4 bitls because vsro get bit[121:124]) vspltisb v10,-1 (0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF) mtvrwz v11,LENGTH (LENGTH in bit[57:60]) xxspltd VSR(v11),VSR(v11),0 (LENGTH in bit[121:124]) vsro v10,v10,v11 (Sift right by octet) vnot v10,v10 vand C0,C0,v10
I recommend the third approach so we don't have to deal with the leftover bytes in the upcoming implementations but the problem is that gcm_init_key() initialize the table for the compatible gcm_hash() function, that means we can't process the remaining bytes using gcm_gf_mul() of gcm_gf_shift_8() because its table potentially has not been initialized, so I'm thinking of keeping gcm_gf_mul() of the one that don't need table (where GCM_TABLE_BITS == 0) and always process the remaining bytes with this function.
The test coverage is fine, I can't think of any potential untested cases.
regards, Mamone
On Sat, Nov 14, 2020 at 6:54 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
+Lmod:
- C --- process the modulo bytes, padding the low-order bytes with
zeros
- cmpldi LENGTH,0
- beq Ldone
- C load table elements
- li r8,1*TableElemAlign
- lxvd2x VSR(H1M),0,TABLE
- lxvd2x VSR(H1L),r8,TABLE
- C push every modulo byte to the stack and load them with padding
into
vector register
- vxor ZERO,ZERO,ZERO
- addi r8,SP,-16
- stvx ZERO,0,r8
+Lstb_loop:
- subic. LENGTH,LENGTH,1
- lbzx r7,LENGTH,DATA
- stbx r7,LENGTH,r8
- bne Lstb_loop
- lxvd2x VSR(C0),0,r8
It's always a bit annoying to have to deal with leftovers like this in the assembly code. Can we avoid having to store it to memory and read back? I can see three other approaches:
Loop, reading a byte at a time, and shift into a target register. I guess we would need to assemble the bytes in a regular register, and then transfer the final value to a vector register. Is that expensive?
Round the address down to make it aligned, read an aligned word and, only if needed, the next word. And shift and mask to get the needed bytes. I think it is fine to read a few bytes outside of the input area, as long as the reads do *not* cross any word boundary (and hence a potential page boundary). We do things like this in some other places, but then for reading unaligned data in general, not just leftover parts.
Adapt the internal C/asm interface, so that the assembly routine only needs to handle complete blocks. It could provide a gcm_gf_mul, and let the C code handle partial blocks using memxor + gcm_gf_mul.
I would guess (1) or maybe (3) is the most reasonable. I don't think performance is that important, since it looks like for each message, this case can happen only for the last call to gcm_update and the last call to gcm_encrypt/gcm_decrypt.
What about test coverage? It looks like we have test cases for sizes up to 8 blocks, and for partial blocks, so I guess that should be fine?
Reards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.