On Thu, Jan 27, 2022 at 11:28 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
And I've now tried the same method for the x86_64 implementation. See attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
I'm not entirely sure cycle numbers are accurate, with clock frequence not being fixed. I think the machine runs bechmarks at 2.1GHz, and then this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 instructions per cycle, 0.5 multiply instructions per cycle.
This laptop has an AMD zen2 processor, which should be capable of issuing four instructions per cycle and complete one multiply instruction per cycle (according to https://gmplib.org/~tege/x86-timing.pdf).
This seems to indicate that on this hardware, speed is not limited by multiplier throughput, instead, the bottleneck is instruction decoding/issuing, with max four instructions per cycle.
Benchmarked also on my other nearby x86_64 machine (intel broadwell processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect it to be generally faster, and have pushed it to the master-updates branch.
I haven't looked that carefully at what the old code was doing, but I think the final folding for each block used a multiply instruction that then depends on the previous ones for that block, increasing the per block latency. With the new code, all multiplies done for a block are independent of each other.
Great! I believe this is the best we can get for processing one block. I'm trying to implement two-way interleaving using AVX extension and the main instruction of interest here is 'vpmuludq' that does double multiply operation, the main concern here is there's a shortage of XMM registers as there are 16 of them, I'm working on addressing this issue by using memory operands of key values for 'vpmuludq' and hope the processor cache do his thing here. I'm expecting to complete the assembly implementation tomorrow.
regards, Mamone
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.