Maamoun TK maamoun.tk@googlemail.com writes:
Great! I believe this is the best we can get for processing one block.
One may be able to squeeze out one or two cycles more using the mulx extension, which should make it possible to eliminate some of the move instructions (I don't think moves cost any execution unit resources, but they do consume decoding resources).
I'm trying to implement two-way interleaving using AVX extension and the main instruction of interest here is 'vpmuludq' that does double multiply operation
My manual seems a bit confused if it's called pmuludq or vpmuludq. But you're thinking of the instruction that does two 32x32 --> 64 multiplies? It will be interesting to see how that works out! It does half the work compared to a 64 x 64 --> 128 multiply instruction, but accumulation/folding may get more efficient by using vector registers. (There seems to also be an avx variant doing four 32x32 --> 64 multiplies, using 256-bit registers).
the main concern here is there's a shortage of XMM registers as there are 16 of them, I'm working on addressing this issue by using memory operands of key values for 'vpmuludq' and hope the processor cache do his thing here.
Reading cached values from memory is usally cheap. So probably fine as long as values modified are kept in registers.
I'm expecting to complete the assembly implementation tomorrow.
If my analysis of the single-block code is right, I'd expect it to be rather important to trim number of instructions per block.
Regards, /Niels