Eric Richter erichte@linux.ibm.com writes:
I suspect with four li instructions, those are issued 4x in parallel, and then the subsequent (slower) lxvw4x instructions are queued 2x. By removing the other three li instructions, that li is queued with the first lxvw4x, but not the second -- causing a stall as the second lxv has to wait for the parallel queue of the li + lxv before, as it depends on the li completing first.
I don't know any details on powerpc instruction issue and pipelining works. But some dependence on alignment seems likely. So great that you found that; it would seem rather odd to get a performance regression for this fix.
Since .align 4 means 16 byte alignment, and instructions are 4 bytes, that's enough to group instructions 4-by-4, is that what you want or is it overkill?
I'm also a bit surprised that an align at this point, outside the loop, makes a significant difference. Maybe it's the alignment of the code in the loop that matters, which is changed indirectly by this .align? Maybe it would make more sense to add the align directive just before the loop: entry, and/or before the blocks of instructions in the loop that should be aligned? Nettle uses aligned loop entry points at many places for several architectures, although I'm not sure how much of that makes a measurable difference in performance, and how much was just done out of habit.
Additional note: I did also try rearranging the LOAD macros with the shifts, as well as moving around the requisite byte-swap vperms, but did not receive any performance benefits. It appears doing the load, vperm, shift, addi in that order appears to be the fastest order.
To what degree does the powerpc processors do out of order execution? If you have the time to experiment more, I'd be curious to see what the results would be, e.g., if either doing all the loads back to back,
lxvd2x A lxvd2x B lxvd2x C lxvd2x D vperm A vperm B vperm C vperm D ...shifts...
or alternatively, trying to schedule each load a few instrucctions before value is used.
-define(`TC4', `r11') -define(`TC8', `r12') -define(`TC12', `r14') -define(`TC16', `r15') +define(`TC16', `r11')
One nice thing is that you can now eliminate the save and restore of r14 and r15. Please do that.
C State registers define(`VSA', `v0') @@ -187,24 +184,24 @@ define(`LOAD', ` define(`DOLOADS', ` IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)') LOAD(0, TC0)
- LOAD(1, TC4)
- LOAD(2, TC8)
- LOAD(3, TC12)
- vsldoi IV(1), IV(0), IV(0), 4
- vsldoi IV(2), IV(0), IV(0), 8
- vsldoi IV(3), IV(0), IV(0), 12 addi INPUT, INPUT, 16 LOAD(4, TC0)
You can eliminate 2 of the 4 addi instructions by using
LOAD(4, TC16)
here and similarly for LOAD(12, TC16).
.align 4
C Load state values lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D
Please add a brief comment on the .align, saying that it appears to enable more efficient issue of the lxvw4x instructions (or your own wording explaining why it's needed).
(For the .align directive in general, there's also an ALIGN macro which takes an non-logarithmic alignment regardless of architecture and assembler, but it's not used consistently in the nettle assembly files).
Regards, /Niels