nisse@lysator.liu.se (Niels Möller) writes:
If this works, FOLD would turn into something like
sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1
I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute
<U4,U3,U2,U1,U0> + U0*p
which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is
<U4,U3,U2,U1> + U0*((p+1)/2^64)
The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes
(2^{192} - 2^{160} + 2^{128} + 2^{32}) U0
I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step.
Regards, /Niels