Maamoun TK maamoun.tk@googlemail.com writes:
I'm not aware of a simple way to accomplish either approaches on POWER8, I recommend to use allocated stack buffer
Let's leave that as is, then. Do you want to make another pull request with only the fixes for register usage?
to assist handling leftovers rather than making it complicated or we can use POWER9 specific instruction 'lxvll' which can used to load vector with length passed to general register as parameter, it also work on both endian modes without any post-loading operations, another benefit from switching to POWER ISA 3.0 is that we can use 'lxvb16x/stxvb16x' to load/store input and output data instead of 'lxvd2x/stxvd2x' instructions, this eliminate the need for post-loading/pre-storing permuting operations on little-endian mode.
I was thinking of something similar to how the unaligned input is handled in arm/v6/sha1-compress.asm. And then, to handle leftovers at the end, one would need to compare leftover size with the alignment related address bits, to decide whether or not to load one more word. But perhaps only worth the effort if there's a performance advantage in avoiding unaligned loads also in the main loop.
Regards, /Niels