Re: [PATCH] powerpc64/sha256: fix loading overreads by loading less and shifting

6 Sep 2024


      On Fri, 2024-09-06 at 14:51 +0200, Niels Möller wrote:
...
Eric Richter erichte@linux.ibm.com writes:
...
I suspect with four li instructions, those are issued 4x in
parallel, and
then the subsequent (slower) lxvw4x instructions are queued 2x. By
removing
the other three li instructions, that li is queued with the first
lxvw4x,
but not the second -- causing a stall as the second lxv has to wait
for the
parallel queue of the li + lxv before, as it depends on the li
completing
first.
I don't know any details on powerpc instruction issue and pipelining
works. But some dependence on alignment seems likely. So great that
you
found that; it would seem rather odd to get a performance regression
for
this fix.
Since .align 4 means 16 byte alignment, and instructions are 4 bytes,
that's enough to group instructions 4-by-4, is that what you want or
is
it overkill?
I don't think I tested with .align 1, but .align 2 did hurt
performance. For sake of minimizing the large amounts of trial and
error, I just stuck with it. I'll indicate that in the comment, unless
I find a better value, location, etc.
...
I'm also a bit surprised that an align at this point, outside the
loop,
makes a significant difference. Maybe it's the alignment of the code
in
the loop that matters, which is changed indirectly by this .align?
Maybe
it would make more sense to add the align directive just before the
loop: entry, and/or before the blocks of instructions in the loop
that
should be aligned? Nettle uses aligned loop entry points at many
places
for several architectures, although I'm not sure how much of that
makes
a measurable difference in performance, and how much was just done
out
of habit.
I'm suspecting similar -- I don't figure aligning that load would cause
that much of a measurable difference compared to perhaps aligning the
ROUNDs. I will be experimenting with placing alignments elsewhere to
see if there's a better/more sensible spot.
...
...
Additional note: I did also try rearranging the LOAD macros with
the
shifts, as well as moving around the requisite byte-swap vperms,
but did
not receive any performance benefits. It appears doing the load,
vperm,
shift, addi in that order appears to be the fastest order.
To what degree does the powerpc processors do out of order execution?
I'm not entirely sure -- that will mostly be the subject of the deep-
dive I'm planning to do, I suspect there might be some hidden
dependency bubbles that are interfering with optimal execution.
...
If
you have the time to experiment more, I'd be curious to see what the
results would be, e.g., if either doing all the loads back to back,
lxvd2x A
  lxvd2x B
  lxvd2x C
  lxvd2x D
  vperm A
  vperm B
  vperm C
  vperm D
  ...shifts...
This was one of my experiments, and it either did not help performance,
or hurt it further. Though in my haste, I did not take notes -- I will
play around further with these and record the results for posterity, I
suspect this might be useful to capture for future work.
...
or alternatively, trying to schedule each load a few instrucctions
before value is used.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH] powerpc64/sha256: fix loading overreads by loading less and shifting