Re: [PATCH] powerpc64/sha256: fix loading overreads by loading less and shifting

6 Sep 2024


      Eric Richter erichte@linux.ibm.com writes:
...
I suspect with four li instructions, those are issued 4x in parallel, and
then the subsequent (slower) lxvw4x instructions are queued 2x. By removing
the other three li instructions, that li is queued with the first lxvw4x,
but not the second -- causing a stall as the second lxv has to wait for the
parallel queue of the li + lxv before, as it depends on the li completing
first.
I don't know any details on powerpc instruction issue and pipelining
works. But some dependence on alignment seems likely. So great that you
found that; it would seem rather odd to get a performance regression for
this fix.
Since .align 4 means 16 byte alignment, and instructions are 4 bytes,
that's enough to group instructions 4-by-4, is that what you want or is
it overkill?
I'm also a bit surprised that an align at this point, outside the loop,
makes a significant difference. Maybe it's the alignment of the code in
the loop that matters, which is changed indirectly by this .align? Maybe
it would make more sense to add the align directive just before the
loop: entry, and/or before the blocks of instructions in the loop that
should be aligned? Nettle uses aligned loop entry points at many places
for several architectures, although I'm not sure how much of that makes
a measurable difference in performance, and how much was just done out
of habit.
...
Additional note: I did also try rearranging the LOAD macros with the
shifts, as well as moving around the requisite byte-swap vperms, but did
not receive any performance benefits. It appears doing the load, vperm,
shift, addi in that order appears to be the fastest order.
To what degree does the powerpc processors do out of order execution? If
you have the time to experiment more, I'd be curious to see what the
results would be, e.g., if either doing all the loads back to back,
lxvd2x A
  lxvd2x B
  lxvd2x C
  lxvd2x D
  vperm A
  vperm B
  vperm C
  vperm D
  ...shifts...
or alternatively, trying to schedule each load a few instrucctions
before value is used.
...
-define(`TC4', `r11')
-define(`TC8', `r12')
-define(`TC12', `r14')
-define(`TC16', `r15')
+define(`TC16', `r11')
One nice thing is that you can now eliminate the save and restore of r14
and r15. Please do that.
...
C State registers
 define(`VSA', `v0')
@@ -187,24 +184,24 @@ define(`LOAD', `
 define(`DOLOADS', `
   IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
   LOAD(0, TC0)

LOAD(1, TC4)
LOAD(2, TC8)
LOAD(3, TC12)


vsldoi	IV(1), IV(0), IV(0), 4
vsldoi	IV(2), IV(0), IV(0), 8
vsldoi	IV(3), IV(0), IV(0), 12
addi	INPUT, INPUT, 16
LOAD(4, TC0)

You can eliminate 2 of the 4 addi instructions by using
LOAD(4, TC16)
here and similarly for LOAD(12, TC16).
...

.align 4
C Load state values
lxvw4x	VSR(VSA), 0, STATE	C VSA contains A,B,C,D


Please add a brief comment on the .align, saying that it appears to
enable more efficient issue of the lxvw4x instructions (or your
own wording explaining why it's needed).
(For the .align directive in general, there's also an ALIGN macro which
takes an non-logarithmic alignment regardless of architecture and
assembler, but it's not used consistently in the nettle assembly files).
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH] powerpc64/sha256: fix loading overreads by loading less and shifting