Re: CBC-AES (was: Re: [S390x] Optimize AES modes)

19 Sep 2021

      On Mon, Sep 13, 2021 at 5:08 PM Niels Möller nisse@lysator.liu.se wrote:
...
nisse@lysator.liu.se (Niels Möller) writes:
...
I've also added a cbc-aes128-encrypt.asm.
That gives more significant speedup, almost 60%. I think main reason for
the speedup is that we avoid reloading subkeys between blocks.
I've continued this path, see branch aes-cbc. The aes128 variant is at
https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes...
Benchmark results are positive but a bit puzzling. On my laptop (AMD
Ryzen 5) I get
        aes128  ECB encrypt 5450.18

This is the latest version, doing two blocks per iteration.
        aes128  CBC encrypt  547.34

The general CBC mode written in C, with one call to aes128_encrypt per
block. 10(!) times slower than ECB.
    cbc_aes128      encrypt  865.11

The new assembly function. Almost 60% speedup over the old code, which
is nice, and large enough that it seems motivated to have the new
functin. But still 6 times slower than ECB. I'm not sure why. Let's look
a bit closer at cycle numbers.
Not sure I get accurate cycle numbers (it's a bit tricky with variable
features and turbo modes and whatnot), but it looks like ECB mode is 6
cycles per block, which would be consistent with issue of two aesenc
instructions per block. While the CBC mode is 37 cycles per block,
almost 4 cycles per aesenc.
This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii)
the processor's out-of-order machinery results in as many as 7-8 blocks
processed in parallel when executing the ECB loop, i.e., instruction
issue for 3-4 iterations through the loop before the results of the
first iteration is ready.
I did the tests on Intel Comet Lake architecture and I can't think of
another explanation, it seems x86_64 arch issues multiple blocks
simultaneously without hand-written unrolling of the block loop. Also,
Intel processors or at least Intel Comet Lake arch implements this
machinery in a more ideal way than your testing processor (AMD Ryzen 5) so
you don't even need to have 2-way interleaving of AES-ECB implementation
nor a separate AES-CBC implementation. I got the same benchmark speed of
ECB and CBC modes for all cases with CBC mode being always 6 times slower
than ECB mode.
regards,
Mamone

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: CBC-AES (was: Re: [S390x] Optimize AES modes)