On Mon, Sep 13, 2021 at 5:08 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I've also added a cbc-aes128-encrypt.asm. That gives more significant speedup, almost 60%. I think main reason for the speedup is that we avoid reloading subkeys between blocks.
I've continued this path, see branch aes-cbc. The aes128 variant is at
https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes...
Benchmark results are positive but a bit puzzling. On my laptop (AMD Ryzen 5) I get
aes128 ECB encrypt 5450.18This is the latest version, doing two blocks per iteration.
aes128 CBC encrypt 547.34The general CBC mode written in C, with one call to aes128_encrypt per block. 10(!) times slower than ECB.
cbc_aes128 encrypt 865.11The new assembly function. Almost 60% speedup over the old code, which is nice, and large enough that it seems motivated to have the new functin. But still 6 times slower than ECB. I'm not sure why. Let's look a bit closer at cycle numbers.
Not sure I get accurate cycle numbers (it's a bit tricky with variable features and turbo modes and whatnot), but it looks like ECB mode is 6 cycles per block, which would be consistent with issue of two aesenc instructions per block. While the CBC mode is 37 cycles per block, almost 4 cycles per aesenc.
This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii) the processor's out-of-order machinery results in as many as 7-8 blocks processed in parallel when executing the ECB loop, i.e., instruction issue for 3-4 iterations through the loop before the results of the first iteration is ready.
I did the tests on Intel Comet Lake architecture and I can't think of another explanation, it seems x86_64 arch issues multiple blocks simultaneously without hand-written unrolling of the block loop. Also, Intel processors or at least Intel Comet Lake arch implements this machinery in a more ideal way than your testing processor (AMD Ryzen 5) so you don't even need to have 2-way interleaving of AES-ECB implementation nor a separate AES-CBC implementation. I got the same benchmark speed of ECB and CBC modes for all cases with CBC mode being always 6 times slower than ECB mode.
regards, Mamone