linux

History

Eric Biggers e6e758fa64 crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM Rewrite the AES-NI implementations of AES-GCM, taking advantage of things I learned while writing the VAES-AVX10 implementations. This is a complete rewrite that reduces the AES-NI GCM source code size by about 70% and the binary code size by about 95%, while not regressing performance and in fact improving it significantly in many cases. The following summarizes the state before this patch: - The aesni-intel module registered algorithms "generic-gcm-aesni" and "rfc4106-gcm-aesni" with the crypto API that actually delegated to one of three underlying implementations according to the CPU capabilities detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2. - The AES-NI + AVX and AES-NI + AVX2 assembly code was in aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and 257 KB of binary. This massive binary size was not really appropriate, and depending on the kconfig it could take up over 1% the size of the entire vmlinux. The main loops did 8 blocks per iteration. The AVX code minimized the use of carryless multiplication whereas the AVX2 code did not. The "AVX2" code did not actually use AVX2; the check for AVX2 was really a check for Intel Haswell or later to detect support for fast carryless multiplication. The long source length was caused by factors such as significant code duplication. - The AES-NI only assembly code was in aesni-intel_asm.S and consisted of 1501 lines of source and 15 KB of binary. The main loops did 4 blocks per iteration and minimized the use of carryless multiplication by using Karatsuba multiplication and a multiplication-less reduction. - The assembly code was contributed in 2010-2013. Maintenance has been sporadic and most design choices haven't been revisited. - The assembly function prototypes and the corresponding glue code were separate from and were not consistent with the new VAES-AVX10 code I recently added. The older code had several issues such as not precomputing the GHASH key powers, which hurt performance. This rewrite achieves the following goals: - Much shorter source and binary sizes. The assembly source shrinks from 4300 lines to 1130 lines, and it produces about 9 KB of binary instead of 272 KB. This is achieved via a better designed AES-GCM implementation that doesn't excessively unroll the code and instead prioritizes the parts that really matter. Sharing the C glue code with the VAES-AVX10 implementations also saves 250 lines of C source. - Improve performance on most (possibly all) CPUs on which this code runs, for most (possibly all) message lengths. Benchmark results are given in Tables 1 and 2 below. - Use the same function prototypes and glue code as the new VAES-AVX10 algorithms. This fixes some issues with the integration of the assembly and results in some significant performance improvements, primarily on short messages. Also, the AVX and non-AVX implementations are now registered as separate algorithms with the crypto API, which makes them both testable by the self-tests. - Keep support for AES-NI without AVX (for Westmere, Silvermont, Goldmont, and Tremont), but unify the source code with AES-NI + AVX. Since 256-bit vectors cannot be used without VAES anyway, this is made feasible by just using the non-VEX coded form of most instructions. - Use a unified approach where the main loop does 8 blocks per iteration and uses Karatsuba multiplication to save one pclmulqdq per block but does not use the multiplication-less reduction. This strikes a good balance across the range of CPUs on which this code runs. - Don't spam the kernel log with an informational message on every boot. The following tables summarize the improvement in AES-GCM throughput on various CPU microarchitectures as a result of this patch: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 2% \| 8% \| 11% \| 18% \| 31% \| 26% \| Intel Skylake \| 1% \| 4% \| 7% \| 12% \| 26% \| 19% \| Intel Cascade Lake \| 3% \| 8% \| 10% \| 18% \| 33% \| 24% \| AMD Zen 1 \| 6% \| 12% \| 6% \| 15% \| 27% \| 24% \| AMD Zen 2 \| 8% \| 13% \| 13% \| 19% \| 26% \| 28% \| AMD Zen 3 \| 8% \| 14% \| 13% \| 19% \| 26% \| 25% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 35% \| 29% \| 45% \| 55% \| 54% \| Intel Skylake \| 25% \| 19% \| 28% \| 33% \| 27% \| Intel Cascade Lake \| 36% \| 28% \| 39% \| 49% \| 54% \| AMD Zen 1 \| 27% \| 22% \| 23% \| 29% \| 26% \| AMD Zen 2 \| 32% \| 24% \| 22% \| 25% \| 31% \| AMD Zen 3 \| 30% \| 24% \| 22% \| 23% \| 26% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 3% \| 8% \| 11% \| 19% \| 32% \| 28% \| Intel Skylake \| 3% \| 4% \| 7% \| 13% \| 28% \| 27% \| Intel Cascade Lake \| 3% \| 9% \| 11% \| 19% \| 33% \| 28% \| AMD Zen 1 \| 15% \| 18% \| 14% \| 20% \| 36% \| 33% \| AMD Zen 2 \| 9% \| 16% \| 13% \| 21% \| 26% \| 27% \| AMD Zen 3 \| 8% \| 15% \| 12% \| 18% \| 23% \| 23% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 36% \| 31% \| 40% \| 51% \| 53% \| Intel Skylake \| 28% \| 21% \| 23% \| 30% \| 30% \| Intel Cascade Lake \| 36% \| 29% \| 36% \| 47% \| 53% \| AMD Zen 1 \| 35% \| 31% \| 32% \| 35% \| 36% \| AMD Zen 2 \| 31% \| 30% \| 27% \| 38% \| 30% \| AMD Zen 3 \| 27% \| 23% \| 24% \| 32% \| 26% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be listed as 10%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or Intel low-power CPUs, as these weren't readily available to me. However, based on the design of the new code and the available information about these other CPU microarchitectures, I wouldn't expect any significant regressions, and there's a good chance performance is improved just as it is above. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>		2024-06-07 19:47:58 +08:00
..
.gitignore	…
Kconfig	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM	2024-06-07 19:47:58 +08:00
Makefile	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM	2024-06-07 19:47:58 +08:00
aegis128-aesni-asm.S	crypto: x86/aegis128 - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
aegis128-aesni-glue.c	…
aes-gcm-aesni-x86_64.S	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM	2024-06-07 19:47:58 +08:00
aes-gcm-avx10-x86_64.S	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM	2024-06-07 19:47:58 +08:00
aes-xts-avx-x86_64.S	crypto: x86/aes-xts - optimize size of instructions operating on lengths	2024-04-19 18:54:19 +08:00
aes_ctrby8_avx-x86_64.S	crypto: x86/aesni-xctr - Add accelerated implementation of XCTR	2022-06-10 16:40:17 +08:00
aesni-intel_asm.S	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM	2024-06-07 19:47:58 +08:00
aesni-intel_glue.c	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM	2024-06-07 19:47:58 +08:00
aria-aesni-avx-asm_64.S	crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors	2023-05-24 18:10:27 +08:00
aria-aesni-avx2-asm_64.S	crypto: x86/aria - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
aria-avx.h	crypto: x86/aria - implement aria-avx512	2023-01-06 17:15:47 +08:00
aria-gfni-avx512-asm_64.S	crypto: x86/aria - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
aria_aesni_avx2_glue.c	crypto: x86/aria-avx2 - fix build failure with old binutils	2023-01-20 18:29:32 +08:00
aria_aesni_avx_glue.c	crypto: x86/aria-avx - fix build failure with old binutils	2023-01-20 18:29:31 +08:00
aria_gfni_avx512_glue.c	crypto: x86/aria - implement aria-avx512	2023-01-06 17:15:47 +08:00
blake2s-core.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
blake2s-glue.c	crypto: blake2s: remove module_init and module.h inclusion	2023-04-13 13:13:51 -07:00
blowfish-x86_64-asm_64.S	crypto: x86/blowfish - Eliminate use of SYM_TYPED_FUNC_START in asm	2023-02-10 17:20:19 +08:00
blowfish_glue.c	crypto: x86/blowfish - Convert to use ECB/CBC helpers	2023-02-10 17:20:19 +08:00
camellia-aesni-avx-asm_64.S	crypto: x86/camellia - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
camellia-aesni-avx2-asm_64.S	crypto: x86/camellia - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
camellia-x86_64-asm_64.S	crypto: x86/camellia - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
camellia.h	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
camellia_aesni_avx2_glue.c	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
camellia_aesni_avx_glue.c	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
camellia_glue.c	crypto: x86 - eliminate anonymous module_init & module_exit	2022-04-08 16:13:31 +08:00
cast5-avx-x86_64-asm_64.S	crypto: x86/cast5 - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
cast5_avx_glue.c	crypto: x86/cast5 - drop dependency on glue helper	2021-01-14 17:10:29 +11:00
cast6-avx-x86_64-asm_64.S	crypto: x86/cast6 - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
cast6_avx_glue.c	crypto: x86/cast6 - drop dependency on glue helper	2021-01-14 17:10:29 +11:00
chacha-avx2-x86_64.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
chacha-avx512vl-x86_64.S	crypto: x86/chacha20 - Avoid spurious jumps to other functions	2022-03-25 16:21:05 +12:00
chacha-ssse3-x86_64.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
chacha_glue.c	…
crc32-pclmul_asm.S	crypto: x86/crc32 - Use local .L symbols for code	2023-04-20 18:20:04 +08:00
crc32-pclmul_glue.c	crypto: x86 - add missing MODULE_DESCRIPTION() macros	2024-06-07 19:46:39 +08:00
crc32c-intel_glue.c	…
crc32c-pcl-intel-asm_64.S	arch/x86: Fix typos	2024-01-03 11:46:22 +01:00
crct10dif-pcl-asm_64.S	crypto: x86/crct10dif-pcl: Remove redundant alignments	2022-10-17 16:41:01 +02:00
crct10dif-pclmul_glue.c	…
curve25519-x86_64.c	crypto: x86 - add missing MODULE_DESCRIPTION() macros	2024-06-07 19:46:39 +08:00
des3_ede-asm_64.S	crypto: x86/des3 - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
des3_ede_glue.c	crypto: x86/des3 - Remove unused inline function des3_ede_enc_blk_3way()	2022-02-23 15:28:32 +12:00
ecb_cbc_helpers.h	crypto: x86 - exit fpu context earlier in ECB/CBC macros	2023-02-03 12:54:54 +08:00
ghash-clmulni-intel_asm.S	crypto: x86/ghash - Use RIP-relative addressing	2023-04-20 18:20:04 +08:00
ghash-clmulni-intel_glue.c	crypto: x86/ghash - add comment and fix broken link	2022-12-30 17:57:42 +08:00
glue_helper-asm-avx.S	crypto: x86/glue-helper - drop CTR helper routines	2021-01-14 17:10:28 +11:00
glue_helper-asm-avx2.S	crypto: x86/glue-helper - drop CTR helper routines	2021-01-14 17:10:28 +11:00
nh-avx2-x86_64.S	crypto: x86/nh-avx2 - add missing vzeroupper	2024-04-12 15:07:52 +08:00
nh-sse2-x86_64.S	crypto: x86/nhpoly1305 - eliminate unnecessary CFI wrappers	2022-11-25 17:39:19 +08:00
nhpoly1305-avx2-glue.c	crypto: x86/nhpoly1305 - implement ->digest	2023-10-20 13:39:25 +08:00
nhpoly1305-sse2-glue.c	crypto: x86/nhpoly1305 - implement ->digest	2023-10-20 13:39:25 +08:00
poly1305-x86_64-cryptogams.pl	crypto: x86/poly1305: Remove custom function alignment	2022-10-17 16:41:03 +02:00
poly1305_glue.c	crypto: x86/poly1305 - Switch to new Intel CPU model defines	2024-05-31 17:12:30 +08:00
polyval-clmulni_asm.S	crypto: x86/polyval - Add PCLMULQDQ accelerated implementation of POLYVAL	2022-06-10 16:40:17 +08:00
polyval-clmulni_glue.c	crypto: x86/polyval - Fix crashes when keys are not 16-byte aligned	2022-10-21 19:05:05 +08:00
serpent-avx-x86_64-asm_64.S	crypto: x86/serpent: Remove redundant alignments	2022-10-17 16:41:01 +02:00
serpent-avx.h	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
serpent-avx2-asm_64.S	crypto: x86/serpent: Remove redundant alignments	2022-10-17 16:41:01 +02:00
serpent-sse2-i586-asm_32.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
serpent-sse2-x86_64-asm_64.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
serpent-sse2.h	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
serpent_avx2_glue.c	crypto: x86 - eliminate anonymous module_init & module_exit	2022-04-08 16:13:31 +08:00
serpent_avx_glue.c	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
serpent_sse2_glue.c	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
sha1_avx2_x86_64_asm.S	crypto: x86/sha - Use local .L symbols for code	2023-04-20 18:20:04 +08:00
sha1_ni_asm.S	- Add the call depth tracking mitigation for Retbleed which has	2022-12-14 15:03:00 -08:00
sha1_ssse3_asm.S	crypto: x86/sha1 - fix possible crash with CFI enabled	2022-11-25 17:39:19 +08:00
sha1_ssse3_glue.c	crypto: x86/sha1 - autoload if SHA-NI detected	2023-11-17 19:16:29 +08:00
sha256-avx-asm.S	crypto: x86/sha - Use local .L symbols for code	2023-04-20 18:20:04 +08:00
sha256-avx2-asm.S	crypto: x86/sha256-avx2 - add missing vzeroupper	2024-04-12 15:07:52 +08:00
sha256-ssse3-asm.S	crypto: x86/sha - Use local .L symbols for code	2023-04-20 18:20:04 +08:00
sha256_ni_asm.S	crypto: x86/sha256-ni - simplify do_4rounds	2024-04-19 18:54:18 +08:00
sha256_ssse3_glue.c	crypto: x86/sha256 - autoload if SHA-NI detected	2023-11-17 19:16:29 +08:00
sha512-avx-asm.S	arch/x86: Fix typos	2024-01-03 11:46:22 +01:00
sha512-avx2-asm.S	crypto: x86/sha512-avx2 - add missing vzeroupper	2024-04-12 15:07:52 +08:00
sha512-ssse3-asm.S	arch/x86: Fix typos	2024-01-03 11:46:22 +01:00
sha512_ssse3_glue.c	crypto: x86/sha512 - load based on CPU features	2022-08-19 18:39:39 +08:00
sm3-avx-asm_64.S	- Add the call depth tracking mitigation for Retbleed which has	2022-12-14 15:03:00 -08:00
sm3_avx_glue.c	crypto: x86/sm3 - add AVX assembly implementation	2022-01-28 16:51:11 +11:00
sm4-aesni-avx-asm_64.S	crypto: x86/sm4 - Remove cfb(sm4)	2023-12-08 11:59:45 +08:00
sm4-aesni-avx2-asm_64.S	crypto: x86/sm4 - Remove cfb(sm4)	2023-12-08 11:59:45 +08:00
sm4-avx.h	crypto: x86/sm4 - Remove cfb(sm4)	2023-12-08 11:59:45 +08:00
sm4_aesni_avx2_glue.c	crypto: x86/sm4 - Remove cfb(sm4)	2023-12-08 11:59:45 +08:00
sm4_aesni_avx_glue.c	crypto: x86/sm4 - Remove cfb(sm4)	2023-12-08 11:59:45 +08:00
twofish-avx-x86_64-asm_64.S	crypto: twofish: Remove redundant alignments	2022-10-17 16:41:03 +02:00
twofish-i586-asm_32.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
twofish-x86_64-asm_64-3way.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
twofish-x86_64-asm_64.S	x86: Prepare asm files for straight-line-speculation	2021-12-08 12:25:37 +01:00
twofish.h	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
twofish_avx_glue.c	crypto: x86 - use local headers for x86 specific shared declarations	2021-01-14 17:10:30 +11:00
twofish_glue.c	crypto: Prepare to move crypto_tfm_ctx	2022-12-02 18:12:40 +08:00
twofish_glue_3way.c	crypto: x86/twofish - Switch to new Intel CPU model defines	2024-05-31 17:12:21 +08:00