mirror of https://github.com/torvalds/linux.git
Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations. This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.
The following summarizes the state before this patch:
- The aesni-intel module registered algorithms "generic-gcm-aesni" and
"rfc4106-gcm-aesni" with the crypto API that actually delegated to one
of three underlying implementations according to the CPU capabilities
detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.
- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
257 KB of binary. This massive binary size was not really
appropriate, and depending on the kconfig it could take up over 1% the
size of the entire vmlinux. The main loops did 8 blocks per
iteration. The AVX code minimized the use of carryless multiplication
whereas the AVX2 code did not. The "AVX2" code did not actually use
AVX2; the check for AVX2 was really a check for Intel Haswell or later
to detect support for fast carryless multiplication. The long source
length was caused by factors such as significant code duplication.
- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
of 1501 lines of source and 15 KB of binary. The main loops did 4
blocks per iteration and minimized the use of carryless multiplication
by using Karatsuba multiplication and a multiplication-less reduction.
- The assembly code was contributed in 2010-2013. Maintenance has been
sporadic and most design choices haven't been revisited.
- The assembly function prototypes and the corresponding glue code were
separate from and were not consistent with the new VAES-AVX10 code I
recently added. The older code had several issues such as not
precomputing the GHASH key powers, which hurt performance.
This rewrite achieves the following goals:
- Much shorter source and binary sizes. The assembly source shrinks
from 4300 lines to 1130 lines, and it produces about 9 KB of binary
instead of 272 KB. This is achieved via a better designed AES-GCM
implementation that doesn't excessively unroll the code and instead
prioritizes the parts that really matter. Sharing the C glue code
with the VAES-AVX10 implementations also saves 250 lines of C source.
- Improve performance on most (possibly all) CPUs on which this code
runs, for most (possibly all) message lengths. Benchmark results are
given in Tables 1 and 2 below.
- Use the same function prototypes and glue code as the new VAES-AVX10
algorithms. This fixes some issues with the integration of the
assembly and results in some significant performance improvements,
primarily on short messages. Also, the AVX and non-AVX
implementations are now registered as separate algorithms with the
crypto API, which makes them both testable by the self-tests.
- Keep support for AES-NI without AVX (for Westmere, Silvermont,
Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
Since 256-bit vectors cannot be used without VAES anyway, this is made
feasible by just using the non-VEX coded form of most instructions.
- Use a unified approach where the main loop does 8 blocks per iteration
and uses Karatsuba multiplication to save one pclmulqdq per block but
does not use the multiplication-less reduction. This strikes a good
balance across the range of CPUs on which this code runs.
- Don't spam the kernel log with an informational message on every boot.
The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 2% | 8% | 11% | 18% | 31% | 26% |
Intel Skylake | 1% | 4% | 7% | 12% | 26% | 19% |
Intel Cascade Lake | 3% | 8% | 10% | 18% | 33% | 24% |
AMD Zen 1 | 6% | 12% | 6% | 15% | 27% | 24% |
AMD Zen 2 | 8% | 13% | 13% | 19% | 26% | 28% |
AMD Zen 3 | 8% | 14% | 13% | 19% | 26% | 25% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 35% | 29% | 45% | 55% | 54% |
Intel Skylake | 25% | 19% | 28% | 33% | 27% |
Intel Cascade Lake | 36% | 28% | 39% | 49% | 54% |
AMD Zen 1 | 27% | 22% | 23% | 29% | 26% |
AMD Zen 2 | 32% | 24% | 22% | 25% | 31% |
AMD Zen 3 | 30% | 24% | 22% | 23% | 26% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 3% | 8% | 11% | 19% | 32% | 28% |
Intel Skylake | 3% | 4% | 7% | 13% | 28% | 27% |
Intel Cascade Lake | 3% | 9% | 11% | 19% | 33% | 28% |
AMD Zen 1 | 15% | 18% | 14% | 20% | 36% | 33% |
AMD Zen 2 | 9% | 16% | 13% | 21% | 26% | 27% |
AMD Zen 3 | 8% | 15% | 12% | 18% | 23% | 23% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 36% | 31% | 40% | 51% | 53% |
Intel Skylake | 28% | 21% | 23% | 30% | 30% |
Intel Cascade Lake | 36% | 29% | 36% | 47% | 53% |
AMD Zen 1 | 35% | 31% | 32% | 35% | 36% |
AMD Zen 2 | 31% | 30% | 27% | 38% | 30% |
AMD Zen 3 | 27% | 23% | 24% | 32% | 26% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
||
|---|---|---|
| .. | ||
| .gitignore | ||
| Kconfig | ||
| Makefile | ||
| aegis128-aesni-asm.S | ||
| aegis128-aesni-glue.c | ||
| aes-gcm-aesni-x86_64.S | ||
| aes-gcm-avx10-x86_64.S | ||
| aes-xts-avx-x86_64.S | ||
| aes_ctrby8_avx-x86_64.S | ||
| aesni-intel_asm.S | ||
| aesni-intel_glue.c | ||
| aria-aesni-avx-asm_64.S | ||
| aria-aesni-avx2-asm_64.S | ||
| aria-avx.h | ||
| aria-gfni-avx512-asm_64.S | ||
| aria_aesni_avx2_glue.c | ||
| aria_aesni_avx_glue.c | ||
| aria_gfni_avx512_glue.c | ||
| blake2s-core.S | ||
| blake2s-glue.c | ||
| blowfish-x86_64-asm_64.S | ||
| blowfish_glue.c | ||
| camellia-aesni-avx-asm_64.S | ||
| camellia-aesni-avx2-asm_64.S | ||
| camellia-x86_64-asm_64.S | ||
| camellia.h | ||
| camellia_aesni_avx2_glue.c | ||
| camellia_aesni_avx_glue.c | ||
| camellia_glue.c | ||
| cast5-avx-x86_64-asm_64.S | ||
| cast5_avx_glue.c | ||
| cast6-avx-x86_64-asm_64.S | ||
| cast6_avx_glue.c | ||
| chacha-avx2-x86_64.S | ||
| chacha-avx512vl-x86_64.S | ||
| chacha-ssse3-x86_64.S | ||
| chacha_glue.c | ||
| crc32-pclmul_asm.S | ||
| crc32-pclmul_glue.c | ||
| crc32c-intel_glue.c | ||
| crc32c-pcl-intel-asm_64.S | ||
| crct10dif-pcl-asm_64.S | ||
| crct10dif-pclmul_glue.c | ||
| curve25519-x86_64.c | ||
| des3_ede-asm_64.S | ||
| des3_ede_glue.c | ||
| ecb_cbc_helpers.h | ||
| ghash-clmulni-intel_asm.S | ||
| ghash-clmulni-intel_glue.c | ||
| glue_helper-asm-avx.S | ||
| glue_helper-asm-avx2.S | ||
| nh-avx2-x86_64.S | ||
| nh-sse2-x86_64.S | ||
| nhpoly1305-avx2-glue.c | ||
| nhpoly1305-sse2-glue.c | ||
| poly1305-x86_64-cryptogams.pl | ||
| poly1305_glue.c | ||
| polyval-clmulni_asm.S | ||
| polyval-clmulni_glue.c | ||
| serpent-avx-x86_64-asm_64.S | ||
| serpent-avx.h | ||
| serpent-avx2-asm_64.S | ||
| serpent-sse2-i586-asm_32.S | ||
| serpent-sse2-x86_64-asm_64.S | ||
| serpent-sse2.h | ||
| serpent_avx2_glue.c | ||
| serpent_avx_glue.c | ||
| serpent_sse2_glue.c | ||
| sha1_avx2_x86_64_asm.S | ||
| sha1_ni_asm.S | ||
| sha1_ssse3_asm.S | ||
| sha1_ssse3_glue.c | ||
| sha256-avx-asm.S | ||
| sha256-avx2-asm.S | ||
| sha256-ssse3-asm.S | ||
| sha256_ni_asm.S | ||
| sha256_ssse3_glue.c | ||
| sha512-avx-asm.S | ||
| sha512-avx2-asm.S | ||
| sha512-ssse3-asm.S | ||
| sha512_ssse3_glue.c | ||
| sm3-avx-asm_64.S | ||
| sm3_avx_glue.c | ||
| sm4-aesni-avx-asm_64.S | ||
| sm4-aesni-avx2-asm_64.S | ||
| sm4-avx.h | ||
| sm4_aesni_avx2_glue.c | ||
| sm4_aesni_avx_glue.c | ||
| twofish-avx-x86_64-asm_64.S | ||
| twofish-i586-asm_32.S | ||
| twofish-x86_64-asm_64-3way.S | ||
| twofish-x86_64-asm_64.S | ||
| twofish.h | ||
| twofish_avx_glue.c | ||
| twofish_glue.c | ||
| twofish_glue_3way.c | ||