mirror of
https://github.com/abseil/abseil-cpp.git
synced 2026-06-04 20:14:23 +08:00
1. When reducing 4 vectors to 1, fold across 2 vectors first and then across 1, instead of across 1 and then across 2. This works slightly better because it makes the constants be used in order. 2. Use a faster algorithm to reduce 1 vector to a scalar value. This approach is the same one I used in the assembly code I recently wrote for the Linux kernel in the patch series https://lore.kernel.org/lkml/20250210174540.161705-1-ebiggers@kernel.org/T/#u (search for "reduce_128bits_to_crc"). On Skylake (which uses num_pclmul_streams=2), this improves CRC32C performance on 2048-byte messages by about 2%. The overall improvement is relatively small since FinalizePclmulStream() is only called for messages >= 2048 bytes and is only called num_pclmul_streams times per message. So it's not really a bottleneck, but the new code is definitely a bit shorter and faster. PiperOrigin-RevId: 739002382 Change-Id: I0505e61f012e4a4f8b85958f7f00478f5b1a7026