This will let us deprecate the declarations without triggering warnings in Abseil itself.
PiperOrigin-RevId: 906360966
Change-Id: Iee362ac0eac647909ef38003280f1179813f764d
absl::variant, and related types
The corresponding headers are removed from cc files, but kept in
headers to prevent breakages from transitive dependencies.
PiperOrigin-RevId: 872421685
Change-Id: I867d4c3f7c9e422289c63816d44719b0530fb0a6
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1944
Increase the consistency between _mm_loadu_si128 and _mm_stream_si128 by using vector loads/stores of 64-bit elements in both. This should have no impact on existing users. On aarch64 (release build, GCC 15.2), crc_non_temporal_memcpy.cc.o stays effectively the same, the only change being as follows:
```
--- crc_non_temporal_memcpy.cc.o (original)
+++ crc_non_temporal_memcpy.cc.o (patched)
├── objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text {} │ @@ -255,15 +255,15 @@
│ add x2, x21, x2
│ mov x0, x21
│ ldp q31, q30, [x0, #32]
│ add x1, x1, #0x40
│ ldp q29, q28, [x0], #64
│ stp q31, q30, [x1, #-32]
│ stp q29, q28, [x1, #-64]
│ - cmp x0, x2
│ + cmp x2, x0
│ b.ne 3b0 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x270> // b.any
│ and x0, x3, #0xffffffffffffffc0
│ and x23, x23, #0x3f
│ dmb ish
│ add x22, x22, x0
│ add x21, x21, x0
│ b 380 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x240>
```
On big-endian Arm (aarch64_be), this fixes a bug in non_temporal_store_memcpy, in which each 32-bit half out of a 64-bit parcel of memory was swapped with the other. For example, the byte sequence 218edf0b 13c68753 would be copied as 13c68753 218edf0b.
Merge 8f08d4c792 into e5c6ccbc96
Merging this change closes#1944
COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1944 from neuschaefer:nontemp 8f08d4c792
PiperOrigin-RevId: 819779377
Change-Id: I46c8c5540fb4786948c5f16d25630fbbab892602
`CRC32_u64` generates `CRC32` x86 instruction which has 3 cycle latency. Because of that, the `crc` variable below causes a loop carried dependency of 3 cycles per iteration.
```
for (int i = 0; i < 8; i++) {
crc = CRC32_u64(static_cast<uint32_t>(crc), absl::little_endian::Load64(p));
p += 8;
}
```
Total latency for a 64-byte block is 29 cycles (codegen: https://godbolt.org/z/zxsrGMEPs, llvm-mca: https://godbolt.org/z/xrTMhhd1E).
So, it is more efficient to interleave (up to 3 calls because of the 3 cycle latency) the `CRC32_u64` calls at a lower level. Even if we interleave 3 streams, the total latency for (three) 64-byte blocks is 33 cycles (codegen: https://godbolt.org/z/5ojzPdj3h, llvm-mca: https://godbolt.org/z/5cEPxvddW). And this is without considering any inlining.
PiperOrigin-RevId: 799757460
Change-Id: I80118d5c1736ae31d69e5624c94cc0a6513ef28f
Currently at the start of the Extend() call we process some number of bytes to align
the length to a multiple of 16. However, for large inputs we then process another small number of bytes to align the next load address to 8 bytes, undoing the length alignment. At the end of the call, we process the remaining bytes anyway.
The initial length alignment is not useful, since it is undone anyway. We never return early here for small inputs since this function is only used for lengths > 64 anyway. Removing this reduces the amount of time we spend processing only a small number of bytes at a time.
Also, we can optimize processing the remaining bytes at the end by leveraging the CRC32 instructions for 2,4, and 8 bytes.
This looks to be about 2-5% faster on various platforms for typical input sizes.
PiperOrigin-RevId: 793697720
Change-Id: Ibe71a51c851863ad40acef7d334694a9ac930f4d
Optimize multiply() (renamed to MultiplyWithExtraX33()) to eliminate
several instructions that were present only to avoid introducing an
extra factor of x^33 into the multiplication. It's actually fine to
introduce the extra factor of x^33 as long as it's canceled out with an
extra factor of x^-33 in all the kCRC32CPowers[] entries.
To make this work, the number of bits dropped by ComputeZeroConstant()
had to be increased from 2 to at least 3, since 2^(i + 3 +
kNumDroppedBits) - 33 must be >= 0 for all i including i=0; otherwise
kCRC32CPowers[0] would need a negative power of x. However, this is
fine since it's more efficient to utilize CRC32_u32() and CRC32_u64()
for bits 2 and 3 anyway. So, increase kNumDroppedBits to 4.
Add a Python script that generates the updated kCRC32CPowers[]. It
isn't wired up to the build system, but rather is just added so that
kCRC32CPowers[] can be reproduced.
Also add a test which tests ExtendCrc32cByZeroes() with all the length
bits, thus testing all the entries of kCRC32CPowers[].
Note that the kCRC32CPowers[] generation script and new test case are
things we should have had anyway, regardless of the x^33 optimization.
This change slightly improves the performance of Extend() for lengths
greater than or equal to 2048 bytes, and also the performance of
ExtendByZeroes(). It also slightly reduces the binary code size.
Before:
BM_Calculate/2048 84.3 ns 84.3 ns 8307735
BM_Calculate/10000 376 ns 375 ns 1865976
BM_Calculate/500000 18538 ns 18531 ns 37813
BM_ExtendByZeroes/1 3.55 ns 3.55 ns 197111095
BM_ExtendByZeroes/10 3.90 ns 3.89 ns 179773877
BM_ExtendByZeroes/100 6.06 ns 6.06 ns 115242160
BM_ExtendByZeroes/1000 12.0 ns 12.0 ns 58078004
BM_ExtendByZeroes/10000 9.97 ns 9.97 ns 70335772
BM_ExtendByZeroes/100000 12.1 ns 12.1 ns 58157829
BM_ExtendByZeroes/1000000 14.4 ns 14.4 ns 48527365
After:
BM_Calculate/2048 82.8 ns 82.7 ns 8478296
BM_Calculate/10000 375 ns 375 ns 1869663
BM_Calculate/500000 18547 ns 18538 ns 37846
BM_ExtendByZeroes/1 2.96 ns 2.96 ns 236772500
BM_ExtendByZeroes/10 3.85 ns 3.85 ns 182059238
BM_ExtendByZeroes/100 5.42 ns 5.42 ns 129077546
BM_ExtendByZeroes/1000 9.43 ns 9.42 ns 74232457
BM_ExtendByZeroes/10000 8.14 ns 8.14 ns 86244218
BM_ExtendByZeroes/100000 10.7 ns 10.7 ns 65467391
BM_ExtendByZeroes/1000000 11.0 ns 11.0 ns 63575936
PiperOrigin-RevId: 786828855
Change-Id: I6208625fd1c35c2c137e756cf5fadc1adccfdd5d
My previous CL optimized the Barrett reduction. But since this is CRC32C and
scalar instructions for it are available, there is actually no need for Barrett
reduction at all. Just use two 64-bit CRC32C instructions to reduce fullCRC.
This improves CRC32C performance on 2048-byte messages on Skylake by another 2%
or so.
PiperOrigin-RevId: 739977426
Change-Id: I4611af88cd32ed7a995e772a13c30e3bdcec8de9
1. When reducing 4 vectors to 1, fold across 2 vectors first and then across 1,
instead of across 1 and then across 2. This works slightly better because it
makes the constants be used in order.
2. Use a faster algorithm to reduce 1 vector to a scalar value.
This approach is the same one I used in the assembly code I recently wrote for
the Linux kernel in the patch series
https://lore.kernel.org/lkml/20250210174540.161705-1-ebiggers@kernel.org/T/#u
(search for "reduce_128bits_to_crc").
On Skylake (which uses num_pclmul_streams=2), this improves CRC32C performance
on 2048-byte messages by about 2%. The overall improvement is relatively small
since FinalizePclmulStream() is only called for messages >= 2048 bytes and is
only called num_pclmul_streams times per message. So it's not really a
bottleneck, but the new code is definitely a bit shorter and faster.
PiperOrigin-RevId: 739002382
Change-Id: I0505e61f012e4a4f8b85958f7f00478f5b1a7026
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1833
`ABSL_INTERNAL_STEP1`, `ABSL_INTERNAL_STEP2`, `ABSL_INTERNAL_STEP4` assumed that `p` exists where these were used. All while similar macro `ABSL_INTERNAL_STEP8` correctly passed `p` as a macro arg. This PR updates all of them to take extra param instead of relying p's existence. Also, renamed `data` to `p` for `ABSL_INTERNAL_STEP8` to be consistent with others
Merge 9a89bb0b62 into e3183f1584
Merging this change closes#1833
COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1833 from pps83:master-macrofix 9a89bb0b62
PiperOrigin-RevId: 728751982
Change-Id: I48c3635f8d22848115744f6e9869717136385154
non_temporal_store_memcpy_avx uses gnu::target("avx") to use AVX intrinsics
inside its function body even if the compiler was not configured for AVX
support. This is OK because non_temporal_store_memcpy_avx is guarded by a cpuid
check before it is called.
However, non_temporal_memcpy_test.cc performs no such cpuid guard. In practice,
nobody will really notice this bug as CPUs have had AVX for a long time by now.
That said, this does come up if one has compiled absl for x86_64 and runs the
binary on a arm64 Mac. This is because the Rosetta 2 emulation environment does
not support AVX or newer instructions.
PiperOrigin-RevId: 717991751
Change-Id: Id41bd186ebfd1cf7124ab5211fbfb74a01d5b56c
It seems that this feature is not fully baked on all build configurations, let's remove it for now.
PiperOrigin-RevId: 716825311
Change-Id: I2ea9d941f8f3f177f9eb2afbd737935d58923780
Otherwise we can observe a build failure when absl::optional != std::optional.
PiperOrigin-RevId: 716275922
Change-Id: I4918a8901530f0daafeec07e319fd79123358bc1
With a newer clang, we can use __builtin_cpu_supports which caches all
the feature bits.
If we are using an older clang, we fall back to querying sysctlbyname
for the relevant processor features.
PiperOrigin-RevId: 715153229
Change-Id: I570fa349f96829d5da3b32c928480ddf67176cad
Linux "latest" containers updated to
GCC 14.2
CMake 3.31.2
Bazel 8.0.0
Included are various fixes to get these versions to work.
Bazel now references repositories by their canonical names from the
Bazel Central Registry. For example, Abseil is now @abseil-cpp instead
of @com_google_absl, and GoogleTest is now @googletest instead of
@com_google_googletest. Users still using the old WORKSPACE system may
need to use `repo_mapping` on repositories using the old names. See
`WORKSPACE.bazel` in this commit for an example.
PiperOrigin-RevId: 709102146
Change-Id: I02327ed4f8fb947766480bdeef2b1930a7f831eb
This removes redundant vector-vector moves and results in Extend being up to 3% faster.
PiperOrigin-RevId: 621948170
Change-Id: Id82816aa6e294d34140ff591103cb20feac79d9a
This change will allow the AVX version of non-temporal memcpy to be compiled
even if the compiler isn't run with AVX support. This allows runtime dispatch
to select the AVX implementation for CPUs that are known to be compatible with
AVX instructions.
PiperOrigin-RevId: 619594422
Change-Id: Ia7d92404ef8d10d152030b29b71948ed954f28f5
This will allow us to give visibility to other Google-internal libraries. The
change is necessary since //visibility:private cannot be combined with other
specifications.
PiperOrigin-RevId: 615779561
Change-Id: I82b1edfa4e1ca280e429cf2a5e4003a1cc316a60
Note that this only changes how we allocate the empty state, and reference countings of `empty` stay the same.
PiperOrigin-RevId: 599526339
Change-Id: I2c6aaf875c144c947e17fe8f69692b1195b55dd7
The layering_check feature ensures that rules that include a header
explicitly depend on a rule that exports that header. Compiler support
is required, and currently only Clang 16+ supports diagnoses
layering_check failures.
The parse_headers feature ensures headers are self-contained by
compiling them with -fsyntax-only on supported compilers.
PiperOrigin-RevId: 572350144
Change-Id: I37297f761566d686d9dd58d318979d688b7e36d1
Siryn's crc32 instruction seems to have latency 3 and throughput 1, which makes the optimal ratio of pmull and crc streams close to that of tested x86 machines. Up to +120% faster for large inputs.
PiperOrigin-RevId: 568645559
Change-Id: I86b85b1b2a5d4fb3680c516c4c9044238b20fe61
This is a temporary workaround for an apparent compiler bug with pmull(2) instructions. The current hot loop looks like this:
mov w14, #0xef02,
lsl x15, x15, #6,
mov x13, xzr,
movk w14, #0x740e, lsl #16,
sub x15, x15, #0x40,
ldr q4, [x16, #0x4e0],
_LOOP_START:
add x16, x9, x13,
add x17, x12, x13,
fmov d19, x14, <--------- This is Loop invariant and expensive
add x13, x13, #0x40,
cmp x15, x13,
prfm pldl1keep, [x16, #0x140],
prfm pldl1keep, [x17, #0x140],
ldp x18, x0, [x16, #0x40],
crc32cx w10, w10, x18,
ldp x2, x18, [x16, #0x50],
crc32cx w10, w10, x0,
crc32cx w10, w10, x2,
ldp x0, x2, [x16, #0x60],
crc32cx w10, w10, x18,
ldp x18, x16, [x16, #0x70],
pmull2 v5.1q, v1.2d, v4.2d,
pmull2 v6.1q, v0.2d, v4.2d,
pmull2 v7.1q, v2.2d, v4.2d,
pmull2 v16.1q, v3.2d, v4.2d,
ldp q17, q18, [x17, #0x40],
crc32cx w10, w10, x0,
pmull v1.1q, v1.1d, v19.1d,
crc32cx w10, w10, x2,
pmull v0.1q, v0.1d, v19.1d,
crc32cx w10, w10, x18,
pmull v2.1q, v2.1d, v19.1d,
crc32cx w10, w10, x16,
pmull v3.1q, v3.1d, v19.1d,
ldp q20, q21, [x17, #0x60],
eor v1.16b, v17.16b, v1.16b,
eor v0.16b, v18.16b, v0.16b,
eor v1.16b, v1.16b, v5.16b,
eor v2.16b, v20.16b, v2.16b,
eor v0.16b, v0.16b, v6.16b,
eor v3.16b, v21.16b, v3.16b,
eor v2.16b, v2.16b, v7.16b,
eor v3.16b, v3.16b, v16.16b,
b.ne _LOOP_START
There is a redundant fmov that moves the same constant into a Neon register every loop iteration to be used in the PMULL instructions. The PMULL2 instructions already have this constant loaded into Neon registers. After this change, both the PMULL and PMULL2 instructions use the values in q4, and they are not reloaded every iteration. This fmov was expensive because it contends for execution units with crc32cx instructions. This is up to 20% faster for large inputs.
PiperOrigin-RevId: 567391972
Change-Id: I4c8e49750cfa5cc5730c3bb713bd9fd67657804a
This CL rolls forward a previous change which we rolled back temporarily due to
compilation errors on x86 when PCLMUL intrinsics were unavailable.
*** Original change description ***
This change replaces inline x86 intrinsics with generic versions that compile
for both x86 and ARM depending on the target arch.
This change does not enable the accelerated crc memcpy engine on ARM. That will
be done in a subsequent change after the optimal number of vector and integer
regions for different CPUs is determined.
***
PiperOrigin-RevId: 563416413
Change-Id: Iee630a15ed83c26659adb0e8a03d3f3d3a46d688
In some configurations this change causes compilation errors. We will roll this
forward again after those issue are addressed.
PiperOrigin-RevId: 562810916
Change-Id: I45b2a8d456273e9eff188f36da8f11323c4dfe66
This change replaces inline x86 intrinsics with generic versions that compile
for both x86 and ARM depending on the target arch.
This change does not enable the accelerated crc memcpy engine on ARM. That will
be done in a subsequent change after the optimal number of vector and integer
regions for different CPUs is determined.
PiperOrigin-RevId: 562785420
Change-Id: I8ba4aa8de17587cedd92532f03767059a481f159