abseil-cpp

mirror of https://github.com/abseil/abseil-cpp.git synced 2026-06-04 12:07:05 +08:00

Author	SHA1	Message	Date
Derek Mauro	e7e7b016aa	CRC: Fix unused variable warnings in no-op implementation PiperOrigin-RevId: 919072538 Change-Id: Id927ef7e9d17dc9f5e83ca12a23851c2fcf60aad	2026-05-21 08:45:36 -07:00
Connal de Souza	0c60e214e9	Remove PCLMUL steam on AMD Rome, as it appears to be marginally faster without it. PiperOrigin-RevId: 914995032 Change-Id: I642095189cf36e7cf1dcaa44e8bfb97246158831	2026-05-13 12:26:43 -07:00
Ilya Tokar	1eb0661e7f	Re-land crc32 optimization on AMD Milan+ BM_Calculate/0 1.136n ± 0% 1.136n ± 0% ~ (p=0.708 n=6) BM_Calculate/1 1.420n ± 0% 1.420n ± 0% ~ (p=0.697 n=6) BM_Calculate/100 9.374n ± 0% 9.374n ± 0% ~ (p=0.859 n=6) BM_Calculate/2048 75.59n ± 1% 66.91n ± 0% -11.49% (p=0.002 n=6) BM_Calculate/10000 312.7n ± 0% 284.9n ± 0% -8.91% (p=0.002 n=6) BM_Calculate/500000 14.78µ ± 1% 13.40µ ± 1% -9.37% (p=0.002 n=6) BM_Extend/0 1.136n ± 0% 1.137n ± 0% ~ (p=0.935 n=6) BM_Extend/1 1.421n ± 0% 1.278n ± 0% -10.03% (p=0.002 n=6) BM_Extend/100 9.376n ± 0% 9.091n ± 0% -3.05% (p=0.002 n=6) BM_Extend/2048 75.43n ± 0% 66.81n ± 0% -11.43% (p=0.002 n=6) BM_Extend/10000 312.5n ± 0% 284.9n ± 0% -8.83% (p=0.002 n=6) BM_Extend/500000 14.82µ ± 1% 13.39µ ± 1% -9.59% (p=0.002 n=6) BM_Extend/100000000 3.185m ± 0% 2.790m ± 0% -12.40% (p=0.002 n=6) BM_ExtendCacheMiss/10 26.06m ± 0% 23.91m ± 1% -8.27% (p=0.002 n=6) BM_ExtendCacheMiss/100 14.06m ± 1% 13.78m ± 1% -1.99% (p=0.002 n=6) BM_ExtendCacheMiss/1000 26.89m ± 4% 26.66m ± 2% ~ (p=0.132 n=6) BM_ExtendCacheMiss/100000 5.120m ± 1% 4.582m ± 1% -10.52% (p=0.002 n=6) PiperOrigin-RevId: 907109111 Change-Id: I5a01870bd85a2c69052cdf1677987d762a8a1a2a	2026-04-28 12:10:12 -07:00
Abseil Team	852fc61f31	Remove more lingering C++17 type traits polyfill usages This will let us deprecate the declarations without triggering warnings in Abseil itself. PiperOrigin-RevId: 906360966 Change-Id: Iee362ac0eac647909ef38003280f1179813f764d	2026-04-27 08:03:53 -07:00
Abseil Team	b85d16902f	Optimzie crc32 on AMD Milan+ We have AVX encoded vector PCLMULQDQ on Milan, so use it to make crc32c computations ~10% faster. We need to use inline asm, since building this twice with different complier flags for dynamic dispatch performed worse due to missing inlining. BM_Calculate/0 1.136n ± 0% 1.136n ± 1% ~ (p=0.968 n=6) BM_Calculate/1 1.420n ± 0% 1.421n ± 1% ~ (p=0.870 n=6) BM_Calculate/100 9.089n ± 0% 9.660n ± 1% +6.29% (p=0.002 n=6) BM_Calculate/2048 75.30n ± 1% 67.67n ± 1% -10.13% (p=0.002 n=6) BM_Calculate/10000 313.1n ± 0% 286.1n ± 0% -8.63% (p=0.002 n=6) BM_Calculate/500000 14.91µ ± 4% 13.49µ ± 1% -9.48% (p=0.002 n=6) BM_Extend/0 1.136n ± 1% 1.136n ± 1% ~ (p=0.636 n=6) BM_Extend/1 1.420n ± 0% 1.420n ± 1% ~ (p=0.636 n=6) BM_Extend/100 9.247n ± 2% 9.800n ± 2% +5.99% (p=0.002 n=6) BM_Extend/2048 75.73n ± 1% 67.37n ± 1% -11.04% (p=0.002 n=6) BM_Extend/10000 313.2n ± 1% 286.2n ± 0% -8.62% (p=0.002 n=6) BM_Extend/500000 14.87µ ± 1% 13.57µ ± 1% -8.74% (p=0.002 n=6) BM_Extend/100000000 3.185m ± 2% 2.816m ± 3% -11.60% (p=0.002 n=6) BM_ExtendCacheMiss/10 26.07m ± 1% 26.06m ± 1% ~ (p=1.000 n=6) BM_ExtendCacheMiss/100 13.86m ± 4% 14.36m ± 2% +3.61% (p=0.026 n=6) BM_ExtendCacheMiss/1000 27.02m ± 4% 27.28m ± 4% ~ (p=0.699 n=6) BM_ExtendCacheMiss/100000 5.114m ± 5% 4.600m ± 8% -10.07% (p=0.002 n=6) BM_ExtendByZeroes/1 1.420n ± 0% 1.420n ± 0% ~ (p=0.670 n=12) BM_ExtendByZeroes/10 1.704n ± 1% 1.704n ± 0% ~ (p=1.000 n=6) BM_ExtendByZeroes/100 3.128n ± 0% 3.128n ± 0% ~ (p=1.000 n=6) BM_ExtendByZeroes/1000 6.758n ± 0% 6.638n ± 1% -1.78% (p=0.002 n=6) BM_ExtendByZeroes/10000 6.619n ± 1% 6.503n ± 0% -1.75% (p=0.002 n=6) BM_ExtendByZeroes/100000 8.537n ± 1% 8.479n ± 0% -0.67% (p=0.019 n=6) BM_ExtendByZeroes/1000000 9.766n ± 1% 9.692n ± 1% -0.75% (p=0.002 n=6) PiperOrigin-RevId: 900897540 Change-Id: I57d8df2bf10690afc07009d61f8c4ea61e88ce50	2026-04-16 13:59:26 -07:00
Ilya Tokar	5f9d5bfcc4	Optimzie crc32 on AMD Milan+ We have AVX encoded vector PCLMULQDQ on Milan, so use it to make crc32c computations ~10% faster. We need to use inline asm, since building this twice with different complier flags for dynamic dispatch performed worse due to missing inlining. BM_Calculate/0 1.136n ± 0% 1.136n ± 1% ~ (p=0.968 n=6) BM_Calculate/1 1.420n ± 0% 1.421n ± 1% ~ (p=0.870 n=6) BM_Calculate/100 9.089n ± 0% 9.660n ± 1% +6.29% (p=0.002 n=6) BM_Calculate/2048 75.30n ± 1% 67.67n ± 1% -10.13% (p=0.002 n=6) BM_Calculate/10000 313.1n ± 0% 286.1n ± 0% -8.63% (p=0.002 n=6) BM_Calculate/500000 14.91µ ± 4% 13.49µ ± 1% -9.48% (p=0.002 n=6) BM_Extend/0 1.136n ± 1% 1.136n ± 1% ~ (p=0.636 n=6) BM_Extend/1 1.420n ± 0% 1.420n ± 1% ~ (p=0.636 n=6) BM_Extend/100 9.247n ± 2% 9.800n ± 2% +5.99% (p=0.002 n=6) BM_Extend/2048 75.73n ± 1% 67.37n ± 1% -11.04% (p=0.002 n=6) BM_Extend/10000 313.2n ± 1% 286.2n ± 0% -8.62% (p=0.002 n=6) BM_Extend/500000 14.87µ ± 1% 13.57µ ± 1% -8.74% (p=0.002 n=6) BM_Extend/100000000 3.185m ± 2% 2.816m ± 3% -11.60% (p=0.002 n=6) BM_ExtendCacheMiss/10 26.07m ± 1% 26.06m ± 1% ~ (p=1.000 n=6) BM_ExtendCacheMiss/100 13.86m ± 4% 14.36m ± 2% +3.61% (p=0.026 n=6) BM_ExtendCacheMiss/1000 27.02m ± 4% 27.28m ± 4% ~ (p=0.699 n=6) BM_ExtendCacheMiss/100000 5.114m ± 5% 4.600m ± 8% -10.07% (p=0.002 n=6) BM_ExtendByZeroes/1 1.420n ± 0% 1.420n ± 0% ~ (p=0.670 n=12) BM_ExtendByZeroes/10 1.704n ± 1% 1.704n ± 0% ~ (p=1.000 n=6) BM_ExtendByZeroes/100 3.128n ± 0% 3.128n ± 0% ~ (p=1.000 n=6) BM_ExtendByZeroes/1000 6.758n ± 0% 6.638n ± 1% -1.78% (p=0.002 n=6) BM_ExtendByZeroes/10000 6.619n ± 1% 6.503n ± 0% -1.75% (p=0.002 n=6) BM_ExtendByZeroes/100000 8.537n ± 1% 8.479n ± 0% -0.67% (p=0.019 n=6) BM_ExtendByZeroes/1000000 9.766n ± 1% 9.692n ± 1% -0.75% (p=0.002 n=6) PiperOrigin-RevId: 900870516 Change-Id: I1382ae2ffeed35e1d55a0916290144cae5256fe0	2026-04-16 13:02:39 -07:00
Derek Mauro	5088cf5194	Cleanup the uses of the polyfills absl::any, absl::optional, absl::variant, and related types The corresponding headers are removed from cc files, but kept in headers to prevent breakages from transitive dependencies. PiperOrigin-RevId: 872421685 Change-Id: I867d4c3f7c9e422289c63816d44719b0530fb0a6	2026-02-19 08:53:17 -08:00
Derek Mauro	569ff20318	Cleanup duplicated bit-rotation code PiperOrigin-RevId: 857286087 Change-Id: Ie79f5b9e7ca8417f6311750c0de469ca6de4a8f9	2026-01-16 13:38:01 -08:00
J. Neuschäfer	55a99fb37a	PR #1944 : Use same element-width for non-temporal loads and stores on Arm Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1944 Increase the consistency between _mm_loadu_si128 and _mm_stream_si128 by using vector loads/stores of 64-bit elements in both. This should have no impact on existing users. On aarch64 (release build, GCC 15.2), crc_non_temporal_memcpy.cc.o stays effectively the same, the only change being as follows: ``` --- crc_non_temporal_memcpy.cc.o (original) +++ crc_non_temporal_memcpy.cc.o (patched) ├── objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text {} │ @@ -255,15 +255,15 @@ │ add x2, x21, x2 │ mov x0, x21 │ ldp q31, q30, [x0, #32] │ add x1, x1, #0x40 │ ldp q29, q28, [x0], #64 │ stp q31, q30, [x1, #-32] │ stp q29, q28, [x1, #-64] │ - cmp x0, x2 │ + cmp x2, x0 │ b.ne 3b0 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void, void const, unsigned long, absl::crc32c_t) const+0x270> // b.any │ and x0, x3, #0xffffffffffffffc0 │ and x23, x23, #0x3f │ dmb ish │ add x22, x22, x0 │ add x21, x21, x0 │ b 380 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void, void const, unsigned long, absl::crc32c_t) const+0x240> ``` On big-endian Arm (aarch64_be), this fixes a bug in non_temporal_store_memcpy, in which each 32-bit half out of a 64-bit parcel of memory was swapped with the other. For example, the byte sequence 218edf0b 13c68753 would be copied as 13c68753 218edf0b. Merge `8f08d4c792` into `e5c6ccbc96` Merging this change closes #1944 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1944 from neuschaefer:nontemp `8f08d4c792` PiperOrigin-RevId: 819779377 Change-Id: I46c8c5540fb4786948c5f16d25630fbbab892602	2025-10-15 09:03:00 -07:00
Shahriar Rouf	5ad0bfb7ab	Optimize `CRC32AcceleratedX86ARMCombinedMultipleStreams::Extend` by interleaving the `CRC32_u64` calls at a lower level. `CRC32_u64` generates `CRC32` x86 instruction which has 3 cycle latency. Because of that, the `crc` variable below causes a loop carried dependency of 3 cycles per iteration. ``` for (int i = 0; i < 8; i++) { crc = CRC32_u64(static_cast<uint32_t>(crc), absl::little_endian::Load64(p)); p += 8; } ``` Total latency for a 64-byte block is 29 cycles (codegen: https://godbolt.org/z/zxsrGMEPs, llvm-mca: https://godbolt.org/z/xrTMhhd1E). So, it is more efficient to interleave (up to 3 calls because of the 3 cycle latency) the `CRC32_u64` calls at a lower level. Even if we interleave 3 streams, the total latency for (three) 64-byte blocks is 33 cycles (codegen: https://godbolt.org/z/5ojzPdj3h, llvm-mca: https://godbolt.org/z/5cEPxvddW). And this is without considering any inlining. PiperOrigin-RevId: 799757460 Change-Id: I80118d5c1736ae31d69e5624c94cc0a6513ef28f	2025-08-26 16:07:36 -07:00
Connal de Souza	274c81389f	Optimize crc32 Extend by removing obsolete length alignment. Currently at the start of the Extend() call we process some number of bytes to align the length to a multiple of 16. However, for large inputs we then process another small number of bytes to align the next load address to 8 bytes, undoing the length alignment. At the end of the call, we process the remaining bytes anyway. The initial length alignment is not useful, since it is undone anyway. We never return early here for small inputs since this function is only used for lengths > 64 anyway. Removing this reduces the amount of time we spend processing only a small number of bytes at a time. Also, we can optimize processing the remaining bytes at the end by leveraging the CRC32 instructions for 2,4, and 8 bytes. This looks to be about 2-5% faster on various platforms for typical input sizes. PiperOrigin-RevId: 793697720 Change-Id: Ibe71a51c851863ad40acef7d334694a9ac930f4d	2025-08-11 10:12:05 -07:00
Connal de Souza	483951bb49	Update the crc32 dynamic dispatch table with newer platforms. Up to 13% performance improvement for the platforms affected. PiperOrigin-RevId: 789033088 Change-Id: I1d74360377e3c40dfaae2108ec55f907960d177a	2025-07-30 14:01:43 -07:00
Abseil Team	57abc0ee3f	Optimize CRC-32C extension by zeroes Optimize multiply() (renamed to MultiplyWithExtraX33()) to eliminate several instructions that were present only to avoid introducing an extra factor of x^33 into the multiplication. It's actually fine to introduce the extra factor of x^33 as long as it's canceled out with an extra factor of x^-33 in all the kCRC32CPowers[] entries. To make this work, the number of bits dropped by ComputeZeroConstant() had to be increased from 2 to at least 3, since 2^(i + 3 + kNumDroppedBits) - 33 must be >= 0 for all i including i=0; otherwise kCRC32CPowers[0] would need a negative power of x. However, this is fine since it's more efficient to utilize CRC32_u32() and CRC32_u64() for bits 2 and 3 anyway. So, increase kNumDroppedBits to 4. Add a Python script that generates the updated kCRC32CPowers[]. It isn't wired up to the build system, but rather is just added so that kCRC32CPowers[] can be reproduced. Also add a test which tests ExtendCrc32cByZeroes() with all the length bits, thus testing all the entries of kCRC32CPowers[]. Note that the kCRC32CPowers[] generation script and new test case are things we should have had anyway, regardless of the x^33 optimization. This change slightly improves the performance of Extend() for lengths greater than or equal to 2048 bytes, and also the performance of ExtendByZeroes(). It also slightly reduces the binary code size. Before: BM_Calculate/2048 84.3 ns 84.3 ns 8307735 BM_Calculate/10000 376 ns 375 ns 1865976 BM_Calculate/500000 18538 ns 18531 ns 37813 BM_ExtendByZeroes/1 3.55 ns 3.55 ns 197111095 BM_ExtendByZeroes/10 3.90 ns 3.89 ns 179773877 BM_ExtendByZeroes/100 6.06 ns 6.06 ns 115242160 BM_ExtendByZeroes/1000 12.0 ns 12.0 ns 58078004 BM_ExtendByZeroes/10000 9.97 ns 9.97 ns 70335772 BM_ExtendByZeroes/100000 12.1 ns 12.1 ns 58157829 BM_ExtendByZeroes/1000000 14.4 ns 14.4 ns 48527365 After: BM_Calculate/2048 82.8 ns 82.7 ns 8478296 BM_Calculate/10000 375 ns 375 ns 1869663 BM_Calculate/500000 18547 ns 18538 ns 37846 BM_ExtendByZeroes/1 2.96 ns 2.96 ns 236772500 BM_ExtendByZeroes/10 3.85 ns 3.85 ns 182059238 BM_ExtendByZeroes/100 5.42 ns 5.42 ns 129077546 BM_ExtendByZeroes/1000 9.43 ns 9.42 ns 74232457 BM_ExtendByZeroes/10000 8.14 ns 8.14 ns 86244218 BM_ExtendByZeroes/100000 10.7 ns 10.7 ns 65467391 BM_ExtendByZeroes/1000000 11.0 ns 11.0 ns 63575936 PiperOrigin-RevId: 786828855 Change-Id: I6208625fd1c35c2c137e756cf5fadc1adccfdd5d	2025-07-24 14:04:51 -07:00
Abseil Team	64a9eafe33	Disable sanitizer bounds checking in ComputeZeroConstant. The code is correct, but the compiler can't optimize away the check. PiperOrigin-RevId: 785603401 Change-Id: I9277e3b71965322691108f08597728dd84737329	2025-07-21 15:47:07 -07:00
Abseil Team	878361312d	Automated Code Change PiperOrigin-RevId: 783054860 Change-Id: I3f84881642f2f77be5d5275983243edf6305178c	2025-07-14 15:00:34 -07:00
Abseil Team	f60bfd822e	Enable SIMD memcpy-crc on ARM cores. PiperOrigin-RevId: 773749299 Change-Id: I798913549298c0993af16fc3ab6215089aab1f18	2025-06-20 10:13:26 -07:00
Abseil Team	99275763ac	Use even faster reduction algorithm in FinalizePclmulStream() My previous CL optimized the Barrett reduction. But since this is CRC32C and scalar instructions for it are available, there is actually no need for Barrett reduction at all. Just use two 64-bit CRC32C instructions to reduce fullCRC. This improves CRC32C performance on 2048-byte messages on Skylake by another 2% or so. PiperOrigin-RevId: 739977426 Change-Id: I4611af88cd32ed7a995e772a13c30e3bdcec8de9	2025-03-24 09:59:22 -07:00
Abseil Team	c4ff4d561c	Use more efficient reduction algorithm in FinalizePclmulStream() 1. When reducing 4 vectors to 1, fold across 2 vectors first and then across 1, instead of across 1 and then across 2. This works slightly better because it makes the constants be used in order. 2. Use a faster algorithm to reduce 1 vector to a scalar value. This approach is the same one I used in the assembly code I recently wrote for the Linux kernel in the patch series https://lore.kernel.org/lkml/20250210174540.161705-1-ebiggers@kernel.org/T/#u (search for "reduce_128bits_to_crc"). On Skylake (which uses num_pclmul_streams=2), this improves CRC32C performance on 2048-byte messages by about 2%. The overall improvement is relatively small since FinalizePclmulStream() is only called for messages >= 2048 bytes and is only called num_pclmul_streams times per message. So it's not really a bottleneck, but the new code is definitely a bit shorter and faster. PiperOrigin-RevId: 739002382 Change-Id: I0505e61f012e4a4f8b85958f7f00478f5b1a7026	2025-03-20 18:06:56 -07:00
Derek Mauro	feb3d276d4	Remove ABSL_INTERNAL_NEED_REDUNDANT_CONSTEXPR_DECL which is longer needed with the C++17 floor PiperOrigin-RevId: 729365281 Change-Id: Ife5e778ead193bb37150b9799099e92f53252cb4	2025-02-20 21:07:28 -08:00
Pavel P	26b6046ab2	PR #1833 : Make ABSL_INTERNAL_STEP_n macros consistent in crc code Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1833 `ABSL_INTERNAL_STEP1`, `ABSL_INTERNAL_STEP2`, `ABSL_INTERNAL_STEP4` assumed that `p` exists where these were used. All while similar macro `ABSL_INTERNAL_STEP8` correctly passed `p` as a macro arg. This PR updates all of them to take extra param instead of relying p's existence. Also, renamed `data` to `p` for `ABSL_INTERNAL_STEP8` to be consistent with others Merge `9a89bb0b62` into `e3183f1584` Merging this change closes #1833 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1833 from pps83:master-macrofix `9a89bb0b62` PiperOrigin-RevId: 728751982 Change-Id: I48c3635f8d22848115744f6e9869717136385154	2025-02-19 11:36:50 -08:00
Abseil Team	e3183f1584	Move the implementation of absl::ComputeCrc32c to the header file, to facilitate inlining. PiperOrigin-RevId: 728699475 Change-Id: I444b1aa5b1ea77705175eadf47e05d772446441d	2025-02-19 09:16:21 -08:00
David Majnemer	df8178e26e	Crc: Only test non_temporal_store_memcpy_avx on AVX targets non_temporal_store_memcpy_avx uses gnu::target("avx") to use AVX intrinsics inside its function body even if the compiler was not configured for AVX support. This is OK because non_temporal_store_memcpy_avx is guarded by a cpuid check before it is called. However, non_temporal_memcpy_test.cc performs no such cpuid guard. In practice, nobody will really notice this bug as CPUs have had AVX for a long time by now. That said, this does come up if one has compiled absl for x86_64 and runs the binary on a arm64 Mac. This is because the Rosetta 2 emulation environment does not support AVX or newer instructions. PiperOrigin-RevId: 717991751 Change-Id: Id41bd186ebfd1cf7124ab5211fbfb74a01d5b56c	2025-01-21 11:06:47 -08:00
David Majnemer	3735766b3b	Crc: Remove the __builtin_cpu_supports path for SupportsArmCRC32PMULL It seems that this feature is not fully baked on all build configurations, let's remove it for now. PiperOrigin-RevId: 716825311 Change-Id: I2ea9d941f8f3f177f9eb2afbd737935d58923780	2025-01-17 15:56:53 -08:00
David Majnemer	3ded0b656e	crc: Use absl::nullopt when returning absl::optional Otherwise we can observe a build failure when absl::optional != std::optional. PiperOrigin-RevId: 716275922 Change-Id: I4918a8901530f0daafeec07e319fd79123358bc1	2025-01-16 10:01:05 -08:00
David Majnemer	6effb000ca	Crc: Detect support for pmull and crc instructions on Apple AArch64 With a newer clang, we can use __builtin_cpu_supports which caches all the feature bits. If we are using an older clang, we fall back to querying sysctlbyname for the relevant processor features. PiperOrigin-RevId: 715153229 Change-Id: I570fa349f96829d5da3b32c928480ddf67176cad	2025-01-13 16:45:10 -08:00
Derek Mauro	90a7ba66e8	Updates to CI to support newer versions of tools Linux "latest" containers updated to GCC 14.2 CMake 3.31.2 Bazel 8.0.0 Included are various fixes to get these versions to work. Bazel now references repositories by their canonical names from the Bazel Central Registry. For example, Abseil is now @abseil-cpp instead of @com_google_absl, and GoogleTest is now @googletest instead of @com_google_googletest. Users still using the old WORKSPACE system may need to use `repo_mapping` on repositories using the old names. See `WORKSPACE.bazel` in this commit for an example. PiperOrigin-RevId: 709102146 Change-Id: I02327ed4f8fb947766480bdeef2b1930a7f831eb	2024-12-23 10:58:05 -08:00
Dertosh	fffac1157d	PR #1794 : Update cpu_detect.cc fix hw crc32 and AES capability check, fix undefined Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1794 Source and explanation https://github.com/JuliaLang/julia/issues/26458 https://github.com/memcached/memcached/pull/744 For build for aarch64 on v22_clang-16.0.6-centos7 ` abseil-cpp/absl/crc/internal/cpu_detect.cc:273:20: error: use of undeclared identifier 'HWCAP_CRC32' return (hwcaps & HWCAP_CRC32) && (hwcaps & HWCAP_PMULL); ^ abseil-cpp/absl/crc/internal/cpu_detect.cc:273:46: error: use of undeclared identifier 'HWCAP_PMULL' return (hwcaps & HWCAP_CRC32) && (hwcaps & HWCAP_PMULL); ` Merge `3ee325b7a4` into `940e0ec36a` Merging this change closes #1794 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1794 from Dertosh:patch-1 `3ee325b7a4` PiperOrigin-RevId: 705936372 Change-Id: Ifebd6d1a854e17acf6cc00bab92053bc0d4c2349	2024-12-13 10:59:49 -08:00
Derek Mauro	29fdacd2e5	Fix the conditional compilation of non_temporal_store_memcpy_avx to verify that AVX can be forced via `gnu::target`. Fixes #1759 PiperOrigin-RevId: 677853230 Change-Id: Ic69045c71ddf8230fd7b0210ba4aef8693053232	2024-09-23 10:39:56 -07:00
Pavel P	77224c28ff	PR #1662 : Replace shift with addition in crc multiply Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1662 Merge 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b into `cba31a9562` Merging this change closes #1662 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1662 from pps83:crc-add 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b PiperOrigin-RevId: 631470883 Change-Id: I4a72be643ed341ddf0e0007418ab4a613a03db4b	2024-05-07 10:33:09 -07:00
Pavel P	564372fcd6	PR #1653 : Remove unnecessary casts when calling CRC32_u64 Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1653 CRC32_u64 returns uint32_t, no need to cast returned result to uint32_t Merge `90e7b063f3` into `9a61b00dde` Merging this change closes #1653 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1653 from pps83:CRC32_u64-cast `90e7b063f3` PiperOrigin-RevId: 626462347 Change-Id: I748a2da5fcc66eb6aa07aaf0fbc7eca927fcbb16	2024-04-19 13:59:21 -07:00
Connal de Souza	61e47a454c	Optimize crc32 V128_From2x64 on Arm This removes redundant vector-vector moves and results in Extend being up to 3% faster. PiperOrigin-RevId: 621948170 Change-Id: Id82816aa6e294d34140ff591103cb20feac79d9a	2024-04-04 13:09:48 -07:00
Abseil Team	18018aa45d	Adjust conditonal compilation in non_temporal_memcpy.h This change will allow the AVX version of non-temporal memcpy to be compiled even if the compiler isn't run with AVX support. This allows runtime dispatch to select the AVX implementation for CPUs that are known to be compatible with AVX instructions. PiperOrigin-RevId: 619594422 Change-Id: Ia7d92404ef8d10d152030b29b71948ed954f28f5	2024-03-27 11:22:57 -07:00
Abseil Team	2f0591010d	Replace //visibility:private with :__pkg__ for certain targets This will allow us to give visibility to other Google-internal libraries. The change is necessary since //visibility:private cannot be combined with other specifications. PiperOrigin-RevId: 615779561 Change-Id: I82b1edfa4e1ca280e429cf2a5e4003a1cc316a60	2024-03-14 08:01:09 -07:00
Abseil Team	2a7d0da1dd	Add several missing includes in crc/internal PiperOrigin-RevId: 615504707 Change-Id: Ia0e8211bd3c3d28fd0715c8f296ec50f6a700757	2024-03-13 12:21:38 -07:00
Abseil Team	3c1f9be71e	Disable ubsan for benign unaligned access in crc_memcpy PiperOrigin-RevId: 615160537 Change-Id: I29070c898104c55e6563eed0eef7397441bef1d7	2024-03-12 13:51:42 -07:00
Abseil Team	e20285c652	Delete a stray comment PiperOrigin-RevId: 615017130 Change-Id: I73277de8ece31d6a35b47dbdb205b473324b74a2	2024-03-12 06:19:45 -07:00
Stanislaw Halik	d4578efe7c	PR #1617 : fix MSVC 32-bit build with -arch:AVX Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1617 The intrinsics used aren't available on `x86_64` processors while running in 32-bit mode. See: - list of 64-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170) - list of 32-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170) - list of predefined MSVC macros (https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170) The error message in question: ```console F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(145,32): error C3861: '_mm_crc32_u64': identifier not found return static_cast<uint32_t>(_mm_crc32_u64(crc, v)); ^ F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(193,50): error C3861: '_mm_cvtsi128_si64': identifier not found inline int64_t V128_Low64(const V128 l) { return _mm_cvtsi128_si64(l); } ``` Merge `06f5832108` into `797501d12e` Merging this change closes #1617 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1617 from sthalik:pr/fix-msvc-32-bit-avx `06f5832108` PiperOrigin-RevId: 607483370 Change-Id: Id2a6f6dd33c2707fe7ffe134e7335916f3fb9da3	2024-02-15 15:58:50 -08:00
Shahriar Rouf	780bfc194d	Replace `testonly = 1` with `testonly = True` in abseil BUILD files. https://bazel.build/build/style-guide#other-conventions PiperOrigin-RevId: 603084345 Change-Id: Ibd7c9573d820f88059d12c46ff82d7d322d002ae	2024-01-31 10:08:35 -08:00
Abseil Team	49ff696cda	Migrate empty CrcCordState to absl::NoDestructor. Note that this only changes how we allocate the empty state, and reference countings of `empty` stay the same. PiperOrigin-RevId: 599526339 Change-Id: I2c6aaf875c144c947e17fe8f69692b1195b55dd7	2024-01-18 09:11:43 -08:00
Derek Mauro	c8087ae8bd	Avoid using the non-portable type __m128i_u. According to https://stackoverflow.com/a/68939636 it is safe to use __m128i instead. https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170 also uses this type instead __m128i_u is just __m128i with a looser alignment requirement, but simply calling _mm_loadu_si128() instead of _mm_load_si128() is enough to tell the compiler when a pointer is unaligned. Fixes #1552 PiperOrigin-RevId: 576931936 Change-Id: I7c3530001149b360c12a1786c7e1832754d0e35c	2023-10-26 11:16:31 -07:00
Derek Mauro	0ef3ef4329	Bazel: Enable the header_modules feature PiperOrigin-RevId: 572575394 Change-Id: Ic1c5ac2423b1634e50c43bad6daa14e82a8f3e2c	2023-10-11 07:58:06 -07:00
Derek Mauro	143e983739	Bazel: Support layering_check and parse_headers The layering_check feature ensures that rules that include a header explicitly depend on a rule that exports that header. Compiler support is required, and currently only Clang 16+ supports diagnoses layering_check failures. The parse_headers feature ensures headers are self-contained by compiling them with -fsyntax-only on supported compilers. PiperOrigin-RevId: 572350144 Change-Id: I37297f761566d686d9dd58d318979d688b7e36d1	2023-10-10 13:30:24 -07:00
Connal de Souza	f3ba72ee55	Add entries for Neoverse N2,V1, and V2 into CRC dynamic dispatch table. PiperOrigin-RevId: 571430428 Change-Id: I4777c37c5287d26a75f37fe059324ac218878f0e	2023-10-06 14:07:43 -07:00
Connal de Souza	ac364eb9d0	Optimize CRC32 for Ampere Siryn Siryn's crc32 instruction seems to have latency 3 and throughput 1, which makes the optimal ratio of pmull and crc streams close to that of tested x86 machines. Up to +120% faster for large inputs. PiperOrigin-RevId: 568645559 Change-Id: I86b85b1b2a5d4fb3680c516c4c9044238b20fe61	2023-09-26 14:13:55 -07:00
Connal de Souza	aa3c949a7f	Optimize CRC32 Extend for large inputs on Arm This is a temporary workaround for an apparent compiler bug with pmull(2) instructions. The current hot loop looks like this: mov w14, #0xef02, lsl x15, x15, #6, mov x13, xzr, movk w14, #0x740e, lsl #16, sub x15, x15, #0x40, ldr q4, [x16, #0x4e0], _LOOP_START: add x16, x9, x13, add x17, x12, x13, fmov d19, x14, <--------- This is Loop invariant and expensive add x13, x13, #0x40, cmp x15, x13, prfm pldl1keep, [x16, #0x140], prfm pldl1keep, [x17, #0x140], ldp x18, x0, [x16, #0x40], crc32cx w10, w10, x18, ldp x2, x18, [x16, #0x50], crc32cx w10, w10, x0, crc32cx w10, w10, x2, ldp x0, x2, [x16, #0x60], crc32cx w10, w10, x18, ldp x18, x16, [x16, #0x70], pmull2 v5.1q, v1.2d, v4.2d, pmull2 v6.1q, v0.2d, v4.2d, pmull2 v7.1q, v2.2d, v4.2d, pmull2 v16.1q, v3.2d, v4.2d, ldp q17, q18, [x17, #0x40], crc32cx w10, w10, x0, pmull v1.1q, v1.1d, v19.1d, crc32cx w10, w10, x2, pmull v0.1q, v0.1d, v19.1d, crc32cx w10, w10, x18, pmull v2.1q, v2.1d, v19.1d, crc32cx w10, w10, x16, pmull v3.1q, v3.1d, v19.1d, ldp q20, q21, [x17, #0x60], eor v1.16b, v17.16b, v1.16b, eor v0.16b, v18.16b, v0.16b, eor v1.16b, v1.16b, v5.16b, eor v2.16b, v20.16b, v2.16b, eor v0.16b, v0.16b, v6.16b, eor v3.16b, v21.16b, v3.16b, eor v2.16b, v2.16b, v7.16b, eor v3.16b, v3.16b, v16.16b, b.ne _LOOP_START There is a redundant fmov that moves the same constant into a Neon register every loop iteration to be used in the PMULL instructions. The PMULL2 instructions already have this constant loaded into Neon registers. After this change, both the PMULL and PMULL2 instructions use the values in q4, and they are not reloaded every iteration. This fmov was expensive because it contends for execution units with crc32cx instructions. This is up to 20% faster for large inputs. PiperOrigin-RevId: 567391972 Change-Id: I4c8e49750cfa5cc5730c3bb713bd9fd67657804a	2023-09-21 12:52:45 -07:00
Abseil Team	c78a3f32c3	Remove implicit int64_t->uint64_t conversion in ARM version of V128_Extract64 PiperOrigin-RevId: 565662176 Change-Id: I18d5d9eb444b0090e3f4ab8f66ad214a67344268	2023-09-15 06:30:25 -07:00
Abseil Team	2c4ce9b2ad	Rename x86 crc_memcpy tests since they cover ARM as well This is a rename only with no other changes. PiperOrigin-RevId: 563428969 Change-Id: Iefc184bf9a233cb72649bc20b8555f6b662cac6d	2023-09-07 07:48:00 -07:00
Abseil Team	433289a258	Roll forward support for ARM intrinsics in crc_memcpy This CL rolls forward a previous change which we rolled back temporarily due to compilation errors on x86 when PCLMUL intrinsics were unavailable. * Original change description * This change replaces inline x86 intrinsics with generic versions that compile for both x86 and ARM depending on the target arch. This change does not enable the accelerated crc memcpy engine on ARM. That will be done in a subsequent change after the optimal number of vector and integer regions for different CPUs is determined. *** PiperOrigin-RevId: 563416413 Change-Id: Iee630a15ed83c26659adb0e8a03d3f3d3a46d688	2023-09-07 06:53:24 -07:00
Abseil Team	461f1e49b3	Rollback adding support for ARM intrinsics In some configurations this change causes compilation errors. We will roll this forward again after those issue are addressed. PiperOrigin-RevId: 562810916 Change-Id: I45b2a8d456273e9eff188f36da8f11323c4dfe66	2023-09-05 09:57:30 -07:00
Abseil Team	1a882833c0	Add support for ARM intrinsics in crc_memcpy This change replaces inline x86 intrinsics with generic versions that compile for both x86 and ARM depending on the target arch. This change does not enable the accelerated crc memcpy engine on ARM. That will be done in a subsequent change after the optimal number of vector and integer regions for different CPUs is determined. PiperOrigin-RevId: 562785420 Change-Id: I8ba4aa8de17587cedd92532f03767059a481f159	2023-09-05 08:24:39 -07:00

1 2

94 Commits