Commit Graph

94 Commits

Author SHA1 Message Date
Derek Mauro
e7e7b016aa CRC: Fix unused variable warnings in no-op implementation
PiperOrigin-RevId: 919072538
Change-Id: Id927ef7e9d17dc9f5e83ca12a23851c2fcf60aad
2026-05-21 08:45:36 -07:00
Connal de Souza
0c60e214e9 Remove PCLMUL steam on AMD Rome, as it appears to be marginally faster without it.
PiperOrigin-RevId: 914995032
Change-Id: I642095189cf36e7cf1dcaa44e8bfb97246158831
2026-05-13 12:26:43 -07:00
Ilya Tokar
1eb0661e7f Re-land crc32 optimization on AMD Milan+
BM_Calculate/0                 1.136n ±  0%   1.136n ± 0%        ~ (p=0.708 n=6)
BM_Calculate/1                 1.420n ±  0%   1.420n ± 0%        ~ (p=0.697 n=6)
BM_Calculate/100               9.374n ±  0%   9.374n ± 0%        ~ (p=0.859 n=6)
BM_Calculate/2048              75.59n ±  1%   66.91n ± 0%  -11.49% (p=0.002 n=6)
BM_Calculate/10000             312.7n ±  0%   284.9n ± 0%   -8.91% (p=0.002 n=6)
BM_Calculate/500000            14.78µ ±  1%   13.40µ ± 1%   -9.37% (p=0.002 n=6)
BM_Extend/0                    1.136n ±  0%   1.137n ± 0%        ~ (p=0.935 n=6)
BM_Extend/1                    1.421n ±  0%   1.278n ± 0%  -10.03% (p=0.002 n=6)
BM_Extend/100                  9.376n ±  0%   9.091n ± 0%   -3.05% (p=0.002 n=6)
BM_Extend/2048                 75.43n ±  0%   66.81n ± 0%  -11.43% (p=0.002 n=6)
BM_Extend/10000                312.5n ±  0%   284.9n ± 0%   -8.83% (p=0.002 n=6)
BM_Extend/500000               14.82µ ±  1%   13.39µ ± 1%   -9.59% (p=0.002 n=6)
BM_Extend/100000000            3.185m ±  0%   2.790m ± 0%  -12.40% (p=0.002 n=6)
BM_ExtendCacheMiss/10          26.06m ±  0%   23.91m ± 1%   -8.27% (p=0.002 n=6)
BM_ExtendCacheMiss/100         14.06m ±  1%   13.78m ± 1%   -1.99% (p=0.002 n=6)
BM_ExtendCacheMiss/1000        26.89m ±  4%   26.66m ± 2%        ~ (p=0.132 n=6)
BM_ExtendCacheMiss/100000      5.120m ±  1%   4.582m ± 1%  -10.52% (p=0.002 n=6)

PiperOrigin-RevId: 907109111
Change-Id: I5a01870bd85a2c69052cdf1677987d762a8a1a2a
2026-04-28 12:10:12 -07:00
Abseil Team
852fc61f31 Remove more lingering C++17 type traits polyfill usages
This will let us deprecate the declarations without triggering warnings in Abseil itself.

PiperOrigin-RevId: 906360966
Change-Id: Iee362ac0eac647909ef38003280f1179813f764d
2026-04-27 08:03:53 -07:00
Abseil Team
b85d16902f Optimzie crc32 on AMD Milan+
We have AVX encoded vector PCLMULQDQ on Milan, so use it to make
crc32c computations ~10% faster. We need to use inline asm, since
building this twice with different complier flags for dynamic
dispatch performed worse due to missing inlining.

BM_Calculate/0                  1.136n ±  0%    1.136n ±  1%        ~ (p=0.968 n=6)
BM_Calculate/1                  1.420n ±  0%    1.421n ±  1%        ~ (p=0.870 n=6)
BM_Calculate/100                9.089n ±  0%    9.660n ±  1%   +6.29% (p=0.002 n=6)
BM_Calculate/2048               75.30n ±  1%    67.67n ±  1%  -10.13% (p=0.002 n=6)
BM_Calculate/10000              313.1n ±  0%    286.1n ±  0%   -8.63% (p=0.002 n=6)
BM_Calculate/500000             14.91µ ±  4%    13.49µ ±  1%   -9.48% (p=0.002 n=6)
BM_Extend/0                     1.136n ±  1%    1.136n ±  1%        ~ (p=0.636 n=6)
BM_Extend/1                     1.420n ±  0%    1.420n ±  1%        ~ (p=0.636 n=6)
BM_Extend/100                   9.247n ±  2%    9.800n ±  2%   +5.99% (p=0.002 n=6)
BM_Extend/2048                  75.73n ±  1%    67.37n ±  1%  -11.04% (p=0.002 n=6)
BM_Extend/10000                 313.2n ±  1%    286.2n ±  0%   -8.62% (p=0.002 n=6)
BM_Extend/500000                14.87µ ±  1%    13.57µ ±  1%   -8.74% (p=0.002 n=6)
BM_Extend/100000000             3.185m ±  2%    2.816m ±  3%  -11.60% (p=0.002 n=6)
BM_ExtendCacheMiss/10           26.07m ±  1%    26.06m ±  1%        ~ (p=1.000 n=6)
BM_ExtendCacheMiss/100          13.86m ±  4%    14.36m ±  2%   +3.61% (p=0.026 n=6)
BM_ExtendCacheMiss/1000         27.02m ±  4%    27.28m ±  4%        ~ (p=0.699 n=6)
BM_ExtendCacheMiss/100000       5.114m ±  5%    4.600m ±  8%  -10.07% (p=0.002 n=6)
BM_ExtendByZeroes/1             1.420n ±  0%    1.420n ±  0%        ~ (p=0.670 n=12)
BM_ExtendByZeroes/10            1.704n ±  1%    1.704n ±  0%        ~ (p=1.000 n=6)
BM_ExtendByZeroes/100           3.128n ±  0%    3.128n ±  0%        ~ (p=1.000 n=6)
BM_ExtendByZeroes/1000          6.758n ±  0%    6.638n ±  1%   -1.78% (p=0.002 n=6)
BM_ExtendByZeroes/10000         6.619n ±  1%    6.503n ±  0%   -1.75% (p=0.002 n=6)
BM_ExtendByZeroes/100000        8.537n ±  1%    8.479n ±  0%   -0.67% (p=0.019 n=6)
BM_ExtendByZeroes/1000000       9.766n ±  1%    9.692n ±  1%   -0.75% (p=0.002 n=6)

PiperOrigin-RevId: 900897540
Change-Id: I57d8df2bf10690afc07009d61f8c4ea61e88ce50
2026-04-16 13:59:26 -07:00
Ilya Tokar
5f9d5bfcc4 Optimzie crc32 on AMD Milan+
We have AVX encoded vector PCLMULQDQ on Milan, so use it to make
crc32c computations ~10% faster. We need to use inline asm, since
building this twice with different complier flags for dynamic
dispatch performed worse due to missing inlining.

BM_Calculate/0                  1.136n ±  0%    1.136n ±  1%        ~ (p=0.968 n=6)
BM_Calculate/1                  1.420n ±  0%    1.421n ±  1%        ~ (p=0.870 n=6)
BM_Calculate/100                9.089n ±  0%    9.660n ±  1%   +6.29% (p=0.002 n=6)
BM_Calculate/2048               75.30n ±  1%    67.67n ±  1%  -10.13% (p=0.002 n=6)
BM_Calculate/10000              313.1n ±  0%    286.1n ±  0%   -8.63% (p=0.002 n=6)
BM_Calculate/500000             14.91µ ±  4%    13.49µ ±  1%   -9.48% (p=0.002 n=6)
BM_Extend/0                     1.136n ±  1%    1.136n ±  1%        ~ (p=0.636 n=6)
BM_Extend/1                     1.420n ±  0%    1.420n ±  1%        ~ (p=0.636 n=6)
BM_Extend/100                   9.247n ±  2%    9.800n ±  2%   +5.99% (p=0.002 n=6)
BM_Extend/2048                  75.73n ±  1%    67.37n ±  1%  -11.04% (p=0.002 n=6)
BM_Extend/10000                 313.2n ±  1%    286.2n ±  0%   -8.62% (p=0.002 n=6)
BM_Extend/500000                14.87µ ±  1%    13.57µ ±  1%   -8.74% (p=0.002 n=6)
BM_Extend/100000000             3.185m ±  2%    2.816m ±  3%  -11.60% (p=0.002 n=6)
BM_ExtendCacheMiss/10           26.07m ±  1%    26.06m ±  1%        ~ (p=1.000 n=6)
BM_ExtendCacheMiss/100          13.86m ±  4%    14.36m ±  2%   +3.61% (p=0.026 n=6)
BM_ExtendCacheMiss/1000         27.02m ±  4%    27.28m ±  4%        ~ (p=0.699 n=6)
BM_ExtendCacheMiss/100000       5.114m ±  5%    4.600m ±  8%  -10.07% (p=0.002 n=6)
BM_ExtendByZeroes/1             1.420n ±  0%    1.420n ±  0%        ~ (p=0.670 n=12)
BM_ExtendByZeroes/10            1.704n ±  1%    1.704n ±  0%        ~ (p=1.000 n=6)
BM_ExtendByZeroes/100           3.128n ±  0%    3.128n ±  0%        ~ (p=1.000 n=6)
BM_ExtendByZeroes/1000          6.758n ±  0%    6.638n ±  1%   -1.78% (p=0.002 n=6)
BM_ExtendByZeroes/10000         6.619n ±  1%    6.503n ±  0%   -1.75% (p=0.002 n=6)
BM_ExtendByZeroes/100000        8.537n ±  1%    8.479n ±  0%   -0.67% (p=0.019 n=6)
BM_ExtendByZeroes/1000000       9.766n ±  1%    9.692n ±  1%   -0.75% (p=0.002 n=6)

PiperOrigin-RevId: 900870516
Change-Id: I1382ae2ffeed35e1d55a0916290144cae5256fe0
2026-04-16 13:02:39 -07:00
Derek Mauro
5088cf5194 Cleanup the uses of the polyfills absl::any, absl::optional,
absl::variant, and related types

The corresponding headers are removed from cc files, but kept in
headers to prevent breakages from transitive dependencies.

PiperOrigin-RevId: 872421685
Change-Id: I867d4c3f7c9e422289c63816d44719b0530fb0a6
2026-02-19 08:53:17 -08:00
Derek Mauro
569ff20318 Cleanup duplicated bit-rotation code
PiperOrigin-RevId: 857286087
Change-Id: Ie79f5b9e7ca8417f6311750c0de469ca6de4a8f9
2026-01-16 13:38:01 -08:00
J. Neuschäfer
55a99fb37a PR #1944: Use same element-width for non-temporal loads and stores on Arm
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1944

Increase the consistency between _mm_loadu_si128 and _mm_stream_si128 by using vector loads/stores of 64-bit elements in both. This should have no impact on existing users. On aarch64 (release build, GCC 15.2), crc_non_temporal_memcpy.cc.o stays effectively the same, the only change being as follows:

```
--- crc_non_temporal_memcpy.cc.o (original)
+++ crc_non_temporal_memcpy.cc.o (patched)
├── objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text {} │ @@ -255,15 +255,15 @@
│       add     x2, x21, x2
│       mov     x0, x21
│       ldp     q31, q30, [x0, #32]
│       add     x1, x1, #0x40
│       ldp     q29, q28, [x0], #64
│       stp     q31, q30, [x1, #-32]
│       stp     q29, q28, [x1, #-64]
│ -     cmp     x0, x2
│ +     cmp     x2, x0
│       b.ne    3b0 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x270>  // b.any
│       and     x0, x3, #0xffffffffffffffc0
│       and     x23, x23, #0x3f
│       dmb     ish
│       add     x22, x22, x0
│       add     x21, x21, x0
│       b       380 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x240>
```

On big-endian Arm (aarch64_be), this fixes a bug in non_temporal_store_memcpy, in which each 32-bit half out of a 64-bit parcel of memory was swapped with the other. For example, the byte sequence 218edf0b 13c68753 would be copied as 13c68753 218edf0b.

Merge 8f08d4c792 into e5c6ccbc96

Merging this change closes #1944

COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1944 from neuschaefer:nontemp 8f08d4c792
PiperOrigin-RevId: 819779377
Change-Id: I46c8c5540fb4786948c5f16d25630fbbab892602
2025-10-15 09:03:00 -07:00
Shahriar Rouf
5ad0bfb7ab Optimize CRC32AcceleratedX86ARMCombinedMultipleStreams::Extend by interleaving the CRC32_u64 calls at a lower level.
`CRC32_u64` generates `CRC32` x86 instruction which has 3 cycle latency. Because of that, the `crc` variable below causes a loop carried dependency of 3 cycles per iteration.
```
for (int i = 0; i < 8; i++) {
  crc = CRC32_u64(static_cast<uint32_t>(crc), absl::little_endian::Load64(p));
  p += 8;
}
```
Total latency for a 64-byte block is 29 cycles (codegen: https://godbolt.org/z/zxsrGMEPs, llvm-mca: https://godbolt.org/z/xrTMhhd1E).

So, it is more efficient to interleave (up to 3 calls because of the 3 cycle latency) the `CRC32_u64` calls at a lower level. Even if we interleave 3 streams, the total latency for (three) 64-byte blocks is 33 cycles (codegen: https://godbolt.org/z/5ojzPdj3h, llvm-mca: https://godbolt.org/z/5cEPxvddW). And this is without considering any inlining.

PiperOrigin-RevId: 799757460
Change-Id: I80118d5c1736ae31d69e5624c94cc0a6513ef28f
2025-08-26 16:07:36 -07:00
Connal de Souza
274c81389f Optimize crc32 Extend by removing obsolete length alignment.
Currently at the start of the Extend() call we process some number of bytes to align
the length to a multiple of 16. However, for large inputs we then process another small number of bytes to align the next load address to 8 bytes, undoing the length alignment. At the end of the call, we process the remaining bytes anyway.

The initial length alignment is not useful, since it is undone anyway. We never return early here for small inputs since this function is only used for lengths > 64 anyway. Removing this reduces the amount of time we spend processing only a small number of bytes at a time.

Also, we can optimize processing the remaining bytes at the end by leveraging the CRC32 instructions for 2,4, and 8 bytes.

This looks to be about 2-5% faster on various platforms for typical input sizes.

PiperOrigin-RevId: 793697720
Change-Id: Ibe71a51c851863ad40acef7d334694a9ac930f4d
2025-08-11 10:12:05 -07:00
Connal de Souza
483951bb49 Update the crc32 dynamic dispatch table with newer platforms.
Up to 13% performance improvement for the platforms affected.

PiperOrigin-RevId: 789033088
Change-Id: I1d74360377e3c40dfaae2108ec55f907960d177a
2025-07-30 14:01:43 -07:00
Abseil Team
57abc0ee3f Optimize CRC-32C extension by zeroes
Optimize multiply() (renamed to MultiplyWithExtraX33()) to eliminate
several instructions that were present only to avoid introducing an
extra factor of x^33 into the multiplication.  It's actually fine to
introduce the extra factor of x^33 as long as it's canceled out with an
extra factor of x^-33 in all the kCRC32CPowers[] entries.

To make this work, the number of bits dropped by ComputeZeroConstant()
had to be increased from 2 to at least 3, since 2^(i + 3 +
kNumDroppedBits) - 33 must be >= 0 for all i including i=0; otherwise
kCRC32CPowers[0] would need a negative power of x.  However, this is
fine since it's more efficient to utilize CRC32_u32() and CRC32_u64()
for bits 2 and 3 anyway.  So, increase kNumDroppedBits to 4.

Add a Python script that generates the updated kCRC32CPowers[].  It
isn't wired up to the build system, but rather is just added so that
kCRC32CPowers[] can be reproduced.

Also add a test which tests ExtendCrc32cByZeroes() with all the length
bits, thus testing all the entries of kCRC32CPowers[].

Note that the kCRC32CPowers[] generation script and new test case are
things we should have had anyway, regardless of the x^33 optimization.

This change slightly improves the performance of Extend() for lengths
greater than or equal to 2048 bytes, and also the performance of
ExtendByZeroes().  It also slightly reduces the binary code size.

Before:
    BM_Calculate/2048                   84.3 ns         84.3 ns      8307735
    BM_Calculate/10000                   376 ns          375 ns      1865976
    BM_Calculate/500000                18538 ns        18531 ns        37813
    BM_ExtendByZeroes/1                 3.55 ns         3.55 ns    197111095
    BM_ExtendByZeroes/10                3.90 ns         3.89 ns    179773877
    BM_ExtendByZeroes/100               6.06 ns         6.06 ns    115242160
    BM_ExtendByZeroes/1000              12.0 ns         12.0 ns     58078004
    BM_ExtendByZeroes/10000             9.97 ns         9.97 ns     70335772
    BM_ExtendByZeroes/100000            12.1 ns         12.1 ns     58157829
    BM_ExtendByZeroes/1000000           14.4 ns         14.4 ns     48527365

After:
    BM_Calculate/2048                   82.8 ns         82.7 ns      8478296
    BM_Calculate/10000                   375 ns          375 ns      1869663
    BM_Calculate/500000                18547 ns        18538 ns        37846
    BM_ExtendByZeroes/1                 2.96 ns         2.96 ns    236772500
    BM_ExtendByZeroes/10                3.85 ns         3.85 ns    182059238
    BM_ExtendByZeroes/100               5.42 ns         5.42 ns    129077546
    BM_ExtendByZeroes/1000              9.43 ns         9.42 ns     74232457
    BM_ExtendByZeroes/10000             8.14 ns         8.14 ns     86244218
    BM_ExtendByZeroes/100000            10.7 ns         10.7 ns     65467391
    BM_ExtendByZeroes/1000000           11.0 ns         11.0 ns     63575936
PiperOrigin-RevId: 786828855
Change-Id: I6208625fd1c35c2c137e756cf5fadc1adccfdd5d
2025-07-24 14:04:51 -07:00
Abseil Team
64a9eafe33 Disable sanitizer bounds checking in ComputeZeroConstant.
The code is correct, but the compiler can't optimize away the check.

PiperOrigin-RevId: 785603401
Change-Id: I9277e3b71965322691108f08597728dd84737329
2025-07-21 15:47:07 -07:00
Abseil Team
878361312d Automated Code Change
PiperOrigin-RevId: 783054860
Change-Id: I3f84881642f2f77be5d5275983243edf6305178c
2025-07-14 15:00:34 -07:00
Abseil Team
f60bfd822e Enable SIMD memcpy-crc on ARM cores.
PiperOrigin-RevId: 773749299
Change-Id: I798913549298c0993af16fc3ab6215089aab1f18
2025-06-20 10:13:26 -07:00
Abseil Team
99275763ac Use even faster reduction algorithm in FinalizePclmulStream()
My previous CL optimized the Barrett reduction.  But since this is CRC32C and
scalar instructions for it are available, there is actually no need for Barrett
reduction at all.  Just use two 64-bit CRC32C instructions to reduce fullCRC.

This improves CRC32C performance on 2048-byte messages on Skylake by another 2%
or so.

PiperOrigin-RevId: 739977426
Change-Id: I4611af88cd32ed7a995e772a13c30e3bdcec8de9
2025-03-24 09:59:22 -07:00
Abseil Team
c4ff4d561c Use more efficient reduction algorithm in FinalizePclmulStream()
1. When reducing 4 vectors to 1, fold across 2 vectors first and then across 1,
   instead of across 1 and then across 2.  This works slightly better because it
   makes the constants be used in order.

2. Use a faster algorithm to reduce 1 vector to a scalar value.

This approach is the same one I used in the assembly code I recently wrote for
the Linux kernel in the patch series
https://lore.kernel.org/lkml/20250210174540.161705-1-ebiggers@kernel.org/T/#u
(search for "reduce_128bits_to_crc").

On Skylake (which uses num_pclmul_streams=2), this improves CRC32C performance
on 2048-byte messages by about 2%.  The overall improvement is relatively small
since FinalizePclmulStream() is only called for messages >= 2048 bytes and is
only called num_pclmul_streams times per message.  So it's not really a
bottleneck, but the new code is definitely a bit shorter and faster.

PiperOrigin-RevId: 739002382
Change-Id: I0505e61f012e4a4f8b85958f7f00478f5b1a7026
2025-03-20 18:06:56 -07:00
Derek Mauro
feb3d276d4 Remove ABSL_INTERNAL_NEED_REDUNDANT_CONSTEXPR_DECL
which is longer needed with the C++17 floor

PiperOrigin-RevId: 729365281
Change-Id: Ife5e778ead193bb37150b9799099e92f53252cb4
2025-02-20 21:07:28 -08:00
Pavel P
26b6046ab2 PR #1833: Make ABSL_INTERNAL_STEP_n macros consistent in crc code
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1833

`ABSL_INTERNAL_STEP1`, `ABSL_INTERNAL_STEP2`, `ABSL_INTERNAL_STEP4` assumed that `p` exists where these were used. All while similar macro `ABSL_INTERNAL_STEP8` correctly passed `p` as a macro arg. This PR updates all of them to take extra param instead of relying p's existence. Also, renamed `data` to `p` for `ABSL_INTERNAL_STEP8` to be consistent with others
Merge 9a89bb0b62 into e3183f1584

Merging this change closes #1833

COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1833 from pps83:master-macrofix 9a89bb0b62
PiperOrigin-RevId: 728751982
Change-Id: I48c3635f8d22848115744f6e9869717136385154
2025-02-19 11:36:50 -08:00
Abseil Team
e3183f1584 Move the implementation of absl::ComputeCrc32c to the header file, to
facilitate inlining.

PiperOrigin-RevId: 728699475
Change-Id: I444b1aa5b1ea77705175eadf47e05d772446441d
2025-02-19 09:16:21 -08:00
David Majnemer
df8178e26e Crc: Only test non_temporal_store_memcpy_avx on AVX targets
non_temporal_store_memcpy_avx uses gnu::target("avx") to use AVX intrinsics
inside its function body even if the compiler was not configured for AVX
support. This is OK because non_temporal_store_memcpy_avx is guarded by a cpuid
check before it is called.

However, non_temporal_memcpy_test.cc performs no such cpuid guard. In practice,
nobody will really notice this bug as CPUs have had AVX for a long time by now.

That said, this does come up if one has compiled absl for x86_64 and runs the
binary on a arm64 Mac. This is because the Rosetta 2 emulation environment does
not support AVX or newer instructions.

PiperOrigin-RevId: 717991751
Change-Id: Id41bd186ebfd1cf7124ab5211fbfb74a01d5b56c
2025-01-21 11:06:47 -08:00
David Majnemer
3735766b3b Crc: Remove the __builtin_cpu_supports path for SupportsArmCRC32PMULL
It seems that this feature is not fully baked on all build configurations, let's remove it for now.

PiperOrigin-RevId: 716825311
Change-Id: I2ea9d941f8f3f177f9eb2afbd737935d58923780
2025-01-17 15:56:53 -08:00
David Majnemer
3ded0b656e crc: Use absl::nullopt when returning absl::optional
Otherwise we can observe a build failure when absl::optional != std::optional.

PiperOrigin-RevId: 716275922
Change-Id: I4918a8901530f0daafeec07e319fd79123358bc1
2025-01-16 10:01:05 -08:00
David Majnemer
6effb000ca Crc: Detect support for pmull and crc instructions on Apple AArch64
With a newer clang, we can use __builtin_cpu_supports which caches all
the feature bits.

If we are using an older clang, we fall back to querying sysctlbyname
for the relevant processor features.

PiperOrigin-RevId: 715153229
Change-Id: I570fa349f96829d5da3b32c928480ddf67176cad
2025-01-13 16:45:10 -08:00
Derek Mauro
90a7ba66e8 Updates to CI to support newer versions of tools
Linux "latest" containers updated to
GCC 14.2
CMake 3.31.2
Bazel 8.0.0

Included are various fixes to get these versions to work.

Bazel now references repositories by their canonical names from the
Bazel Central Registry. For example, Abseil is now @abseil-cpp instead
of @com_google_absl, and GoogleTest is now @googletest instead of
@com_google_googletest. Users still using the old WORKSPACE system may
need to use `repo_mapping` on repositories using the old names. See
`WORKSPACE.bazel` in this commit for an example.

PiperOrigin-RevId: 709102146
Change-Id: I02327ed4f8fb947766480bdeef2b1930a7f831eb
2024-12-23 10:58:05 -08:00
Dertosh
fffac1157d PR #1794: Update cpu_detect.cc
fix hw crc32 and AES capability check, fix undefined

Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1794

Source and explanation
https://github.com/JuliaLang/julia/issues/26458
https://github.com/memcached/memcached/pull/744

For build for aarch64 on v22_clang-16.0.6-centos7
`
abseil-cpp/absl/crc/internal/cpu_detect.cc:273:20: error: use of undeclared identifier 'HWCAP_CRC32'
  return (hwcaps & HWCAP_CRC32) && (hwcaps & HWCAP_PMULL);
                   ^
abseil-cpp/absl/crc/internal/cpu_detect.cc:273:46: error: use of undeclared identifier 'HWCAP_PMULL'
  return (hwcaps & HWCAP_CRC32) && (hwcaps & HWCAP_PMULL);
`

Merge 3ee325b7a4 into 940e0ec36a

Merging this change closes #1794

COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1794 from Dertosh:patch-1 3ee325b7a4
PiperOrigin-RevId: 705936372
Change-Id: Ifebd6d1a854e17acf6cc00bab92053bc0d4c2349
2024-12-13 10:59:49 -08:00
Derek Mauro
29fdacd2e5 Fix the conditional compilation of non_temporal_store_memcpy_avx
to verify that AVX can be forced via `gnu::target`.

Fixes #1759

PiperOrigin-RevId: 677853230
Change-Id: Ic69045c71ddf8230fd7b0210ba4aef8693053232
2024-09-23 10:39:56 -07:00
Pavel P
77224c28ff PR #1662: Replace shift with addition in crc multiply
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1662

Merge 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b into cba31a9562

Merging this change closes #1662

COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1662 from pps83:crc-add 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b
PiperOrigin-RevId: 631470883
Change-Id: I4a72be643ed341ddf0e0007418ab4a613a03db4b
2024-05-07 10:33:09 -07:00
Pavel P
564372fcd6 PR #1653: Remove unnecessary casts when calling CRC32_u64
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1653

CRC32_u64 returns uint32_t, no need to cast returned result to uint32_t

Merge 90e7b063f3 into 9a61b00dde

Merging this change closes #1653

COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1653 from pps83:CRC32_u64-cast 90e7b063f3
PiperOrigin-RevId: 626462347
Change-Id: I748a2da5fcc66eb6aa07aaf0fbc7eca927fcbb16
2024-04-19 13:59:21 -07:00
Connal de Souza
61e47a454c Optimize crc32 V128_From2x64 on Arm
This removes redundant vector-vector moves and results in Extend being up to 3% faster.

PiperOrigin-RevId: 621948170
Change-Id: Id82816aa6e294d34140ff591103cb20feac79d9a
2024-04-04 13:09:48 -07:00
Abseil Team
18018aa45d Adjust conditonal compilation in non_temporal_memcpy.h
This change will allow the AVX version of non-temporal memcpy to be compiled
even if the compiler isn't run with AVX support. This allows runtime dispatch
to select the AVX implementation for CPUs that are known to be compatible with
AVX instructions.

PiperOrigin-RevId: 619594422
Change-Id: Ia7d92404ef8d10d152030b29b71948ed954f28f5
2024-03-27 11:22:57 -07:00
Abseil Team
2f0591010d Replace //visibility:private with :__pkg__ for certain targets
This will allow us to give visibility to other Google-internal libraries. The
change is necessary since //visibility:private cannot be combined with other
specifications.

PiperOrigin-RevId: 615779561
Change-Id: I82b1edfa4e1ca280e429cf2a5e4003a1cc316a60
2024-03-14 08:01:09 -07:00
Abseil Team
2a7d0da1dd Add several missing includes in crc/internal
PiperOrigin-RevId: 615504707
Change-Id: Ia0e8211bd3c3d28fd0715c8f296ec50f6a700757
2024-03-13 12:21:38 -07:00
Abseil Team
3c1f9be71e Disable ubsan for benign unaligned access in crc_memcpy
PiperOrigin-RevId: 615160537
Change-Id: I29070c898104c55e6563eed0eef7397441bef1d7
2024-03-12 13:51:42 -07:00
Abseil Team
e20285c652 Delete a stray comment
PiperOrigin-RevId: 615017130
Change-Id: I73277de8ece31d6a35b47dbdb205b473324b74a2
2024-03-12 06:19:45 -07:00
Stanislaw Halik
d4578efe7c PR #1617: fix MSVC 32-bit build with -arch:AVX
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1617

The intrinsics used aren't available on `x86_64` processors while running in 32-bit mode. See:

- list of 64-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170)
- list of 32-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170)
- list of predefined MSVC macros (https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170)

The error message in question:

```console
F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(145,32): error C3861: '_mm_crc32_u64': identifier not found
  return static_cast<uint32_t>(_mm_crc32_u64(crc, v));
                               ^
F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(193,50): error C3861: '_mm_cvtsi128_si64': identifier not found
inline int64_t V128_Low64(const V128 l) { return _mm_cvtsi128_si64(l); }
```
Merge 06f5832108 into 797501d12e

Merging this change closes #1617

COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1617 from sthalik:pr/fix-msvc-32-bit-avx 06f5832108
PiperOrigin-RevId: 607483370
Change-Id: Id2a6f6dd33c2707fe7ffe134e7335916f3fb9da3
2024-02-15 15:58:50 -08:00
Shahriar Rouf
780bfc194d Replace testonly = 1 with testonly = True in abseil BUILD files.
https://bazel.build/build/style-guide#other-conventions

PiperOrigin-RevId: 603084345
Change-Id: Ibd7c9573d820f88059d12c46ff82d7d322d002ae
2024-01-31 10:08:35 -08:00
Abseil Team
49ff696cda Migrate empty CrcCordState to absl::NoDestructor.
Note that this only changes how we allocate the empty state, and reference countings of `empty` stay the same.

PiperOrigin-RevId: 599526339
Change-Id: I2c6aaf875c144c947e17fe8f69692b1195b55dd7
2024-01-18 09:11:43 -08:00
Derek Mauro
c8087ae8bd Avoid using the non-portable type __m128i_u.
According to https://stackoverflow.com/a/68939636 it is safe to use
__m128i instead.

https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170 also uses this type instead

__m128i_u is just __m128i with a looser alignment requirement, but
simply calling _mm_loadu_si128() instead of _mm_load_si128() is enough to
tell the compiler when a pointer is unaligned.

Fixes #1552

PiperOrigin-RevId: 576931936
Change-Id: I7c3530001149b360c12a1786c7e1832754d0e35c
2023-10-26 11:16:31 -07:00
Derek Mauro
0ef3ef4329 Bazel: Enable the header_modules feature
PiperOrigin-RevId: 572575394
Change-Id: Ic1c5ac2423b1634e50c43bad6daa14e82a8f3e2c
2023-10-11 07:58:06 -07:00
Derek Mauro
143e983739 Bazel: Support layering_check and parse_headers
The layering_check feature ensures that rules that include a header
explicitly depend on a rule that exports that header. Compiler support
is required, and currently only Clang 16+ supports diagnoses
layering_check failures.

The parse_headers feature ensures headers are self-contained by
compiling them with -fsyntax-only on supported compilers.

PiperOrigin-RevId: 572350144
Change-Id: I37297f761566d686d9dd58d318979d688b7e36d1
2023-10-10 13:30:24 -07:00
Connal de Souza
f3ba72ee55 Add entries for Neoverse N2,V1, and V2 into CRC dynamic dispatch table.
PiperOrigin-RevId: 571430428
Change-Id: I4777c37c5287d26a75f37fe059324ac218878f0e
2023-10-06 14:07:43 -07:00
Connal de Souza
ac364eb9d0 Optimize CRC32 for Ampere Siryn
Siryn's crc32 instruction seems to have latency 3 and throughput 1, which makes the optimal ratio of pmull and crc streams close to that of tested x86 machines. Up to +120% faster for large inputs.

PiperOrigin-RevId: 568645559
Change-Id: I86b85b1b2a5d4fb3680c516c4c9044238b20fe61
2023-09-26 14:13:55 -07:00
Connal de Souza
aa3c949a7f Optimize CRC32 Extend for large inputs on Arm
This is a temporary workaround for an apparent compiler bug with pmull(2) instructions. The current hot loop looks like this:

mov	w14, #0xef02,
lsl	x15, x15, #6,
mov	x13, xzr,
movk	w14, #0x740e, lsl #16,
sub	x15, x15, #0x40,
ldr	q4, [x16, #0x4e0],

_LOOP_START:
add	x16, x9, x13,
add	x17, x12, x13,
fmov	d19, x14,            <--------- This is Loop invariant and expensive
add	x13, x13, #0x40,
cmp	x15, x13,
prfm	pldl1keep, [x16, #0x140],
prfm	pldl1keep, [x17, #0x140],
ldp	x18, x0, [x16, #0x40],
crc32cx	w10, w10, x18,
ldp	x2, x18, [x16, #0x50],
crc32cx	w10, w10, x0,
crc32cx	w10, w10, x2,
ldp	x0, x2, [x16, #0x60],
crc32cx	w10, w10, x18,
ldp	x18, x16, [x16, #0x70],
pmull2	v5.1q, v1.2d, v4.2d,
pmull2	v6.1q, v0.2d, v4.2d,
pmull2	v7.1q, v2.2d, v4.2d,
pmull2	v16.1q, v3.2d, v4.2d,
ldp	q17, q18, [x17, #0x40],
crc32cx	w10, w10, x0,
pmull	v1.1q, v1.1d, v19.1d,
crc32cx	w10, w10, x2,
pmull	v0.1q, v0.1d, v19.1d,
crc32cx	w10, w10, x18,
pmull	v2.1q, v2.1d, v19.1d,
crc32cx	w10, w10, x16,
pmull	v3.1q, v3.1d, v19.1d,
ldp	q20, q21, [x17, #0x60],
eor	v1.16b, v17.16b, v1.16b,
eor	v0.16b, v18.16b, v0.16b,
eor	v1.16b, v1.16b, v5.16b,
eor	v2.16b, v20.16b, v2.16b,
eor	v0.16b, v0.16b, v6.16b,
eor	v3.16b, v21.16b, v3.16b,
eor	v2.16b, v2.16b, v7.16b,
eor	v3.16b, v3.16b, v16.16b,
b.ne	_LOOP_START

There is a redundant fmov that moves the same constant into a Neon register every loop iteration to be used in the PMULL instructions. The PMULL2 instructions already have this constant loaded into Neon registers. After this change, both the PMULL and PMULL2 instructions use the values in q4, and they are not reloaded every iteration. This fmov was expensive because it contends for execution units with crc32cx instructions. This is up to 20% faster for large inputs.

PiperOrigin-RevId: 567391972
Change-Id: I4c8e49750cfa5cc5730c3bb713bd9fd67657804a
2023-09-21 12:52:45 -07:00
Abseil Team
c78a3f32c3 Remove implicit int64_t->uint64_t conversion in ARM version of V128_Extract64
PiperOrigin-RevId: 565662176
Change-Id: I18d5d9eb444b0090e3f4ab8f66ad214a67344268
2023-09-15 06:30:25 -07:00
Abseil Team
2c4ce9b2ad Rename x86 crc_memcpy tests since they cover ARM as well
This is a rename only with no other changes.

PiperOrigin-RevId: 563428969
Change-Id: Iefc184bf9a233cb72649bc20b8555f6b662cac6d
2023-09-07 07:48:00 -07:00
Abseil Team
433289a258 Roll forward support for ARM intrinsics in crc_memcpy
This CL rolls forward a previous change which we rolled back temporarily due to
compilation errors on x86 when PCLMUL intrinsics were unavailable.

*** Original change description ***

This change replaces inline x86 intrinsics with generic versions that compile
for both x86 and ARM depending on the target arch.

This change does not enable the accelerated crc memcpy engine on ARM. That will
be done in a subsequent change after the optimal number of vector and integer
regions for different CPUs is determined.

***

PiperOrigin-RevId: 563416413
Change-Id: Iee630a15ed83c26659adb0e8a03d3f3d3a46d688
2023-09-07 06:53:24 -07:00
Abseil Team
461f1e49b3 Rollback adding support for ARM intrinsics
In some configurations this change causes compilation errors. We will roll this
forward again after those issue are addressed.

PiperOrigin-RevId: 562810916
Change-Id: I45b2a8d456273e9eff188f36da8f11323c4dfe66
2023-09-05 09:57:30 -07:00
Abseil Team
1a882833c0 Add support for ARM intrinsics in crc_memcpy
This change replaces inline x86 intrinsics with generic versions that compile
for both x86 and ARM depending on the target arch.

This change does not enable the accelerated crc memcpy engine on ARM. That will
be done in a subsequent change after the optimal number of vector and integer
regions for different CPUs is determined.

PiperOrigin-RevId: 562785420
Change-Id: I8ba4aa8de17587cedd92532f03767059a481f159
2023-09-05 08:24:39 -07:00