mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
## Summary
- Deduplicate MDBMinimalShard file infos by file hash during sync and
async streaming parse.
- Keep only the first file info seen for a duplicate file hash; async
callbacks fire only for retained entries.
- Add a focused streaming-shard test covering parse, async callbacks,
and reserialization.
## Why
Duplicate file infos can survive the minimal streaming shard
parse/re-serialize path because it stores file entries as a Vec. This
narrows canonicalization to that streaming path while leaving in-memory
shard and set-operation behavior unchanged.
## Impact
- MDBMinimalShard::num_files() now reports unique file hashes for parsed
shards.
- Later duplicate file infos are ignored even if they contain richer
optional verification or metadata extension data.
- Raw full-section readers, MDBInMemoryShard behavior, and shard set
operations remain unchanged.
## Validation
- cargo test -p xet-core-structures metadata_shard
- cargo test -p xet-client test_global_dedup
- git diff --check
- rustfmt --edition 2024 --check
xet_core_structures/src/metadata_shard/set_operations.rs
xet_core_structures/src/metadata_shard/shard_in_memory.rs
xet_core_structures/src/metadata_shard/streaming_shard.rs
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes shard streaming parse semantics by dropping duplicate
`file_hash` entries, which can affect downstream counts/serialization
and may hide later entries’ richer metadata/verification.
>
> **Overview**
> `MDBMinimalShard` now **deduplicates file-info records by
`file_hash`** during both sync (`from_reader`) and async
(`from_reader_async_with_custom_callbacks`) streaming parses, keeping
only the *first* occurrence.
>
> Adds a focused test that constructs a shard stream with duplicate file
infos and asserts first-wins behavior, validates async parsing/callback
behavior, and confirms re-serialization only emits the retained entry.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
1320ce36ce. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
xet-core-structures
Core data structures for the Hugging Face Xet storage system, including Merkle hashes, metadata shards, and Xorb objects.
Overview
- MerkleHash — 256-bit content-addressed hash used throughout the system
- Metadata shards — Compact shard format mapping file ranges to Xorb chunks
- Xorb objects — Content-addressed storage objects with byte-grouping compression
- Data structures — Specialized hash maps and utilities for deduplication
This crate is part of xet-core, the Rust backend for huggingface_hub.
License
Apache-2.0