Files
xet-core/xet_core_structures
Assaf Vayner 654080d080 [codex] Deduplicate shard file infos (#834)
## Summary
- Deduplicate MDBMinimalShard file infos by file hash during sync and
async streaming parse.
- Keep only the first file info seen for a duplicate file hash; async
callbacks fire only for retained entries.
- Add a focused streaming-shard test covering parse, async callbacks,
and reserialization.

## Why
Duplicate file infos can survive the minimal streaming shard
parse/re-serialize path because it stores file entries as a Vec. This
narrows canonicalization to that streaming path while leaving in-memory
shard and set-operation behavior unchanged.

## Impact
- MDBMinimalShard::num_files() now reports unique file hashes for parsed
shards.
- Later duplicate file infos are ignored even if they contain richer
optional verification or metadata extension data.
- Raw full-section readers, MDBInMemoryShard behavior, and shard set
operations remain unchanged.

## Validation
- cargo test -p xet-core-structures metadata_shard
- cargo test -p xet-client test_global_dedup
- git diff --check
- rustfmt --edition 2024 --check
xet_core_structures/src/metadata_shard/set_operations.rs
xet_core_structures/src/metadata_shard/shard_in_memory.rs
xet_core_structures/src/metadata_shard/streaming_shard.rs

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes shard streaming parse semantics by dropping duplicate
`file_hash` entries, which can affect downstream counts/serialization
and may hide later entries’ richer metadata/verification.
> 
> **Overview**
> `MDBMinimalShard` now **deduplicates file-info records by
`file_hash`** during both sync (`from_reader`) and async
(`from_reader_async_with_custom_callbacks`) streaming parses, keeping
only the *first* occurrence.
> 
> Adds a focused test that constructs a shard stream with duplicate file
infos and asserts first-wins behavior, validates async parsing/callback
behavior, and confirms re-serialization only emits the retained entry.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
1320ce36ce. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-05-14 08:49:52 -07:00
..
2026-04-20 15:06:14 -07:00

xet-core-structures

crates.io docs.rs License

Core data structures for the Hugging Face Xet storage system, including Merkle hashes, metadata shards, and Xorb objects.

Overview

  • MerkleHash — 256-bit content-addressed hash used throughout the system
  • Metadata shards — Compact shard format mapping file ranges to Xorb chunks
  • Xorb objects — Content-addressed storage objects with byte-grouping compression
  • Data structures — Specialized hash maps and utilities for deduplication

This crate is part of xet-core, the Rust backend for huggingface_hub.

License

Apache-2.0