mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
Currently, computing aggregate chunk hashes across independently processed ranges requires recomputing over the full concatenated chunk list. This PR introduces ChunkHashRange, a composable representation that can hash contiguous partial ranges and merge them while preserving equivalence with the existing xorb_hash / file_hash behavior. This allows an intermediate representation of the hash ranges that can be merged in arbitrary order to get the final hash. It also uses O(log(n)) storage and all operations are done in linear time. Serialization and Deserialization are fully supported. The main use case for this is in doing partial file edits. Previously, to edit the middle of a large file, the client would have to know all the hashes for the full file, even if only a few in the middle were changed. With a large file, this can still be 100s of MB; the chunk metadata size is roughly 1/1000 of the data size. With this change, we can now transmit the unmodified parts of a file in O(log(n)) storage but still be able to build the entire function hash; now a sequence of 10M chunks takes the equivalent storage of ~500 chunks or so. Along the way, we also added in an optimization for the merge step to avoid an allocation, yielding a 2x speedup. --------- Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
API changes
This folder contains a record of API changes in main. It's indended for AI agents to read in order to correctly apply merges or update dependencies and PRs.
The updates are listed by date in the form: update_<yymmdd>_<description>.md
When applying a merge, rebase, or downstream update, all AI agents should first scan this folder to understand what relevant information may need to be applied.
When creating a PR that involves an API change potentially requiring downstream updates, an AI agent should create such a file. This file should be humanly readable but contain enough information to correctly apply the needed changes without scanning the code.