mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
move spec to docs (#515)
publish to hub docs out of xet-core for xet-spec. Need to merge this first before iterating to get the github workflows working right.
This commit is contained in:
20
.github/workflows/build_documentation.yml
vendored
Normal file
20
.github/workflows/build_documentation.yml
vendored
Normal file
@@ -0,0 +1,20 @@
|
||||
name: Build Documentation
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
- doc-builder*
|
||||
- v*-release
|
||||
|
||||
jobs:
|
||||
build:
|
||||
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
|
||||
with:
|
||||
commit_sha: ${{ github.sha }}
|
||||
package: xet-core
|
||||
package_name: xet-spec
|
||||
additional_args: --not_python_module
|
||||
secrets:
|
||||
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
|
||||
31
.github/workflows/build_pr_documentation.yml
vendored
Normal file
31
.github/workflows/build_pr_documentation.yml
vendored
Normal file
@@ -0,0 +1,31 @@
|
||||
name: Build PR Documentation
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths:
|
||||
- "docs/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
debug-workflow-name:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Echo workflow metadata
|
||||
run: |
|
||||
echo "github.workflow = ${{ github.workflow }}"
|
||||
echo "github.workflow_ref = ${{ github.workflow_ref }}"
|
||||
echo "github.workflow_sha = ${{ github.workflow_sha }}"
|
||||
echo "github.run_id = ${{ github.run_id }}"
|
||||
echo "github.event_name = ${{ github.event_name }}"
|
||||
echo "github.event.number = ${{ github.event.number }}"
|
||||
build:
|
||||
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
|
||||
with:
|
||||
commit_sha: ${{ github.event.pull_request.head.sha }}
|
||||
pr_number: ${{ github.event.number }}
|
||||
package: xet-core
|
||||
package_name: xet-spec
|
||||
additional_args: --not_python_module
|
||||
11
.github/workflows/ci.yml
vendored
11
.github/workflows/ci.yml
vendored
@@ -105,14 +105,3 @@ jobs:
|
||||
working-directory: hf_xet_wasm
|
||||
run: |
|
||||
./build_wasm.sh
|
||||
|
||||
lint_markdown:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
- name: Install markdownlint-cli
|
||||
run: |
|
||||
npm install -g markdownlint-cli
|
||||
- name: Lint markdown
|
||||
run: markdownlint spec/**/*.md --config markdownlint.toml
|
||||
|
||||
15
.github/workflows/upload_pr_documentation.yml
vendored
Normal file
15
.github/workflows/upload_pr_documentation.yml
vendored
Normal file
@@ -0,0 +1,15 @@
|
||||
name: Upload PR Documentation
|
||||
|
||||
on:
|
||||
workflow_run:
|
||||
workflows: ["Build PR Documentation"]
|
||||
types: [completed]
|
||||
|
||||
jobs:
|
||||
build:
|
||||
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
|
||||
with:
|
||||
package_name: xet-spec
|
||||
secrets:
|
||||
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
|
||||
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
|
||||
30
docs/source/_toctree.yml
Normal file
30
docs/source/_toctree.yml
Normal file
@@ -0,0 +1,30 @@
|
||||
- local: index
|
||||
title: Xet Protocol Specification
|
||||
|
||||
- title: Building a client library for xet storage
|
||||
sections:
|
||||
- local: upload-protocol
|
||||
title: Upload Protocol
|
||||
- local: download-protocol
|
||||
title: Download Protocol
|
||||
- local: api
|
||||
title: CAS API
|
||||
- local: auth
|
||||
title: Authentication and Authorization
|
||||
- local: file-id
|
||||
title: Hugging Face Hub Files Conversion to Xet File ID's
|
||||
|
||||
- title: Overall Xet architecture
|
||||
sections:
|
||||
- local: chunking
|
||||
title: Content-Defined Chunking
|
||||
- local: hashing
|
||||
title: Hashing Methods
|
||||
- local: file-reconstruction
|
||||
title: File Reconstruction
|
||||
- local: xorb
|
||||
title: Xorb Format
|
||||
- local: shard
|
||||
title: Shard Format
|
||||
- local: deduplication
|
||||
title: Deduplication
|
||||
@@ -1,10 +1,10 @@
|
||||
# CAS API Documentation
|
||||
|
||||
This document describes the HTTP API endpoints used by the CAS (Content Addressable Storage) client to interact with the remote CAS server.
|
||||
This document describes the HTTP API endpoints used by the Content Addressable Storage (CAS) client to interact with the remote CAS server.
|
||||
|
||||
## Authentication
|
||||
|
||||
To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](../spec/auth.md).
|
||||
To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](./auth).
|
||||
|
||||
## Converting Hashes to Strings
|
||||
|
||||
@@ -18,6 +18,9 @@ For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the orde
|
||||
|
||||
Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).
|
||||
|
||||
> [!NOTE]
|
||||
> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.
|
||||
|
||||
### Example
|
||||
|
||||
Suppose a hash value is:
|
||||
@@ -38,7 +41,7 @@ It is: `07060504030201000f0e0d0c0b0a0908171615141312111f1e1d1c1b1a1918`.
|
||||
- **Method**: `GET`
|
||||
- **Parameters**:
|
||||
- `file_id`: File hash in hex format (64 lowercase hexadecimal characters).
|
||||
See [file hashes](../spec/hashing.md#file-hashes) for computing the file hash and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
|
||||
See [file hashes](./hashing#file-hashes) for computing the file hash and [converting hashes to strings](./api#converting-hashes-to-strings).
|
||||
- **Headers**:
|
||||
- `Range`: OPTIONAL. Format: `bytes={start}-{end}` (end is inclusive).
|
||||
- **Minimum Token Scope**: `read`
|
||||
@@ -53,7 +56,7 @@ See [file hashes](../spec/hashing.md#file-hashes) for computing the file hash an
|
||||
}
|
||||
```
|
||||
|
||||
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
|
||||
- **Error Responses**: See [Error Cases](./api#error-cases)
|
||||
- `400 Bad Request`: Malformed `file_id` in the path. Fix the path before retrying.
|
||||
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
|
||||
- `404 Not Found`: The file does not exist. Not retryable.
|
||||
@@ -67,7 +70,7 @@ OPTIONAL: -H Range: "bytes=0-100000"
|
||||
|
||||
### Example File Reconstruction Response Body
|
||||
|
||||
See [QueryReconstructionResponse](../spec/download_protocol.md#queryreconstructionresponse-structure) for more details in the download protocol specification.
|
||||
See [QueryReconstructionResponse](./download-protocol#queryreconstructionresponse-structure) for more details in the download protocol specification.
|
||||
|
||||
### 2. Query Chunk Deduplication (Global Deduplication)
|
||||
|
||||
@@ -77,11 +80,11 @@ See [QueryReconstructionResponse](../spec/download_protocol.md#queryreconstructi
|
||||
- **Parameters**:
|
||||
- `prefix`: The only acceptable prefix for the Global Deduplication API is `default-merkledb`.
|
||||
- `hash`: Chunk hash in hex format (64 lowercase hexadecimal characters).
|
||||
See [Chunk Hashes](../spec/hashing.md#chunk-hashes) to compute the chunk hash and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
|
||||
See [Chunk Hashes](./hashing#chunk-hashes) to compute the chunk hash and [converting hashes to strings](./api#converting-hashes-to-strings).
|
||||
- **Minimum Token Scope**: `read`
|
||||
- **Body**: None.
|
||||
- **Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](../spec/shard.md#global-deduplication).
|
||||
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
|
||||
- **Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](./shard#global-deduplication).
|
||||
- **Error Responses**: See [Error Cases](./api#error-cases)
|
||||
- `400 Bad Request`: Malformed hash in the path. Fix the path before retrying.
|
||||
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
|
||||
- `404 Not Found`: Chunk not already tracked by global deduplication. Not retryable.
|
||||
@@ -103,10 +106,10 @@ An example shard response body can be found in [Xet reference files](https://hug
|
||||
- **Parameters**:
|
||||
- `prefix`: The only acceptable prefix for the Xorb upload API is `default`.
|
||||
- `hash`: Xorb hash in hex format (64 lowercase hexadecimal characters).
|
||||
See [Xorb Hashes](../spec/hashing.md#xorb-hashes) to compute the hash, and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
|
||||
See [Xorb Hashes](./hashing#xorb-hashes) to compute the hash, and [converting hashes to strings](./api#converting-hashes-to-strings).
|
||||
- **Minimum Token Scope**: `write`
|
||||
- **Body**: Serialized Xorb bytes (`application/octet-stream`).
|
||||
See [xorb format serialization](../spec/xorb.md).
|
||||
See [xorb format serialization](./xorb).
|
||||
- **Response**: JSON (`UploadXorbResponse`)
|
||||
|
||||
```json
|
||||
@@ -117,7 +120,7 @@ See [xorb format serialization](../spec/xorb.md).
|
||||
|
||||
- Note: `was_inserted` is `false` if the Xorb already exists; this is not an error.
|
||||
|
||||
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
|
||||
- **Error Responses**: See [Error Cases](./api#error-cases)
|
||||
- `400 Bad Request`: Malformed hash in the path, Xorb hash does not match the body, or body is incorrectly serialized.
|
||||
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
|
||||
- `403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). Clients MUST retry with a `write` scope token.
|
||||
@@ -139,7 +142,7 @@ Uploads file reconstructions and new xorb listing, serialized into the shard for
|
||||
- **Method**: `POST`
|
||||
- **Minimum Token Scope**: `write`
|
||||
- **Body**: Serialized Shard data as bytes (`application/octet-stream`).
|
||||
See [Shard format guide](../spec/shard.md#shard-upload).
|
||||
See [Shard format guide](./shard#shard-upload).
|
||||
- **Response**: JSON (`UploadShardResponse`)
|
||||
|
||||
```json
|
||||
@@ -154,7 +157,7 @@ See [Shard format guide](../spec/shard.md#shard-upload).
|
||||
|
||||
The value of `result` does not carry any meaning, if the upload shard API returns a `200 OK` status code, the upload was successful and the files listed are considered uploaded.
|
||||
|
||||
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
|
||||
- **Error Responses**: See [Error Cases](./api#error-cases)
|
||||
- `400 Bad Request`: Shard is incorrectly serialized or Shard contents failed verification.
|
||||
- Can mean that a referenced Xorb doesn't exist or the shard is too large
|
||||
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
|
||||
@@ -1,6 +1,6 @@
|
||||
# Authentication and Authorization
|
||||
|
||||
To invoke any API's mentioned in this specification a client MUST first acquire a token (and the url) to authenticate against the server which serves these API's.
|
||||
To invoke any API's mentioned in this specification a client MUST first acquire a token (and the URL) to authenticate against the server which serves these API's.
|
||||
|
||||
The Xet protocol server uses bearer authentication via a token generated by the Hugging Face Hub (<https://huggingface.co>).
|
||||
|
||||
@@ -16,14 +16,14 @@ https://huggingface.co/api/{repo_type}s/{repo_id}/xet-{token_type}-token/{revisi
|
||||
|
||||
**Parameters:**
|
||||
|
||||
All parameters are required to form the url.
|
||||
All parameters are required to form the URL.
|
||||
|
||||
- `repo_type`: Type of repository - `model`, `dataset`, or `space`
|
||||
- `repo_id`: Repository identifier in format `namespace/repo-name`
|
||||
- `token_type`: Either `read` or `write`.
|
||||
- `revision`: Git revision (branch, tag, or commit hash; default to using `main` if no specific ref is required)
|
||||
|
||||
To understand the distinction for between `token_type` values read onwards in this document to [Token Scope](../spec/auth.md#token-scope).
|
||||
To understand the distinction for between `token_type` values read onwards in this document to [Token Scope](./auth#token-scope).
|
||||
|
||||
**Example URLs:**
|
||||
|
||||
@@ -101,6 +101,7 @@ Here's a basic implementation flow:
|
||||
4. **Token refresh (when needed):**
|
||||
Use the same API to generate a new token.
|
||||
|
||||
> [!NOTE]
|
||||
> In `xet-core` we SHOULD add 30 seconds of buffer time before the provided `expiration` time to refresh the token.
|
||||
|
||||
## Token Scope
|
||||
@@ -109,7 +110,7 @@ Xet tokens can have either a `read` or a `write` scope.
|
||||
`write` scope supersedes `read` scope and all `read` scope API's can be invoked when using a `write` scope token.
|
||||
The type of token issued is determined on the `token_type` URI path component when requesting the token from the Hugging Face Hub (see above).
|
||||
|
||||
Revise API specification for what scope level is necessary to invoke each API (briefly, only `POST /shard` and `POST /xorb/*` API's require `write` scope).
|
||||
Check API specification for what scope level is necessary to invoke each API (briefly, only `POST /shard` and `POST /xorb/*` API's require `write` scope).
|
||||
|
||||
The scope of the Xet tokens is limited to the repository and ref for which they were issued. To upload or download from different repositories or refs (different branches) clients MUST be issued different tokens.
|
||||
|
||||
@@ -9,7 +9,7 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
|
||||
+---------+---------+---------+---------+---------+---------+---------+--------------
|
||||
```
|
||||
|
||||
## Step-by-step algorithm (Gearhash-based CDC)
|
||||
## Step-by-step Algorithm (Gearhash-based CDC)
|
||||
|
||||
### Constant Parameters
|
||||
|
||||
@@ -24,7 +24,7 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
|
||||
- h: 64-bit hash, initialized to 0
|
||||
- start_offset: start offset of the current chunk, initialized to 0
|
||||
|
||||
### Per-byte update rule (Gearhash)
|
||||
### Per-byte Update Rule (Gearhash)
|
||||
|
||||
For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
|
||||
|
||||
@@ -32,7 +32,7 @@ For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
|
||||
h = (h << 1) + TABLE[b]
|
||||
```
|
||||
|
||||
### Boundary test and size constraints
|
||||
### Boundary Test and Size Constraints
|
||||
|
||||
At each position after updating `h`, let `size = current_offset - start_offset + 1`.
|
||||
|
||||
@@ -81,7 +81,7 @@ if start_offset < len(data):
|
||||
|
||||
### Boundary probability and mask selection
|
||||
|
||||
Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
|
||||
Given that MASK has 16 one-bits, for a random 64-bit hash `h`, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
|
||||
|
||||
### Properties
|
||||
|
||||
@@ -89,31 +89,31 @@ Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all
|
||||
- Locality: small edits only affect nearby boundaries
|
||||
- Linear time and constant memory: single 64-bit state and counters
|
||||
|
||||
### Intuition and rationale
|
||||
### Intuition and Rationale
|
||||
|
||||
- The table `TABLE[256]` injects pseudo-randomness per byte value so that the evolving hash `h` behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.
|
||||
- The left shift `(h << 1)` amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.
|
||||
- Resetting `h` to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.
|
||||
|
||||
### Implementation notes
|
||||
### Implementation Notes
|
||||
|
||||
- Only reset `h` when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.
|
||||
- Apply the mask test only once `size >= MIN_CHUNK_SIZE`. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.
|
||||
- MUST force a boundary at `MAX_CHUNK_SIZE` even if `(h & MASK) != 0`. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.
|
||||
- Use 64-bit wrapping arithmetic for `(h << 1) + TABLE[b]`. This is the behavior in the reference implementation [rust-gearhash].
|
||||
|
||||
### Edge cases
|
||||
### Edge Cases
|
||||
|
||||
- Tiny files: if `len(data) < MIN_CHUNK_SIZE`, the entire `data` is emitted as a single chunk.
|
||||
- Long runs without a match: if no position matches `(h & MASK) == 0` before `MAX_CHUNK_SIZE`, a boundary is forced at `MAX_CHUNK_SIZE` to cap chunk size.
|
||||
|
||||
### Portability and determinism
|
||||
### Portability and Determinism
|
||||
|
||||
- With a fixed `T[256]` table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.
|
||||
- Endianness does not affect behavior because updates are byte-wise and use scalar 64-bit operations.
|
||||
- SIMD-accelerated implementations (when available) are optimizations only; they produce the same boundaries as the scalar path [rust-gearhash].
|
||||
|
||||
## Minimum-size skip-ahead (cut-point skipping optimization)
|
||||
## Minimum-size Skip-ahead (Cut-point Skipping Optimization)
|
||||
|
||||
Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the `MIN_CHUNK_SIZE` constraint anyway.
|
||||
We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.
|
||||
@@ -141,7 +141,7 @@ The [xet-team/xet-spec-reference-files](https://huggingface.co/datasets/xet-team
|
||||
|
||||
In the same repository in file [Electric_Vehicle_Population_Data_20250917.csv.chunks](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.chunks)
|
||||
the chunks produced out of [Electric_Vehicle_Population_Data_20250917.csv](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv) are listed.
|
||||
Each line in the file is a 64 hexadecimal hash of the chunk, followed by a space and then the number of bytes in that chunk.
|
||||
Each line in the file is a 64 hexadecimal character string version of the hash of the chunk, followed by a space and then the number of bytes in that chunk.
|
||||
|
||||
Implementors should use the chunk lengths to determine that they are producing the right chunk boundaries for this file with their chunking implementation.
|
||||
|
||||
@@ -23,11 +23,11 @@ A **chunk** is a variable-sized content block derived from files using Content-D
|
||||
- **Size range**: 8KB to 128KB (minimum and maximum constraints)
|
||||
- **Identification**: Each chunk is uniquely identified by its cryptographic hash (MerkleHash)
|
||||
|
||||
[Detailed chunking description](../spec/chunking.md)
|
||||
[Detailed chunking description](./chunking)
|
||||
|
||||
### Xorbs (Extended Object Blocks)
|
||||
### Xorbs
|
||||
|
||||
**Xorbs** are containers that aggregate multiple chunks for efficient storage and transfer:
|
||||
**Xorbs** are objects that aggregate multiple chunks for efficient storage and transfer:
|
||||
|
||||
- **Maximum size**: 64MB
|
||||
- **Maximum chunks**: 8,192 chunks per xorb
|
||||
@@ -96,7 +96,7 @@ Xet employs a three-tiered deduplication strategy to maximize efficiency while m
|
||||
|
||||
#### Level 3: Global Deduplication API
|
||||
|
||||
**Scope**: Entire Xet ecosystem
|
||||
**Scope**: Entire Xet system
|
||||
**Mechanism**: Global deduplication service with HMAC protection
|
||||
**Purpose**: Discover deduplication opportunities across all users and repositories
|
||||
|
||||
@@ -143,11 +143,11 @@ They MAY know this chunk hash because they own this data, the match has made the
|
||||
### Chunk Hash Computation
|
||||
|
||||
Each chunk has its content hashed using a cryptographic hash function (Blake3-based MerkleHash) to create a unique identifier for content addressing.
|
||||
[See section about hashing](../spec/hashing.md#chunk-hashes).
|
||||
[See section about hashing](./hashing#chunk-hashes).
|
||||
|
||||
### Xorb Formation
|
||||
|
||||
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](../xorb.md)
|
||||
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](./xorb)
|
||||
|
||||
### File Reconstruction Information
|
||||
|
||||
@@ -164,7 +164,7 @@ This information allows the system to reconstruct files by:
|
||||
2. Extracting the specific chunk ranges from each xorb
|
||||
3. Concatenating chunks in the correct order
|
||||
|
||||
[See section about file reconstruction](../file_reconstruction.md).
|
||||
[See section about file reconstruction](./file-reconstruction).
|
||||
|
||||
## Fragmentation Prevention
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Download Protocol
|
||||
|
||||
This document describes the complete process of downloading a single file from the Xet protocol using the CAS (Content Addressable Storage) reconstruction API.
|
||||
This document describes the complete process of downloading a single file from the Xet protocol using the Content Addressable Storage (CAS) reconstruction API.
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -13,10 +13,11 @@ File download in the Xet protocol is a two-stage process:
|
||||
|
||||
### Single File Reconstruction
|
||||
|
||||
To download a file given a file hash, first call the reconstruction API to get the file reconstruction. Follow the steps in [api.md](../spec/api.md#1-get-file-reconstruction).
|
||||
To download a file given a file hash, first call the reconstruction API to get the file reconstruction. Follow the steps in [api](./api#1-get-file-reconstruction).
|
||||
|
||||
Note that you will need at least a `read` scope auth token, [auth reference](../spec/auth.md).
|
||||
Note that you will need at least a `read` scope auth token, [auth reference](./auth).
|
||||
|
||||
> [!TIP]
|
||||
> For large files it is RECOMMENDED to request the reconstruction in batches i.e. the first 10GB, download all the data, then the next 10GB and so on. Clients can use the `Range` header to specify a range of file data.
|
||||
|
||||
## Stage 2: Understanding the Reconstruction Response
|
||||
@@ -25,8 +26,6 @@ The reconstruction API returns a `QueryReconstructionResponse` object with three
|
||||
|
||||
### QueryReconstructionResponse Structure
|
||||
|
||||
Scroll
|
||||
|
||||
```json
|
||||
{
|
||||
"offset_into_first_range": 0,
|
||||
@@ -85,7 +84,7 @@ Scroll
|
||||
- Maps xorb hashes to required information to download some of their chunks.
|
||||
- The mapping is to an array of 1 or more `CASReconstructionFetchInfo`
|
||||
- Each `CASReconstructionFetchInfo` contains:
|
||||
- `url`: HTTP URL for downloading the xorb data, presigned url containing authorization information
|
||||
- `url`: HTTP URL for downloading the xorb data, presigned URL containing authorization information
|
||||
- `url_range` (bytes_start, bytes_end): Byte range `{ start: number, end: number }` for the Range header; end-inclusive `[start, end]`
|
||||
- The `Range` header MUST be set as `Range: bytes=<start>-<end>` when downloading this chunk range
|
||||
- `range` (index_start, index_end): Chunk index range `{ start: number, end: number }` that this URL provides; end-exclusive `[start, end)`
|
||||
@@ -116,7 +115,7 @@ Scroll
|
||||
|
||||
```python
|
||||
file_id = "0123...abcdef"
|
||||
api_endpoint, token = get_token() # follow auth.md instructions
|
||||
api_endpoint, token = get_token() # follow auth instructions
|
||||
url = api_endpoint + "/reconstructions/" + file_id
|
||||
reconstruction = get(url, headers={"Authorization": "Bearer: " + token})
|
||||
|
||||
@@ -172,7 +171,7 @@ The downloaded data is in xorb format and MUST be deserialized:
|
||||
3. **Extract byte indices**: Track byte boundaries between chunks for range extraction
|
||||
4. **Validate length**: Decompressed length MUST match `unpacked_length` from the term
|
||||
|
||||
**Note**: The specific deserialization process depends on the [Xorb format](../xorb.md).
|
||||
**Note**: The deserialization process depends on the [Xorb format](./xorb).
|
||||
|
||||
```python
|
||||
for term in terms:
|
||||
@@ -234,7 +233,7 @@ For partial file downloads, the reconstruction API supports range queries:
|
||||
|
||||
When downloading individual term data:
|
||||
|
||||
A client MUST include the `Range` header formed with the values from the url_range field to specify the exact range of data of a xorb that they are accessing. Not specifying this header will cause result in an authorization failure.
|
||||
A client MUST include the `Range` header formed with the values from the `url_range` field to specify the exact range of data of a xorb that they are accessing. Not specifying this header will cause result in an authorization failure.
|
||||
|
||||
Xet global deduplication requires that access to xorbs is only granted to authorized ranges.
|
||||
Not specifying this header will result in an authorization failure.
|
||||
@@ -251,8 +250,8 @@ Consider downloading such content only once and reusing the data.
|
||||
### Caching recommendations
|
||||
|
||||
1. It can be ineffective to cache the reconstruction object
|
||||
1. The fetch_info section provides short-expiration pre-signed url's hence Clients SHOULD NOT cache the urls beyond their short expiration
|
||||
2. To get those url's to access the data you will need to call the reconstruction API again anyway
|
||||
1. The fetch_info section provides short-expiration pre-signed URL's hence Clients SHOULD NOT cache the urls beyond their short expiration
|
||||
2. To get those URL's to access the data you will need to call the reconstruction API again anyway
|
||||
2. Cache chunks by range not just individually
|
||||
1. If you need a chunk from a xorb it is very likely that you will need another, so cache them close
|
||||
3. Caching helps when downloading similar contents. May not be worth to cache data if you are always downloading different things
|
||||
@@ -327,8 +326,8 @@ This example shows reconstruction of a file that requires:
|
||||
- Chunks `[0, 2)` from the second xorb (~144KB of unpacked data)
|
||||
- Chunks `[3, 43)` from the same xorb from the first term (~3MB of unpacked data)
|
||||
|
||||
The `fetch_info` provides the HTTP URLs and byte ranges needed to download the required chunk data from each xorb. The ranges provided within fetch_info and term sections are always end-exclusive i.e. `{ "start": 0, "end": 3 }` is a range of 3 chunks at indices 0, 1 and 2.
|
||||
The ranges provided under a fetch_info items' url_range key are to be used to form the `Range` header when downloading the chunk range.
|
||||
The `fetch_info` provides the HTTP URLs and byte ranges needed to download the required chunk data from each xorb. The ranges provided within `fetch_info` and term sections are always end-exclusive i.e. `{ "start": 0, "end": 3 }` is a range of 3 chunks at indices 0, 1 and 2.
|
||||
The ranges provided under a `fetch_info` items' `url_range` key are to be used to form the `Range` header when downloading the chunk range.
|
||||
A `"url_range"` value of `{ "start": X, "end": Y }` creates a `Range` header value of `bytes=X-Y`.
|
||||
|
||||
When downloading and deserializing the chunks from xorb `a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456` we will have the chunks at indices `[1, 43)`.
|
||||
@@ -340,23 +339,23 @@ Note that in this example the chunk at index 3 is used twice! This is the benefi
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
actor Client as "Client"
|
||||
participant CAS as "CAS API"
|
||||
participant Transfer as "Transfer Service (Xet storage)"
|
||||
actor client as Client
|
||||
participant S as CAS API
|
||||
participant Transfer as Transfer Service (Xet storage)
|
||||
|
||||
Client->>CAS: GET /reconstructions/{file_id}<br/>Authorization: Bearer <token><br/>Range: bytes=start-end (optional)
|
||||
CAS-->>Client: 200 OK<br/>QueryReconstructionResponse {offset_into_first_range, terms[], fetch_info{}}
|
||||
client->>S: GET /reconstructions/{file_id}<br/>Authorization: Bearer <token><br/>Range: bytes=start-end (optional)
|
||||
S-->>client: 200 OK<br/>QueryReconstructionResponse {offset_into_first_range, terms[], fetch_info{}}
|
||||
|
||||
loop For each term in terms (ordered)
|
||||
Client->>Client: Find fetch_info by xorb hash, entry whose range contains term.range
|
||||
Client->>Transfer: GET {url}<br/>Range: bytes=url_range.start-url_range.end
|
||||
Transfer-->>Client: 206 Partial Content<br/>xorb byte range
|
||||
Client->>Client: Deserialize xorb → chunks for fetch_info.range
|
||||
Client->>Client: Trim to term.range, apply offset for first term
|
||||
Client->>Client: Append chunks to output
|
||||
client->>client: Find fetch_info by xorb hash, entry whose range contains term.range
|
||||
client->>Transfer: GET {url}<br/>Range: bytes=url_range.start-url_range.end
|
||||
Transfer-->>client: 206 Partial Content<br/>xorb byte range
|
||||
client->>client: Deserialize xorb → chunks for fetch_info.range
|
||||
client->>client: Trim to term.range, apply offset for first term
|
||||
client->>client: Append chunks to output
|
||||
end
|
||||
|
||||
alt Range requested
|
||||
Client->>Client: Truncate output to requested length
|
||||
client->>client: Truncate output to requested length
|
||||
end
|
||||
```
|
||||
@@ -1,6 +1,6 @@
|
||||
# Getting a File ID from the Hugging Face Hub
|
||||
# Getting a Xet File ID from the Hugging Face Hub
|
||||
|
||||
This section explains the Xet file ID used in the reconstruction API to download a file from the HuggingFace hub using the xet protocol.
|
||||
This section explains the Xet file ID used in the reconstruction API to download a file from the Hugging Face Hub using the xet protocol.
|
||||
|
||||
Given a particular namespace, repository and branch or commit hash and file path from the root of the repository, build the "resolve" URL for the file following this format:
|
||||
|
||||
@@ -11,7 +11,7 @@ repository: the repository name e.g. Qwen-Image-Edit
|
||||
branch: any git branch or commit hash e.g. main
|
||||
filepath: filepath in repository e.g. transformer/diffusion_pytorch_model-00001-of-00009.safetensors
|
||||
|
||||
resolve url:
|
||||
resolve URL:
|
||||
|
||||
https://huggingface.co/{namespace}/{repository}/resolve/{branch}/{filepath}
|
||||
|
||||
@@ -21,12 +21,13 @@ Example:
|
||||
https://huggingface.co/Qwen/Qwen-Image-Edit/resolve/main/transformer/diffusion_pytorch_model-00001-of-00009.safetensors
|
||||
```
|
||||
|
||||
Then make a `GET` request to the resolve url using your standard Hugging Face Hub credentials/token.
|
||||
Then make a `GET` request to the resolve URL using your standard Hugging Face Hub credentials/token.
|
||||
|
||||
If the file is stored on the xet system then a successful response will have a `X-Xet-Hash` header.
|
||||
|
||||
The string value of this header is the Xet file ID and SHOULD be used in the path of the reconstruction API URL.
|
||||
This is the string representation of the hash and can be used directly in the file reconstruction API on download.
|
||||
|
||||
> [!NOTE]
|
||||
> The resolve URL will return a 302 redirect http status code, following the redirect will download the content via the old LFS compatible route rather than through the Xet protocol.
|
||||
In order to use the Xet protocol you MUST NOT follow this redirect.
|
||||
@@ -12,8 +12,8 @@ This document describes how a file can be represented and reconstructed from a c
|
||||
|
||||
## Core Idea
|
||||
|
||||
After following the [chunking procedure](../spec/chunking.md) a file can be represented as an ordering of chunks.
|
||||
Those chunks are then packed into [xorbs](../spec/xorb.md) and given the set of xorbs we convert the file representation to "reconstruction" made up of "terms".
|
||||
After following the [chunking procedure](./chunking) a file can be represented as an ordering of chunks.
|
||||
Those chunks are then packed into [xorbs](./xorb) and given the set of xorbs we convert the file representation to "reconstruction" made up of "terms".
|
||||
When forming xorbs the ordering and grouping of chunks prioritizes contiguous runs of chunks that appear in a file such that when referencing a xorb we maximize the term range length.
|
||||
|
||||
Any file’s raw bytes can be described as the concatenation of data produced by a sequence of terms.
|
||||
@@ -22,7 +22,7 @@ The file is reconstructed by retrieving those chunk ranges, decoding them to raw
|
||||
|
||||
### Diagram
|
||||
|
||||
> A file with 4 terms. Each term is a pointer to chunk range within a xorb.
|
||||
A file with 4 terms. Each term is a pointer to chunk range within a xorb.
|
||||
|
||||
```txt
|
||||
File Reconstruction
|
||||
@@ -105,7 +105,7 @@ A file’s reconstruction can be serialized into a shard as part of its file inf
|
||||
Conceptually, this section encodes the complete set of terms that describe the file.
|
||||
When stored this way, the representation is canonical and sufficient to reconstruct the full file solely from its referenced xorb ranges.
|
||||
|
||||
Reference: [shard format file info](../spec/shard.md#2-file-info-section)
|
||||
Reference: [shard format file info](./shard#2-file-info-section)
|
||||
|
||||
### Deserialization from the reconstruction API (JSON)
|
||||
|
||||
@@ -114,7 +114,7 @@ This response is represented by a structure named “QueryReconstructionResponse
|
||||
The `terms` list contains, for each term, the xorb identifier and the contiguous chunk index range to retrieve.
|
||||
Other fields may provide auxiliary details (such as offsets or fetch hints) that optimize retrieval without altering the meaning of the `terms` sequence.
|
||||
|
||||
Reference: [api.md](../spec/api.md), [download protocol](../spec/download_protocol.md)
|
||||
Reference: [api](./api), [download protocol](./download-protocol)
|
||||
|
||||
## Fragmentation and Why Longer Ranges Matter
|
||||
|
||||
@@ -137,7 +137,7 @@ Reference files are provided in Hugging Face Dataset repository [xet-team/xet-sp
|
||||
In this repository there are a number of different samples implementors can use to verify hash computations.
|
||||
|
||||
> Note that all hashes are represented as strings.
|
||||
To get the raw value of these hashes you must invert the endianness of each byte octet in the hash string, reversing the procedure described in [api.md](../spec/api.md#converting-hashes-to-strings).
|
||||
To get the raw value of these hashes you must invert the endianness of each byte octet in the hash string, reversing the procedure described in [api](./api#converting-hashes-to-strings).
|
||||
|
||||
### Chunk Hashes Sample
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
# Xet Protocol Specification
|
||||
|
||||
> [!NOTE]
|
||||
> Version 0.1.0 (1.0.0 on release)
|
||||
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119](https://www.ietf.org/rfc/rfc2119.txt) [RFC8174](https://www.ietf.org/rfc/rfc8174.txt)
|
||||
when, and only when, they appear in all capitals, as shown here.
|
||||
@@ -8,26 +9,28 @@ This specification defines the end-to-end Xet protocol for content-addressed dat
|
||||
Its goal is interoperability and determinism: independent implementations MUST produce the same hashes, objects, and API behavior so data written by one client can be read by another with integrity and performance.
|
||||
Implementors can create their own clients, SDKs, and tools that speak the Xet protocol and interface with the CAS service, as long as they MUST adhere to the requirements defined here.
|
||||
|
||||
## Building a client library for xet storage
|
||||
## Building a Client Library for Xet Storage
|
||||
|
||||
- [Upload Protocol](../spec/upload_protocol.md): End-to-end top level description of the upload flow.
|
||||
- [Download Protocol](../spec/download_protocol.md): Instructions for the download procedure.
|
||||
- [CAS API](../spec/api.md): HTTP endpoints for reconstruction, global chunk dedupe, xorb upload, and shard upload, including error semantics.
|
||||
- [Authentication and Authorization](../spec/auth.md): How to obtain Xet tokens from the Hugging Face Hub, token scopes, and security considerations.
|
||||
- [Hugging Face Hub Files Conversion to Xet File ID's](../spec/file_id.md): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
|
||||
- [Upload Protocol](./upload-protocol): End-to-end top level description of the upload flow.
|
||||
- [Download Protocol](./download-protocol): Instructions for the download procedure.
|
||||
- [CAS API](./api): HTTP endpoints for reconstruction, global chunk dedupe, xorb upload, and shard upload, including error semantics.
|
||||
- [Authentication and Authorization](./auth): How to obtain Xet tokens from the Hugging Face Hub, token scopes, and security considerations.
|
||||
- [Converting Hugging Face Hub Files to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
|
||||
|
||||
## Overall Xet architecture
|
||||
## Overall Xet Architecture
|
||||
|
||||
- [Content-Defined Chunking](../spec/chunking.md): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
|
||||
- [Hashing Methods](../spec/hashing.md): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
|
||||
- [File Reconstruction](../spec/file_reconstruction.md): Defining "term"-based representation of files using xorb hash + chunk ranges.
|
||||
- [Xorb Format](../spec/xorb.md): Explains grouping chunks into xorbs, 64 MiB limits, binary layout, and compression schemes.
|
||||
- [Shard Format](../spec/shard.md): Binary shard structure (header, file info, CAS info, footer), offsets, HMAC key usage, and bookends.
|
||||
- [Deduplication](../spec/deduplication.md): Explanation of chunk level dedupe including global system-wide chunk level dedupe.
|
||||
- [Content-Defined Chunking](./chunking): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
|
||||
- [Hashing Methods](./hashing): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
|
||||
- [File Reconstruction](./file-reconstruction): Defining "term"-based representation of files using xorb hash + chunk ranges.
|
||||
- [Xorb Format](./xorb): Explains grouping chunks into xorbs, 64 MiB limits, binary layout, and compression schemes.
|
||||
- [Shard Format](./shard): Binary shard structure (header, file info, CAS info, footer), offsets, HMAC key usage, and bookends.
|
||||
- [Deduplication](./deduplication): Explanation of chunk level dedupe including global system-wide chunk level dedupe.
|
||||
|
||||
## Reference implementation
|
||||
## Reference Implementation
|
||||
|
||||
The primary reference implementation of the protocol written in rust 🦀 lives in the [xet-core](https://github.com/huggingface/xet-core) repository under multiple crates:
|
||||
### xet-core: hf-xet + git-xet
|
||||
|
||||
The primary reference implementation of the protocol written in Rust 🦀 lives in the [xet-core](https://github.com/huggingface/xet-core) repository under multiple crates:
|
||||
|
||||
- [cas_types](https://github.com/huggingface/xet-core/tree/main/cas_types) - Common re-usable types for interacting with CAS API's
|
||||
- [cas_client](https://github.com/huggingface/xet-core/tree/main/cas_client) - Client interface that calls CAS API's, including comprehensive implementation of download protocol.
|
||||
@@ -37,8 +40,9 @@ The primary reference implementation of the protocol written in rust 🦀 lives
|
||||
- [merklehash](https://github.com/huggingface/xet-core/tree/main/merklehash) - Exports a `MerkleHash` type extensively used to represent hashes. Exports functions to compute the different hashes used to track chunks, xorbs and files.
|
||||
- [data](https://github.com/huggingface/xet-core/tree/main/data) - Comprehensive package exposing interfaces to upload and download contents
|
||||
- [hf_xet](https://github.com/huggingface/xet-core/tree/main/hf_xet) - Python bindings to use the Xet protocol for uploads and downloads with the Hugging Face Hub.
|
||||
- [git-xet](ttps://github.com/huggingface/xet-core/tree/main/git-xet) - git lfs custom transfer agent that uploads files using the xet protocol to the Hugging Face Hub.
|
||||
|
||||
### Huggingface.js
|
||||
### huggingface.js
|
||||
|
||||
There is also a second reference implementation in Huggingface.js that can be used when downloading or uploading files with the `@huggingface/hub` library.
|
||||
|
||||
@@ -8,7 +8,7 @@ The Shard format is the vehicle for uploading the file reconstruction upload and
|
||||
|
||||
The MDB (Merkle Database) shard file format is a binary format used to store file metadata and content-addressable storage (CAS) information for efficient deduplication and retrieval.
|
||||
This document describes the binary layout and deserialization process for the shard format.
|
||||
Implementors of the xet protocol MUST use the shard format when implementing the [upload protocol](../spec/upload_protocol.md).
|
||||
Implementors of the xet protocol MUST use the shard format when implementing the [upload protocol](./upload-protocol).
|
||||
The shard format is used on the shard upload (record files) and global deduplication APIs.
|
||||
|
||||
## Use As API Request and Response Bodies
|
||||
@@ -132,7 +132,8 @@ struct MDBShardFileHeader {
|
||||
4. Verify version equals 2
|
||||
5. Read 8 bytes for footer_size (u64)
|
||||
|
||||
> when serializing, footer_size MUST be the number of bytes that make up the footer, or 0 if the footer is omitted.
|
||||
> [!NOTE]
|
||||
> When serializing, footer_size MUST be the number of bytes that make up the footer, or 0 if the footer is omitted.
|
||||
|
||||
## 2. File Info Section
|
||||
|
||||
@@ -141,7 +142,7 @@ struct MDBShardFileHeader {
|
||||
This section contains a sequence of 0 or more file information (File Info) blocks, each consisting at least a header and at least 1 data sequence entry, and OPTIONAL verification entries and metadata extension section.
|
||||
The file info section ends when reaching the bookend entry.
|
||||
|
||||
Each File Info block within the overall section is a serialization of a [file reconstruction](../spec/file_reconstruction.md) into a binary format.
|
||||
Each File Info block within the overall section is a serialization of a [file reconstruction](./file-reconstruction) into a binary format.
|
||||
For each file, there is a `FileDataSequenceHeader` and for each term a `FileDataSequenceEntry` with OPTIONAL a matching `FileVerificationEntry` and also OPTIONAL at the end a `FileMetadataExt`.
|
||||
|
||||
A shard File Info section can contain more than 1 File Info block in series, after completing reading all the content for 1 file description, the next one immediately begins.
|
||||
@@ -229,7 +230,7 @@ Given the `file_data_sequence_header.file_flags & MASK` (bitwise AND) operations
|
||||
|
||||
### FileDataSequenceEntry
|
||||
|
||||
Each `FileDataSequenceEntry` is 1 term is essentially the binary serialization of a [file reconstruction term](../spec/file_reconstruction.md#term-format).
|
||||
Each `FileDataSequenceEntry` is 1 term is essentially the binary serialization of a [file reconstruction term](./file-reconstruction#term-format).
|
||||
|
||||
```rust
|
||||
struct FileDataSequenceEntry {
|
||||
@@ -241,6 +242,7 @@ struct FileDataSequenceEntry {
|
||||
}
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> Note that when describing a chunk range in a `FileDataSequenceEntry` use ranges that are start-inclusive but end-exclusive i.e. `[chunk_index_start, chunk_index_end)`
|
||||
|
||||
**Memory Layout**:
|
||||
@@ -258,7 +260,7 @@ struct FileDataSequenceEntry {
|
||||
|
||||
Verification Entries MUST be set for shard uploads.
|
||||
|
||||
To generate verification hashes for shard upload read the section about [Verification Hashes](../hashing.md#Term%20Verification%20Hashes).
|
||||
To generate verification hashes for shard upload read the section about [Verification Hashes](./hashing#Term-Verification-Hashes).
|
||||
|
||||
```rust
|
||||
struct FileVerificationEntry {
|
||||
@@ -427,6 +429,7 @@ Since the cas info section immediately follows the file info section bookend, a
|
||||
|
||||
## 4. Footer (MDBShardFileFooter)
|
||||
|
||||
> [!NOTE]
|
||||
> MUST NOT include the footer when serializing the shard as the body for the shard upload API.
|
||||
|
||||
**Location**: End of file minus footer_size
|
||||
@@ -448,6 +451,7 @@ struct MDBShardFileFooter {
|
||||
|
||||
**Memory Layout**:
|
||||
|
||||
> [!NOTE]
|
||||
> Fields are not exactly to scale
|
||||
|
||||
```txt
|
||||
@@ -1,4 +1,4 @@
|
||||
# Upload protocol
|
||||
# Upload Protocol
|
||||
|
||||
This document describes how files are uploaded in the Xet protocol to the Content Addressable Storage (CAS) service.
|
||||
The flow converts input files into chunks, applies deduplication, groups chunks into xorbs, uploads xorbs, then forms and uploads shards that reference those xorbs.
|
||||
@@ -11,7 +11,7 @@ Content addressing uses hashes as stable keys for deduplication and integrity ve
|
||||
|
||||
A chunk is a slice of data from a real file.
|
||||
|
||||
A chunk has an associated hash computed through the [chunk hashing process](../spec/hashing.md#chunk-hashes) and its data is determined by finding chunk boundaries following the chunking algorithm defined in [chunking.md](../spec/chunking.md).
|
||||
A chunk has an associated hash computed through the [chunk hashing process](./hashing#chunk-hashes) and its data is determined by finding chunk boundaries following the chunking algorithm defined in [chunking](./chunking).
|
||||
|
||||
A chunk is ~64KiB of data with a maximum of 128KiB and minimum of 8KiB.
|
||||
However, the minimum chunk size limit is not enforced for the last chunk of a file or if the file is smaller than 8KiB.
|
||||
@@ -20,13 +20,13 @@ However, the minimum chunk size limit is not enforced for the last chunk of a fi
|
||||
|
||||
A Xorb is composed of a sequence of chunks.
|
||||
|
||||
Chunks in a xorb are not simply concatenated but instead compressed and appended after a header as described in [xorb.md](../spec/xorb.md#xorb-format).
|
||||
Chunks in a xorb are not simply concatenated but instead compressed and appended after a header as described in [xorb](./xorb#xorb-format).
|
||||
Chunks are collected in a xorb for more efficient upload and downloads of "ranges" of chunks.
|
||||
Each chunk has an associated index (beginning at 0) and chunks may addressed from xorbs using through an end exclusive chunk index range i.e. [0, 100).
|
||||
|
||||
Xorbs are created by grouping sequences of chunks from files and are referenced in file reconstructions to provide instructions to rebuild the file.
|
||||
|
||||
Xorbs have an associated hash computed according to the instructions for the [xorb hashing process](../spec/hashing.md#xorb-hashes).
|
||||
Xorbs have an associated hash computed according to the instructions for the [xorb hashing process](./hashing#xorb-hashes).
|
||||
|
||||
Xorbs are always less than or equal to 64MiB in length and on average contain 1024 chunks, but this number is variable.
|
||||
|
||||
@@ -48,82 +48,86 @@ Shards are used to communicate a "file upload" or registering the file in the CA
|
||||
|
||||
Shards are also used to communicate xorb metadata that can be used for deduplication using the Global Deduplication API.
|
||||
|
||||
The shard format is specified in [shard.md](../spec/shard.md).
|
||||
The shard format is specified in [shard](./shard).
|
||||
|
||||
> [!NOTE]
|
||||
> In xet-core the shard format is used to keep a local cache with fast lookup of known chunks for deduplication, other implementors of the xet protocol may choose to reuse the shard format for that purpose as well, however that is not a requirement of the protocol.
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Chunking
|
||||
|
||||
Using the chunking algorithm described in [chunking.md](../spec/chunking.md) first split the file into variable sized chunks.
|
||||
Each unique chunk MUST have a unique hash computed as described in the [Chunk Hashing section](../spec/hashing.md#chunk-hashes).
|
||||
Using the chunking algorithm described in [chunking](./chunking) first split the file into variable sized chunks.
|
||||
Each unique chunk MUST have a unique hash computed as described in the [Chunk Hashing section](./hashing#chunk-hashes).
|
||||
This chunk hash will be used to attempt to deduplicate any chunk against other known chunks.
|
||||
|
||||
### 2. Deduplication
|
||||
|
||||
Given a chunk hash, attempt to find if the chunk already exists in the Xet system.
|
||||
|
||||
To deduplicate a chunk is to find if the current chunk hash already exists, either in the current upload process, in a local cache of known chunks or using the [Global Deduplication API](../spec/api.md#2-query-chunk-deduplication-global-deduplication).
|
||||
To deduplicate a chunk is to find if the current chunk hash already exists, either in the current upload process, in a local cache of known chunks or using the [Global Deduplication API](./api#2-query-chunk-deduplication-global-deduplication).
|
||||
|
||||
When a chunk is deduplicated it SHOULD NOT be re-uploaded to the CAS (by being included in a xorb in the next step), but when rebuilding the file, the chunk needs to be included by referencing the xorb that includes it and the specific chunk index.
|
||||
|
||||
> [!NOTE]
|
||||
> Note that Deduplication is considered an optimization and is an OPTIONAL component of the upload process, however it provides potential resource saving.
|
||||
|
||||
For more detail visit the [deduplication document](../spec/deduplication.md)
|
||||
For more detail visit the [deduplication document](./deduplication)
|
||||
|
||||
### 3. Xorb formation and hashing
|
||||
### 3. Xorb Formation and Hashing
|
||||
|
||||
Contiguous runs of chunks are collected into xorbs (roughly 64 MiB total length per xorb), preserving order within each run. See formation rules: [xorb.md](../spec/xorb.md#collecting-chunks).
|
||||
The xorb's content-addressed key is computed using the chunks in the xorb. See: [hashing.md](../spec/hashing.md#xorb-hashes).
|
||||
Contiguous runs of chunks are collected into xorbs (roughly 64 MiB total length per xorb), preserving order within each run. See formation rules: [xorb](./xorb#collecting-chunks).
|
||||
The xorb's content-addressed key is computed using the chunks in the xorb. See: [hashing](./hashing#xorb-hashes).
|
||||
|
||||
Given the xorb hash chunks in the xorb can be referred in file reconstructions.
|
||||
|
||||
### 4. Xorb serialization and upload
|
||||
### 4. Xorb Serialization and Upload
|
||||
|
||||
Each xorb is serialized into its binary representation as defined by the xorb format. See: [xorb.md](../spec/xorb.md).
|
||||
The client uploads each new xorb via the [Xorb upload API](../spec/api.md#3-upload-xorb).
|
||||
Each xorb is serialized into its binary representation as defined by the xorb format. See: [xorb](./xorb).
|
||||
The client uploads each new xorb via the [Xorb upload API](./api#3-upload-xorb).
|
||||
|
||||
The serialization and upload steps are separated from collecting chunks and hashing as these steps can be done independently while still referencing the xorb in creating file reconstructions.
|
||||
However a xorb MUST be uploaded before a file reconstruction that references it is uploaded in a shard.
|
||||
|
||||
### 5. Shard formation, collect required components
|
||||
### 5. Shard Formation, Collect Required Components
|
||||
|
||||
Map each file to a reconstruction using available xorbs, the file reconstruction MUST point to ranges of chunks within xorbs that refer to each chunk in the file.
|
||||
Terms for chunks that are deduplicated using results from the Global Dedupe API will use xorb hashes that already exist in CAS.
|
||||
|
||||
Then for each file:
|
||||
|
||||
- Compute the file hash using the [file hashing process](../spec/hashing.md#file-hashes).
|
||||
- For each xorb range (a "term") compute a [verification hash](../spec/hashing.md#term-verification-hashes) in order to upload it.
|
||||
- Compute the file hash using the [file hashing process](./hashing#file-hashes).
|
||||
- For each xorb range (a "term") compute a [verification hash](./hashing#term-verification-hashes) in order to upload it.
|
||||
- These hashes are used to ensure that the client uploading the file in the shard authoritatively has access to the actual file data.
|
||||
- Compute the sha256 for the file contents
|
||||
|
||||
With these components it is now possible to completely serialize a [file info block](../spec/shard.md#2-file-info-section) in the shard format.
|
||||
With these components it is now possible to completely serialize a [file info block](./shard#2-file-info-section) in the shard format.
|
||||
|
||||
In addition to the file info information, it is also necessary to collect all metadata for new xorbs that were created.
|
||||
This metadata is the xorb hash, the hash and length of each chunk, the serialized length of the xorb and the sum of the chunk lengths for a xorb.
|
||||
With these components it is now possible to serialize for each xorb a [CAS Info block](../spec/shard.md#3-cas-info-section).
|
||||
With these components it is now possible to serialize for each xorb a [CAS Info block](./shard#3-cas-info-section).
|
||||
|
||||
### 6. Shard serialization
|
||||
### 6. Shard Serialization and Upload
|
||||
|
||||
Given the information collected in the previous section, serialize a shard for a batch of files following the format specified in the [shard spec](../spec/shard.md).
|
||||
Given the information collected in the previous section, serialize a shard for a batch of files following the format specified in the [shard spec](./shard).
|
||||
|
||||
The client uploads the shard via the [shard upload](../spec/api.md#4-upload-shard) endpoint on the CAS server.
|
||||
The client uploads the shard via the [shard upload](./api#4-upload-shard) endpoint on the CAS server.
|
||||
For this to succeed, all xorbs referenced by the shard MUST have already completed uploading.
|
||||
|
||||
This API registers files as uploaded.
|
||||
|
||||
> [!NOTE]
|
||||
> For a large batch of files or a batch of large files if the serialized shard will be greater than 64 MiB you MUST break up the content into multiple shards.
|
||||
|
||||
### Done
|
||||
|
||||
After all xorbs and all shards are successfully uploaded, the full upload is considered complete.
|
||||
Files can then be downloaded by any client using the [download protocol](../spec/download_protocol.md).
|
||||
Files can then be downloaded by any client using the [download protocol](./download-protocol).
|
||||
|
||||
> [!NOTE]
|
||||
> If this file is being uploaded to the Hugging Face Hub, users will need to commit a git lfs pointer file using the sha256 of the file contents.
|
||||
|
||||
## Ordering and concurrency
|
||||
## Ordering and Concurrency
|
||||
|
||||
There are some natural ordering requirements in the upload process, e.g. you MUST have determined a chunk boundary before computing the chunk hash, and you MUST have collected a sequence of chunks to create a xorb to compute the xorb hash etc.
|
||||
|
||||
@@ -131,9 +135,9 @@ However there is one additional enforced requirement about ordering: **all xorbs
|
||||
If any xorb referenced by a shard is not already uploaded when the shard upload API is called, the server will reject the request.
|
||||
All xorbs whose hash is used as an entry in the cas info section and in data entries of the file info section are considered "referenced" by a shard.
|
||||
|
||||
## Integrity and idempotency
|
||||
## Integrity and Idempotency
|
||||
|
||||
- Hashing of chunks, xorbs, and shards ensures integrity and enables deduplication across local and global scopes. See: [hashing.md](../spec/hashing.md).
|
||||
- Hashing of chunks, xorbs, and shards ensures integrity and enables deduplication across local and global scopes. See: [hashing](./hashing).
|
||||
- the same chunk data produces the same chunk hash
|
||||
- the same set of chunks will produce the same xorb hash
|
||||
- Consistent chunking algorithm yields that the same data will be split into the same chunks at the same boundaries, allowing those chunks to be matched to other data and deduplicated.
|
||||
@@ -144,30 +148,30 @@ All xorbs whose hash is used as an entry in the cas info section and in data ent
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant Client
|
||||
participant CAS as CAS Server
|
||||
participant C as Client
|
||||
participant S as CAS Server
|
||||
|
||||
Client->>Client: Chunking: split file into chunks and compute chunk hashes
|
||||
C->>C: Chunking: split file into chunks and compute chunk hashes
|
||||
|
||||
Note right of Client: 2) Local deduplication (OPTIONAL)
|
||||
Note right of C: 2) Local deduplication (OPTIONAL)
|
||||
|
||||
loop For each chunk if chunk % 1024 == 0<br/>(global dedupe eligible)
|
||||
opt Global deduplication (OPTIONAL)
|
||||
Client->>CAS: GET /v1/chunks/default-merkledb/{chunk_hash}
|
||||
CAS-->>Client: 200 dedupe information or 404 not found
|
||||
C->>S: GET /v1/chunks/default-merkledb/{chunk_hash}
|
||||
S-->>C: 200 dedupe information or 404 not found
|
||||
end
|
||||
end
|
||||
|
||||
Client->>Client: Xorb formation (group chunks ~64 MiB), hashing, serialization
|
||||
C->>C: Xorb formation (group chunks ~64 MiB), hashing, serialization
|
||||
|
||||
loop For each new Xorb
|
||||
Client->>CAS: POST /v1/xorbs/default/{xorb_hash}
|
||||
CAS-->>Client: 200 OK
|
||||
C->>S: POST /v1/xorbs/default/{xorb_hash}
|
||||
S-->>C: 200 OK
|
||||
end
|
||||
|
||||
Client->>Client: Shard formation (files -> reconstructions) and serialization
|
||||
Client->>CAS: POST /v1/shards
|
||||
CAS-->>Client: 200 OK
|
||||
C->>C: Shard formation (files -> reconstructions) and serialization
|
||||
C->>S: POST /v1/shards
|
||||
S-->>C: 200 OK
|
||||
|
||||
Note over Client,CAS: All referenced Xorbs MUST be uploaded before Shard upload.<br/>Endpoints are idempotent by content-addressed keys.
|
||||
Note over C,S: All referenced Xorbs MUST be uploaded before Shard upload.<br/>Endpoints are idempotent by content-addressed keys.
|
||||
```
|
||||
@@ -111,7 +111,7 @@ Note that a Xorb MAY contain chunks that utilize different compression schemes.
|
||||
2. **Best Effort Prediction**
|
||||
|
||||
In `xet-core`, to predict if BG4 will be useful we maximum KL divergence between the distribution of per-byte pop-counts on a sample of each of the 4 groups that would be formed.
|
||||
You can read more about it in [bg4_prediction.rs](../cas_object/src/byte_grouping/bg4_prediction.rs) and accompanying scripts.
|
||||
You can read more about it in [bg4_prediction.rs](./cas_object/src/byte_grouping/bg4_prediction.rs) and accompanying scripts.
|
||||
|
||||
If the predictor does not show that BG4 will be better, we use Lz4 and in either case we will store the chunk as the uncompressed version if the compression scheme used does not show any benefit.
|
||||
|
||||
Reference in New Issue
Block a user