V2 reconstruction with client-side optional single range splitting (#703)

This PR introduces V2 multirange URL fetching for xorbs, but optionally
splits the multirange requests into multiple single-range requests that
can be executed in parallel. This allows the reconstruction process to
generate full multirange presigned URLs, but the client effectively
performs the retrieval stage as a sequence of parallel single-range
queries.

The config variable `client.enable_multirange_fetching` controls this
behavior; by default it is set to false due to the current observed
slowness of fetching multiranged URLs.

---------

Co-authored-by: Adrien <adrien@huggingface.co>
This commit is contained in:
Hoyt Koepke
2026-03-16 14:10:50 -07:00
committed by GitHub
parent 79df99ad01
commit 9caf7fcc44
30 changed files with 3752 additions and 880 deletions

View File

@@ -0,0 +1,94 @@
# API Update: V2 Reconstruction with Multi-Range Fetch Support (2026-03-16)
## Overview
The CAS reconstruction API now supports a V2 endpoint that returns optimized
multi-range fetch descriptors. The client auto-detects V2 and falls back to V1
transparently. Two new config options control reconstruction behavior.
---
## 1. New CAS Endpoint
`GET /v2/reconstructions/{file_id}` returns `QueryReconstructionResponseV2`:
```json
{
"terms": [...],
"offset_into_first_range": 0,
"xorbs": {
"<hex_hash>": [
{
"url": "https://...",
"ranges": [
{ "chunks": { "start": 0, "end": 3 }, "bytes": { "start": 0, "end": 1023 } },
{ "chunks": { "start": 5, "end": 8 }, "bytes": { "start": 2048, "end": 3071 } }
]
}
]
}
}
```
Each `XorbMultiRangeFetch` entry groups multiple disjoint chunk ranges under a
single presigned URL, enabling multi-range HTTP requests.
The client tries V2 first. On 404 or 501 it falls back to V1 and caches the
result so subsequent calls skip the V2 attempt. Setting
`HF_XET_CLIENT_RECONSTRUCTION_API_VERSION=1` or `=2` forces a specific version
with no fallback.
The `Client::get_reconstruction` trait method now always returns
`QueryReconstructionResponseV2`. When the server returns V1, the client
converts it internally.
---
## 2. New Config Options
### `HF_XET_CLIENT_RECONSTRUCTION_API_VERSION`
Forces a specific reconstruction API version (1 or 2). When unset, the client
auto-detects by trying V2 first.
### `HF_XET_CLIENT_ENABLE_MULTIRANGE_FETCHING`
Default: `false`. When false, V2 multi-range fetch entries are split into
individual single-range requests executed in parallel. When true, multi-range
requests are sent as-is (using `multipart/byteranges` responses).
---
## 3. Default Concurrency Changes
- `ac_initial_upload_concurrency`: 1 → 2
- `ac_initial_download_concurrency`: 1 → 4
These align the defaults with the documented values.
---
## 4. New Types in `xet_client::cas_types`
- `QueryReconstructionResponseV2` — V2 reconstruction response
- `XorbMultiRangeFetch` — A presigned URL with associated chunk/byte ranges
- `XorbRangeDescriptor` — A single chunk range + byte range pair
---
## 5. Multipart/Byteranges Parsing
`xet_client::cas_client::multipart::parse_multipart_byteranges` parses RFC 7233
`multipart/byteranges` HTTP responses. Used when `enable_multirange_fetching`
is true and the presigned URL server returns multiple byte ranges in a single
response.
---
## 6. Downstream Impact
- `Client::get_reconstruction` return type changed to `QueryReconstructionResponseV2`
(all trait implementations updated).
- `URLProvider::retrieve_url` now returns `Vec<HttpRange>` instead of a single
`HttpRange` to support multi-range blocks.
- No wire format or serialization changes; V1 responses are converted client-side.

View File

@@ -14,24 +14,27 @@ security:
paths:
/v1/reconstructions/{file_id}:
get:
summary: Get File Reconstruction
summary: Get File Reconstruction (V1)
description: |
Retrieves reconstruction information for a specific file. Supports byte range via the optional `Range` header.
Returns one presigned URL per chunk range per xorb.
Minimum token scope: `read`.
x-required-scope: read
operationId: getReconstruction
operationId: getReconstructionV1
parameters:
- $ref: '#/components/parameters/FileIdParam'
- $ref: '#/components/parameters/RangeHeader'
responses:
'200':
description: Reconstruction object
description: V1 reconstruction object
content:
application/json:
schema:
$ref: '#/components/schemas/QueryReconstructionResponse'
examples:
example:
v1:
summary: V1 response
value:
offset_into_first_range: 0
terms:
@@ -57,6 +60,60 @@ paths:
description: Not Found — File does not exist
'416':
description: Range Not Satisfiable — Requested byte range start exceeds file length
/v2/reconstructions/{file_id}:
get:
summary: Get File Reconstruction (V2)
description: |
V2 reconstruction endpoint optimized for multi-range fetching.
Returns fewer signed URLs by combining multiple byte ranges for the same xorb into a single URL,
enabling multi-range HTTP requests (RFC 7233).
Clients SHOULD try V2 first and fall back to V1 if the server returns 404 or 501.
Minimum token scope: `read`.
x-required-scope: read
operationId: getReconstructionV2
parameters:
- $ref: '#/components/parameters/FileIdParam'
- $ref: '#/components/parameters/RangeHeader'
responses:
'200':
description: V2 reconstruction object
content:
application/json:
schema:
$ref: '#/components/schemas/QueryReconstructionResponseV2'
examples:
v2:
summary: V2 response (multi-range optimized)
value:
offset_into_first_range: 0
terms:
- hash: a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456
unpacked_length: 263873
range:
start: 0
end: 4
xorbs:
a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456:
- url: "https://transfer.xethub.hf.co/xorbs/default/a1b2c3...?<signed-params>"
ranges:
- chunks:
start: 0
end: 4
bytes:
start: 0
end: 131071
'400':
description: Bad Request — Malformed file_id
'401':
description: Unauthorized — Missing/expired token
'404':
description: Not Found — File does not exist, or V2 not supported (fall back to V1)
'416':
description: Range Not Satisfiable — Requested byte range start exceeds file length
'501':
description: Not Implemented — V2 not supported by this server (fall back to V1)
/v1/chunks/{prefix}/{hash}:
get:
summary: Query Chunk Deduplication (Global Deduplication)
@@ -286,6 +343,56 @@ components:
$ref: '#/components/schemas/CASReconstructionFetchInfo'
required: [offset_into_first_range, terms, fetch_info]
additionalProperties: false
XorbRangeDescriptor:
type: object
description: A chunk/byte range within a xorb
properties:
chunks:
$ref: '#/components/schemas/IndexRange'
bytes:
$ref: '#/components/schemas/ByteRange'
required: [chunks, bytes]
additionalProperties: false
XorbMultiRangeFetch:
type: object
description: A signed multi-range fetch entry covering a subset of ranges for a xorb
properties:
url:
type: string
format: uri
description: |
Signed URL with all byte ranges encoded.
Client must send exactly the signed range value as the Range header.
ranges:
type: array
items:
$ref: '#/components/schemas/XorbRangeDescriptor'
description: Byte ranges covered by this URL, sorted by chunk start
required: [url, ranges]
additionalProperties: false
QueryReconstructionResponseV2:
type: object
description: V2 reconstruction response optimized for multi-range fetching
properties:
offset_into_first_range:
type: integer
minimum: 0
terms:
type: array
items:
$ref: '#/components/schemas/CASReconstructionTerm'
xorbs:
type: object
description: Map from xorb hash to list of multi-range fetch entries
propertyNames:
$ref: '#/components/schemas/HexString64Lowercase'
additionalProperties:
type: array
items:
$ref: '#/components/schemas/XorbMultiRangeFetch'
minItems: 1
required: [offset_into_first_range, terms, xorbs]
additionalProperties: false
UploadXorbResponse:
type: object
properties:

View File

@@ -6,14 +6,16 @@ use xet_core_structures::xorb_object::SerializedXorbObject;
use super::adaptive_concurrency::ConnectionPermit;
use super::error::Result;
use super::progress_tracked_streams::ProgressCallback;
use crate::cas_types::{BatchQueryReconstructionResponse, FileRange, HttpRange, QueryReconstructionResponse};
use crate::cas_types::{BatchQueryReconstructionResponse, FileRange, HttpRange, QueryReconstructionResponseV2};
#[async_trait::async_trait]
pub trait URLProvider: Send + Sync {
// Retrieves the URL.
async fn retrieve_url(&self) -> Result<(String, HttpRange)>;
/// Retrieves the URL and the byte ranges to fetch.
/// For single-range (V1) blocks, the Vec has one entry.
/// For multi-range (V2) blocks, all ranges are included.
async fn retrieve_url(&self) -> Result<(String, Vec<HttpRange>)>;
// Asks for a refresh of the URL; triggered on 403 errors.
/// Asks for a refresh of the URL; triggered on 403 errors.
async fn refresh_url(&self) -> Result<()>;
}
@@ -30,11 +32,13 @@ pub trait Client: Send + Sync {
file_hash: &MerkleHash,
) -> Result<Option<(MDBFileInfo, Option<MerkleHash>)>>;
/// Returns reconstruction info always in V2 format.
/// Implementations may try V2 first and fall back to V1 + convert.
async fn get_reconstruction(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>>;
) -> Result<Option<QueryReconstructionResponseV2>>;
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse>;

View File

@@ -16,6 +16,7 @@ mod error;
pub mod exports;
pub mod http_client;
mod interface;
pub mod multipart;
pub mod progress_tracked_streams;
pub mod remote_client;
pub mod retry_wrapper;

View File

@@ -0,0 +1,186 @@
use bytes::Bytes;
use crate::cas_client::error::{CasClientError, Result};
use crate::cas_types::HttpRange;
/// A single part from a multipart/byteranges HTTP response.
pub struct MultipartPart {
pub range: HttpRange,
pub data: Bytes,
}
/// Parse a `multipart/byteranges` HTTP response body (RFC 7233 §4.1).
///
/// Extracts the boundary from `content_type`, splits the body by boundary markers,
/// parses `Content-Range` headers from each part, and returns parts sorted by byte range start.
pub fn parse_multipart_byteranges(content_type: &str, body: Bytes) -> Result<Vec<MultipartPart>> {
let boundary = extract_boundary(content_type)?;
let delimiter = format!("\r\n--{boundary}");
let body_slice = body.as_ref();
let mut parts = Vec::new();
let first_delim = format!("--{boundary}");
let Some(start) = find_subsequence(body_slice, first_delim.as_bytes()) else {
return Err(CasClientError::Other("No boundary found in multipart body".to_string()));
};
let mut remaining = &body_slice[start + first_delim.len()..];
loop {
if remaining.starts_with(b"\r\n") {
remaining = &remaining[2..];
} else {
break;
}
let next_boundary = find_subsequence(remaining, delimiter.as_bytes());
let part_data = match next_boundary {
Some(pos) => &remaining[..pos],
None => remaining,
};
let Some(header_end) = find_subsequence(part_data, b"\r\n\r\n") else {
return Err(CasClientError::Other("Malformed multipart part: missing header/data separator".to_string()));
};
let headers = &part_data[..header_end];
let data_start = header_end + 4;
let data = &part_data[data_start..];
let range = parse_content_range(headers)?;
// Compute the absolute byte offset into the original `body` so we can
// use Bytes::slice for zero-copy extraction of this part's data.
let offset =
body.len() - body_slice.len() + (remaining.as_ptr() as usize - body_slice.as_ptr() as usize) + data_start;
parts.push(MultipartPart {
range,
data: body.slice(offset..offset + data.len()),
});
match next_boundary {
Some(pos) => {
remaining = &remaining[pos + delimiter.len()..];
},
None => break,
}
}
parts.sort_by_key(|p| p.range.start);
Ok(parts)
}
fn extract_boundary(content_type: &str) -> Result<String> {
for part in content_type.split(';') {
let part = part.trim();
if let Some(value) = part.strip_prefix("boundary=") {
let boundary = value.trim_matches('"');
return Ok(boundary.to_string());
}
}
Err(CasClientError::Other(format!("No boundary found in Content-Type: {content_type}")))
}
fn parse_content_range(headers: &[u8]) -> Result<HttpRange> {
let headers_str = std::str::from_utf8(headers)
.map_err(|e| CasClientError::Other(format!("Invalid UTF-8 in part headers: {e}")))?;
for line in headers_str.split("\r\n") {
let line_lower = line.to_ascii_lowercase();
if let Some(value) = line_lower.strip_prefix("content-range:") {
// Digits, dashes, and slashes are case-invariant, so we can parse
// directly from the lowercased value.
if let Some(range_spec) = value.trim().strip_prefix("bytes ") {
let original_value = range_spec.trim();
let slash_pos = original_value
.find('/')
.ok_or_else(|| CasClientError::Other(format!("Invalid Content-Range: {line}")))?;
let range_part = &original_value[..slash_pos];
let dash_pos = range_part
.find('-')
.ok_or_else(|| CasClientError::Other(format!("Invalid Content-Range: {line}")))?;
let start: u64 = range_part[..dash_pos]
.parse()
.map_err(|e| CasClientError::Other(format!("Invalid Content-Range start: {e}")))?;
let end: u64 = range_part[dash_pos + 1..]
.parse()
.map_err(|e| CasClientError::Other(format!("Invalid Content-Range end: {e}")))?;
// RFC 7233 Content-Range uses an inclusive end, which matches HttpRange.
return Ok(HttpRange::new(start, end));
}
}
}
Err(CasClientError::Other("No Content-Range header found in multipart part".to_string()))
}
fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
haystack.windows(needle.len()).position(|window| window == needle)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_extract_boundary() {
assert_eq!(extract_boundary("multipart/byteranges; boundary=something").unwrap(), "something");
assert_eq!(extract_boundary("multipart/byteranges; boundary=\"quoted\"").unwrap(), "quoted");
}
#[test]
fn test_extract_boundary_missing() {
assert!(extract_boundary("text/plain").is_err());
}
#[test]
fn test_parse_single_part() {
let boundary = "abc123";
let body = format!(
"--{boundary}\r\nContent-Type: application/octet-stream\r\nContent-Range: bytes 0-99/1000\r\n\r\nHello World\r\n--{boundary}--\r\n"
);
let content_type = format!("multipart/byteranges; boundary={boundary}");
let parts = parse_multipart_byteranges(&content_type, Bytes::from(body)).unwrap();
assert_eq!(parts.len(), 1);
assert_eq!(parts[0].range.start, 0);
assert_eq!(parts[0].range.end, 99);
assert_eq!(&parts[0].data[..], b"Hello World");
}
#[test]
fn test_parse_multiple_parts() {
let boundary = "sep";
let body = format!(
"--{boundary}\r\nContent-Range: bytes 100-199/1000\r\n\r\nPart2Data\r\n--{boundary}\r\nContent-Range: bytes 0-49/1000\r\n\r\nPart1Data\r\n--{boundary}--\r\n"
);
let content_type = format!("multipart/byteranges; boundary={boundary}");
let parts = parse_multipart_byteranges(&content_type, Bytes::from(body)).unwrap();
assert_eq!(parts.len(), 2);
assert_eq!(parts[0].range.start, 0);
assert_eq!(parts[0].range.end, 49);
assert_eq!(&parts[0].data[..], b"Part1Data");
assert_eq!(parts[1].range.start, 100);
assert_eq!(parts[1].range.end, 199);
assert_eq!(&parts[1].data[..], b"Part2Data");
}
#[test]
fn test_parse_empty_body_no_boundary() {
let content_type = "multipart/byteranges; boundary=xyz";
let result = parse_multipart_byteranges(content_type, Bytes::new());
assert!(result.is_err());
}
#[test]
fn test_parse_part_missing_header_separator() {
let boundary = "xyz";
let body = format!("--{boundary}\r\nContent-Range: bytes 0-9/100\r\nMISSING_SEPARATOR\r\n--{boundary}--\r\n");
let content_type = format!("multipart/byteranges; boundary={boundary}");
let result = parse_multipart_byteranges(&content_type, Bytes::from(body));
assert!(result.is_err());
}
}

View File

@@ -1,5 +1,5 @@
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::atomic::{AtomicU32, AtomicU64, Ordering};
use bytes::Bytes;
use futures::TryStreamExt;
@@ -24,8 +24,8 @@ use super::progress_tracked_streams::{
use super::retry_wrapper::{RetryWrapper, RetryableReqwestError};
use super::{Client, INFORMATION_LOG_LEVEL};
use crate::cas_types::{
BatchQueryReconstructionResponse, FileRange, HttpRange, Key, QueryReconstructionResponse, UploadShardResponse,
UploadShardResponseType, UploadXorbResponse,
BatchQueryReconstructionResponse, FileRange, HttpRange, Key, QueryReconstructionResponse,
QueryReconstructionResponseV2, UploadShardResponse, UploadShardResponseType, UploadXorbResponse,
};
pub const CAS_ENDPOINT: &str = "http://localhost:8080";
@@ -48,6 +48,8 @@ pub struct RemoteClient {
shard_upload_http_client: Arc<ClientWithMiddleware>,
upload_concurrency_controller: Arc<AdaptiveConcurrencyController>,
download_concurrency_controller: Arc<AdaptiveConcurrencyController>,
/// Caches the discovered reconstruction API version (0 = not yet probed, 1 = V1, 2 = V2).
detected_reconstruction_api_version: AtomicU32,
}
impl RemoteClient {
@@ -85,6 +87,7 @@ impl RemoteClient {
),
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("upload"),
download_concurrency_controller: AdaptiveConcurrencyController::new_download("download"),
detected_reconstruction_api_version: AtomicU32::new(0),
})
}
@@ -168,6 +171,126 @@ impl RemoteClient {
}
}
impl RemoteClient {
async fn get_reconstruction_impl<T>(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
api_version: &str,
) -> Result<Option<T>>
where
T: serde::de::DeserializeOwned + 'static,
{
let call_id = FN_CALL_ID.fetch_add(1, Ordering::Relaxed);
let url = Url::parse(&format!("{}/{api_version}/reconstructions/{}", self.endpoint, file_id.hex()))?;
let api_tag = match api_version {
"v1" => "cas::get_reconstruction_v1",
"v2" => "cas::get_reconstruction_v2",
_ => {
return Err(CasClientError::internal(format!("unsupported reconstruction API version: {api_version}")));
},
};
event!(
INFORMATION_LOG_LEVEL,
call_id,
%file_id,
?bytes_range,
api_version,
"Starting get_reconstruction API call",
);
let client = self.authenticated_http_client.clone();
let result: Result<T> = RetryWrapper::new(api_tag)
.run_and_extract_json(move || {
let mut request = client.get(url.clone()).with_extension(Api(api_tag));
if let Some(range) = bytes_range {
request = request.header(RANGE, HttpRange::from(range).range_header())
}
request.send()
})
.await;
match result {
Ok(response) => {
event!(
INFORMATION_LOG_LEVEL,
call_id,
%file_id,
?bytes_range,
api_version,
"Completed get_reconstruction API call"
);
Ok(Some(response))
},
Err(CasClientError::ReqwestError(ref e, _)) if e.status() == Some(StatusCode::RANGE_NOT_SATISFIABLE) => {
Ok(None)
},
Err(e) => Err(e),
}
}
/// V1 reconstruction: returns per-range presigned URLs.
pub async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
self.get_reconstruction_impl(file_id, bytes_range, "v1").await
}
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
pub async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>> {
self.get_reconstruction_impl(file_id, bytes_range, "v2").await
}
pub(crate) async fn get_reconstruction_with_version_override(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
forced_version: Option<u32>,
) -> Result<Option<QueryReconstructionResponseV2>> {
// Prefer V2; fall back to V1 on 404/501; persist detected version to
// avoid repeated fallback attempts.
let version = match forced_version {
Some(v) => v,
None => {
let detected = self.detected_reconstruction_api_version.load(Ordering::Relaxed);
if detected != 0 { detected } else { 2 }
},
};
match version {
2 => match self.get_reconstruction_v2(file_id, bytes_range).await {
Ok(result) => {
if forced_version.is_none() {
self.detected_reconstruction_api_version.store(2, Ordering::Relaxed);
}
Ok(result)
},
Err(e)
if forced_version.is_none()
&& matches!(e.status(), Some(StatusCode::NOT_FOUND) | Some(StatusCode::NOT_IMPLEMENTED)) =>
{
info!(status = ?e.status(), "V2 reconstruction not available, falling back to V1");
let result = self.get_reconstruction_v1(file_id, bytes_range).await?.map(Into::into);
// Store after success to make sure we don't mess up on e.g. network failure.
self.detected_reconstruction_api_version.store(1, Ordering::Relaxed);
Ok(result)
},
Err(e) => Err(e),
},
1 => Ok(self.get_reconstruction_v1(file_id, bytes_range).await?.map(Into::into)),
other => Err(CasClientError::internal(format!("unsupported reconstruction API version: {other}"))),
}
}
}
#[cfg_attr(not(target_family = "wasm"), async_trait::async_trait)]
#[cfg_attr(target_family = "wasm", async_trait::async_trait(?Send))]
impl Client for RemoteClient {
@@ -175,49 +298,10 @@ impl Client for RemoteClient {
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
let call_id = FN_CALL_ID.fetch_add(1, Ordering::Relaxed);
let url = Url::parse(&format!("{}/v1/reconstructions/{}", self.endpoint, file_id.hex()))?;
event!(
INFORMATION_LOG_LEVEL,
call_id,
%file_id,
?bytes_range,
"Starting get_reconstruction API call",
);
let api_tag = "cas::get_reconstruction";
let client = self.authenticated_http_client.clone();
let result: Result<QueryReconstructionResponse> = RetryWrapper::new(api_tag)
.run_and_extract_json(move || {
let mut request = client.get(url.clone()).with_extension(Api(api_tag));
if let Some(range) = bytes_range {
// convert exclusive-end to inclusive-end range
request = request.header(RANGE, HttpRange::from(range).range_header())
}
request.send()
})
.await;
match result {
Ok(query_reconstruction_response) => {
event!(
INFORMATION_LOG_LEVEL,
call_id,
%file_id,
?bytes_range,
"Completed get_reconstruction API call"
);
Ok(Some(query_reconstruction_response))
},
Err(CasClientError::ReqwestError(ref e, _)) if e.status() == Some(StatusCode::RANGE_NOT_SATISFIABLE) => {
// bytes_range not satisfiable
Ok(None)
},
Err(e) => Err(e),
}
) -> Result<Option<QueryReconstructionResponseV2>> {
let forced_version = xet_config().client.reconstruction_api_version;
self.get_reconstruction_with_version_override(file_id, bytes_range, forced_version)
.await
}
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse> {
@@ -270,8 +354,8 @@ impl Client for RemoteClient {
let http_client = self.http_client.clone();
let url_info = Arc::new(url_info);
let (_, url_range) = url_info.retrieve_url().await?;
let total_download_bytes = url_range.length();
let (_, url_ranges) = url_info.retrieve_url().await?;
let total_download_bytes: u64 = url_ranges.iter().map(|r| r.length()).sum();
let mut transfer_reporter = StreamProgressReporter::new(total_download_bytes)
.with_adaptive_concurrency_reporter(download_permit.get_partial_completion_reporting_function());
@@ -288,16 +372,28 @@ impl Client for RemoteClient {
let url_info = url_info.clone();
async move {
let (url_string, url_range) = url_info
let (url_string, url_ranges) = url_info
.retrieve_url()
.await
.map_err(|e| reqwest_middleware::Error::Middleware(e.into()))?;
let url =
Url::parse(&url_string).map_err(|e| reqwest_middleware::Error::Middleware(e.into()))?;
// RFC 7233 §2.1: single-range uses "bytes=S-E", multi-range uses "bytes=S1-E1,S2-E2,..."
let range_header_value = if url_ranges.len() == 1 {
url_ranges[0].range_header()
} else {
let joined = url_ranges
.iter()
.map(|r| format!("{}-{}", r.start, r.end))
.collect::<Vec<_>>()
.join(",");
format!("bytes={joined}")
};
let response = http_client
.get(url)
.header(RANGE, url_range.range_header())
.header(RANGE, range_header_value)
.with_extension(Api(api_tag))
.send()
.await?;
@@ -315,6 +411,57 @@ impl Client for RemoteClient {
move |resp: Response| {
let transfer_reporter = transfer_reporter.clone();
async move {
let content_type = resp
.headers()
.get("content-type")
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.to_string();
let is_multipart = content_type.contains("multipart/byteranges");
if is_multipart {
let body = resp
.bytes()
.await
.map_err(|e| RetryableReqwestError::RetryableError(CasClientError::from(e)))?;
let multipart_parts = crate::cas_client::multipart::parse_multipart_byteranges(&content_type, body)
.map_err(RetryableReqwestError::FatalError)?;
let mut all_decompressed = Vec::with_capacity(uncompressed_size_if_known.unwrap_or(0));
let mut all_chunk_indices = Vec::<u32>::new();
let mut total_compressed_bytes = 0u64;
for part in multipart_parts {
total_compressed_bytes += part.data.len() as u64;
let (data, chunk_indices) =
xet_core_structures::xorb_object::deserialize_chunks(&mut std::io::Cursor::new(part.data.as_ref()))
.map_err(|e| {
RetryableReqwestError::RetryableError(CasClientError::FormatError(e))
})?;
xet_core_structures::xorb_object::append_chunk_segment(
&mut all_decompressed,
&mut all_chunk_indices,
&data,
&chunk_indices,
);
transfer_reporter.report_progress(total_compressed_bytes as usize);
}
if let Some(expected) = uncompressed_size_if_known
&& expected != all_decompressed.len()
{
return Err(RetryableReqwestError::RetryableError(CasClientError::Other(format!(
"get_file_term_data: expected {expected} uncompressed bytes, got {}",
all_decompressed.len()
))));
}
Ok((Bytes::from(all_decompressed), all_chunk_indices))
} else {
let incoming_stream = DownloadProgressStream::wrap_stream(
resp.bytes_stream().map_err(std::io::Error::other),
transfer_reporter,
@@ -345,6 +492,7 @@ impl Client for RemoteClient {
Err(e) => Err(RetryableReqwestError::RetryableError(CasClientError::FormatError(e))),
}
}
}
},
)
.await?;

View File

@@ -157,10 +157,13 @@ impl RetryWrapper {
}
},
(Err(e), Some(Retryable::Transient)) => {
// Intercept the too many requests condition in the case of no retrying on 429.
if e.status() == Some(StatusCode::TOO_MANY_REQUESTS) && self.no_retry_on_429 {
let cas_err = process_error("Too Many Requests (retry on 429 disabled)", e, false);
Err(RetryableReqwestError::FatalError(cas_err))
} else if e.status() == Some(StatusCode::NOT_IMPLEMENTED) {
// 501 is permanent -- the server won't implement this on retry.
let cas_err = process_error("Not Implemented", e, true);
Err(RetryableReqwestError::FatalError(cas_err))
} else {
let cas_err = process_error("Retryable Error", e, true);
Err(RetryableReqwestError::RetryableError(cas_err))

View File

@@ -36,6 +36,11 @@ where
test_get_file_data_with_ranges(factory().await).await;
test_get_file_size(factory().await).await;
test_global_dedup(factory().await).await;
test_v2_reconstruction_basic(factory().await).await;
test_v2_reconstruction_ranges(factory().await).await;
test_v2_reconstruction_matches_v1(factory().await).await;
test_v2_max_ranges_per_fetch(factory().await).await;
test_v2_url_encoding(factory().await).await;
}
/// Tests that adjacent chunk ranges from the same xorb are merged into a single fetch_info.
@@ -43,7 +48,7 @@ pub async fn test_reconstruction_merges_adjacent_ranges(client: Arc<dyn DirectAc
let term_spec = &[(1, (0, 2)), (1, (2, 4))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 2);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -59,7 +64,7 @@ pub async fn test_reconstruction_with_multiple_xorbs(client: Arc<dyn DirectAcces
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (1, (3, 5))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 3);
assert_eq!(reconstruction.fetch_info.len(), 2);
}
@@ -73,7 +78,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (0, 3)), (1, (1, 4))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 2);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -89,7 +94,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (0, 5)), (1, (1, 3))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 2);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -105,7 +110,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (0, 2)), (1, (1, 4)), (1, (3, 6))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 3);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -121,7 +126,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (0, 2)), (1, (4, 6))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 2);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -139,7 +144,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (0, 3)), (1, (3, 5))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 2);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -155,7 +160,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (2, 5)), (1, (2, 5)), (1, (2, 5))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 3);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -171,7 +176,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
let term_spec = &[(1, (0, 3)), (1, (2, 4)), (1, (6, 8)), (1, (7, 10))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 4);
assert_eq!(reconstruction.fetch_info.len(), 1);
@@ -191,12 +196,12 @@ pub async fn test_range_requests(client: Arc<dyn DirectAccessClient>) {
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
// Calculate total file size from terms
let reconstruction_full = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction_full = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
let total_file_size: u64 = reconstruction_full.terms.iter().map(|t| t.unpacked_length as u64).sum();
// Partial out-of-range truncates
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(total_file_size / 2, total_file_size + 1000)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(total_file_size / 2, total_file_size + 1000)))
.await
.unwrap()
.unwrap();
@@ -205,19 +210,19 @@ pub async fn test_range_requests(client: Arc<dyn DirectAccessClient>) {
// Entire range out of bounds returns Ok(None) (like RemoteClient's 416 handling)
let result = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(total_file_size + 100, total_file_size + 1000)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(total_file_size + 100, total_file_size + 1000)))
.await;
assert!(result.unwrap().is_none());
// Start equals file size returns Ok(None)
let result = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(total_file_size, total_file_size + 100)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(total_file_size, total_file_size + 100)))
.await;
assert!(result.unwrap().is_none());
// Valid range within bounds succeeds
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(0, total_file_size / 2)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(0, total_file_size / 2)))
.await
.unwrap()
.unwrap();
@@ -226,7 +231,7 @@ pub async fn test_range_requests(client: Arc<dyn DirectAccessClient>) {
// End exactly at file size succeeds
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(0, total_file_size)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(0, total_file_size)))
.await
.unwrap()
.unwrap();
@@ -239,7 +244,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
// Test 1: Single segment with 3 chunks
{
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 1);
}
@@ -248,7 +253,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
let term_spec = &[(1, (0, 2)), (1, (2, 4)), (1, (4, 6))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 3);
assert_eq!(reconstruction.fetch_info.len(), 1);
}
@@ -258,7 +263,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (3, (0, 4))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 3);
assert_eq!(reconstruction.fetch_info.len(), 3);
}
@@ -268,7 +273,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
let term_spec = &[(1, (0, 3)), (1, (1, 4)), (1, (2, 5))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(reconstruction.terms.len(), 3);
assert_eq!(reconstruction.fetch_info.len(), 1);
}
@@ -280,7 +285,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
let term_spec = &[(1, (0, 5))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction_full = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction_full = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
let total_file_size: u64 = reconstruction_full.terms.iter().map(|t| t.unpacked_length as u64).sum();
assert_eq!(total_file_size, (5 * chunk_size) as u64);
@@ -289,7 +294,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
let start = chunk_size as u64 + 500;
let end = total_file_size;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -305,7 +310,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
let start = (chunk_size * 2) as u64;
let end = total_file_size;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -321,7 +326,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
let start = 0u64;
let end = (chunk_size * 2) as u64 + 500;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -337,7 +342,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
let start = (chunk_size * 2) as u64 + 100;
let end = (chunk_size * 2) as u64 + 500;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -353,7 +358,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
let start = chunk_size as u64 - 100;
let end = chunk_size as u64 + 100;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -371,7 +376,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
let term_spec = &[(1, (0, 4)), (2, (0, 4))];
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
let reconstruction_full = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction_full = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
let total_file_size: u64 = reconstruction_full.terms.iter().map(|t| t.unpacked_length as u64).sum();
assert_eq!(total_file_size, (8 * chunk_size) as u64);
@@ -380,7 +385,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
let start = chunk_size as u64 + 500;
let end = total_file_size;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -398,7 +403,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
let start = chunk_size as u64;
let end = (chunk_size * 3) as u64;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -415,7 +420,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
let start = xorb1_size + chunk_size as u64;
let end = xorb1_size + (chunk_size * 3) as u64;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -432,7 +437,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
let start = (chunk_size * 2) as u64;
let end = xorb1_size + (chunk_size * 2) as u64 + 500;
let response = client
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
.await
.unwrap()
.unwrap();
@@ -712,7 +717,7 @@ async fn test_url_expiration_within_window(client: Arc<dyn DirectAccessClient>)
// Upload a file and get reconstruction info (which creates URLs with current timestamp)
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
// Get the fetch_info for the first term's xorb
let xorb_hash = file.terms[0].xorb_hash;
@@ -738,7 +743,7 @@ async fn test_url_expiration_after_window(client: Arc<dyn DirectAccessClient>) {
// Upload a file and get reconstruction info (which creates URLs with current timestamp)
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
// Get the fetch_info for the first term's xorb
let xorb_hash = file.terms[0].xorb_hash;
@@ -764,7 +769,7 @@ async fn test_url_expiration_default_infinite(client: Arc<dyn DirectAccessClient
// Upload a file and get reconstruction info
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
// Get the fetch_info for the first term's xorb
let xorb_hash = file.terms[0].xorb_hash;
@@ -790,7 +795,7 @@ async fn test_url_expiration_exact_boundary(client: Arc<dyn DirectAccessClient>)
// Upload a file and get reconstruction info
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
// Get the fetch_info for the first term's xorb
let xorb_hash = file.terms[0].xorb_hash;
@@ -916,3 +921,190 @@ async fn test_api_delay_can_be_disabled(client: Arc<dyn DirectAccessClient>) {
"Delay should not be applied after disabling: elapsed={elapsed:?}"
);
}
// ===== V2 Reconstruction Tests =====
/// Tests basic V2 reconstruction response structure.
async fn test_v2_reconstruction_basic(client: Arc<dyn DirectAccessClient>) {
let term_spec = &[(1, (0, 5))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let response = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
assert!(!response.terms.is_empty());
assert!(!response.xorbs.is_empty());
assert_eq!(response.offset_into_first_range, 0);
for term in &response.terms {
let xorb_descriptor = response.xorbs.get(&term.hash).expect("xorb descriptor missing for term");
assert!(!xorb_descriptor.is_empty());
for fetch in xorb_descriptor {
assert!(!fetch.url.is_empty());
assert!(!fetch.ranges.is_empty());
for range in &fetch.ranges {
assert!(range.bytes.start < range.bytes.end);
assert!(range.chunks.start < range.chunks.end);
}
}
}
}
/// Tests V2 reconstruction with byte range queries.
async fn test_v2_reconstruction_ranges(client: Arc<dyn DirectAccessClient>) {
let term_spec = &[(1, (0, 3)), (2, (0, 3)), (1, (3, 6))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let file_size = file.data.len() as u64;
// Partial range
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
let response = client
.get_reconstruction_v2(&file.file_hash, Some(range))
.await
.unwrap()
.unwrap();
assert!(!response.terms.is_empty());
assert!(!response.xorbs.is_empty());
// Out-of-range query returns None
let out_of_range = FileRange::new(file_size + 100, file_size + 200);
let none_result = client.get_reconstruction_v2(&file.file_hash, Some(out_of_range)).await.unwrap();
assert!(none_result.is_none());
}
/// Tests that V2 reconstruction terms match V1 terms and offsets.
async fn test_v2_reconstruction_matches_v1(client: Arc<dyn DirectAccessClient>) {
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (1, (3, 5))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let v1 = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
let v2 = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
assert_eq!(v1.offset_into_first_range, v2.offset_into_first_range);
assert_eq!(v1.terms.len(), v2.terms.len());
for (t1, t2) in v1.terms.iter().zip(v2.terms.iter()) {
assert_eq!(t1.hash, t2.hash);
assert_eq!(t1.range, t2.range);
assert_eq!(t1.unpacked_length, t2.unpacked_length);
}
// Both should have the same xorb hashes
let mut v1_xorb_hashes: Vec<_> = v1.fetch_info.keys().map(|h| h.to_string()).collect();
let mut v2_xorb_hashes: Vec<_> = v2.xorbs.keys().map(|h| h.to_string()).collect();
v1_xorb_hashes.sort();
v2_xorb_hashes.sort();
assert_eq!(v1_xorb_hashes, v2_xorb_hashes);
// Check range with partial file
let file_size = file.data.len() as u64;
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
let v1r = client
.get_reconstruction_v1(&file.file_hash, Some(range))
.await
.unwrap()
.unwrap();
let v2r = client
.get_reconstruction_v2(&file.file_hash, Some(range))
.await
.unwrap()
.unwrap();
assert_eq!(v1r.offset_into_first_range, v2r.offset_into_first_range);
assert_eq!(v1r.terms.len(), v2r.terms.len());
}
/// Tests that max_ranges_per_fetch correctly splits multi-range fetch entries.
async fn test_v2_max_ranges_per_fetch(client: Arc<dyn DirectAccessClient>) {
// Use a file with many non-contiguous segments from the same xorb,
// interleaved with another xorb to prevent merging.
let term_spec = &[
(1, (0, 2)),
(2, (0, 1)),
(1, (2, 4)),
(2, (1, 2)),
(1, (4, 6)),
(2, (2, 3)),
(1, (6, 8)),
];
let file = client.upload_random_file(term_spec, 512).await.unwrap();
// Without limit, xorb 1 should have all its ranges in a single fetch
let response_unlimited = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
// Find xorb 1's descriptor
let xorb1_hash = &file.terms[0].xorb_hash;
let hex_hash: crate::cas_types::HexMerkleHash = (*xorb1_hash).into();
let desc_unlimited = response_unlimited.xorbs.get(&hex_hash).unwrap();
// Now set max_ranges_per_fetch to 2
client.set_max_ranges_per_fetch(2);
let response_limited = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
let desc_limited = response_limited.xorbs.get(&hex_hash).unwrap();
// With a limit of 2, the number of fetch entries should be >= the unlimited count
assert!(
desc_limited.len() >= desc_unlimited.len(),
"Limited ({}) should have at least as many fetch entries as unlimited ({})",
desc_limited.len(),
desc_unlimited.len()
);
// Each fetch entry should have at most 2 ranges
for fetch in desc_limited {
assert!(fetch.ranges.len() <= 2, "Expected at most 2 ranges per fetch, got {}", fetch.ranges.len());
}
// Total ranges across all fetches should equal the unlimited total
let total_unlimited: usize = desc_unlimited.iter().map(|f| f.ranges.len()).sum();
let total_limited: usize = desc_limited.iter().map(|f| f.ranges.len()).sum();
assert_eq!(total_unlimited, total_limited, "Total ranges should be preserved");
// Reset for other tests
client.set_max_ranges_per_fetch(usize::MAX);
}
/// Tests that V2 URLs are valid base64 and decode correctly.
/// When going through a server, URLs are HTTP; when direct, they're base64.
async fn test_v2_url_encoding(client: Arc<dyn DirectAccessClient>) {
use base64::Engine;
use base64::engine::general_purpose::URL_SAFE_NO_PAD;
let term_spec = &[(1, (0, 3))];
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
let response = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
for fetch_entries in response.xorbs.values() {
for fetch in fetch_entries {
assert!(!fetch.url.is_empty(), "URL should not be empty");
if fetch.url.starts_with("http://") || fetch.url.starts_with("https://") {
// Server-transformed URL: should point to fetch_term
assert!(fetch.url.contains("/fetch_term"), "HTTP URL should contain /fetch_term: {}", fetch.url);
} else {
// Direct client URL: should be valid base64
let decoded = URL_SAFE_NO_PAD.decode(&fetch.url);
assert!(decoded.is_ok(), "URL should be valid base64: {}", fetch.url);
let payload = String::from_utf8(decoded.unwrap()).unwrap();
let parts: Vec<&str> = payload.splitn(3, ':').collect();
assert_eq!(parts.len(), 3, "Payload should have 3 colon-separated parts");
let hash = xet_core_structures::merklehash::MerkleHash::from_hex(parts[0]);
assert!(hash.is_ok(), "Hash part should be valid hex");
let ts: std::result::Result<u64, _> = parts[1].parse();
assert!(ts.is_ok(), "Timestamp should be a valid u64");
for range_str in parts[2].split(',').filter(|s| !s.is_empty()) {
let range_parts: Vec<&str> = range_str.split('-').collect();
assert_eq!(range_parts.len(), 2, "Each range should be start-end");
assert!(range_parts[0].parse::<u64>().is_ok());
assert!(range_parts[1].parse::<u64>().is_ok());
}
}
}
}
}

View File

@@ -14,7 +14,9 @@ use xet_core_structures::xorb_object::XorbObject;
use super::super::error::Result;
use super::super::interface::Client;
use crate::cas_types::{FileRange, XorbReconstructionFetchInfo};
use crate::cas_types::{
FileRange, QueryReconstructionResponse, QueryReconstructionResponseV2, XorbReconstructionFetchInfo,
};
/// A Client with direct access to XORB and file storage.
///
@@ -40,6 +42,39 @@ pub trait DirectAccessClient: Client + Send + Sync {
/// Pass `None` to disable the delay.
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>);
/// Sets the maximum number of byte ranges per `XorbMultiRangeFetch` entry
/// in V2 reconstruction responses.
///
/// Default is `usize::MAX` (all ranges in one fetch). When set to N,
/// ranges for each xorb are grouped into entries of at most N ranges.
/// This simulates the CloudFront URL length limit that forces splitting.
fn set_max_ranges_per_fetch(&self, max_ranges: usize);
/// Disables V2 reconstruction responses with the given HTTP status code.
/// When disabled, the V2 endpoint returns this status, forcing clients to
/// fall back to V1. Pass 0 to re-enable.
fn disable_v2_reconstruction(&self, status_code: u16);
/// Returns the HTTP status code the V2 endpoint should return when disabled,
/// or 0 if V2 is enabled.
fn v2_disabled_status_code(&self) -> u16 {
0
}
/// V1 reconstruction: returns per-range presigned URLs.
async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>>;
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>>;
/// Applies the configured API delay if set.
///
/// This method sleeps for a random duration within the configured delay range.

View File

@@ -5,14 +5,12 @@ use std::mem::size_of;
use std::ops::Range;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::atomic::{AtomicU16, AtomicU64, AtomicUsize, Ordering};
use anyhow::anyhow;
use async_trait::async_trait;
use bytes::Bytes;
use heed::types::*;
use lazy_static::lazy_static;
use more_asserts::*;
use rand::Rng;
use tempfile::TempDir;
use tokio::time::{Duration, Instant};
@@ -30,25 +28,16 @@ use xet_core_structures::xorb_object::{SerializedXorbObject, XorbObject};
use xet_runtime::file_utils::SafeFileCreator;
use super::direct_access_client::DirectAccessClient;
use super::xorb_utils::{self, REFERENCE_INSTANT};
use crate::cas_client::Client;
use crate::cas_client::adaptive_concurrency::AdaptiveConcurrencyController;
use crate::cas_client::error::{CasClientError, Result};
use crate::cas_client::progress_tracked_streams::ProgressCallback;
use crate::cas_types::{
BatchQueryReconstructionResponse, ChunkRange, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
XorbReconstructionFetchInfo, XorbReconstructionTerm,
BatchQueryReconstructionResponse, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
QueryReconstructionResponseV2, XorbMultiRangeFetch, XorbRangeDescriptor, XorbReconstructionFetchInfo,
};
lazy_static! {
/// Reference instant for URL timestamps. Initialized far in the past to allow
/// testing timestamps that are earlier in the current process lifetime.
static ref REFERENCE_INSTANT: Instant = {
let now = Instant::now();
now.checked_sub(Duration::from_secs(365 * 24 * 60 * 60))
.unwrap_or(now)
};
}
pub struct LocalClient {
// Note: Field order matters for Drop! heed::Env must be dropped before _tmp_dir
// because heed holds file handles that need to be closed before the directory is deleted.
@@ -62,6 +51,10 @@ pub struct LocalClient {
url_expiration_ms: AtomicU64,
/// API delay range in milliseconds as (min_ms, max_ms). (0, 0) means disabled.
random_ms_delay_window: (AtomicU64, AtomicU64),
/// Max ranges per XorbMultiRangeFetch entry. usize::MAX means no splitting.
max_ranges_per_fetch: AtomicUsize,
/// HTTP status code to return when V2 is disabled (0 = enabled).
v2_disabled_status: AtomicU16,
_tmp_dir: Option<TempDir>, // Must be last - dropped after heed env is closed
}
@@ -157,6 +150,8 @@ impl LocalClient {
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("local_uploads"),
url_expiration_ms: AtomicU64::new(u64::MAX),
random_ms_delay_window: (AtomicU64::new(0), AtomicU64::new(0)),
max_ranges_per_fetch: AtomicUsize::new(usize::MAX),
v2_disabled_status: AtomicU16::new(0),
_tmp_dir: tmp_dir, // Must be last - dropped after heed env is closed
})
}
@@ -347,6 +342,34 @@ impl DirectAccessClient for LocalClient {
self.url_expiration_ms.store(expiration.as_millis() as u64, Ordering::Relaxed);
}
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
self.max_ranges_per_fetch.store(max_ranges, Ordering::Relaxed);
}
fn disable_v2_reconstruction(&self, status_code: u16) {
self.v2_disabled_status.store(status_code, Ordering::Relaxed);
}
fn v2_disabled_status_code(&self) -> u16 {
self.v2_disabled_status.load(Ordering::Relaxed)
}
async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
LocalClient::get_reconstruction_v1(self, file_id, bytes_range).await
}
async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>> {
LocalClient::get_reconstruction_v2(self, file_id, bytes_range).await
}
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>) {
match delay_range {
Some(range) => {
@@ -626,7 +649,126 @@ impl DirectAccessClient for LocalClient {
}
}
/// LocalClient is responsible for writing/reading Xorbs on the local disk.
impl LocalClient {
async fn compute_reconstruction_ranges(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<xorb_utils::ReconstructionRangesResult> {
let Some((file_info, _)) = self.shard_manager.get_file_reconstruction_info(file_id).await? else {
return Ok(None);
};
xorb_utils::compute_reconstruction_ranges(&file_info, bytes_range, &mut |hash| self.xorb_footer_sync(hash))
}
fn xorb_footer_sync(&self, hash: &MerkleHash) -> Result<XorbObject> {
let file_path = self.get_path_for_entry(hash);
let mut file = File::open(&file_path).map_err(|_| {
error!("Unable to find file in local CAS {:?}", file_path);
CasClientError::XORBNotFound(*hash)
})?;
XorbObject::deserialize(&mut file).map_err(Into::into)
}
/// V1 reconstruction: returns per-range presigned URLs.
pub async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
self.apply_api_delay().await;
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
return Ok(None);
};
if terms.is_empty() {
return Ok(Some(QueryReconstructionResponse {
offset_into_first_range,
terms,
fetch_info: HashMap::new(),
}));
}
let timestamp = Instant::now();
let mut fetch_info: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
for (hash, ranges) in merged_ranges {
let file_path = self.get_path_for_entry(&hash);
let entries = ranges
.into_iter()
.map(|r| XorbReconstructionFetchInfo {
range: r.chunk_range,
url: generate_fetch_url(&file_path, &r.byte_range, timestamp),
url_range: HttpRange::from(r.byte_range),
})
.collect();
fetch_info.insert(hash.into(), entries);
}
Ok(Some(QueryReconstructionResponse {
offset_into_first_range,
terms,
fetch_info,
}))
}
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
pub async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>> {
self.apply_api_delay().await;
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
return Ok(None);
};
if terms.is_empty() {
return Ok(Some(QueryReconstructionResponseV2 {
offset_into_first_range,
terms,
xorbs: HashMap::new(),
}));
}
let timestamp = Instant::now();
let max_ranges = self.max_ranges_per_fetch.load(Ordering::Relaxed);
let mut xorbs: HashMap<HexMerkleHash, Vec<XorbMultiRangeFetch>> = HashMap::new();
for (hash, ranges) in merged_ranges {
let mut fetch_entries = Vec::new();
for chunk in ranges.chunks(max_ranges) {
let range_descriptors: Vec<XorbRangeDescriptor> = chunk
.iter()
.map(|r| XorbRangeDescriptor {
chunks: r.chunk_range,
bytes: HttpRange::from(r.byte_range),
})
.collect();
let url = generate_v2_fetch_url(&hash, &range_descriptors, timestamp);
fetch_entries.push(XorbMultiRangeFetch {
url,
ranges: range_descriptors,
});
}
xorbs.insert(hash.into(), fetch_entries);
}
Ok(Some(QueryReconstructionResponseV2 {
offset_into_first_range,
terms,
xorbs,
}))
}
}
#[async_trait]
impl Client for LocalClient {
async fn get_file_reconstruction_info(
@@ -784,196 +926,8 @@ impl Client for LocalClient {
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
self.apply_api_delay().await;
let Some((file_info, _)) = self.shard_manager.get_file_reconstruction_info(file_id).await? else {
return Ok(None);
};
// Calculate total file size from segments
let total_file_size: u64 = file_info.file_size();
// Handle range validation and truncation
let file_range = if let Some(range) = bytes_range {
// If the entire range is out of bounds, return None (like RemoteClient does for 416)
if range.start >= total_file_size {
// For empty files (size 0), only the first query (start == 0) should return the empty reconstruction
// All subsequent queries should return None to prevent infinite remainder loops
if total_file_size == 0 && range.start == 0 {
// Empty file - return valid but empty reconstruction
return Ok(Some(QueryReconstructionResponse {
offset_into_first_range: 0,
terms: vec![],
fetch_info: HashMap::new(),
}));
}
return Ok(None);
}
// Truncate end if it extends beyond file size
FileRange::new(range.start, range.end.min(total_file_size))
} else {
// No range specified - handle empty files
if total_file_size == 0 {
return Ok(Some(QueryReconstructionResponse {
offset_into_first_range: 0,
terms: vec![],
fetch_info: HashMap::new(),
}));
}
FileRange::full()
};
// First skip file segments until we find the first one that starts before the file range start
let mut s_idx = 0;
let mut cumulative_bytes = 0u64;
let mut first_chunk_byte_start;
loop {
if s_idx >= file_info.segments.len() {
// We have here that the requested file range is out of bounds,
// so return a range error.
return Err(CasClientError::InvalidRange);
}
let n = file_info.segments[s_idx].unpacked_segment_bytes as u64;
if cumulative_bytes + n > file_range.start {
assert_ge!(file_range.start, cumulative_bytes);
first_chunk_byte_start = cumulative_bytes;
break;
} else {
cumulative_bytes += n;
s_idx += 1;
}
}
// Now, prepare the response by iterating over the segments and
// adding the terms and fetch info to the response.
let mut terms = Vec::new();
#[derive(Clone)]
struct FetchInfoIntermediate {
chunk_range: ChunkRange,
byte_range: FileRange,
}
let mut fetch_info_map: HashMap<MerkleHash, Vec<FetchInfoIntermediate>> = HashMap::new();
while s_idx < file_info.segments.len() && cumulative_bytes < file_range.end {
let mut segment = file_info.segments[s_idx].clone();
let mut chunk_range = ChunkRange::new(segment.chunk_index_start, segment.chunk_index_end);
// Now get the URL for this segment, which involves reading the actual byte range there.
let xorb_footer = self.xorb_footer(&segment.xorb_hash).await?;
// Do we need to prune the first segment on chunk boundaries to align with the range given?
if cumulative_bytes < file_range.start {
while chunk_range.start < chunk_range.end {
let next_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.start)? as u64;
if cumulative_bytes + next_chunk_size <= file_range.start {
cumulative_bytes += next_chunk_size;
first_chunk_byte_start += next_chunk_size;
segment.unpacked_segment_bytes -= next_chunk_size as u32;
chunk_range.start += 1;
// Should find it somewhere in here.
debug_assert_lt!(chunk_range.start, chunk_range.end);
} else {
break;
}
}
}
// Do we need to prune the last segment on chunk boundaries to align with the range given?
if cumulative_bytes + segment.unpacked_segment_bytes as u64 > file_range.end {
while chunk_range.end > chunk_range.start {
let last_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.end - 1)?;
if cumulative_bytes + (segment.unpacked_segment_bytes - last_chunk_size) as u64 >= file_range.end {
// We can cut the last chunk off and still contain the requested range.
chunk_range.end -= 1;
segment.unpacked_segment_bytes -= last_chunk_size;
debug_assert_lt!(chunk_range.start, chunk_range.end);
debug_assert_gt!(segment.unpacked_segment_bytes, 0);
} else {
break;
}
}
}
let (byte_start, byte_end) = xorb_footer.get_byte_offset(chunk_range.start, chunk_range.end)?;
let byte_range = FileRange::new(byte_start as u64, byte_end as u64);
let xorb_reconstruction_term = XorbReconstructionTerm {
hash: segment.xorb_hash.into(),
unpacked_length: segment.unpacked_segment_bytes,
range: chunk_range,
};
terms.push(xorb_reconstruction_term);
let fetch_info_intemediate = FetchInfoIntermediate {
chunk_range,
byte_range,
};
fetch_info_map
.entry(segment.xorb_hash)
.or_default()
.push(fetch_info_intemediate);
cumulative_bytes += segment.unpacked_segment_bytes as u64;
s_idx += 1;
}
assert!(!terms.is_empty());
let timestamp = Instant::now();
// Sort and merge adjacent/overlapping ranges in each fetch_info Vec
let mut merged_fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
for (hash, mut fi_vec) in fetch_info_map {
// Sort by url_range.start
fi_vec.sort_by_key(|fi| fi.chunk_range.start);
let file_path = self.get_path_for_entry(&hash);
// Merge adjacent or overlapping ranges
let mut merged: Vec<XorbReconstructionFetchInfo> = Vec::new();
let mut idx = 0;
while idx < fi_vec.len() {
// Go through and merge adjascent or overlapping ranges,
// then form the full XorbReconstructionFetchInfo structs.
let mut new_fi = fi_vec[idx].clone();
while idx + 1 < fi_vec.len() {
let next_fi = &fi_vec[idx + 1];
if next_fi.chunk_range.start <= new_fi.chunk_range.end {
new_fi.chunk_range.end = next_fi.chunk_range.end.max(new_fi.chunk_range.end);
new_fi.byte_range.end = next_fi.byte_range.end.max(new_fi.byte_range.end);
idx += 1;
} else {
break;
}
}
merged.push(XorbReconstructionFetchInfo {
range: new_fi.chunk_range,
url: generate_fetch_url(&file_path, &new_fi.byte_range, timestamp),
url_range: HttpRange::from(new_fi.byte_range),
});
idx += 1;
}
merged_fetch_info_map.insert(hash.into(), merged);
}
Ok(Some(QueryReconstructionResponse {
offset_into_first_range: file_range.start - first_chunk_byte_start,
terms,
fetch_info: merged_fetch_info_map,
}))
) -> Result<Option<QueryReconstructionResponseV2>> {
self.get_reconstruction_v2(file_id, bytes_range).await
}
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse> {
@@ -982,7 +936,7 @@ impl Client for LocalClient {
let mut fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
for file_id in file_ids {
if let Some(response) = self.get_reconstruction(file_id, None).await? {
if let Some(response) = self.get_reconstruction_v1(file_id, None).await? {
let hex_hash: HexMerkleHash = (*file_id).into();
files.insert(hex_hash, response.terms);
@@ -1013,8 +967,14 @@ impl Client for LocalClient {
// Retry loop: try to fetch, and if URL expired, refresh and retry once.
for attempt in 0..2 {
self.apply_api_delay().await;
let (url, range) = url_info.retrieve_url().await?;
let (file_path, _url_byte_range, url_timestamp) = parse_fetch_url(&url)?;
let (url, http_ranges) = url_info.retrieve_url().await?;
let (file_path, url_timestamp) = if let Ok((path, _, ts)) = parse_fetch_url(&url) {
(path, ts)
} else {
let (hash, ts, _) = xorb_utils::parse_v2_fetch_url(&url)?;
(self.get_path_for_entry(&hash), ts)
};
// Check if URL has expired
let expiration_ms = self.url_expiration_ms.load(Ordering::Relaxed);
@@ -1028,34 +988,46 @@ impl Client for LocalClient {
return Err(CasClientError::PresignedUrlExpirationError);
}
// Read the byte range from the file and deserialize
// Read each byte range from the serialized file and deserialize the chunks.
let mut file = File::open(&file_path).map_err(|_| CasClientError::XORBNotFound(MerkleHash::default()))?;
let start = range.start;
let end = range.end + 1; // HttpRange is inclusive end
file.seek(SeekFrom::Start(start))?;
let len = (end - start) as usize;
let mut all_decompressed = Vec::new();
let mut all_chunk_indices = Vec::<u32>::new();
let mut total_transfer = 0u64;
for http_range in &http_ranges {
let len = http_range.length() as usize;
total_transfer += http_range.length();
file.seek(SeekFrom::Start(http_range.start))?;
let mut data = vec![0u8; len];
std::io::Read::read_exact(&mut file, &mut data)?;
// Deserialize the chunks from the raw XORB data
let (decompressed_data, chunk_byte_indices) =
let (decompressed, chunk_indices) =
xet_core_structures::xorb_object::deserialize_chunks(&mut Cursor::new(&data))?;
if let Some(expected) = uncompressed_size_if_known {
debug_assert_eq!(
decompressed_data.len(),
expected,
"get_file_term_data: expected {} bytes, got {}",
expected,
decompressed_data.len()
xet_core_structures::xorb_object::append_chunk_segment(
&mut all_decompressed,
&mut all_chunk_indices,
&decompressed,
&chunk_indices,
);
}
let transfer_len = len as u64;
if let Some(ref cb) = progress_callback {
cb(transfer_len, transfer_len, transfer_len);
if let Some(expected) = uncompressed_size_if_known {
debug_assert_eq!(
all_decompressed.len(),
expected,
"get_file_term_data: expected {} bytes, got {}",
expected,
all_decompressed.len()
);
}
return Ok((Bytes::from(decompressed_data), chunk_byte_indices));
if let Some(ref cb) = progress_callback {
cb(total_transfer, total_transfer, total_transfer);
}
return Ok((Bytes::from(all_decompressed), all_chunk_indices));
}
// Should not reach here, but return error if we do.
@@ -1093,6 +1065,10 @@ fn parse_fetch_url(url: &str) -> Result<(PathBuf, FileRange, Instant)> {
Ok((file_path, byte_range, timestamp))
}
fn generate_v2_fetch_url(hash: &MerkleHash, ranges: &[XorbRangeDescriptor], timestamp: Instant) -> String {
xorb_utils::generate_v2_fetch_url(hash, ranges, timestamp)
}
#[cfg(test)]
mod tests {
use xet_core_structures::xorb_object::xorb_format_test_utils::{
@@ -1102,7 +1078,7 @@ mod tests {
use super::*;
use crate::cas_client::simulation::DeletionControlableClient;
use crate::cas_client::simulation::client_testing_utils::ClientTestingUtils;
use crate::cas_types::XorbReconstructionFetchInfo;
use crate::cas_types::{ChunkRange, XorbReconstructionFetchInfo};
/// Runs the common TestingClient trait test suite for LocalClient.
#[tokio::test]

View File

@@ -32,8 +32,8 @@ use super::super::super::error::CasClientError;
use super::super::super::{DeletionControlableClient, DirectAccessClient};
use super::latency_simulation::{LatencySimulation, ServerLatencyProfile};
use crate::cas_types::{
FileRange, HexKey, HexMerkleHash, UploadShardResponse, UploadShardResponseType, UploadXorbResponse,
XorbReconstructionFetchInfo,
FileRange, HexKey, HexMerkleHash, QueryReconstructionResponseV2, UploadShardResponse, UploadShardResponseType,
UploadXorbResponse, XorbRangeDescriptor, XorbReconstructionFetchInfo,
};
/// Server state passed to all handlers.
@@ -128,27 +128,55 @@ pub(super) fn error_to_response(e: CasClientError) -> Response {
(status, e.to_string()).into_response()
}
/// Encodes term data (file path) into a URL-safe base64 string.
///
/// The term encodes the local file path that the LocalClient uses.
/// This allows the fetch_term endpoint to retrieve the data.
/// Encodes a fetch term for HTTP transport.
///
/// The encoded term contains:
/// - xorb_hash: The XORB hash (hex encoded)
///
/// The byte range to fetch comes from the HTTP Range header, not encoded in the term.
/// Encodes a V1 fetch term for HTTP transport.
/// Contains only the xorb hash; the byte range comes from the HTTP Range header.
fn encode_term(xorb_hash: &MerkleHash) -> String {
URL_SAFE_NO_PAD.encode(xorb_hash.hex().as_bytes())
}
/// Decodes a fetch term back into its components.
///
/// Returns the xorb_hash.
fn decode_term(term: &str) -> Result<MerkleHash, String> {
/// Encodes a V2 fetch term with embedded byte ranges.
/// Format: "{hash_hex}:{start1}-{end1},{start2}-{end2},..."
/// Byte ranges use exclusive end (FileRange convention).
fn encode_term_with_ranges(xorb_hash: &MerkleHash, ranges: &[XorbRangeDescriptor]) -> String {
let ranges_str: Vec<String> = ranges
.iter()
.map(|r| {
let file_range = FileRange::from(r.bytes);
format!("{}-{}", file_range.start, file_range.end)
})
.collect();
let payload = format!("{}:{}", xorb_hash.hex(), ranges_str.join(","));
URL_SAFE_NO_PAD.encode(payload.as_bytes())
}
/// Decoded fetch term: hash and optional byte ranges (exclusive end).
struct DecodedTerm {
hash: MerkleHash,
byte_ranges: Vec<FileRange>,
}
/// Decodes a fetch term. Supports both V1 (hash only) and V2 (hash + ranges).
fn decode_term(term: &str) -> Result<DecodedTerm, String> {
let bytes = URL_SAFE_NO_PAD.decode(term).map_err(|e| format!("Invalid base64: {e}"))?;
let hash_hex = String::from_utf8(bytes).map_err(|e| format!("Invalid UTF-8: {e}"))?;
MerkleHash::from_hex(&hash_hex).map_err(|e| format!("Invalid hash: {e}"))
let payload = String::from_utf8(bytes).map_err(|e| format!("Invalid UTF-8: {e}"))?;
if let Some((hash_hex, ranges_str)) = payload.split_once(':') {
let hash = MerkleHash::from_hex(hash_hex).map_err(|e| format!("Invalid hash: {e}"))?;
let mut byte_ranges = Vec::new();
for r in ranges_str.split(',').filter(|s| !s.is_empty()) {
let (start_s, end_s) = r.split_once('-').ok_or("Invalid range syntax")?;
let start: u64 = start_s.parse().map_err(|e| format!("Invalid range start: {e}"))?;
let end: u64 = end_s.parse().map_err(|e| format!("Invalid range end: {e}"))?;
byte_ranges.push(FileRange::new(start, end));
}
Ok(DecodedTerm { hash, byte_ranges })
} else {
let hash = MerkleHash::from_hex(&payload).map_err(|e| format!("Invalid hash: {e}"))?;
Ok(DecodedTerm {
hash,
byte_ranges: vec![],
})
}
}
/// Extracts the base URL from request headers (Host header).
@@ -220,7 +248,7 @@ pub async fn get_reconstruction(
Err((status, msg)) => return (status, msg).into_response(),
};
match state.client.get_reconstruction(&file_id, range).await {
match state.client.get_reconstruction_v1(&file_id, range).await {
Ok(Some(mut response)) => {
transform_fetch_info_urls(&mut response.fetch_info, &base_url);
Json(response).into_response()
@@ -230,6 +258,78 @@ pub async fn get_reconstruction(
}
}
/// GET /v2/reconstructions/{file_id}
///
/// Returns V2 reconstruction information for a file, including:
/// - List of terms (chunks) needed to reconstruct the file
/// - Per-xorb fetch descriptors with multi-range URLs
///
/// Supports Range header for partial file reconstruction.
/// URLs in the response point to the /v1/fetch_term endpoint.
pub async fn get_reconstruction_v2(
State(state): State<ServerState>,
Path(HexMerkleHash(file_id)): Path<HexMerkleHash>,
headers: HeaderMap,
) -> Response {
let connection_guard = state.latency_simulation.register_connection().await;
if let Some(simulated_error) = connection_guard.simulate_error() {
return simulated_error;
}
// Allow testing V1 fallback by simulating V2 endpoint unavailability.
let disabled_status = state.client.v2_disabled_status_code();
if disabled_status != 0 {
let code = StatusCode::from_u16(disabled_status).unwrap_or(StatusCode::NOT_FOUND);
return (code, "V2 reconstruction endpoint disabled").into_response();
}
let base_url = get_base_url(&headers);
let range = match parse_range_header(headers.get(RANGE)) {
Ok(Some(FileRangeVariant::Normal(range))) => Some(range),
Ok(Some(FileRangeVariant::OpenRHS(start))) => {
let file_size = match state.client.get_file_size(&file_id).await {
Ok(size) => size,
Err(e) => return error_to_response(e),
};
Some(FileRange::new(start, file_size))
},
Ok(Some(FileRangeVariant::Suffix(suffix))) => {
let file_size = match state.client.get_file_size(&file_id).await {
Ok(size) => size,
Err(e) => return error_to_response(e),
};
Some(FileRange::new(file_size.saturating_sub(suffix), file_size))
},
Ok(None) => None,
Err((status, msg)) => return (status, msg).into_response(),
};
match state.client.get_reconstruction_v2(&file_id, range).await {
Ok(Some(mut response)) => {
transform_v2_xorb_urls(&mut response, &base_url);
Json(response).into_response()
},
Ok(None) => (StatusCode::RANGE_NOT_SATISFIABLE, "Range not satisfiable").into_response(),
Err(e) => error_to_response(e),
}
}
/// Transforms V2 xorb URLs from client-internal format to HTTP URLs.
///
/// Each `XorbMultiRangeFetch` URL is replaced with an HTTP URL pointing
/// to the /v1/fetch_term endpoint. The byte ranges from the V2 response
/// are encoded into the term so the endpoint can serve all ranges in one request.
fn transform_v2_xorb_urls(response: &mut QueryReconstructionResponseV2, base_url: &str) {
for (xorb_hash, fetch_entries) in response.xorbs.iter_mut() {
let xorb_hash: MerkleHash = (*xorb_hash).into();
for fetch in fetch_entries.iter_mut() {
let encoded_term = encode_term_with_ranges(&xorb_hash, &fetch.ranges);
fetch.url = format!("{base_url}/v1/fetch_term?term={encoded_term}");
}
}
}
/// GET /reconstructions?file_id=...&file_id=...
///
/// Batch query for reconstruction information for multiple files using query parameters.
@@ -285,10 +385,12 @@ pub async fn batch_get_reconstruction(
/// GET /v1/fetch_term?term=<base64_encoded_term>
///
/// Fetches raw XORB data based on an encoded term.
/// The term contains the xorb hash. The byte range is specified via HTTP Range header.
///
/// This endpoint is called by RemoteClient when fetching reconstruction terms.
/// It returns raw (compressed) bytes that the client will decompress.
/// For V1 terms (hash only), the byte range comes from the HTTP Range header.
/// For V2 terms (hash + ranges), all encoded byte ranges are fetched and
/// concatenated in order, allowing a single request to serve multi-range blocks.
///
/// Returns raw (compressed) bytes that the client will decompress.
pub async fn fetch_term(State(state): State<ServerState>, uri: axum::http::Uri, headers: HeaderMap) -> Response {
let connection_guard = state.latency_simulation.register_connection().await;
if let Some(simulated_error) = connection_guard.simulate_error() {
@@ -304,13 +406,69 @@ pub async fn fetch_term(State(state): State<ServerState>, uri: axum::http::Uri,
return (StatusCode::BAD_REQUEST, "Missing 'term' query parameter").into_response();
};
let xorb_hash = match decode_term(&term) {
Ok(h) => h,
let decoded = match decode_term(&term) {
Ok(d) => d,
Err(e) => return (StatusCode::BAD_REQUEST, format!("Invalid term: {e}")).into_response(),
};
// Get total length of the raw XORB data for Range header handling
let total_length = match state.client.xorb_raw_length(&xorb_hash).await {
if !decoded.byte_ranges.is_empty() {
// If the client sends a single-range HTTP Range header, serve just that range.
// This simulates S3/CDN behavior where the Range header controls the response
// regardless of what ranges are encoded in the presigned URL. This is the
// common path when ranges are split into single-range requests based on
// the multirange thresholds (V2 URLs with individual requests).
if let Ok(Some(FileRangeVariant::Normal(range))) = parse_range_header(headers.get(RANGE)) {
return match state.client.get_xorb_raw_bytes(&decoded.hash, Some(range)).await {
Ok(data) => (StatusCode::PARTIAL_CONTENT, data).into_response(),
Err(e) => error_to_response(e),
};
}
if decoded.byte_ranges.len() == 1 {
let range = &decoded.byte_ranges[0];
return match state.client.get_xorb_raw_bytes(&decoded.hash, Some(*range)).await {
Ok(data) => (StatusCode::PARTIAL_CONTENT, data).into_response(),
Err(e) => error_to_response(e),
};
}
// Multiple ranges with no Range header override: return a multipart/byteranges
// response (RFC 7233 Section 4.1), matching S3/CloudFront multi-range format.
let total_length = match state.client.xorb_raw_length(&decoded.hash).await {
Ok(len) => len,
Err(e) => return error_to_response(e),
};
let boundary = "xet_multipart_boundary";
let mut response_body = Vec::new();
for range in &decoded.byte_ranges {
let data = match state.client.get_xorb_raw_bytes(&decoded.hash, Some(*range)).await {
Ok(d) => d,
Err(e) => return error_to_response(e),
};
// FileRange uses exclusive end; Content-Range header uses inclusive end.
let inclusive_end = range.end.saturating_sub(1);
let part_header = format!(
"--{boundary}\r\nContent-Type: application/octet-stream\r\nContent-Range: bytes {}-{}/{total_length}\r\n\r\n",
range.start, inclusive_end
);
response_body.extend_from_slice(part_header.as_bytes());
response_body.extend_from_slice(&data);
response_body.extend_from_slice(b"\r\n");
}
response_body.extend_from_slice(format!("--{boundary}--\r\n").as_bytes());
let content_type = format!("multipart/byteranges; boundary={boundary}");
let mut headers = HeaderMap::new();
headers.insert(http::header::CONTENT_TYPE, HeaderValue::from_str(&content_type).unwrap());
return (StatusCode::PARTIAL_CONTENT, headers, Bytes::from(response_body)).into_response();
}
// V1 term: byte range comes from the HTTP Range header.
// Get total length of the raw XORB data for Range header handling.
let total_length = match state.client.xorb_raw_length(&decoded.hash).await {
Ok(len) => len,
Err(e) => return error_to_response(e),
};
@@ -327,7 +485,7 @@ pub async fn fetch_term(State(state): State<ServerState>, uri: axum::http::Uri,
};
// Fetch raw (serialized/compressed) bytes from the XORB
match state.client.get_xorb_raw_bytes(&xorb_hash, byte_range).await {
match state.client.get_xorb_raw_bytes(&decoded.hash, byte_range).await {
Ok(data) => (StatusCode::OK, data).into_response(),
Err(e) => error_to_response(e),
}
@@ -713,9 +871,33 @@ mod tests {
let xorb_hash = MerkleHash::from_hex(&format!("{:0>64}", "abc123")).unwrap();
let encoded = encode_term(&xorb_hash);
let decoded_hash = decode_term(&encoded).unwrap();
let decoded = decode_term(&encoded).unwrap();
assert_eq!(decoded.hash, xorb_hash);
assert!(decoded.byte_ranges.is_empty());
}
assert_eq!(decoded_hash, xorb_hash);
#[test]
fn test_encode_decode_term_with_ranges() {
use crate::cas_types::{ChunkRange, HttpRange, XorbRangeDescriptor};
let xorb_hash = MerkleHash::from_hex(&format!("{:0>64}", "abc123")).unwrap();
let ranges = vec![
XorbRangeDescriptor {
chunks: ChunkRange::new(0, 3),
bytes: HttpRange::new(0, 1023),
},
XorbRangeDescriptor {
chunks: ChunkRange::new(5, 8),
bytes: HttpRange::new(2048, 4095),
},
];
let encoded = encode_term_with_ranges(&xorb_hash, &ranges);
let decoded = decode_term(&encoded).unwrap();
assert_eq!(decoded.hash, xorb_hash);
assert_eq!(decoded.byte_ranges.len(), 2);
assert_eq!(decoded.byte_ranges[0], FileRange::new(0, 1024));
assert_eq!(decoded.byte_ranges[1], FileRange::new(2048, 4096));
}
#[test]

View File

@@ -177,6 +177,7 @@ impl LocalServer {
.route("/get_xorb/{prefix}/{hash}/", get(handlers::get_file_term_data))
.route("/fetch_term", get(handlers::fetch_term)),
)
.nest("/v2", Router::new().route("/reconstructions/{file_id}", get(handlers::get_reconstruction_v2)))
.nest(
"/simulation",
super::simulation_handlers::simulation_routes()
@@ -425,7 +426,7 @@ impl Client for LocalTestServer {
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
self.remote_client.get_reconstruction(file_id, bytes_range).await
}
@@ -492,6 +493,34 @@ impl DirectAccessClient for LocalTestServer {
self.client.set_fetch_term_url_expiration(expiration);
}
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
self.client.set_max_ranges_per_fetch(max_ranges);
}
fn disable_v2_reconstruction(&self, status_code: u16) {
self.client.disable_v2_reconstruction(status_code);
}
fn v2_disabled_status_code(&self) -> u16 {
self.client.v2_disabled_status_code()
}
async fn get_reconstruction_v1(
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
self.remote_client.get_reconstruction_v1(file_id, bytes_range).await
}
async fn get_reconstruction_v2(
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
self.remote_client.get_reconstruction_v2(file_id, bytes_range).await
}
fn set_api_delay_range(&self, delay_range: Option<std::ops::Range<std::time::Duration>>) {
self.client.set_api_delay_range(delay_range);
}
@@ -588,7 +617,7 @@ mod tests {
use crate::cas_client::simulation::client_testing_utils::ClientTestingUtils;
use crate::cas_client::simulation::local_server::SimulationControlClient;
use crate::cas_client::simulation::{DeletionControlableClient, DirectAccessClient};
use crate::cas_types::FileRange;
use crate::cas_types::{FileRange, QueryReconstructionResponseV2};
const CHUNK_SIZE: usize = 123;
@@ -604,16 +633,16 @@ mod tests {
let local_data = server.client().get_file_data(&file.file_hash, None).await.unwrap();
assert_eq!(file.data, local_data);
// Full file reconstruction - compare remote and local
// Full file reconstruction - compare remote and local (V1)
let remote_recon = server
.remote_client()
.get_reconstruction(&file.file_hash, None)
.get_reconstruction_v1(&file.file_hash, None)
.await
.unwrap()
.unwrap();
let local_recon = server
.client()
.get_reconstruction(&file.file_hash, None)
.get_reconstruction_v1(&file.file_hash, None)
.await
.unwrap()
.unwrap();
@@ -629,7 +658,7 @@ mod tests {
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
let range_recon = server
.remote_client()
.get_reconstruction(&file.file_hash, Some(range))
.get_reconstruction_v1(&file.file_hash, Some(range))
.await
.unwrap();
assert!(range_recon.is_some());
@@ -639,7 +668,7 @@ mod tests {
let multi_file = server.client().upload_random_file(term_spec, CHUNK_SIZE).await.unwrap();
let multi_recon = server
.remote_client()
.get_reconstruction(&multi_file.file_hash, None)
.get_reconstruction_v1(&multi_file.file_hash, None)
.await
.unwrap()
.unwrap();
@@ -750,7 +779,7 @@ mod tests {
// Verify single XORB URLs are HTTP
let recon1 = server
.remote_client()
.get_reconstruction(&file1.file_hash, None)
.get_reconstruction_v1(&file1.file_hash, None)
.await
.unwrap()
.unwrap();
@@ -770,7 +799,7 @@ mod tests {
// Verify multi-XORB file has HTTP URLs for all XORBs
let multi_recon = server
.remote_client()
.get_reconstruction(&multi_file.file_hash, None)
.get_reconstruction_v1(&multi_file.file_hash, None)
.await
.unwrap()
.unwrap();
@@ -786,7 +815,7 @@ mod tests {
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
let range_recon = server
.remote_client()
.get_reconstruction(&multi_file.file_hash, Some(range))
.get_reconstruction_v1(&multi_file.file_hash, Some(range))
.await
.unwrap()
.unwrap();
@@ -817,7 +846,7 @@ mod tests {
// Get reconstruction via remote client
let recon = server
.remote_client()
.get_reconstruction(&file.file_hash, None)
.get_reconstruction_v1(&file.file_hash, None)
.await
.unwrap()
.unwrap();
@@ -841,7 +870,7 @@ mod tests {
// Get reconstruction
let recon = server
.remote_client()
.get_reconstruction(&file.file_hash, None)
.get_reconstruction_v1(&file.file_hash, None)
.await
.unwrap()
.unwrap();
@@ -906,6 +935,241 @@ mod tests {
}
}
/// Tests V2 reconstruction endpoint returns valid responses through the server.
async fn check_v2_reconstruction(server: &LocalTestServer) {
let file = server.client().upload_random_file(&[(1, (0, 5))], CHUNK_SIZE).await.unwrap();
// Query V2 endpoint via remote client
let v2 = server
.remote_client()
.get_reconstruction_v2(&file.file_hash, None)
.await
.unwrap()
.unwrap();
assert!(!v2.terms.is_empty());
assert!(!v2.xorbs.is_empty());
assert_eq!(v2.offset_into_first_range, 0);
// V2 URLs should be HTTP URLs pointing to /v1/fetch_term
for fetch_entries in v2.xorbs.values() {
for fetch in fetch_entries {
assert!(fetch.url.starts_with("http://"), "V2 URL should be HTTP, got: {}", fetch.url);
assert!(
fetch.url.contains("/v1/fetch_term?term="),
"V2 URL should point to fetch_term endpoint, got: {}",
fetch.url
);
}
}
// V2 terms should match V1 terms
let v1 = server
.remote_client()
.get_reconstruction_v1(&file.file_hash, None)
.await
.unwrap()
.unwrap();
assert_eq!(v1.terms.len(), v2.terms.len());
assert_eq!(v1.offset_into_first_range, v2.offset_into_first_range);
for (t1, t2) in v1.terms.iter().zip(v2.terms.iter()) {
assert_eq!(t1.hash, t2.hash);
assert_eq!(t1.range, t2.range);
}
}
/// Tests V2 fetch URLs are fetchable via the /v1/fetch_term endpoint.
async fn check_v2_url_transformation(server: &LocalTestServer) {
let http_client = reqwest::Client::new();
let file = server
.client()
.upload_random_file(&[(1, (0, 3)), (2, (0, 2))], CHUNK_SIZE)
.await
.unwrap();
let v2 = server
.remote_client()
.get_reconstruction_v2(&file.file_hash, None)
.await
.unwrap()
.unwrap();
for fetch_entries in v2.xorbs.values() {
for fetch in fetch_entries {
let response = http_client.get(&fetch.url).send().await.unwrap();
assert!(
response.status().is_success(),
"V2 fetch URL should be fetchable: {} (status: {})",
fetch.url,
response.status()
);
let data = response.bytes().await.unwrap();
assert!(!data.is_empty(), "Fetched data should not be empty");
}
}
}
/// Tests V2 with range requests through the server.
async fn check_v2_range_reconstruction(server: &LocalTestServer) {
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (1, (3, 5))];
let file = server.client().upload_random_file(term_spec, CHUNK_SIZE).await.unwrap();
let file_size = file.data.len() as u64;
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
let v2 = server
.remote_client()
.get_reconstruction_v2(&file.file_hash, Some(range))
.await
.unwrap()
.unwrap();
assert!(!v2.terms.is_empty());
for fetch_entries in v2.xorbs.values() {
for fetch in fetch_entries {
assert!(fetch.url.starts_with("http://"));
}
}
// Validate open-ended and suffix range variants through the V2 HTTP endpoint.
let v2_url = format!("{}/v2/reconstructions/{}", server.endpoint(), file.file_hash.hex());
let http_client = reqwest::Client::new();
let open_rhs: QueryReconstructionResponseV2 = http_client
.get(&v2_url)
.header(reqwest::header::RANGE, "bytes=100-")
.send()
.await
.unwrap()
.error_for_status()
.unwrap()
.json()
.await
.unwrap();
assert!(!open_rhs.terms.is_empty());
let suffix: QueryReconstructionResponseV2 = http_client
.get(&v2_url)
.header(reqwest::header::RANGE, "bytes=-128")
.send()
.await
.unwrap()
.error_for_status()
.unwrap()
.json()
.await
.unwrap();
assert!(!suffix.terms.is_empty());
}
/// Tests V2 max_ranges_per_fetch through the server.
async fn check_v2_max_ranges(server: &LocalTestServer) {
let term_spec = &[(1, (0, 2)), (2, (0, 1)), (1, (2, 4)), (2, (1, 2)), (1, (4, 6))];
let file = server.client().upload_random_file(term_spec, 512).await.unwrap();
// Set max_ranges_per_fetch to 1
server.set_max_ranges_per_fetch(1);
let v2 = server
.client()
.get_reconstruction_v2(&file.file_hash, None)
.await
.unwrap()
.unwrap();
let xorb1_hash: crate::cas_types::HexMerkleHash = file.terms[0].xorb_hash.into();
if let Some(desc) = v2.xorbs.get(&xorb1_hash) {
for fetch in desc {
assert!(fetch.ranges.len() <= 1, "Each fetch should have at most 1 range, got {}", fetch.ranges.len());
}
}
// Reset
server.set_max_ranges_per_fetch(usize::MAX);
}
/// Verifies that disabling V2 with various status codes causes the V2 endpoint
/// to return that code, and that get_reconstruction falls back to V1.
async fn check_v2_disabled_fallback(server: &LocalTestServer) {
let file = server
.remote_client()
.upload_random_file(&[(1, (0, 3)), (2, (0, 2))], CHUNK_SIZE)
.await
.unwrap();
// V2 should work before disabling.
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
assert!(v2_result.is_ok());
// Test 501 (Not Implemented) fallback first, before the RemoteClient
// caches a V1 preference from a 404 fallback.
server.disable_v2_reconstruction(501);
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
assert!(v2_result.is_err(), "V2 should return error when disabled with 501");
// Forced V2 should surface the endpoint error directly with no fallback.
let forced_v2 = server
.remote_client()
.get_reconstruction_with_version_override(&file.file_hash, None, Some(2))
.await;
assert!(forced_v2.is_err());
assert_eq!(forced_v2.unwrap_err().status(), Some(reqwest::StatusCode::NOT_IMPLEMENTED));
// Forced V1 should continue to succeed when V2 is disabled.
let forced_v1 = server
.remote_client()
.get_reconstruction_with_version_override(&file.file_hash, None, Some(1))
.await
.unwrap()
.unwrap();
assert_eq!(forced_v1.terms.len(), 2);
let result = server
.remote_client()
.get_reconstruction(&file.file_hash, None)
.await
.unwrap()
.unwrap();
assert_eq!(result.terms.len(), 2);
// Re-enable V2, then test 404 fallback.
server.disable_v2_reconstruction(0);
// Reset the RemoteClient's cached version by making a successful V2 call.
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
assert!(v2_result.is_ok(), "V2 should work again after re-enabling");
server.disable_v2_reconstruction(404);
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
assert!(v2_result.is_err(), "V2 should return error when disabled with 404");
let forced_v2 = server
.remote_client()
.get_reconstruction_with_version_override(&file.file_hash, None, Some(2))
.await;
assert!(forced_v2.is_err());
assert_eq!(forced_v2.unwrap_err().status(), Some(reqwest::StatusCode::NOT_FOUND));
let forced_v1 = server
.remote_client()
.get_reconstruction_with_version_override(&file.file_hash, None, Some(1))
.await
.unwrap()
.unwrap();
assert_eq!(forced_v1.terms.len(), 2);
let result = server
.remote_client()
.get_reconstruction(&file.file_hash, None)
.await
.unwrap()
.unwrap();
assert_eq!(result.terms.len(), 2);
}
/// Runs all server checks for a given test server instance.
async fn run_all_server_checks(server: &LocalTestServer) {
check_basic_correctness(server).await;
@@ -915,6 +1179,11 @@ mod tests {
check_downloaded_terms_match_expected_data(server).await;
check_complete_file_reconstruction(server).await;
check_chunk_hashes_correctness(server).await;
check_v2_reconstruction(server).await;
check_v2_url_transformation(server).await;
check_v2_range_reconstruction(server).await;
check_v2_max_ranges(server).await;
check_v2_disabled_fallback(server).await;
}
async fn all_file_hashes(client: &LocalClient) -> HashSet<MerkleHash> {

View File

@@ -17,7 +17,7 @@ use crate::cas_client::RemoteClient;
use crate::cas_client::error::{CasClientError, Result};
use crate::cas_client::interface::Client;
use crate::cas_client::simulation::{DeletionControlableClient, DirectAccessClient};
use crate::cas_types::{FileRange, HexMerkleHash, XorbReconstructionFetchInfo};
use crate::cas_types::{FileRange, HexMerkleHash, QueryReconstructionResponseV2, XorbReconstructionFetchInfo};
/// A client that connects to a `LocalTestServer` via HTTP and provides access
/// to both `DirectAccessClient` and `DeletionControlableClient` operations
@@ -91,7 +91,7 @@ impl Client for SimulationControlClient {
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
) -> Result<Option<QueryReconstructionResponseV2>> {
self.remote_client.get_reconstruction(file_id, bytes_range).await
}
@@ -172,6 +172,30 @@ impl DirectAccessClient for SimulationControlClient {
// No-op: delays are applied server-side via set_api_delay_range
}
fn set_max_ranges_per_fetch(&self, _max_ranges: usize) {
// No-op: SimulationControlClient configures server via HTTP; endpoint not yet implemented.
}
fn disable_v2_reconstruction(&self, _status_code: u16) {
// No-op: SimulationControlClient configures server via HTTP; endpoint not yet implemented.
}
async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
self.remote_client.get_reconstruction_v1(file_id, bytes_range).await
}
async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>> {
self.remote_client.get_reconstruction_v2(file_id, bytes_range).await
}
/// Sets the API delay range via the `/simulation/config/api_delay` endpoint.
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>) {
let url = self.sim_url("/config/api_delay");

View File

@@ -2,11 +2,10 @@ use std::collections::HashMap;
use std::io::{BufReader, Cursor};
use std::ops::Range;
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::atomic::{AtomicU16, AtomicU64, AtomicUsize, Ordering};
use async_trait::async_trait;
use bytes::Bytes;
use more_asserts::{assert_ge, assert_gt, debug_assert_lt};
use rand::Rng;
use tokio::sync::RwLock;
use tokio::time::{Duration, Instant};
@@ -26,21 +25,12 @@ use super::super::progress_tracked_streams::ProgressCallback;
use super::client_testing_utils::{FileTermReference, RandomFileContents};
use super::direct_access_client::DirectAccessClient;
use super::random_xorb::RandomXorb;
use super::xorb_utils::{self, REFERENCE_INSTANT};
use crate::cas_types::{
BatchQueryReconstructionResponse, ChunkRange, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
XorbReconstructionFetchInfo, XorbReconstructionTerm,
BatchQueryReconstructionResponse, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
QueryReconstructionResponseV2, XorbMultiRangeFetch, XorbRangeDescriptor, XorbReconstructionFetchInfo,
};
lazy_static::lazy_static! {
/// Reference instant for URL timestamps. Initialized far in the past to allow
/// testing timestamps that are earlier in the current process lifetime.
static ref REFERENCE_INSTANT: Instant = {
let now = Instant::now();
now.checked_sub(Duration::from_secs(365 * 24 * 60 * 60))
.unwrap_or(now)
};
}
/// Stored XORB data: the serialized data and the deserialized XorbObject (header/footer).
struct MaterializedXorb {
serialized_data: Bytes,
@@ -69,6 +59,10 @@ pub struct MemoryClient {
url_expiration_ms: AtomicU64,
/// API delay range in milliseconds as (min_ms, max_ms). (0, 0) means disabled.
random_ms_delay_window: (AtomicU64, AtomicU64),
/// Max ranges per XorbMultiRangeFetch entry. usize::MAX means no splitting.
max_ranges_per_fetch: AtomicUsize,
/// HTTP status code to return when V2 is disabled (0 = enabled).
v2_disabled_status: AtomicU16,
}
impl MemoryClient {
@@ -81,6 +75,8 @@ impl MemoryClient {
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("memory_uploads"),
url_expiration_ms: AtomicU64::new(u64::MAX),
random_ms_delay_window: (AtomicU64::new(0), AtomicU64::new(0)),
max_ranges_per_fetch: AtomicUsize::new(usize::MAX),
v2_disabled_status: AtomicU16::new(0),
})
}
@@ -225,6 +221,8 @@ impl Default for MemoryClient {
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("memory_uploads"),
url_expiration_ms: AtomicU64::new(u64::MAX),
random_ms_delay_window: (AtomicU64::new(0), AtomicU64::new(0)),
max_ranges_per_fetch: AtomicUsize::new(usize::MAX),
v2_disabled_status: AtomicU16::new(0),
}
}
}
@@ -236,6 +234,34 @@ impl DirectAccessClient for MemoryClient {
self.url_expiration_ms.store(expiration.as_millis() as u64, Ordering::Relaxed);
}
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
self.max_ranges_per_fetch.store(max_ranges, Ordering::Relaxed);
}
fn disable_v2_reconstruction(&self, status_code: u16) {
self.v2_disabled_status.store(status_code, Ordering::Relaxed);
}
fn v2_disabled_status_code(&self) -> u16 {
self.v2_disabled_status.load(Ordering::Relaxed)
}
async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
MemoryClient::get_reconstruction_v1(self, file_id, bytes_range).await
}
async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>> {
MemoryClient::get_reconstruction_v2(self, file_id, bytes_range).await
}
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>) {
match delay_range {
Some(range) => {
@@ -514,6 +540,130 @@ impl DirectAccessClient for MemoryClient {
}
}
impl MemoryClient {
async fn compute_reconstruction_ranges(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<xorb_utils::ReconstructionRangesResult> {
let file_info = {
let shard = self.shard.read().await;
match shard.get_file_reconstruction_info(file_id) {
Some(fi) => fi,
None => return Ok(None),
}
};
let xorbs = self.xorbs.read().await;
xorb_utils::compute_reconstruction_ranges(&file_info, bytes_range, &mut |hash| {
let storage = xorbs.get(hash).ok_or_else(|| {
error!("Unable to find xorb in memory CAS {:?}", hash);
CasClientError::XORBNotFound(*hash)
})?;
Ok(match storage {
XorbStorage::Materialized(entry) => entry.xorb_object.clone(),
XorbStorage::Random(xorb) => xorb.get_xorb_object(),
})
})
}
/// V1 reconstruction: returns per-range presigned URLs.
pub async fn get_reconstruction_v1(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
self.apply_api_delay().await;
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
return Ok(None);
};
if terms.is_empty() {
return Ok(Some(QueryReconstructionResponse {
offset_into_first_range,
terms,
fetch_info: HashMap::new(),
}));
}
let timestamp = Instant::now();
let mut fetch_info: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
for (hash, ranges) in merged_ranges {
let entries = ranges
.into_iter()
.map(|r| XorbReconstructionFetchInfo {
range: r.chunk_range,
url: generate_fetch_url(&hash, &r.byte_range, timestamp),
url_range: HttpRange::from(r.byte_range),
})
.collect();
fetch_info.insert(hash.into(), entries);
}
Ok(Some(QueryReconstructionResponse {
offset_into_first_range,
terms,
fetch_info,
}))
}
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
pub async fn get_reconstruction_v2(
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponseV2>> {
self.apply_api_delay().await;
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
return Ok(None);
};
if terms.is_empty() {
return Ok(Some(QueryReconstructionResponseV2 {
offset_into_first_range,
terms,
xorbs: HashMap::new(),
}));
}
let timestamp = Instant::now();
let max_ranges = self.max_ranges_per_fetch.load(Ordering::Relaxed);
let mut xorbs: HashMap<HexMerkleHash, Vec<XorbMultiRangeFetch>> = HashMap::new();
for (hash, ranges) in merged_ranges {
let mut fetch_entries = Vec::new();
for chunk in ranges.chunks(max_ranges) {
let range_descriptors: Vec<XorbRangeDescriptor> = chunk
.iter()
.map(|r| XorbRangeDescriptor {
chunks: r.chunk_range,
bytes: HttpRange::from(r.byte_range),
})
.collect();
let url = generate_v2_fetch_url(&hash, &range_descriptors, timestamp);
fetch_entries.push(XorbMultiRangeFetch {
url,
ranges: range_descriptors,
});
}
xorbs.insert(hash.into(), fetch_entries);
}
Ok(Some(QueryReconstructionResponseV2 {
offset_into_first_range,
terms,
xorbs,
}))
}
}
#[cfg_attr(not(target_family = "wasm"), async_trait)]
#[cfg_attr(target_family = "wasm", async_trait(?Send))]
impl Client for MemoryClient {
@@ -651,194 +801,8 @@ impl Client for MemoryClient {
&self,
file_id: &MerkleHash,
bytes_range: Option<FileRange>,
) -> Result<Option<QueryReconstructionResponse>> {
self.apply_api_delay().await;
let file_info = {
let shard = self.shard.read().await;
match shard.get_file_reconstruction_info(file_id) {
Some(fi) => fi,
None => return Ok(None),
}
};
let total_file_size: u64 = file_info.file_size();
// Handle range validation and truncation
let file_range = if let Some(range) = bytes_range {
// If the entire range is out of bounds, return None (like RemoteClient does for 416)
if range.start >= total_file_size {
// For empty files (size 0), only the first query (start == 0) should return the empty reconstruction
// All subsequent queries should return None to prevent infinite remainder loops
if total_file_size == 0 && range.start == 0 {
// Empty file - return valid but empty reconstruction
return Ok(Some(QueryReconstructionResponse {
offset_into_first_range: 0,
terms: vec![],
fetch_info: HashMap::new(),
}));
}
return Ok(None);
}
FileRange::new(range.start, range.end.min(total_file_size))
} else {
// No range specified - handle empty files
if total_file_size == 0 {
return Ok(Some(QueryReconstructionResponse {
offset_into_first_range: 0,
terms: vec![],
fetch_info: HashMap::new(),
}));
}
FileRange::full()
};
// Find the first segment that contains bytes in our range
let mut s_idx = 0;
let mut cumulative_bytes = 0u64;
let mut first_chunk_byte_start;
loop {
if s_idx >= file_info.segments.len() {
return Err(CasClientError::InvalidRange);
}
let n = file_info.segments[s_idx].unpacked_segment_bytes as u64;
if cumulative_bytes + n > file_range.start {
assert_ge!(file_range.start, cumulative_bytes);
first_chunk_byte_start = cumulative_bytes;
break;
} else {
cumulative_bytes += n;
s_idx += 1;
}
}
let mut terms = Vec::new();
#[derive(Clone)]
struct FetchInfoIntermediate {
chunk_range: ChunkRange,
byte_range: FileRange,
}
let mut fetch_info_map: MerkleHashMap<Vec<FetchInfoIntermediate>> = MerkleHashMap::new();
let xorbs = self.xorbs.read().await;
while s_idx < file_info.segments.len() && cumulative_bytes < file_range.end {
let mut segment = file_info.segments[s_idx].clone();
let mut chunk_range = ChunkRange::new(segment.chunk_index_start, segment.chunk_index_end);
let storage = xorbs.get(&segment.xorb_hash).ok_or_else(|| {
error!("Unable to find xorb in memory CAS {:?}", segment.xorb_hash);
CasClientError::XORBNotFound(segment.xorb_hash)
})?;
let xorb_footer = match storage {
XorbStorage::Materialized(entry) => entry.xorb_object.clone(),
XorbStorage::Random(xorb) => xorb.get_xorb_object(),
};
// Prune first segment on chunk boundaries
if cumulative_bytes < file_range.start {
while chunk_range.start < chunk_range.end {
let next_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.start)? as u64;
if cumulative_bytes + next_chunk_size <= file_range.start {
cumulative_bytes += next_chunk_size;
first_chunk_byte_start += next_chunk_size;
segment.unpacked_segment_bytes -= next_chunk_size as u32;
chunk_range.start += 1;
debug_assert_lt!(chunk_range.start, chunk_range.end);
} else {
break;
}
}
}
// Prune last segment on chunk boundaries
if cumulative_bytes + segment.unpacked_segment_bytes as u64 > file_range.end {
while chunk_range.end > chunk_range.start {
let last_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.end - 1)?;
if cumulative_bytes + (segment.unpacked_segment_bytes - last_chunk_size) as u64 >= file_range.end {
chunk_range.end -= 1;
segment.unpacked_segment_bytes -= last_chunk_size;
debug_assert_lt!(chunk_range.start, chunk_range.end);
assert_gt!(segment.unpacked_segment_bytes, 0);
} else {
break;
}
}
}
let (byte_start, byte_end) = xorb_footer.get_byte_offset(chunk_range.start, chunk_range.end)?;
let byte_range = FileRange::new(byte_start as u64, byte_end as u64);
let xorb_reconstruction_term = XorbReconstructionTerm {
hash: segment.xorb_hash.into(),
unpacked_length: segment.unpacked_segment_bytes,
range: chunk_range,
};
terms.push(xorb_reconstruction_term);
let fetch_info_intermediate = FetchInfoIntermediate {
chunk_range,
byte_range,
};
fetch_info_map
.entry(segment.xorb_hash)
.or_default()
.push(fetch_info_intermediate);
cumulative_bytes += segment.unpacked_segment_bytes as u64;
s_idx += 1;
}
assert!(!terms.is_empty());
let timestamp = Instant::now();
// Sort and merge adjacent/overlapping ranges in each fetch_info Vec
let mut merged_fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
for (hash, mut fi_vec) in fetch_info_map {
fi_vec.sort_by_key(|fi| fi.chunk_range.start);
let mut merged: Vec<XorbReconstructionFetchInfo> = Vec::new();
let mut idx = 0;
while idx < fi_vec.len() {
let mut new_fi = fi_vec[idx].clone();
while idx + 1 < fi_vec.len() {
let next_fi = &fi_vec[idx + 1];
if next_fi.chunk_range.start <= new_fi.chunk_range.end {
new_fi.chunk_range.end = next_fi.chunk_range.end.max(new_fi.chunk_range.end);
new_fi.byte_range.end = next_fi.byte_range.end.max(new_fi.byte_range.end);
idx += 1;
} else {
break;
}
}
merged.push(XorbReconstructionFetchInfo {
range: new_fi.chunk_range,
url: generate_fetch_url(&hash, &new_fi.byte_range, timestamp),
url_range: HttpRange::from(new_fi.byte_range),
});
idx += 1;
}
merged_fetch_info_map.insert(hash.into(), merged);
}
Ok(Some(QueryReconstructionResponse {
offset_into_first_range: file_range.start - first_chunk_byte_start,
terms,
fetch_info: merged_fetch_info_map,
}))
) -> Result<Option<QueryReconstructionResponseV2>> {
self.get_reconstruction_v2(file_id, bytes_range).await
}
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse> {
@@ -847,7 +811,7 @@ impl Client for MemoryClient {
let mut fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
for file_id in file_ids {
if let Some(response) = self.get_reconstruction(file_id, None).await? {
if let Some(response) = self.get_reconstruction_v1(file_id, None).await? {
let hex_hash: HexMerkleHash = (*file_id).into();
files.insert(hex_hash, response.terms);
@@ -876,8 +840,8 @@ impl Client for MemoryClient {
uncompressed_size_if_known: Option<usize>,
) -> Result<(Bytes, Vec<u32>)> {
self.apply_api_delay().await;
let (url, range) = url_info.retrieve_url().await?;
let (xorb_hash, _url_byte_range, url_timestamp) = parse_fetch_url(&url)?;
let (url, http_ranges) = url_info.retrieve_url().await?;
let (xorb_hash, url_timestamp) = parse_any_fetch_url(&url)?;
// Check if URL has expired
let expiration_ms = self.url_expiration_ms.load(Ordering::Relaxed);
@@ -889,12 +853,17 @@ impl Client for MemoryClient {
let xorbs = self.xorbs.read().await;
let storage = xorbs.get(&xorb_hash).ok_or(CasClientError::XORBNotFound(xorb_hash))?;
// Extract the byte range from the serialized data and deserialize
let start = range.start as usize;
let end = range.end as usize + 1; // HttpRange is inclusive end
let transfer_len = (end - start) as u64;
// Extract each byte range from the serialized data and deserialize
let mut all_decompressed = Vec::new();
let mut all_chunk_indices = Vec::<u32>::new();
let mut total_transfer = 0u64;
let (decompressed_data, chunk_byte_indices) = match storage {
for http_range in &http_ranges {
let start = http_range.start as usize;
let end = http_range.end as usize + 1;
total_transfer += http_range.length();
let (data, chunk_indices) = match storage {
XorbStorage::Materialized(entry) => {
let range_data = &entry.serialized_data[start..end];
xet_core_structures::xorb_object::deserialize_chunks(&mut Cursor::new(range_data))?
@@ -905,20 +874,28 @@ impl Client for MemoryClient {
},
};
xet_core_structures::xorb_object::append_chunk_segment(
&mut all_decompressed,
&mut all_chunk_indices,
&data,
&chunk_indices,
);
}
if let Some(expected) = uncompressed_size_if_known {
debug_assert_eq!(
decompressed_data.len(),
all_decompressed.len(),
expected,
"get_file_term_data: expected {} bytes, got {}",
expected,
decompressed_data.len()
all_decompressed.len()
);
}
if let Some(ref cb) = progress_callback {
cb(transfer_len, transfer_len, transfer_len);
cb(total_transfer, total_transfer, total_transfer);
}
Ok((Bytes::from(decompressed_data), chunk_byte_indices))
Ok((Bytes::from(all_decompressed), all_chunk_indices))
}
}
@@ -946,6 +923,19 @@ fn parse_fetch_url(url: &str) -> Result<(MerkleHash, FileRange, Instant)> {
Ok((hash, byte_range, timestamp))
}
fn generate_v2_fetch_url(hash: &MerkleHash, ranges: &[XorbRangeDescriptor], timestamp: Instant) -> String {
xorb_utils::generate_v2_fetch_url(hash, ranges, timestamp)
}
/// Parse either a V1 or V2 fetch URL, returning (hash, timestamp).
fn parse_any_fetch_url(url: &str) -> Result<(MerkleHash, Instant)> {
if let Ok((hash, _, ts)) = parse_fetch_url(url) {
return Ok((hash, ts));
}
let (hash, ts, _) = xorb_utils::parse_v2_fetch_url(url)?;
Ok((hash, ts))
}
#[cfg(all(test, not(target_family = "wasm")))]
mod tests {
use super::super::client_testing_utils::ClientTestingUtils;
@@ -1062,7 +1052,7 @@ mod tests {
assert_eq!(range_data.as_ref(), &file2.data[start as usize..end as usize]);
// Reconstruction workflow
let recon = client.get_reconstruction(&file2.file_hash, None).await.unwrap().unwrap();
let recon = client.get_reconstruction_v1(&file2.file_hash, None).await.unwrap().unwrap();
for term in &recon.terms {
let xorb_hash: MerkleHash = term.hash.into();
for fetch_info in recon.fetch_info.get(&term.hash).unwrap() {

View File

@@ -34,6 +34,7 @@ mod simulation_server;
#[cfg(unix)]
#[cfg(not(target_family = "wasm"))]
pub mod socket_proxy;
pub(crate) mod xorb_utils;
pub use client_testing_utils::{ClientTestingUtils, RandomFileContents};
#[cfg(not(target_family = "wasm"))]

View File

@@ -132,7 +132,7 @@ impl Client for RemoteSimulationClient {
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
self.inner.get_reconstruction(file_id, bytes_range).await
}

View File

@@ -440,7 +440,7 @@ impl Client for LocalTestServer {
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
self.remote_simulation_client.get_reconstruction(file_id, bytes_range).await
}
@@ -512,6 +512,30 @@ impl DirectAccessClient for LocalTestServer {
self.client.set_api_delay_range(delay_range);
}
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
self.client.set_max_ranges_per_fetch(max_ranges);
}
fn disable_v2_reconstruction(&self, status_code: u16) {
self.client.disable_v2_reconstruction(status_code);
}
async fn get_reconstruction_v1(
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
self.client.get_reconstruction_v1(file_id, bytes_range).await
}
async fn get_reconstruction_v2(
&self,
file_id: &xet_core_structures::merklehash::MerkleHash,
bytes_range: Option<crate::cas_types::FileRange>,
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
self.client.get_reconstruction_v2(file_id, bytes_range).await
}
async fn apply_api_delay(&self) {
self.client.apply_api_delay().await;
}
@@ -690,31 +714,22 @@ mod tests {
// Fetch term endpoint - verify URLs are HTTP and data can be fetched
let http_client = reqwest::Client::new();
for fetch_infos in remote_recon.fetch_info.values() {
for fi in fetch_infos {
assert!(fi.url.starts_with("http://"));
assert!(fi.url.contains("/fetch_term?term="));
let response = http_client.get(&fi.url).send().await.unwrap();
for multi_range_fetches in remote_recon.xorbs.values() {
for mrf in multi_range_fetches {
assert!(mrf.url.starts_with("http://"));
assert!(mrf.url.contains("/fetch_term?term="));
let response = http_client.get(&mrf.url).send().await.unwrap();
assert!(response.status().is_success());
assert!(!response.bytes().await.unwrap().is_empty());
}
}
// Fetch term with range request
let first_fi = &remote_recon.fetch_info.values().next().unwrap()[0];
let full_data = http_client.get(&first_fi.url).send().await.unwrap().bytes().await.unwrap();
if full_data.len() > 100 {
let range_resp = http_client
.get(&first_fi.url)
.header(reqwest::header::RANGE, "bytes=0-99")
.send()
.await
.unwrap();
assert!(range_resp.status().is_success());
let range_data = range_resp.bytes().await.unwrap();
assert_eq!(range_data.len(), 100);
assert_eq!(&range_data[..], &full_data[..100]);
}
// Verify V2 fetch URLs return consistent data across multiple requests.
let first_mrf = &remote_recon.xorbs.values().next().unwrap()[0];
let data_1 = http_client.get(&first_mrf.url).send().await.unwrap().bytes().await.unwrap();
let data_2 = http_client.get(&first_mrf.url).send().await.unwrap().bytes().await.unwrap();
assert_eq!(data_1, data_2);
assert!(!data_1.is_empty());
}
/// Tests that invalid requests return appropriate error responses.
@@ -762,16 +777,16 @@ mod tests {
.await
.unwrap()
.unwrap();
for (hash, fetch_infos) in &recon1.fetch_info {
for fi in fetch_infos {
for (hash, multi_range_fetches) in &recon1.xorbs {
for mrf in multi_range_fetches {
assert!(
fi.url.starts_with("http://") || fi.url.starts_with("https://"),
mrf.url.starts_with("http://") || mrf.url.starts_with("https://"),
"URL for hash {} should be HTTP, got: {}",
hash,
fi.url
mrf.url
);
assert!(fi.url.contains("/fetch_term?term="));
assert!(!fi.url.contains("\":"));
assert!(mrf.url.contains("/fetch_term?term="));
assert!(!mrf.url.contains("\":"));
}
}
@@ -782,10 +797,10 @@ mod tests {
.await
.unwrap()
.unwrap();
assert!(multi_recon.fetch_info.len() >= 2);
for fetch_infos in multi_recon.fetch_info.values() {
for fi in fetch_infos {
assert!(fi.url.starts_with("http://"));
assert!(multi_recon.xorbs.len() >= 2);
for multi_range_fetches in multi_recon.xorbs.values() {
for mrf in multi_range_fetches {
assert!(mrf.url.starts_with("http://"));
}
}
@@ -798,18 +813,18 @@ mod tests {
.await
.unwrap()
.unwrap();
for fetch_infos in range_recon.fetch_info.values() {
for fi in fetch_infos {
assert!(fi.url.starts_with("http://"));
assert!(fi.url.contains("/fetch_term?term="));
for multi_range_fetches in range_recon.xorbs.values() {
for mrf in multi_range_fetches {
assert!(mrf.url.starts_with("http://"));
assert!(mrf.url.contains("/fetch_term?term="));
}
}
// Verify all term URLs are fetchable
for term in &recon1.terms {
let fetch_infos = recon1.fetch_info.get(&term.hash).unwrap();
for fi in fetch_infos {
let response = http_client.get(&fi.url).send().await.unwrap();
let multi_range_fetches = recon1.xorbs.get(&term.hash).unwrap();
for mrf in multi_range_fetches {
let response = http_client.get(&mrf.url).send().await.unwrap();
assert!(response.status().is_success());
assert!(!response.bytes().await.unwrap().is_empty());
}
@@ -860,9 +875,9 @@ mod tests {
let expected_term = &file.terms[term_idx];
assert_eq!(recon_term.hash.0, expected_term.xorb_hash);
// Verify fetch_info exists for each XORB
let fetch_infos = recon.fetch_info.get(&recon_term.hash).unwrap();
assert!(!fetch_infos.is_empty());
// Verify xorbs has entry for each term
let multi_range_fetches = recon.xorbs.get(&recon_term.hash).unwrap();
assert!(!multi_range_fetches.is_empty());
}
// Verify the complete file can be retrieved correctly via LocalClient

View File

@@ -0,0 +1,499 @@
//! Shared utilities for reconstruction range computation and V2 URL encoding.
//!
//! This module consolidates logic used by both `MemoryClient` and `LocalClient`
//! for computing reconstruction ranges from file segment info, merging adjacent
//! ranges, and encoding/decoding V2 fetch URLs.
use base64::Engine;
use base64::engine::general_purpose::URL_SAFE_NO_PAD;
use more_asserts::{assert_ge, assert_gt, debug_assert_lt};
use tokio::time::{Duration, Instant};
use xet_core_structures::MerkleHashMap;
use xet_core_structures::merklehash::MerkleHash;
use xet_core_structures::metadata_shard::file_structs::MDBFileInfo;
use xet_core_structures::xorb_object::XorbObject;
use crate::cas_client::error::{CasClientError, Result};
use crate::cas_types::{ChunkRange, FileRange, HttpRange, XorbRangeDescriptor, XorbReconstructionTerm};
lazy_static::lazy_static! {
/// Reference instant for URL timestamps. Initialized far in the past to allow
/// testing timestamps that are earlier in the current process lifetime.
pub(crate) static ref REFERENCE_INSTANT: Instant = {
let now = Instant::now();
now.checked_sub(Duration::from_secs(365 * 24 * 60 * 60))
.unwrap_or(now)
};
}
/// A merged byte/chunk range for a single xorb.
#[derive(Clone, Debug)]
pub(crate) struct MergedRange {
pub chunk_range: ChunkRange,
pub byte_range: FileRange,
}
/// Result of `compute_reconstruction_ranges`: the offset into the first range,
/// the list of reconstruction terms, and the merged ranges per xorb hash.
pub(crate) type ReconstructionRangesResult =
Option<(u64, Vec<XorbReconstructionTerm>, MerkleHashMap<Vec<MergedRange>>)>;
/// Computes reconstruction ranges from file segment info.
///
/// Iterates the segments in `file_info`, prunes chunk boundaries to the
/// requested `bytes_range`, and merges adjacent/overlapping ranges per xorb.
///
/// `get_xorb_footer` is called for each unique xorb hash encountered to obtain
/// the `XorbObject` metadata needed for chunk-level byte offset calculations.
///
/// Returns `Ok(None)` when the range is out of bounds, or
/// `Ok(Some((offset_into_first_range, terms, merged_ranges_per_xorb)))`.
pub(crate) fn compute_reconstruction_ranges(
file_info: &MDBFileInfo,
bytes_range: Option<FileRange>,
get_xorb_footer: &mut dyn FnMut(&MerkleHash) -> Result<XorbObject>,
) -> Result<ReconstructionRangesResult> {
let total_file_size: u64 = file_info.file_size();
let file_range = if let Some(range) = bytes_range {
if range.start >= total_file_size {
if total_file_size == 0 && range.start == 0 {
return Ok(Some((0, vec![], MerkleHashMap::new())));
}
return Ok(None);
}
FileRange::new(range.start, range.end.min(total_file_size))
} else {
if total_file_size == 0 {
return Ok(Some((0, vec![], MerkleHashMap::new())));
}
FileRange::full()
};
let mut s_idx = 0;
let mut cumulative_bytes = 0u64;
let mut first_chunk_byte_start;
loop {
if s_idx >= file_info.segments.len() {
return Err(CasClientError::InvalidRange);
}
let n = file_info.segments[s_idx].unpacked_segment_bytes as u64;
if cumulative_bytes + n > file_range.start {
assert_ge!(file_range.start, cumulative_bytes);
first_chunk_byte_start = cumulative_bytes;
break;
} else {
cumulative_bytes += n;
s_idx += 1;
}
}
let mut terms = Vec::new();
#[derive(Clone)]
struct FetchInfoIntermediate {
chunk_range: ChunkRange,
byte_range: FileRange,
}
let mut fetch_info_map: MerkleHashMap<Vec<FetchInfoIntermediate>> = MerkleHashMap::new();
while s_idx < file_info.segments.len() && cumulative_bytes < file_range.end {
let mut segment = file_info.segments[s_idx].clone();
let mut chunk_range = ChunkRange::new(segment.chunk_index_start, segment.chunk_index_end);
let xorb_footer = get_xorb_footer(&segment.xorb_hash)?;
if cumulative_bytes < file_range.start {
while chunk_range.start < chunk_range.end {
let next_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.start)? as u64;
if cumulative_bytes + next_chunk_size <= file_range.start {
cumulative_bytes += next_chunk_size;
first_chunk_byte_start += next_chunk_size;
segment.unpacked_segment_bytes -= next_chunk_size as u32;
chunk_range.start += 1;
debug_assert_lt!(chunk_range.start, chunk_range.end);
} else {
break;
}
}
}
if cumulative_bytes + segment.unpacked_segment_bytes as u64 > file_range.end {
while chunk_range.end > chunk_range.start {
let last_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.end - 1)?;
if cumulative_bytes + (segment.unpacked_segment_bytes - last_chunk_size) as u64 >= file_range.end {
chunk_range.end -= 1;
segment.unpacked_segment_bytes -= last_chunk_size;
debug_assert_lt!(chunk_range.start, chunk_range.end);
assert_gt!(segment.unpacked_segment_bytes, 0);
} else {
break;
}
}
}
let (byte_start, byte_end) = xorb_footer.get_byte_offset(chunk_range.start, chunk_range.end)?;
let byte_range = FileRange::new(byte_start as u64, byte_end as u64);
terms.push(XorbReconstructionTerm {
hash: segment.xorb_hash.into(),
unpacked_length: segment.unpacked_segment_bytes,
range: chunk_range,
});
fetch_info_map
.entry(segment.xorb_hash)
.or_default()
.push(FetchInfoIntermediate {
chunk_range,
byte_range,
});
cumulative_bytes += segment.unpacked_segment_bytes as u64;
s_idx += 1;
}
debug_assert!(!terms.is_empty());
let mut merged: MerkleHashMap<Vec<MergedRange>> = MerkleHashMap::new();
for (hash, mut fi_vec) in fetch_info_map {
fi_vec.sort_by_key(|fi| fi.chunk_range.start);
let mut result: Vec<MergedRange> = Vec::new();
let mut idx = 0;
while idx < fi_vec.len() {
let mut cur = fi_vec[idx].clone();
while idx + 1 < fi_vec.len() {
let next = &fi_vec[idx + 1];
if next.chunk_range.start <= cur.chunk_range.end {
cur.chunk_range.end = cur.chunk_range.end.max(next.chunk_range.end);
cur.byte_range.end = cur.byte_range.end.max(next.byte_range.end);
idx += 1;
} else {
break;
}
}
result.push(MergedRange {
chunk_range: cur.chunk_range,
byte_range: cur.byte_range,
});
idx += 1;
}
merged.insert(hash, result);
}
Ok(Some((file_range.start - first_chunk_byte_start, terms, merged)))
}
/// Generates a V2 fetch URL: base64("{hash_hex}:{timestamp_ms}:{r1_start}-{r1_end},...")
pub(crate) fn generate_v2_fetch_url(hash: &MerkleHash, ranges: &[XorbRangeDescriptor], timestamp: Instant) -> String {
let timestamp_ms = timestamp.saturating_duration_since(*REFERENCE_INSTANT).as_millis() as u64;
let ranges_str: Vec<String> = ranges.iter().map(|r| format!("{}-{}", r.bytes.start, r.bytes.end)).collect();
let payload = format!("{}:{}:{}", hash.hex(), timestamp_ms, ranges_str.join(","));
URL_SAFE_NO_PAD.encode(payload.as_bytes())
}
/// Parses a V2 fetch URL back into (hash, timestamp, byte ranges).
pub(crate) fn parse_v2_fetch_url(url: &str) -> Result<(MerkleHash, Instant, Vec<HttpRange>)> {
let bytes = URL_SAFE_NO_PAD.decode(url).map_err(|_| CasClientError::InvalidArguments)?;
let payload = String::from_utf8(bytes).map_err(|_| CasClientError::InvalidArguments)?;
let mut parts = payload.splitn(3, ':');
let hash_hex = parts.next().ok_or(CasClientError::InvalidArguments)?;
let ts_str = parts.next().ok_or(CasClientError::InvalidArguments)?;
let ranges_str = parts.next().ok_or(CasClientError::InvalidArguments)?;
let hash = MerkleHash::from_hex(hash_hex).map_err(|_| CasClientError::InvalidArguments)?;
let timestamp_ms: u64 = ts_str.parse().map_err(|_| CasClientError::InvalidArguments)?;
let timestamp = *REFERENCE_INSTANT + Duration::from_millis(timestamp_ms);
let mut ranges = Vec::new();
for r in ranges_str.split(',').filter(|s| !s.is_empty()) {
let mut parts = r.splitn(2, '-');
let start: u64 = parts
.next()
.ok_or(CasClientError::InvalidArguments)?
.parse()
.map_err(|_| CasClientError::InvalidArguments)?;
let end: u64 = parts
.next()
.ok_or(CasClientError::InvalidArguments)?
.parse()
.map_err(|_| CasClientError::InvalidArguments)?;
ranges.push(HttpRange::new(start, end));
}
Ok((hash, timestamp, ranges))
}
#[cfg(test)]
mod tests {
use xet_core_structures::metadata_shard::file_structs::{
FileDataSequenceEntry, FileDataSequenceHeader, MDBFileInfo,
};
use super::super::random_xorb::RandomXorb;
use super::*;
fn make_range_descriptor(chunk_start: u32, chunk_end: u32, byte_start: u64, byte_end: u64) -> XorbRangeDescriptor {
XorbRangeDescriptor {
chunks: ChunkRange::new(chunk_start, chunk_end),
bytes: HttpRange::new(byte_start, byte_end),
}
}
fn build_xorb(chunk_sizes: &[usize]) -> (MerkleHash, XorbObject) {
let seed_and_sizes: Vec<(u64, u32)> =
chunk_sizes.iter().enumerate().map(|(i, &s)| (i as u64, s as u32)).collect();
let xorb = RandomXorb::new(&seed_and_sizes);
let xorb_object = xorb.get_xorb_object();
let hash = xorb.xorb_hash();
(hash, xorb_object)
}
fn make_segment(
xorb_hash: MerkleHash,
chunk_start: u32,
chunk_end: u32,
unpacked_bytes: u32,
) -> FileDataSequenceEntry {
FileDataSequenceEntry {
xorb_hash,
xorb_flags: 0,
chunk_index_start: chunk_start,
chunk_index_end: chunk_end,
unpacked_segment_bytes: unpacked_bytes,
}
}
fn make_file_info(segments: Vec<FileDataSequenceEntry>) -> MDBFileInfo {
MDBFileInfo {
metadata: FileDataSequenceHeader {
file_hash: MerkleHash::default(),
..Default::default()
},
segments,
verification: vec![],
metadata_ext: None,
}
}
#[test]
fn test_v2_url_roundtrip() {
let hash = MerkleHash::from_hex("a32d3a2a2e83e4d41b04899f13a8e891f4dd3f2ed940f96f91da7bf55b7ee299").unwrap();
let ranges = vec![
make_range_descriptor(0, 3, 0, 1024),
make_range_descriptor(5, 8, 2048, 4096),
];
let timestamp = Instant::now();
let url = generate_v2_fetch_url(&hash, &ranges, timestamp);
let (parsed_hash, parsed_ts, parsed_ranges) = parse_v2_fetch_url(&url).unwrap();
assert_eq!(hash, parsed_hash);
assert_eq!(parsed_ranges.len(), 2);
assert_eq!(parsed_ranges[0].start, 0);
assert_eq!(parsed_ranges[0].end, 1024);
assert_eq!(parsed_ranges[1].start, 2048);
assert_eq!(parsed_ranges[1].end, 4096);
let diff = if parsed_ts > timestamp {
parsed_ts - timestamp
} else {
timestamp - parsed_ts
};
assert!(diff < Duration::from_millis(2));
}
#[test]
fn test_v2_url_single_range() {
let hash = MerkleHash::default();
let ranges = vec![make_range_descriptor(0, 1, 100, 200)];
let timestamp = Instant::now();
let url = generate_v2_fetch_url(&hash, &ranges, timestamp);
let (_, _, parsed_ranges) = parse_v2_fetch_url(&url).unwrap();
assert_eq!(parsed_ranges.len(), 1);
assert_eq!(parsed_ranges[0].start, 100);
assert_eq!(parsed_ranges[0].end, 200);
}
#[test]
fn test_v2_url_invalid_base64() {
assert!(parse_v2_fetch_url("not-valid!!!").is_err());
}
#[test]
fn test_v2_url_invalid_payload() {
let url = URL_SAFE_NO_PAD.encode(b"bad");
assert!(parse_v2_fetch_url(&url).is_err());
}
#[test]
fn test_compute_ranges_single_segment() {
let (xorb_hash, xorb_object) = build_xorb(&[100, 200, 300]);
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 3, 600)]);
let result = compute_reconstruction_ranges(&file_info, None, &mut |_| Ok(xorb_object.clone())).unwrap();
let (offset, terms, merged) = result.unwrap();
assert_eq!(offset, 0);
assert_eq!(terms.len(), 1);
assert_eq!(terms[0].unpacked_length, 600);
assert_eq!(terms[0].range.start, 0);
assert_eq!(terms[0].range.end, 3);
let xorb_ranges = merged.get(&xorb_hash).unwrap();
assert_eq!(xorb_ranges.len(), 1);
assert_eq!(xorb_ranges[0].chunk_range.start, 0);
assert_eq!(xorb_ranges[0].chunk_range.end, 3);
}
#[test]
fn test_compute_ranges_partial_range() {
let (xorb_hash, xorb_object) = build_xorb(&[100, 200, 300]);
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 3, 600)]);
let range = FileRange::new(100, 300);
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| Ok(xorb_object.clone())).unwrap();
let (offset, terms, merged) = result.unwrap();
assert_eq!(offset, 0, "range starts exactly at chunk boundary");
assert_eq!(terms.len(), 1);
assert_eq!(terms[0].range.start, 1);
assert_eq!(terms[0].range.end, 2);
assert_eq!(terms[0].unpacked_length, 200);
let xorb_ranges = merged.get(&xorb_hash).unwrap();
assert_eq!(xorb_ranges.len(), 1);
assert_eq!(xorb_ranges[0].chunk_range.start, 1);
assert_eq!(xorb_ranges[0].chunk_range.end, 2);
}
#[test]
fn test_compute_ranges_out_of_bounds() {
let file_info = make_file_info(vec![make_segment(MerkleHash::default(), 0, 1, 100)]);
let range = FileRange::new(200, 300);
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| {
panic!("should not be called for out-of-range")
})
.unwrap();
assert!(result.is_none());
}
#[test]
fn test_compute_ranges_empty_file() {
let file_info = make_file_info(vec![]);
let result =
compute_reconstruction_ranges(&file_info, None, &mut |_| panic!("should not be called for empty file"))
.unwrap();
let (offset, terms, merged) = result.unwrap();
assert_eq!(offset, 0);
assert!(terms.is_empty());
assert!(merged.is_empty());
let result = compute_reconstruction_ranges(&file_info, Some(FileRange::new(0, 100)), &mut |_| {
panic!("should not be called for empty file")
})
.unwrap();
let (offset, terms, _) = result.unwrap();
assert_eq!(offset, 0);
assert!(terms.is_empty());
let result = compute_reconstruction_ranges(&file_info, Some(FileRange::new(1, 100)), &mut |_| {
panic!("should not be called for empty file")
})
.unwrap();
assert!(result.is_none());
}
#[test]
fn test_compute_ranges_merges_adjacent() {
let (xorb_hash, xorb_object) = build_xorb(&[100, 100, 100, 100]);
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 2, 200), make_segment(xorb_hash, 2, 4, 200)]);
let result = compute_reconstruction_ranges(&file_info, None, &mut |_| Ok(xorb_object.clone())).unwrap();
let (offset, terms, merged) = result.unwrap();
assert_eq!(offset, 0);
assert_eq!(terms.len(), 2);
let xorb_ranges = merged.get(&xorb_hash).unwrap();
assert_eq!(xorb_ranges.len(), 1);
assert_eq!(xorb_ranges[0].chunk_range.start, 0);
assert_eq!(xorb_ranges[0].chunk_range.end, 4);
}
#[test]
fn test_compute_ranges_multi_xorb_non_contiguous() {
let (hash_a, obj_a) = build_xorb(&[100, 100, 100, 100]);
let (hash_b, obj_b) = build_xorb(&[150, 150]);
let file_info = make_file_info(vec![
make_segment(hash_a, 0, 2, 200),
make_segment(hash_b, 0, 2, 300),
make_segment(hash_a, 2, 4, 200),
]);
let result = compute_reconstruction_ranges(&file_info, None, &mut |hash| {
if *hash == hash_a {
Ok(obj_a.clone())
} else if *hash == hash_b {
Ok(obj_b.clone())
} else {
Err(CasClientError::XORBNotFound(*hash))
}
})
.unwrap();
let (offset, terms, merged) = result.unwrap();
assert_eq!(offset, 0);
assert_eq!(terms.len(), 3);
let a_ranges = merged.get(&hash_a).unwrap();
assert_eq!(a_ranges.len(), 1);
assert_eq!(a_ranges[0].chunk_range.start, 0);
assert_eq!(a_ranges[0].chunk_range.end, 4);
let b_ranges = merged.get(&hash_b).unwrap();
assert_eq!(b_ranges.len(), 1);
assert_eq!(b_ranges[0].chunk_range.start, 0);
assert_eq!(b_ranges[0].chunk_range.end, 2);
}
#[test]
fn test_compute_ranges_truncates_to_file_size() {
let (xorb_hash, xorb_object) = build_xorb(&[500]);
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 1, 500)]);
let range = FileRange::new(0, 10000);
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| Ok(xorb_object.clone())).unwrap();
let (offset, terms, _) = result.unwrap();
assert_eq!(offset, 0);
assert_eq!(terms.len(), 1);
assert_eq!(terms[0].unpacked_length, 500);
}
#[test]
fn test_compute_ranges_offset_into_first_range() {
let (xorb_hash, xorb_object) = build_xorb(&[100, 200, 300]);
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 3, 600)]);
let range = FileRange::new(150, 600);
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| Ok(xorb_object.clone())).unwrap();
let (offset, terms, _) = result.unwrap();
assert_eq!(offset, 50);
assert_eq!(terms[0].range.start, 1);
}
}

View File

@@ -217,6 +217,66 @@ pub struct QueryReconstructionResponse {
pub fetch_info: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>>,
}
/// V2 reconstruction response - optimized for multi-range fetching.
/// May provide fewer signed URLs per xorb by combining multiple byte ranges
/// into a single URL where possible.
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct QueryReconstructionResponseV2 {
pub offset_into_first_range: u64,
pub terms: Vec<XorbReconstructionTerm>,
/// Map from xorb hash -> list of multi-range fetch entries.
/// Typically 1 entry per xorb. Multiple entries when the URL length limit
/// (~8 KiB, roughly ~500 ranges) forces a split.
pub xorbs: HashMap<HexMerkleHash, Vec<XorbMultiRangeFetch>>,
}
/// A signed multi-range fetch: one URL covering a subset of ranges for a xorb.
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct XorbMultiRangeFetch {
/// Signed URL with all byte ranges encoded. Client must send exactly the
/// signed range value as the Range header.
pub url: String,
/// Byte ranges covered by this URL, sorted by chunk start.
pub ranges: Vec<XorbRangeDescriptor>,
}
/// A single byte range within a xorb, mapping chunk indices to physical bytes.
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct XorbRangeDescriptor {
/// Chunk index range [start, end) within the xorb.
pub chunks: ChunkRange,
/// Physical byte range [start, end] (inclusive end) for the HTTP Range header.
pub bytes: HttpRange,
}
impl From<QueryReconstructionResponse> for QueryReconstructionResponseV2 {
fn from(v1: QueryReconstructionResponse) -> Self {
let xorbs = v1
.fetch_info
.into_iter()
.map(|(hash, fetch_infos)| {
let fetch = fetch_infos
.into_iter()
.map(|info| XorbMultiRangeFetch {
url: info.url,
ranges: vec![XorbRangeDescriptor {
chunks: info.range,
bytes: info.url_range,
}],
})
.collect();
(hash, fetch)
})
.collect();
QueryReconstructionResponseV2 {
offset_into_first_range: v1.offset_into_first_range,
terms: v1.terms,
xorbs,
}
}
}
// Request json body type representation for the POST /reconstructions endpoint
// to get the reconstruction for multiple files at a time.
// listing of non-duplicate (enforced by HashSet) keys (file ids) to get reconstructions for

View File

@@ -0,0 +1,40 @@
//! Integration tests for the shard upload no-read-timeout client (XET-885).
//!
//! Verifies that shard uploads succeed even when the server takes a long time to process,
//! since the shard upload client has no read_timeout.
use std::time::Duration;
use xet_client::cas_client::simulation::ClientTestingUtils;
use xet_client::cas_client::{DirectAccessClient, LocalTestServerBuilder};
use xet_runtime::test_set_config;
test_set_config! {
client {
retry_max_attempts = 1usize;
retry_base_delay = Duration::from_millis(10);
}
}
const CHUNK_SIZE: usize = 123;
#[tokio::test]
async fn test_shard_upload_succeeds_with_no_server_delay() {
let server = LocalTestServerBuilder::new().start().await;
let result = server.remote_client().upload_random_file(&[(1, (0, 5))], CHUNK_SIZE).await;
assert!(result.is_ok(), "Shard upload should succeed with no server delay: {result:?}");
}
#[tokio::test]
async fn test_shard_upload_succeeds_with_slow_server() {
let server = LocalTestServerBuilder::new().start().await;
// Server takes 3s to respond — shard upload client has no read_timeout so this should succeed
server.set_api_delay_range(Some(Duration::from_secs(3)..Duration::from_secs(3)));
let result = server.remote_client().upload_random_file(&[(1, (0, 5))], CHUNK_SIZE).await;
assert!(result.is_ok(), "Shard upload should succeed even with slow server (no read_timeout): {result:?}");
}

View File

@@ -192,6 +192,27 @@ pub fn deserialize_chunks<R: Read>(reader: &mut R) -> Result<(Vec<u8>, Vec<u32>)
Ok((buf, chunk_byte_indices))
}
/// Appends a deserialized chunk segment to existing accumulated buffers.
///
/// `deserialize_chunks` returns `chunk_byte_indices` starting with a leading `0`.
/// When concatenating multiple segments, this function deduplicates that leading
/// zero for subsequent segments and rebases all indices to account for data already
/// accumulated.
pub fn append_chunk_segment(
all_data: &mut Vec<u8>,
all_chunk_indices: &mut Vec<u32>,
segment_data: &[u8],
segment_indices: &[u32],
) {
let base_offset = all_data.len() as u32;
if all_chunk_indices.is_empty() {
all_chunk_indices.extend_from_slice(segment_indices);
} else {
all_chunk_indices.extend(segment_indices.iter().skip(1).map(|&o| o + base_offset));
}
all_data.extend_from_slice(segment_data);
}
/// Reads the next chunk header, returning `None` on clean EOF.
///
/// Uses a single `read()` call to detect EOF (returns 0), then completes
@@ -338,6 +359,37 @@ mod tests {
}
}
#[test]
fn test_append_chunk_segment() {
let mut all_data = Vec::new();
let mut all_indices = Vec::<u32>::new();
// First segment: simulates deserialize_chunks output [0, 10, 25]
append_chunk_segment(&mut all_data, &mut all_indices, &[0u8; 25], &[0, 10, 25]);
assert_eq!(all_data.len(), 25);
assert_eq!(all_indices, vec![0, 10, 25]);
// Second segment: [0, 8, 20] — leading 0 should be skipped, offsets rebased by 25
append_chunk_segment(&mut all_data, &mut all_indices, &[1u8; 20], &[0, 8, 20]);
assert_eq!(all_data.len(), 45);
assert_eq!(all_indices, vec![0, 10, 25, 33, 45]);
// Third segment: single chunk [0, 5] — leading 0 skipped, rebased by 45
append_chunk_segment(&mut all_data, &mut all_indices, &[2u8; 5], &[0, 5]);
assert_eq!(all_data.len(), 50);
assert_eq!(all_indices, vec![0, 10, 25, 33, 45, 50]);
}
#[test]
fn test_append_chunk_segment_single() {
let mut all_data = Vec::new();
let mut all_indices = Vec::<u32>::new();
append_chunk_segment(&mut all_data, &mut all_indices, &[0u8; 10], &[0, 10]);
assert_eq!(all_data.len(), 10);
assert_eq!(all_indices, vec![0, 10]);
}
#[test]
fn test_truncated_stream_returns_error() {
let (_, xorb_data, _, _) = build_xorb_object(3, ChunkSize::Fixed(1024), CompressionScheme::None);

View File

@@ -375,6 +375,7 @@ mod tests {
use ulid::Ulid;
use xet_client::cas_client::{ClientTestingUtils, DirectAccessClient, LocalClient, RandomFileContents};
use xet_client::cas_types::FileRange;
use xet_runtime::core::XetRuntime;
use super::*;
use crate::progress_tracking::NoOpProgressUpdater;
@@ -405,6 +406,7 @@ mod tests {
file_hash: MerkleHash,
byte_range: Option<FileRange>,
config: &ReconstructionConfig,
semaphore: Option<Arc<AdjustableSemaphore>>,
) -> Result<Vec<u8>> {
let buffer = Arc::new(std::sync::Mutex::new(Cursor::new(Vec::new())));
let writer = StaticCursorWriter(buffer.clone());
@@ -415,6 +417,9 @@ mod tests {
if let Some(range) = byte_range {
reconstructor = reconstructor.with_byte_range(range);
}
if let Some(sem) = semaphore {
reconstructor = reconstructor.with_buffer_semaphore(sem);
}
reconstructor.reconstruct_to_writer(writer).await?;
@@ -528,7 +533,7 @@ mod tests {
config.use_vectored_write = use_vectored;
// Test 1: reconstruct_to_writer
let vec_result = reconstruct_to_vec(client, h, None, &config).await.unwrap();
let vec_result = reconstruct_to_vec(client, h, None, &config, None).await.unwrap();
assert_eq!(vec_result, *expected, "vec failed (vectored={use_vectored})");
// Test 2: reconstruct_to_file
@@ -560,7 +565,7 @@ mod tests {
config.use_vectored_write = use_vectored;
// Test 1: reconstruct_to_writer
let vec_result = reconstruct_to_vec(client, file_contents.file_hash, Some(range), &config)
let vec_result = reconstruct_to_vec(client, file_contents.file_hash, Some(range), &config, None)
.await
.expect("reconstruct_to_vec should succeed");
assert_eq!(vec_result, expected, "vec failed (vectored={use_vectored})");
@@ -911,7 +916,11 @@ mod tests {
#[tokio::test]
async fn test_non_contiguous_chunks() {
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6))]).await;
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
let config = test_config();
let result = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
.await
.unwrap();
assert_eq!(result, file_contents.data);
}
// ==================== Default Config Tests ====================
@@ -1157,7 +1166,7 @@ mod tests {
let mut config = test_config();
config.download_buffer_perfile_size = xet_runtime::utils::ByteSize::from("8kb");
let reconstructed = reconstruct_to_vec(&client, file_contents.file_hash, None, &config)
let reconstructed = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
.await
.unwrap();
assert_eq!(reconstructed, file_contents.data);
@@ -1287,6 +1296,348 @@ mod tests {
assert_eq!(&result[start as usize..end as usize], &file_contents.data[start as usize..end as usize]);
}
// ==================== V1 Fallback Tests ====================
//
// These tests use LocalTestServer with V2 disabled to verify that
// reconstruction works correctly when the client falls back from V2 to V1.
/// Helper to reconstruct through a LocalTestServer (RemoteClient HTTP path).
async fn reconstruct_via_server(
server: &xet_client::cas_client::LocalTestServer,
file_hash: MerkleHash,
byte_range: Option<FileRange>,
config: &ReconstructionConfig,
) -> Result<Vec<u8>> {
let buffer = Arc::new(std::sync::Mutex::new(Cursor::new(Vec::new())));
let writer = StaticCursorWriter(buffer.clone());
let client: Arc<dyn Client> = server.remote_client().clone();
let mut reconstructor = FileReconstructor::new(&client, file_hash).with_config(config);
if let Some(range) = byte_range {
reconstructor = reconstructor.with_byte_range(range);
}
reconstructor.reconstruct_to_writer(writer).await?;
let data = buffer.lock().unwrap().get_ref().clone();
Ok(data)
}
#[tokio::test]
async fn test_v1_fallback_full_reconstruction() {
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(&[(1, (0, 3)), (2, (0, 2))], TEST_CHUNK_SIZE)
.await
.unwrap();
// Disable V2 so the remote client falls back to V1 + conversion.
server.disable_v2_reconstruction(404);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
.await
.unwrap();
assert_eq!(result, file_contents.data.as_ref());
}
#[tokio::test]
async fn test_v1_fallback_partial_range() {
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(&[(1, (0, 5)), (2, (0, 3))], TEST_CHUNK_SIZE)
.await
.unwrap();
server.disable_v2_reconstruction(404);
let file_len = file_contents.data.len() as u64;
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, Some(range), &config)
.await
.unwrap();
assert_eq!(result, &file_contents.data[range.start as usize..range.end as usize]);
}
#[tokio::test]
async fn test_v1_fallback_non_contiguous_chunks() {
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(&[(1, (0, 2)), (1, (4, 6))], TEST_CHUNK_SIZE)
.await
.unwrap();
server.disable_v2_reconstruction(404);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
.await
.unwrap();
assert_eq!(result, file_contents.data.as_ref());
}
#[tokio::test]
async fn test_v1_fallback_multiple_xorbs() {
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(&[(1, (0, 2)), (2, (0, 3)), (3, (0, 2)), (1, (2, 4))], TEST_CHUNK_SIZE)
.await
.unwrap();
server.disable_v2_reconstruction(404);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
.await
.unwrap();
assert_eq!(result, file_contents.data.as_ref());
}
/// V1 fallback with three disjoint ranges from the same xorb.
#[tokio::test]
async fn test_v1_fallback_triple_disjoint_ranges() {
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))], TEST_CHUNK_SIZE)
.await
.unwrap();
server.disable_v2_reconstruction(404);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
.await
.unwrap();
assert_eq!(result, file_contents.data.as_ref());
}
// ==================== Max Ranges Tests ====================
//
// These tests use LocalTestServer with max_ranges_per_fetch=2 to verify that
// multi-range fetch splitting works correctly through the full HTTP path.
/// Helper to set up a server with max_ranges_per_fetch and reconstruct.
async fn reconstruct_via_server_with_max_ranges(
term_spec: &[(u64, (u64, u64))],
max_ranges: usize,
byte_range: Option<FileRange>,
) -> (Vec<u8>, RandomFileContents) {
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(term_spec, TEST_CHUNK_SIZE)
.await
.unwrap();
server.set_max_ranges_per_fetch(max_ranges);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, byte_range, &config)
.await
.unwrap();
(result, file_contents)
}
#[tokio::test]
async fn test_max_ranges_simple() {
let (result, file_contents) =
reconstruct_via_server_with_max_ranges(&[(1, (0, 3)), (2, (0, 2))], 2, None).await;
assert_eq!(result, file_contents.data.as_ref());
}
/// A single xorb with two disjoint ranges, split at max_ranges=1.
/// Each range becomes its own fetch entry.
#[tokio::test]
async fn test_max_ranges_1_disjoint() {
let (result, file_contents) =
reconstruct_via_server_with_max_ranges(&[(1, (0, 2)), (1, (4, 6))], 1, None).await;
assert_eq!(result, file_contents.data.as_ref());
}
/// Three disjoint ranges from the same xorb with max_ranges=2.
/// First two ranges are grouped, third gets its own fetch entry.
#[tokio::test]
async fn test_max_ranges_2_triple_disjoint() {
let (result, file_contents) =
reconstruct_via_server_with_max_ranges(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))], 2, None).await;
assert_eq!(result, file_contents.data.as_ref());
}
/// Multiple xorbs, each with disjoint ranges, with max_ranges=2.
/// Tests that splitting is applied per-xorb correctly.
#[tokio::test]
async fn test_max_ranges_2_multi_xorb_disjoint() {
let term_spec = &[
(1, (0, 2)),
(2, (0, 2)),
(1, (4, 6)),
(2, (4, 6)),
(1, (8, 10)),
(2, (8, 10)),
];
let (result, file_contents) = reconstruct_via_server_with_max_ranges(term_spec, 2, None).await;
assert_eq!(result, file_contents.data.as_ref());
}
/// Complex interleaved pattern with max_ranges=2 and a partial byte range.
#[tokio::test]
async fn test_max_ranges_2_partial_range() {
let term_spec = &[
(1, (0, 3)),
(2, (0, 2)),
(1, (3, 5)),
(3, (1, 4)),
(2, (4, 6)),
(1, (0, 2)),
];
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let file_contents = server
.remote_client()
.upload_random_file(term_spec, TEST_CHUNK_SIZE)
.await
.unwrap();
server.set_max_ranges_per_fetch(2);
let file_len = file_contents.data.len() as u64;
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
let config = test_config();
let result = reconstruct_via_server(&server, file_contents.file_hash, Some(range), &config)
.await
.unwrap();
assert_eq!(result, &file_contents.data[range.start as usize..range.end as usize]);
}
// ==================== Multi-Disjoint Range Tests (LocalClient) ====================
//
// These tests exercise complex disjoint range patterns through the LocalClient path
// (no HTTP server), ensuring the reconstruction logic handles V2 multi-range
// XorbBlocks correctly.
/// Single xorb with three disjoint chunk ranges.
#[tokio::test]
async fn test_triple_disjoint_ranges_full() {
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
}
/// Single xorb with three disjoint chunk ranges, partial byte range.
#[tokio::test]
async fn test_triple_disjoint_ranges_partial() {
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
let file_len = file_contents.data.len() as u64;
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
reconstruct_and_verify_range(&client, &file_contents, range, test_config()).await;
}
/// Multiple xorbs, each with multiple disjoint ranges, interleaved.
#[tokio::test]
async fn test_multi_xorb_interleaved_disjoint() {
let term_spec = &[
(1, (0, 2)),
(2, (0, 2)),
(1, (4, 6)),
(2, (4, 6)),
(1, (8, 10)),
(2, (8, 10)),
];
let (client, file_contents) = setup_test_file(term_spec).await;
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
}
/// Multiple xorbs with interleaved disjoint ranges, partial byte range.
#[tokio::test]
async fn test_multi_xorb_interleaved_disjoint_partial() {
let term_spec = &[
(1, (0, 2)),
(2, (0, 2)),
(1, (4, 6)),
(2, (4, 6)),
(1, (8, 10)),
(2, (8, 10)),
];
let (client, file_contents) = setup_test_file(term_spec).await;
let file_len = file_contents.data.len() as u64;
let range = FileRange::new(file_len / 3, file_len * 2 / 3);
reconstruct_and_verify_range(&client, &file_contents, range, test_config()).await;
}
/// Single xorb with four disjoint ranges (many gaps).
#[tokio::test]
async fn test_four_disjoint_ranges() {
let term_spec = &[(1, (0, 2)), (1, (4, 6)), (1, (8, 10)), (1, (12, 14))];
let (client, file_contents) = setup_test_file(term_spec).await;
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
}
/// Mix of contiguous and disjoint ranges from the same xorb.
#[tokio::test]
async fn test_mixed_contiguous_and_disjoint() {
let term_spec = &[
(1, (0, 3)), // contiguous block
(1, (3, 5)), // continues contiguously
(1, (8, 10)), // gap, then disjoint
];
let (client, file_contents) = setup_test_file(term_spec).await;
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
}
/// Disjoint ranges across three xorbs with a complex access pattern.
#[tokio::test]
async fn test_complex_three_xorb_disjoint() {
let term_spec = &[
(1, (0, 2)),
(2, (0, 3)),
(3, (2, 5)),
(1, (5, 8)),
(2, (6, 8)),
(3, (0, 2)),
];
let (client, file_contents) = setup_test_file(term_spec).await;
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
}
/// LocalClient with max_ranges_per_fetch=2 (tests V2 response splitting without HTTP).
#[tokio::test]
async fn test_local_client_max_ranges_2_disjoint() {
let client = LocalClient::temporary().await.unwrap();
client.set_max_ranges_per_fetch(2);
let term_spec = &[(1, (0, 2)), (1, (4, 6)), (1, (8, 10)), (1, (12, 14))];
let file_contents = client.upload_random_file(term_spec, TEST_CHUNK_SIZE).await.unwrap();
let config = test_config();
let result = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
.await
.unwrap();
assert_eq!(result, file_contents.data.as_ref());
}
/// LocalClient with max_ranges_per_fetch=1 (every range gets its own fetch entry).
#[tokio::test]
async fn test_local_client_max_ranges_1_multi_xorb() {
let client = LocalClient::temporary().await.unwrap();
client.set_max_ranges_per_fetch(1);
let term_spec = &[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6))];
let file_contents = client.upload_random_file(term_spec, TEST_CHUNK_SIZE).await.unwrap();
let config = test_config();
let result = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
.await
.unwrap();
assert_eq!(result, file_contents.data.as_ref());
}
// ==================== Cancellation Flag Tests ====================
#[tokio::test]
@@ -1385,4 +1736,132 @@ mod tests {
assert_eq!(bytes_written, file_contents.data.len() as u64);
assert_eq!(buffer.lock().unwrap().get_ref().clone(), file_contents.data);
}
// ==================== Multirange Fetching Tests ====================
//
// These tests verify that reconstruction works correctly with both values
// of `enable_multirange_fetching`. When true, V2 multi-range fetch entries
// are used as-is (multirange HTTP requests). When false (default), each
// range is split into its own XorbBlock and fetched via a separate
// single-range request in parallel.
//
// Uses XetRuntime::new_with_config() to override the config per-test,
// following the pattern from test_dynamic_buffer_scaling_noop_increment_preserves_total_permits.
fn with_multirange_config(enable: bool) -> Arc<XetRuntime> {
let mut config = xet_runtime::config::XetConfig::new();
config.client.enable_multirange_fetching = enable;
XetRuntime::new_with_config(config).unwrap()
}
/// Exercises multiple disjoint-range scenarios through LocalClient with both
/// enable_multirange_fetching=true and =false.
#[test]
fn test_multirange_local_client() {
for enable in [false, true] {
let rt = with_multirange_config(enable);
rt.external_run_async_task(async move {
let scenarios: Vec<Vec<(u64, (u64, u64))>> = vec![
vec![(1, (0, 2)), (1, (4, 6)), (1, (8, 10))],
vec![
(1, (0, 2)),
(2, (0, 2)),
(1, (4, 6)),
(2, (4, 6)),
(1, (8, 10)),
(2, (8, 10)),
],
vec![
(1, (0, 2)),
(2, (0, 3)),
(3, (2, 5)),
(1, (5, 8)),
(2, (6, 8)),
(3, (0, 2)),
],
];
let config = test_config();
for term_spec in &scenarios {
let (client, fc) = setup_test_file(term_spec).await;
reconstruct_and_verify_full(&client, &fc, config.clone()).await;
let file_len = fc.data.len() as u64;
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
reconstruct_and_verify_range(&client, &fc, range, config.clone()).await;
}
})
.unwrap();
}
}
/// LocalClient with max_ranges_per_fetch constraint, both enable settings.
#[test]
fn test_multirange_max_ranges() {
for enable in [false, true] {
let rt = with_multirange_config(enable);
rt.external_run_async_task(async {
let client = LocalClient::temporary().await.unwrap();
client.set_max_ranges_per_fetch(2);
let term_spec = &[(1, (0, 2)), (1, (4, 6)), (1, (8, 10)), (1, (12, 14))];
let fc = client.upload_random_file(term_spec, TEST_CHUNK_SIZE).await.unwrap();
let config = test_config();
let result = reconstruct_to_vec(&client, fc.file_hash, None, &config, None).await.unwrap();
assert_eq!(result, fc.data.as_ref());
})
.unwrap();
}
}
/// Exercises HTTP server path with full, max-ranges-split, and partial-range
/// reconstruction, both enable_multirange_fetching values.
#[test]
fn test_multirange_via_server() {
for enable in [false, true] {
let rt = with_multirange_config(enable);
rt.external_run_async_task(async {
let config = test_config();
// Full reconstruction with disjoint ranges
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let fc = server
.remote_client()
.upload_random_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))], TEST_CHUNK_SIZE)
.await
.unwrap();
let result = reconstruct_via_server(&server, fc.file_hash, None, &config).await.unwrap();
assert_eq!(result, fc.data.as_ref());
// Multi-xorb with max_ranges_per_fetch=2
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let fc = server
.remote_client()
.upload_random_file(
&[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6)), (1, (8, 10))],
TEST_CHUNK_SIZE,
)
.await
.unwrap();
server.set_max_ranges_per_fetch(2);
let result = reconstruct_via_server(&server, fc.file_hash, None, &config).await.unwrap();
assert_eq!(result, fc.data.as_ref());
// Partial byte range
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
let fc = server
.remote_client()
.upload_random_file(&[(1, (0, 3)), (2, (0, 2)), (1, (3, 5)), (2, (4, 6))], TEST_CHUNK_SIZE)
.await
.unwrap();
let file_len = fc.data.len() as u64;
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
let result = reconstruct_via_server(&server, fc.file_hash, Some(range), &config)
.await
.unwrap();
assert_eq!(result, &fc.data[range.start as usize..range.end as usize]);
})
.unwrap();
}
}
}

View File

@@ -7,6 +7,7 @@ use tokio::sync::OnceCell;
use xet_client::cas_client::Client;
use xet_client::cas_types::{ChunkRange, FileRange, HttpRange};
use xet_core_structures::merklehash::MerkleHash;
use xet_runtime::core::xet_config;
use xet_runtime::utils::UniqueId;
use super::super::FileReconstructionError;
@@ -19,17 +20,28 @@ use crate::progress_tracking::download_tracking::DownloadTaskUpdater;
/// in the output file that maps to a chunk range within a xorb block.
#[derive(Clone)]
pub struct FileTerm {
// The byte range in the file of this term.
pub byte_range: FileRange,
// Absolute chunk range within the full xorb. Doesn't account for only a partial xorb being downloaded.
pub xorb_chunk_range: ChunkRange,
// The index of the (chunk index, byte offset) pair in the xorb block that starts this file term.
pub xorb_block_start_index: usize,
// The byte offset into the first range of the xorb block should this term not start on a chunk boundary.
pub offset_into_first_range: u64,
// The xorb block that sourced this file term.
pub xorb_block: Arc<XorbBlock>,
// The retrieval URL information for this file term.
pub url_info: Arc<TermBlockRetrievalURLs>,
}
impl FileTerm {
pub fn extract_bytes(&self, xorb_block_data: &XorbBlockData) -> Bytes {
let local_start_chunk = (self.xorb_chunk_range.start - self.xorb_block.chunk_range.start) as usize;
let start_byte_offset = xorb_block_data.chunk_offsets[local_start_chunk];
let (_, start_byte_offset) = xorb_block_data.chunk_offsets[self.xorb_block_start_index];
let start_byte_offset = start_byte_offset + self.offset_into_first_range as usize;
let expected_size = (self.byte_range.end - self.byte_range.start) as usize;
let end_byte_offset = start_byte_offset + expected_size;
@@ -67,6 +79,25 @@ impl FileTerm {
}
}
/// Intermediate data for a single file term, collected during the first pass of
/// `retrieve_file_term_block` before the final `FileTerm` structs are built.
///
/// We need this because `FileTerm` requires `Arc<XorbBlock>` and `Arc<TermBlockRetrievalURLs>`,
/// which can't be constructed until all terms have been processed.
struct FileTermEntry {
/// The byte range in the output file that this term covers.
byte_range: FileRange,
/// The chunk range within the xorb that sources this term's data.
xorb_chunk_range: ChunkRange,
/// Byte offset into the first chunk's data, non-zero only for the first term
/// when the query range starts mid-chunk.
offset_into_first_range: u64,
/// Index into the `xorb_blocks` / `xorb_block_retrieval_urls` vectors.
xorb_block_index: usize,
/// Flattened index into the xorb block's `chunk_offsets` for this term's start chunk.
xorb_block_start_index: usize,
}
/// Retrieve file terms from the client for a given file hash and byte range.
/// Returns None if the requested byte range is past the end of the file.
/// Returns the actual retrieved range and the number of bytes required for the
@@ -77,78 +108,111 @@ pub async fn retrieve_file_term_block(
file_hash: MerkleHash,
query_file_byte_range: FileRange,
) -> Result<Option<(FileRange, u64, Vec<FileTerm>)>> {
// First, get the raw reconstruction.
// get_reconstruction always returns V2 format (the client converts V1 internally).
let Some(raw_reconstruction) = client.get_reconstruction(&file_hash, Some(query_file_byte_range)).await? else {
// None means we've requested a byte range beyond the end of the file.
return Ok(None);
};
// Set a new url acquisition id to ensure that we don't double up the url acquisitions.
// Each acquisition gets a unique ID used for single-flight URL refresh dedup.
let acquisition_id = UniqueId::new();
// Intermediate storage for file term data before we create the actual FileTerm structs.
// (byte_range, xorb_chunk_range, offset_into_first_range, index into xorb_blocks)
let mut file_term_data = Vec::<(FileRange, ChunkRange, u64, usize)>::with_capacity(raw_reconstruction.terms.len());
// First pass: iterate through the reconstruction terms and build up intermediate
// FileTermEntry data, XorbBlock objects, and retrieval URL info. We can't construct
// the final FileTerm structs yet because they need Arc<XorbBlock> and Arc<TermBlockRetrievalURLs>,
// which require all terms to be processed first.
let mut file_term_data = Vec::<FileTermEntry>::with_capacity(raw_reconstruction.terms.len());
let n_xorb_terms = raw_reconstruction.fetch_info.values().map(|v| v.len()).sum();
// Parallel vectors indexed by xorb_block_index:
// - xorb_blocks: the block metadata (hash, chunk ranges, references)
// - xorb_block_retrieval_urls: the download URL and byte ranges for each block
let mut xorb_blocks: Vec<XorbBlock> = Vec::new();
let mut xorb_block_retrieval_urls = Vec::<(String, Vec<HttpRange>)>::new();
// Keep track of the xorb blocks we've created, keyed by (xorb_hash, first chunk index).
let mut xorb_blocks: Vec<XorbBlock> = Vec::with_capacity(n_xorb_terms);
// Dedup map: (xorb_hash, first_range_chunk_start) -> xorb_block_index.
// Multiple terms may reference the same xorb block; this ensures we create
// each block only once and share it across terms.
let mut xorb_index_lookup = HashMap::<(MerkleHash, u32), usize>::new();
// Keep track of the URLs for each.
let mut xorb_block_retrieval_urls = Vec::<(String, HttpRange)>::with_capacity(n_xorb_terms);
// Get a hash map so we can reindex the xorb terms; map of (xorb_hash, first chunk index) -> xorb block index.
let mut xorb_index_lookup = HashMap::<(MerkleHash, u64), usize>::with_capacity(n_xorb_terms);
// Keep track of where we are so as to map the file terms to the byte range within the file.
// Track the current byte offset in the output file as we process terms sequentially.
let mut cur_file_byte_offset = query_file_byte_range.start;
// We'll create the URL info after processing all terms, once we know the actual range.
let enable_multirange = xet_config().client.enable_multirange_fetching;
// Iterate over the terms and build the file terms and xorb terms.
for (local_term_index, term) in raw_reconstruction.terms.iter().enumerate() {
let xorb_hash: MerkleHash = term.hash.into();
// Get the xorb info here.
let Some(xorb_info) = raw_reconstruction.fetch_info.get(&term.hash) else {
let Some(xorb_descriptor) = raw_reconstruction.xorbs.get(&term.hash) else {
return Err(FileReconstructionError::CorruptedReconstruction(format!(
"Xorb info not found for xorb hash {xorb_hash:?}"
)));
};
// Get the xorb block index that this term belongs to.
// Find the XorbBlock for this term's chunk range. The behavior depends on the
// enable_multirange_fetching config:
//
// - When true: one XorbBlock per XorbMultiRangeFetch entry, preserving all ranges in a single block
// (multi-range HTTP request).
// - When false (default): one XorbBlock per individual XorbRangeDescriptor, so each range is fetched as a
// separate single-range HTTP request in parallel.
let xorb_block_index = 'find_xorb_block: {
for raw_xorb_block_info in xorb_info.iter() {
let chunk_range = raw_xorb_block_info.range;
for fetch_entry in xorb_descriptor.iter() {
if enable_multirange {
let term_contained = fetch_entry
.ranges
.iter()
.any(|r| r.chunks.start <= term.range.start && term.range.end <= r.chunks.end);
if chunk_range.start <= term.range.start && term.range.start <= chunk_range.end {
// Verify that the term range is contained within the xorb block.
if term.range.end > chunk_range.end {
return Err(FileReconstructionError::CorruptedReconstruction(format!(
"Term range extends beyond xorb block range for xorb hash {xorb_hash:?}"
)));
if !term_contained {
continue;
}
// Reuse the previous one if it exists, otherwise insert a new one.
let index = match xorb_index_lookup.entry((xorb_hash, chunk_range.start as u64)) {
let first_chunk_start = fetch_entry.ranges[0].chunks.start;
let index = match xorb_index_lookup.entry((xorb_hash, first_chunk_start)) {
Entry::Occupied(entry) => *entry.get(),
Entry::Vacant(entry) => {
let new_index = xorb_blocks.len();
let chunk_ranges: Vec<ChunkRange> = fetch_entry.ranges.iter().map(|r| r.chunks).collect();
let http_ranges: Vec<HttpRange> = fetch_entry.ranges.iter().map(|r| r.bytes).collect();
xorb_blocks.push(XorbBlock {
xorb_hash,
chunk_range,
chunk_ranges,
xorb_block_index: new_index,
references: vec![],
uncompressed_size_if_known: None,
data: OnceCell::new(),
});
// Store the retrieval URL and range for this xorb block.
xorb_block_retrieval_urls
.push((raw_xorb_block_info.url.clone(), raw_xorb_block_info.url_range));
xorb_block_retrieval_urls.push((fetch_entry.url.clone(), http_ranges));
entry.insert(new_index);
new_index
},
};
break 'find_xorb_block index;
} else {
for range in &fetch_entry.ranges {
if range.chunks.start <= term.range.start && term.range.end <= range.chunks.end {
let index = match xorb_index_lookup.entry((xorb_hash, range.chunks.start)) {
Entry::Occupied(entry) => *entry.get(),
Entry::Vacant(entry) => {
let new_index = xorb_blocks.len();
xorb_blocks.push(XorbBlock {
xorb_hash,
chunk_ranges: vec![range.chunks],
xorb_block_index: new_index,
references: vec![],
uncompressed_size_if_known: None,
data: OnceCell::new(),
});
xorb_block_retrieval_urls.push((fetch_entry.url.clone(), vec![range.bytes]));
// Store the index.
entry.insert(new_index);
new_index
},
@@ -157,90 +221,120 @@ pub async fn retrieve_file_term_block(
break 'find_xorb_block index;
}
}
}
}
return Err(FileReconstructionError::CorruptedReconstruction(format!(
"No xorb chunk range found for file term {local_term_index:?} in xorb info for xorb hash {xorb_hash:?}"
"No xorb fetch entry found for file term {local_term_index:?} in xorb info for xorb hash {xorb_hash:?}"
)));
};
// Do we need to adjust for an offset into the first range?
let offset_into_first_range = {
if local_term_index == 0 {
// Only the first term can have a non-zero offset into its first chunk,
// which happens when the query byte range starts mid-chunk.
let offset_into_first_range = if local_term_index == 0 {
raw_reconstruction.offset_into_first_range
} else {
0
}
};
// The effective size of this term in the file.
// The term's contribution to the output file is its full uncompressed size
// minus any offset into the first chunk.
let term_byte_size = term.unpacked_length as u64 - offset_into_first_range;
// Update the references term on the XorbBlock to track where the xorb gets used.
// Record this term as a reference on its xorb block (used later to determine
// whether the block's total uncompressed size can be inferred).
xorb_blocks[xorb_block_index].references.push(XorbReference {
term_chunks: term.range,
uncompressed_size: term.unpacked_length as usize,
});
// Store the file term data (byte_range, xorb_chunk_range, offset_into_first_range, xorb_block_index).
// We'll create the FileTerm structs after we know the actual range.
file_term_data.push((
FileRange::new(cur_file_byte_offset, cur_file_byte_offset + term_byte_size),
term.range,
// Compute the flattened index into the block's chunk_offsets for this term's
// starting chunk. This accounts for disjoint chunk ranges in multi-range blocks.
//
// The term_contained check above guarantees term.range.start falls within one of
// the block's chunk_ranges, so this loop always finds a match.
let xorb_block_start_index = {
let chunk_start = term.range.start;
let chunk_ranges = &xorb_blocks[xorb_block_index].chunk_ranges;
let mut idx = 0;
let mut found = false;
for range in chunk_ranges {
if chunk_start >= range.start && chunk_start < range.end {
idx += (chunk_start - range.start) as usize;
found = true;
break;
}
idx += (range.end - range.start) as usize;
}
if !found {
return Err(FileReconstructionError::CorruptedReconstruction(format!(
"chunk_start {chunk_start} not found in chunk_ranges {chunk_ranges:?} for file term {local_term_index}"
)));
}
idx
};
file_term_data.push(FileTermEntry {
byte_range: FileRange::new(cur_file_byte_offset, cur_file_byte_offset + term_byte_size),
xorb_chunk_range: term.range,
offset_into_first_range,
xorb_block_index,
));
xorb_block_start_index,
});
cur_file_byte_offset += term_byte_size;
}
// Sort the block references so that we can easily scan the terms to figure out how many references
// a particular chunk may have.
// Sort each block's references by chunk start so that determine_size_if_possible
// can use its forward-chaining DP to check coverage.
for block in &mut xorb_blocks {
block.references.sort_by_key(|r| r.term_chunks.start);
block.uncompressed_size_if_known = XorbBlock::determine_size_if_possible(block.chunk_range, &block.references);
block.uncompressed_size_if_known =
XorbBlock::determine_size_if_possible(&block.chunk_ranges, &block.references);
}
// Now, it's possible that we have to shrink the byte range of the last term, as we may have retrieved more
// due to chunk offsets.
// The last term in the reconstruction may extend beyond the requested range
// (e.g. when the query ends mid-chunk). Trim it to the query boundary.
if cur_file_byte_offset > query_file_byte_range.end {
let last_term_shrinkage = cur_file_byte_offset - query_file_byte_range.end;
debug_assert!(!file_term_data.is_empty());
if let Some(fi) = file_term_data.last_mut() {
fi.0.end -= last_term_shrinkage;
if let Some(entry) = file_term_data.last_mut() {
entry.byte_range.end -= last_term_shrinkage;
}
}
// Calculate the actual retrieved range from the file terms.
// The actual range covered, which may be smaller than requested if the file
// ends before the requested range.
let actual_range = FileRange::new(
file_term_data.first().map(|(br, _, _, _)| br.start).unwrap_or(0),
file_term_data.last().map(|(br, _, _, _)| br.end).unwrap_or(0),
file_term_data.first().map(|e| e.byte_range.start).unwrap_or(0),
file_term_data.last().map(|e| e.byte_range.end).unwrap_or(0),
);
// Now, calculate the total number of bytes that needs to be downloaded given dedup and compression savings.
let total_transfer_bytes = xorb_block_retrieval_urls
// Total compressed bytes that will be transferred across all xorb block downloads.
let total_transfer_bytes: u64 = xorb_block_retrieval_urls
.iter()
.map(|(_, http_range)| {
let file_range = FileRange::from(*http_range);
file_range.end.saturating_sub(file_range.start)
})
.flat_map(|(_, ranges)| ranges)
.map(|r| r.length())
.sum();
// Now create the URL info with the actual range and retrieval URLs.
// Wrap the retrieval URLs in a shared struct so all file terms can share them
// and coordinate URL refreshes through a single lock.
let url_info =
Arc::new(TermBlockRetrievalURLs::new(file_hash, actual_range, acquisition_id, xorb_block_retrieval_urls));
// Convert xorb_blocks to Arc<XorbBlock> for use in FileTerms.
// Second pass: convert the intermediate FileTermEntry data into final FileTerm
// structs, now that we can wrap xorb blocks in Arc and share the url_info.
let xorb_blocks_arc: Vec<Arc<XorbBlock>> = xorb_blocks.into_iter().map(Arc::new).collect();
// Convert the intermediate data to FileTerm structs with the shared url_info.
let file_terms: Vec<FileTerm> = file_term_data
.into_iter()
.map(|(byte_range, xorb_chunk_range, offset_into_first_range, xorb_block_index)| FileTerm {
byte_range,
xorb_chunk_range,
offset_into_first_range,
xorb_block: xorb_blocks_arc[xorb_block_index].clone(),
.map(|entry| FileTerm {
byte_range: entry.byte_range,
xorb_chunk_range: entry.xorb_chunk_range,
xorb_block_start_index: entry.xorb_block_start_index,
offset_into_first_range: entry.offset_into_first_range,
xorb_block: xorb_blocks_arc[entry.xorb_block_index].clone(),
url_info: url_info.clone(),
})
.collect();
@@ -252,7 +346,7 @@ pub async fn retrieve_file_term_block(
mod tests {
use std::sync::Arc;
use more_asserts::{assert_ge, assert_le};
use more_asserts::assert_le;
use xet_client::cas_client::{ClientTestingUtils, LocalClient, RandomFileContents};
use xet_client::cas_types::{ChunkRange, FileRange};
use xet_runtime::utils::UniqueId;
@@ -351,10 +445,18 @@ mod tests {
// Track xorb block index
seen_xorb_indices.insert(file_term.xorb_block.xorb_block_index);
// Verify chunk range is within xorb block boundaries.
// Verify chunk range is within xorb block boundaries: the term's chunk range
// must be contained within at least one of the block's chunk ranges.
let xorb_block = &file_term.xorb_block;
assert_ge!(file_term.xorb_chunk_range.start, xorb_block.chunk_range.start);
assert_le!(file_term.xorb_chunk_range.end, xorb_block.chunk_range.end);
let term_in_some_range = xorb_block
.chunk_ranges
.iter()
.any(|cr| file_term.xorb_chunk_range.start >= cr.start && file_term.xorb_chunk_range.end <= cr.end);
assert!(
term_in_some_range,
"term chunk range {:?} not within any block chunk range {:?}",
file_term.xorb_chunk_range, xorb_block.chunk_ranges
);
// Cross-reference with known file contents.
if expected_term_idx < file_contents.terms.len() {
@@ -365,7 +467,7 @@ mod tests {
// Verify chunk range matches (accounting for partial first term).
if file_term_data_offset == 0 {
assert_eq!(file_term.xorb_chunk_range.start as u32, expected_term.chunk_start);
assert_eq!(file_term.xorb_chunk_range.start, expected_term.chunk_start);
}
}
@@ -549,10 +651,11 @@ mod tests {
// Get the first file term's xorb block to test URL retrieval
let file_term = &file_terms[0];
let xorb_block_index = file_term.xorb_block.xorb_block_index;
let (unique_id, url, http_range) = file_term.url_info.get_retrieval_url(xorb_block_index).await;
let (unique_id, url, http_ranges) = file_term.url_info.get_retrieval_url(xorb_block_index).await;
assert!(!url.is_empty());
assert!(http_range.start < http_range.end);
assert!(!http_ranges.is_empty());
assert!(http_ranges[0].start <= http_ranges[0].end);
assert!(unique_id != UniqueId::null());
}
@@ -591,87 +694,136 @@ mod tests {
#[tokio::test]
async fn test_range_few_bytes_before_end() {
// Test requesting a range that ends just a few bytes before the file end,
// within the same chunk as the file end.
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
let file_len = file_contents.data.len() as u64;
// Request range ending 3 bytes before the end
let range = FileRange::new(0, file_len - 3);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
// Request range ending 1 byte before the end
let range = FileRange::new(0, file_len - 1);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
}
#[tokio::test]
async fn test_range_few_bytes_after_start() {
// Test requesting a range that starts just a few bytes after the file start,
// within the same chunk as the file start.
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
let file_len = file_contents.data.len() as u64;
// Request range starting 3 bytes after the start
let range = FileRange::new(3, file_len);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
// Request range starting 1 byte after the start
let range = FileRange::new(1, file_len);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
}
#[tokio::test]
async fn test_range_few_bytes_offset_both_ends() {
// Test requesting a range with small offsets at both ends within the same chunk.
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
let file_len = file_contents.data.len() as u64;
// Request range with 2 bytes trimmed from start and 2 bytes from end
let range = FileRange::new(2, file_len - 2);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
// Request just the middle byte of a small range
let range = FileRange::new(file_len / 2 - 1, file_len / 2 + 1);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
}
#[tokio::test]
async fn test_range_single_byte_at_various_positions() {
// Test requesting single bytes at various positions in the file.
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
let file_len = file_contents.data.len() as u64;
// First byte
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(0, 1))).await;
// Last byte
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(file_len - 1, file_len))).await;
// Middle byte
let mid = file_len / 2;
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(mid, mid + 1))).await;
}
#[tokio::test]
async fn test_multi_term_range_ends_mid_chunk() {
// Test with multiple terms where the requested range ends in the middle of the last term's chunk.
let (client, file_contents) = setup_test_file(&[(1, (0, 3)), (2, (0, 3)), (3, (0, 3))]).await;
let file_len = file_contents.data.len() as u64;
// End a few bytes before the file end
let range = FileRange::new(0, file_len - 5);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
}
#[tokio::test]
async fn test_multi_term_range_starts_mid_chunk() {
// Test with multiple terms where the requested range starts in the middle of the first term's chunk.
let (client, file_contents) = setup_test_file(&[(1, (0, 3)), (2, (0, 3)), (3, (0, 3))]).await;
let file_len = file_contents.data.len() as u64;
// Start a few bytes after the file start
let range = FileRange::new(5, file_len);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
}
// ==================== Multi-Disjoint Range Edge Cases ====================
/// Single xorb with three disjoint chunk ranges.
/// This creates one XorbBlock with chunk_ranges = [(0,2), (4,6), (8,10)].
#[tokio::test]
async fn test_triple_disjoint_same_xorb() {
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
retrieve_and_verify(&client, &file_contents, None).await;
}
/// Triple disjoint ranges with a partial byte range spanning the gap.
#[tokio::test]
async fn test_triple_disjoint_partial_range_across_gap() {
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
let file_len = file_contents.data.len() as u64;
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
retrieve_and_verify(&client, &file_contents, Some(range)).await;
}
/// Two xorbs, each with two disjoint ranges, interleaved in file order.
#[tokio::test]
async fn test_two_xorbs_interleaved_disjoint() {
let term_spec = &[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6))];
let (client, file_contents) = setup_test_file(term_spec).await;
retrieve_and_verify(&client, &file_contents, None).await;
}
/// Two xorbs interleaved with disjoint ranges, partial byte range.
#[tokio::test]
async fn test_two_xorbs_interleaved_disjoint_partial() {
let term_spec = &[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6))];
let (client, file_contents) = setup_test_file(term_spec).await;
let file_len = file_contents.data.len() as u64;
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(file_len / 3, file_len * 2 / 3))).await;
}
/// Single xorb with four disjoint ranges, each a single chunk wide.
#[tokio::test]
async fn test_four_single_chunk_disjoint() {
let term_spec = &[(1, (0, 1)), (1, (3, 4)), (1, (6, 7)), (1, (9, 10))];
let (client, file_contents) = setup_test_file(term_spec).await;
retrieve_and_verify(&client, &file_contents, None).await;
}
/// Mix of contiguous and disjoint ranges from the same xorb.
/// Chunks 0-4 are contiguous, then a gap, then chunk 8-10.
#[tokio::test]
async fn test_contiguous_then_disjoint() {
let term_spec = &[(1, (0, 2)), (1, (2, 4)), (1, (8, 10))];
let (client, file_contents) = setup_test_file(term_spec).await;
retrieve_and_verify(&client, &file_contents, None).await;
}
/// Three xorbs with complex disjoint access patterns.
#[tokio::test]
async fn test_three_xorbs_complex_disjoint() {
let term_spec = &[
(1, (0, 2)),
(2, (0, 3)),
(3, (2, 5)),
(1, (5, 8)),
(2, (6, 8)),
(3, (0, 2)),
];
let (client, file_contents) = setup_test_file(term_spec).await;
retrieve_and_verify(&client, &file_contents, None).await;
}
}

View File

@@ -21,9 +21,11 @@ pub struct TermBlockRetrievalURLs {
// which may be smaller than the originally requested range if the file ends early.
pub byte_range: FileRange,
// The xorb retreival URLs. These could be refreshed if need be.
// The xorb retrieval URLs. These could be refreshed if need be.
// Indexed by xorb_block_index stored in each XorbBlock.
pub(crate) xorb_block_retrieval_urls: RwLock<(UniqueId, Vec<(String, HttpRange)>)>,
// Each entry is (url, http_ranges) to support multi-range V2 blocks.
#[allow(clippy::type_complexity)]
pub(crate) xorb_block_retrieval_urls: RwLock<(UniqueId, Vec<(String, Vec<HttpRange>)>)>,
}
impl TermBlockRetrievalURLs {
@@ -32,7 +34,7 @@ impl TermBlockRetrievalURLs {
file_hash: MerkleHash,
byte_range: FileRange,
acquisition_id: UniqueId,
retrieval_urls: Vec<(String, HttpRange)>,
retrieval_urls: Vec<(String, Vec<HttpRange>)>,
) -> Self {
Self {
file_hash,
@@ -41,15 +43,13 @@ impl TermBlockRetrievalURLs {
}
}
/// Gets the retrieval URL for a given xorb block. All URL requests go through
/// this method in order to manage url refreshes; this function returns the
/// most recent retrieval URL in the case of a refresh.
pub async fn get_retrieval_url(&self, xorb_block_index: usize) -> (UniqueId, String, HttpRange) {
/// Gets the retrieval URL and all byte ranges for a given xorb block.
/// All URL requests go through this method in order to manage URL refreshes;
/// this function returns the most recent retrieval URL in the case of a refresh.
pub async fn get_retrieval_url(&self, xorb_block_index: usize) -> (UniqueId, String, Vec<HttpRange>) {
let xbru = self.xorb_block_retrieval_urls.read().await;
let (url, url_range) = xbru.1[xorb_block_index].clone();
(xbru.0, url, url_range)
let (url, url_ranges) = &xbru.1[xorb_block_index];
(xbru.0, url.clone(), url_ranges.clone())
}
/// Refresh the retrieval URLs for all xorb blocks in this block.
@@ -61,8 +61,7 @@ impl TermBlockRetrievalURLs {
/// the new request will get a new URL.
pub async fn refresh_retrieval_urls(&self, client: Arc<dyn Client>, acquisition_id: UniqueId) -> Result<()> {
if self.xorb_block_retrieval_urls.read().await.0 != acquisition_id {
// This means another process has got in here while we're waiting for the lock and
// refreshed them.
// Another task already refreshed while we were waiting for the read lock.
debug!(
file_hash = %self.file_hash,
byte_range = ?(self.byte_range.start, self.byte_range.end),
@@ -74,7 +73,7 @@ impl TermBlockRetrievalURLs {
let mut retrieval_urls = self.xorb_block_retrieval_urls.write().await;
if retrieval_urls.0 != acquisition_id {
// It's already been refreshed by another process.
// Already refreshed by another task while waiting for the write lock.
debug!(
file_hash = %self.file_hash,
byte_range = ?(self.byte_range.start, self.byte_range.end),
@@ -90,8 +89,7 @@ impl TermBlockRetrievalURLs {
"Refreshing expired retrieval URLs"
);
// Since this hopefully doesn't happen too often, go through and retrieve an
// entire new block, then make sure everything matches up and take in the new stuff.
// Re-fetch the entire block to get fresh URLs, then verify the structure matches.
let Some((returned_range, _transfer_bytes, file_terms)) =
retrieve_file_term_block(client, self.file_hash, self.byte_range).await?
else {
@@ -141,11 +139,13 @@ pub struct XorbURLProvider {
#[async_trait::async_trait]
impl URLProvider for XorbURLProvider {
async fn retrieve_url(&self) -> std::result::Result<(String, HttpRange), xet_client::cas_client::CasClientError> {
let (unique_id, url, http_range) = self.url_info.get_retrieval_url(self.xorb_block_index).await;
async fn retrieve_url(
&self,
) -> std::result::Result<(String, Vec<HttpRange>), xet_client::cas_client::CasClientError> {
let (unique_id, url, http_ranges) = self.url_info.get_retrieval_url(self.xorb_block_index).await;
*self.last_acquisition_id.lock().await = unique_id;
Ok((url, http_range))
Ok((url, http_ranges))
}
async fn refresh_url(&self) -> std::result::Result<(), xet_client::cas_client::CasClientError> {
@@ -155,3 +155,110 @@ impl URLProvider for XorbURLProvider {
.map_err(|e| xet_client::cas_client::CasClientError::Other(e.to_string()))
}
}
#[cfg(test)]
mod tests {
use std::sync::Arc;
use tokio::sync::Mutex;
use xet_client::cas_client::{ClientTestingUtils, LocalClient, URLProvider};
use xet_client::cas_types::{FileRange, HttpRange};
use xet_core_structures::merklehash::MerkleHash;
use xet_runtime::utils::UniqueId;
use super::{TermBlockRetrievalURLs, XorbURLProvider};
fn sample_urls(n: usize) -> Vec<(String, Vec<HttpRange>)> {
(0..n)
.map(|i| (format!("https://example.com/xorb_{i}"), vec![HttpRange::new(0, 100)]))
.collect()
}
#[tokio::test]
async fn test_new_and_get_retrieval_url() {
let id = UniqueId::new();
let urls = sample_urls(3);
let block = TermBlockRetrievalURLs::new(MerkleHash::default(), FileRange::new(0, 100), id, urls.clone());
for (i, expected) in urls.iter().enumerate() {
let (ret_id, url, ranges) = block.get_retrieval_url(i).await;
assert!(ret_id == id, "acquisition ID mismatch for block {i}");
assert_eq!(url, expected.0);
assert_eq!(ranges, expected.1);
}
}
#[tokio::test]
async fn test_refresh_skipped_when_already_refreshed() {
let (client, file_contents) = {
let c = LocalClient::temporary().await.unwrap();
let fc = c.upload_random_file(&[(1, (0, 3))], 64).await.unwrap();
(c, fc)
};
let file_range = FileRange::new(0, file_contents.data.len() as u64);
let dyn_client: Arc<dyn xet_client::cas_client::Client> = client.clone();
let (_, _, file_terms) =
super::retrieve_file_term_block(dyn_client.clone(), file_contents.file_hash, file_range)
.await
.unwrap()
.unwrap();
let url_info = file_terms[0].url_info.clone();
// Get original acquisition ID
let (original_id, _, _) = url_info.get_retrieval_url(0).await;
// Refresh with a stale (different) ID should be a no-op.
let stale_id = UniqueId::new();
url_info.refresh_retrieval_urls(dyn_client.clone(), stale_id).await.unwrap();
let (id_after, _, _) = url_info.get_retrieval_url(0).await;
assert!(id_after == original_id, "refresh with stale ID should not change acquisition ID");
// Refresh with the correct ID should update URLs.
url_info.refresh_retrieval_urls(dyn_client.clone(), original_id).await.unwrap();
let (refreshed_id, _, _) = url_info.get_retrieval_url(0).await;
assert!(refreshed_id != original_id, "refresh with correct ID should change acquisition ID");
}
#[tokio::test]
async fn test_xorb_url_provider_retrieve_and_refresh() {
let (client, file_contents) = {
let c = LocalClient::temporary().await.unwrap();
let fc = c.upload_random_file(&[(1, (0, 3))], 64).await.unwrap();
(c, fc)
};
let file_range = FileRange::new(0, file_contents.data.len() as u64);
let dyn_client: Arc<dyn xet_client::cas_client::Client> = client.clone();
let (_, _, file_terms) =
super::retrieve_file_term_block(dyn_client.clone(), file_contents.file_hash, file_range)
.await
.unwrap()
.unwrap();
let url_info = file_terms[0].url_info.clone();
let provider = XorbURLProvider {
client: dyn_client.clone(),
url_info,
xorb_block_index: 0,
last_acquisition_id: Mutex::new(UniqueId::null()),
};
// retrieve_url should succeed and return a valid URL.
let (url, ranges) = provider.retrieve_url().await.unwrap();
assert!(!url.is_empty());
assert!(!ranges.is_empty());
// refresh_url should succeed (refreshes with the current acquisition ID).
provider.refresh_url().await.unwrap();
// After refresh, retrieve_url should still work with updated URLs.
let (url2, ranges2) = provider.retrieve_url().await.unwrap();
assert!(!url2.is_empty());
assert!(!ranges2.is_empty());
}
}

View File

@@ -13,27 +13,49 @@ use super::retrieval_urls::{TermBlockRetrievalURLs, XorbURLProvider};
use crate::progress_tracking::download_tracking::DownloadTaskUpdater;
/// Downloaded and decompressed data for a xorb block, including chunk boundary offsets.
///
/// A single `XorbBlockData` may hold data from multiple disjoint chunk ranges
/// (V2 multi-range fetch). The chunks are concatenated in range order, and
/// `chunk_offsets` maps each chunk index to its byte position within `data`.
pub struct XorbBlockData {
pub chunk_offsets: Vec<usize>,
pub uncompressed_size: u64,
/// Pairs of (chunk_index, byte_offset) mapping each chunk to its start position
/// within `data`. Because the block can span multiple disjoint chunk ranges,
/// storing the chunk index alongside the offset avoids ambiguity.
pub chunk_offsets: Vec<(usize, usize)>,
/// The concatenated decompressed chunk data for all ranges in this block.
pub data: Bytes,
}
/// A reference from a file term back to the xorb block it belongs to.
/// Used by `determine_size_if_possible` to check whether the block's total
/// uncompressed size can be inferred from the terms that reference it.
#[derive(Debug)]
pub struct XorbReference {
/// The chunk range within the xorb that this file term covers.
pub term_chunks: ChunkRange,
/// The uncompressed byte size of this term's data.
pub uncompressed_size: usize,
}
/// A downloadable xorb block identified by hash and chunk range, with cached data.
/// Multiple file terms may reference the same xorb block.
/// A downloadable xorb block identified by hash and chunk ranges, with cached data.
///
/// A block may contain multiple disjoint chunk ranges from the same xorb (V2 multi-range).
/// Multiple file terms may reference the same block. Downloaded data is cached in `data`
/// so that the first term to request it triggers the download, and subsequent terms
/// reuse the cached result.
pub struct XorbBlock {
pub xorb_hash: MerkleHash,
pub chunk_range: ChunkRange,
/// The chunk ranges fetched for this block. For V1 this is a single range;
/// for V2 multi-range fetches this may contain multiple disjoint ranges.
pub chunk_ranges: Vec<ChunkRange>,
/// Index into the parent `TermBlockRetrievalURLs` for URL lookup.
pub xorb_block_index: usize,
/// All file-term chunk ranges covered by this xorb block, sorted by range start.
/// All file-term references covered by this block, sorted by chunk range start.
/// Populated during `retrieve_file_term_block` and used to compute `uncompressed_size_if_known`.
pub references: Vec<XorbReference>,
/// Expected decompressed size of the block when known. Used for debug_assert in clients.
/// Expected total decompressed size across all chunk ranges, if it can be determined
/// from the references. Passed to clients as a debug assertion hint.
pub uncompressed_size_if_known: Option<usize>,
pub data: OnceCell<Arc<XorbBlockData>>,
}
@@ -41,7 +63,7 @@ pub struct XorbBlock {
impl PartialEq for XorbBlock {
fn eq(&self, other: &Self) -> bool {
self.xorb_hash == other.xorb_hash
&& self.chunk_range == other.chunk_range
&& self.chunk_ranges == other.chunk_ranges
&& self.xorb_block_index == other.xorb_block_index
}
}
@@ -63,6 +85,7 @@ impl XorbBlock {
) -> Result<Arc<XorbBlockData>> {
let xorb_block_index = self.xorb_block_index;
let uncompressed_size_if_known = self.uncompressed_size_if_known;
let chunk_ranges = self.chunk_ranges.clone();
self.data
.get_or_try_init(|| async {
@@ -89,14 +112,18 @@ impl XorbBlock {
.get_file_term_data(Box::new(url_provider), permit, progress_callback, uncompressed_size_if_known)
.await?;
let chunk_offsets: Vec<usize> = chunk_byte_offsets.iter().map(|&x| x as usize).collect();
let uncompressed_size = data.len() as u64;
// Build chunk_offsets by zipping each chunk index (from all chunk_ranges)
// with the corresponding byte offset from the returned data.
let mut chunk_offsets = Vec::new();
let mut offset_idx = 0;
for range in &chunk_ranges {
for chunk_idx in range.start..range.end {
chunk_offsets.push((chunk_idx as usize, chunk_byte_offsets[offset_idx] as usize));
offset_idx += 1;
}
}
Ok(Arc::new(XorbBlockData {
chunk_offsets,
uncompressed_size,
data,
}))
Ok(Arc::new(XorbBlockData { chunk_offsets, data }))
})
.await
.cloned()
@@ -105,33 +132,67 @@ impl XorbBlock {
/// Determines the total uncompressed size of the xorb block from the reference terms,
/// if possible.
///
/// The size can be determined when:
/// 1. A single term's chunk range exactly matches the full xorb range, or
/// 2. A chain of term chunk ranges exactly covers the full xorb range with no gaps (e.g. [0..3, 3..5] covers 0..5).
/// Uses a forward-chaining DP: starting from the first chunk range's start,
/// we track which chunk positions are "reachable" (i.e., fully covered by a
/// contiguous chain of terms) along with the accumulated uncompressed size.
///
/// The `terms` slice must be sorted by chunk range start index.
pub fn determine_size_if_possible(xorb_range: ChunkRange, terms: &[XorbReference]) -> Option<usize> {
/// For multi-range blocks with disjoint chunk ranges (e.g. `[0,3)` and `[5,8)`),
/// the gaps between ranges are inserted as zero-cost bridges. This lets the DP
/// traverse the full set of ranges in a single pass — a gap `[3,5)` contributes
/// no data but connects the end of one range to the start of the next.
///
/// Returns `Some(total_size)` if every range is fully covered, `None` otherwise.
///
/// The `terms` slice must be sorted by `term_chunks.start`.
pub fn determine_size_if_possible(xorb_ranges: &[ChunkRange], terms: &[XorbReference]) -> Option<usize> {
debug_assert!(
terms.windows(2).all(|w| w[0].term_chunks.start <= w[1].term_chunks.start),
"terms must be sorted by chunk range start"
);
// DP approach: track which chunk endpoints are reachable from xorb_range.start
// via contiguous chains, along with accumulated uncompressed sizes.
// This correctly handles multiple terms with the same start index by
// considering all possible chain continuations.
let mut reachable: BTreeMap<u32, usize> = BTreeMap::new();
reachable.insert(xorb_range.start, 0);
debug_assert!(
terms.iter().all(|term| xorb_ranges
.iter()
.any(|r| term.term_chunks.start >= r.start && term.term_chunks.end <= r.end)),
"all terms must fall within one of the xorb ranges"
);
if xorb_ranges.is_empty() {
return Some(0);
}
// Build a lookup from range-end -> next-range-start for gap bridging.
// E.g. for ranges [0,3) and [5,8), maps 3 -> 5, meaning once chunk 3
// is reachable we can bridge to chunk 5 at zero cost.
let gap_bridges: BTreeMap<u32, u32> = xorb_ranges
.windows(2)
.filter(|pair| pair[0].end < pair[1].start)
.map(|pair| (pair[0].end, pair[1].start))
.collect();
// DP map: chunk position -> accumulated uncompressed size to reach that position.
// Seed with the start of the first range.
let mut reachable: BTreeMap<u32, usize> = BTreeMap::new();
reachable.insert(xorb_ranges[0].start, 0);
// Process terms in sorted order, extending reachable positions.
for term in terms {
if let Some(&accumulated) = reachable.get(&term.term_chunks.start) {
reachable
.entry(term.term_chunks.end)
.or_insert(accumulated + term.uncompressed_size);
let new_end = term.term_chunks.end;
let new_size = accumulated + term.uncompressed_size;
reachable.entry(new_end).or_insert(new_size);
// If this term reaches the end of a range that has a gap bridge,
// make the start of the next range reachable at the same accumulated size.
if let Some(&bridge_target) = gap_bridges.get(&new_end) {
reachable.entry(bridge_target).or_insert(new_size);
}
}
}
reachable.get(&xorb_range.end).copied()
// The block is fully covered if we can reach the end of the last range.
reachable.get(&xorb_ranges.last().unwrap().end).copied()
}
}
@@ -153,197 +214,210 @@ mod tests {
#[test]
fn test_single_term_exact_match() {
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[(ChunkRange::new(0, 5), 1000)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
}
#[test]
fn test_two_terms_chained() {
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[(ChunkRange::new(0, 3), 600), (ChunkRange::new(3, 5), 400)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
}
#[test]
fn test_three_terms_chained() {
let xorb_range = ChunkRange::new(0, 6);
let ranges = &[ChunkRange::new(0, 6)];
let terms = build_refs(&[
(ChunkRange::new(0, 2), 200),
(ChunkRange::new(2, 4), 300),
(ChunkRange::new(4, 6), 500),
]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
}
#[test]
fn test_gap_in_chain() {
let xorb_range = ChunkRange::new(0, 6);
let ranges = &[ChunkRange::new(0, 6)];
let terms = build_refs(&[(ChunkRange::new(0, 2), 200), (ChunkRange::new(4, 6), 500)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_does_not_start_at_xorb_start() {
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[(ChunkRange::new(1, 5), 800)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_does_not_end_at_xorb_end() {
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[(ChunkRange::new(0, 3), 600)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_empty_terms() {
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms: Vec<XorbReference> = vec![];
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_overlapping_terms_with_exact_cover() {
// Terms [0..3, 1..4, 3..5] - the chain 0..3, 3..5 covers 0..5.
// The overlapping term 1..4 should be skipped.
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[
(ChunkRange::new(0, 3), 600),
(ChunkRange::new(1, 4), 700),
(ChunkRange::new(3, 5), 400),
]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
}
#[test]
fn test_duplicate_terms_first_covers() {
// Two identical terms covering the full range.
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[(ChunkRange::new(0, 5), 1000), (ChunkRange::new(0, 5), 1000)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
}
#[test]
fn test_nonzero_xorb_start() {
let xorb_range = ChunkRange::new(3, 8);
let ranges = &[ChunkRange::new(3, 8)];
let terms = build_refs(&[(ChunkRange::new(3, 5), 400), (ChunkRange::new(5, 8), 600)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
}
#[test]
fn test_nonzero_xorb_start_no_match() {
let xorb_range = ChunkRange::new(3, 8);
let ranges = &[ChunkRange::new(3, 8)];
let terms = build_refs(&[(ChunkRange::new(3, 5), 400)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_single_chunk_range() {
let xorb_range = ChunkRange::new(0, 1);
let ranges = &[ChunkRange::new(0, 1)];
let terms = build_refs(&[(ChunkRange::new(0, 1), 42)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(42));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(42));
}
#[test]
fn test_chain_with_extra_terms_before_and_after() {
// Extra terms that don't participate in the chain but are within the sorted list.
let xorb_range = ChunkRange::new(2, 8);
fn test_chain_with_overlapping_inner_terms() {
let ranges = &[ChunkRange::new(2, 8)];
// The overlapping term [3,6) is within the range but doesn't form
// a better chain than [2,5) + [5,8), so it's harmlessly ignored.
let terms = build_refs(&[
(ChunkRange::new(0, 2), 100), // before xorb range
(ChunkRange::new(2, 5), 500), // chain start
(ChunkRange::new(3, 6), 999), // overlapping, skipped
(ChunkRange::new(5, 8), 300), // chain end
(ChunkRange::new(8, 10), 200), // after xorb range
(ChunkRange::new(2, 5), 500),
(ChunkRange::new(3, 6), 999),
(ChunkRange::new(5, 8), 300),
]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(800));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(800));
}
#[test]
fn test_partial_overlap_no_cover() {
// Terms partially overlap but don't form a contiguous chain covering the full range.
let xorb_range = ChunkRange::new(0, 10);
let ranges = &[ChunkRange::new(0, 10)];
let terms = build_refs(&[
(ChunkRange::new(0, 4), 400),
(ChunkRange::new(3, 7), 400),
(ChunkRange::new(6, 10), 400),
]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_same_start_short_then_long_covering_full() {
// Short range first, then a long range that covers the full xorb.
let xorb_range = ChunkRange::new(0, 5);
let ranges = &[ChunkRange::new(0, 5)];
let terms = build_refs(&[(ChunkRange::new(0, 3), 300), (ChunkRange::new(0, 5), 500)]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(500));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(500));
}
#[test]
fn test_same_start_short_then_long_with_chain() {
// Short range first, then a longer range, where the short range can also chain.
let xorb_range = ChunkRange::new(0, 6);
// Chain via 0..3 + 3..6 = 600
let ranges = &[ChunkRange::new(0, 6)];
let terms = build_refs(&[
(ChunkRange::new(0, 2), 200),
(ChunkRange::new(0, 3), 300),
(ChunkRange::new(3, 6), 300),
]);
// Chain via 0..3 + 3..6 = 600
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(600));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(600));
}
#[test]
fn test_same_start_multiple_duplicates_chain_through_second() {
// Multiple terms at start 0 with different lengths; only the middle one chains.
let xorb_range = ChunkRange::new(0, 6);
// Chain via 0..4 + 4..6 = 600
let ranges = &[ChunkRange::new(0, 6)];
let terms = build_refs(&[
(ChunkRange::new(0, 2), 200),
(ChunkRange::new(0, 4), 400),
(ChunkRange::new(0, 5), 500),
(ChunkRange::new(4, 6), 200),
]);
// Chain via 0..4 + 4..6 = 600
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(600));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(600));
}
#[test]
fn test_same_start_at_midpoint() {
// Duplicate starts at a midpoint in the chain, not just at the beginning.
let xorb_range = ChunkRange::new(0, 8);
// Chain via 0..3 + 3..6 + 6..8 = 800
let ranges = &[ChunkRange::new(0, 8)];
let terms = build_refs(&[
(ChunkRange::new(0, 3), 300),
(ChunkRange::new(3, 5), 200),
(ChunkRange::new(3, 6), 300),
(ChunkRange::new(6, 8), 200),
]);
// Chain via 0..3 + 3..6 + 6..8 = 800
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(800));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(800));
}
#[test]
fn test_same_start_none_covers() {
// Multiple terms at start 0, but none chain to cover the full range.
let xorb_range = ChunkRange::new(0, 10);
let ranges = &[ChunkRange::new(0, 10)];
let terms = build_refs(&[
(ChunkRange::new(0, 2), 200),
(ChunkRange::new(0, 4), 400),
(ChunkRange::new(0, 6), 600),
]);
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
#[test]
fn test_same_start_two_groups_chained() {
// Two groups of duplicate-start terms that chain together.
let xorb_range = ChunkRange::new(0, 6);
// Chain via 0..3 + 3..6 = 600
let ranges = &[ChunkRange::new(0, 6)];
let terms = build_refs(&[
(ChunkRange::new(0, 2), 200),
(ChunkRange::new(0, 3), 300),
(ChunkRange::new(3, 5), 200),
(ChunkRange::new(3, 6), 300),
]);
// Chain via 0..3 + 3..6 = 600
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(600));
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(600));
}
#[test]
fn test_multiple_disjoint_ranges_both_covered() {
let ranges = &[ChunkRange::new(0, 3), ChunkRange::new(5, 8)];
let terms = build_refs(&[(ChunkRange::new(0, 3), 300), (ChunkRange::new(5, 8), 400)]);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(700));
}
#[test]
fn test_multiple_disjoint_ranges_one_uncovered() {
let ranges = &[ChunkRange::new(0, 3), ChunkRange::new(5, 8)];
let terms = build_refs(&[(ChunkRange::new(0, 3), 300)]);
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
}
}

View File

@@ -7,8 +7,8 @@ use anyhow::Result;
use clap::{Args, Parser, Subcommand};
use http::header::{self, HeaderMap, HeaderValue};
use walkdir::WalkDir;
use xet_client::cas_client::RemoteClient;
use xet_client::cas_client::auth::TokenRefresher;
use xet_client::cas_client::{Client, RemoteClient};
use xet_client::cas_types::{FileRange, QueryReconstructionResponse};
use xet_client::hub_client::{BearerCredentialHelper, HubClient, Operation, RepoInfo};
use xet_core_structures::merklehash::MerkleHash;
@@ -230,8 +230,9 @@ async fn query_reconstruction(
cas_storage_config.custom_headers.clone(),
);
// Use V1 directly so the query tool returns the raw QueryReconstructionResponse for inspection.
remote_client
.get_reconstruction(&file_hash, bytes_range)
.get_reconstruction_v1(&file_hash, bytes_range)
.await
.map_err(anyhow::Error::from)
}

View File

@@ -15,6 +15,48 @@ use super::file_cleaner::Sha256Policy;
use super::{FileDownloadSession, FileUploadSession, XetFileInfo};
use crate::progress_tracking::TrackingProgressUpdater;
/// Describes how hydration (download/smudge) should be performed during a test.
///
/// Each variant exercises a different reconstruction path:
/// - `DirectClient`: Uses `LocalClient` directly (no HTTP server).
/// - `ServerV2`: Uses `LocalTestServer` with default V2 reconstruction.
/// - `ServerV1Fallback`: Uses `LocalTestServer` with V2 disabled, forcing V1 fallback.
/// - `ServerMaxRanges2`: Uses `LocalTestServer` with `max_ranges_per_fetch=2`, forcing multi-range fetch splitting in
/// V2 responses.
#[derive(Debug, Clone, Copy)]
pub enum HydrationMode {
DirectClient,
ServerV2,
ServerV1Fallback,
ServerMaxRanges2,
}
impl HydrationMode {
pub fn all() -> &'static [HydrationMode] {
&[
HydrationMode::DirectClient,
HydrationMode::ServerV2,
HydrationMode::ServerV1Fallback,
HydrationMode::ServerMaxRanges2,
]
}
pub fn uses_server(&self) -> bool {
!matches!(self, HydrationMode::DirectClient)
}
}
impl std::fmt::Display for HydrationMode {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
HydrationMode::DirectClient => write!(f, "direct_client"),
HydrationMode::ServerV2 => write!(f, "server_v2"),
HydrationMode::ServerV1Fallback => write!(f, "server_v1_fallback"),
HydrationMode::ServerMaxRanges2 => write!(f, "server_max_ranges_2"),
}
}
}
/// Creates or overwrites a single file in `dir` with `size` bytes of random data.
/// Panics on any I/O error. Returns the total number of bytes written (=`size`).
pub fn create_random_file(path: impl AsRef<Path>, size: usize, seed: u64) -> usize {
@@ -174,6 +216,44 @@ impl HydrateDehydrateTest {
}
}
/// Creates a new test harness configured for a specific hydration mode.
pub fn for_mode(mode: HydrationMode) -> Self {
Self::new(mode.uses_server())
}
/// Applies hydration mode configuration to the test server.
/// Must be called after `dehydrate()` and before `hydrate()`.
pub async fn apply_hydration_mode(&mut self, mode: HydrationMode) {
match mode {
HydrationMode::DirectClient => {},
HydrationMode::ServerV2 => {
self.ensure_server_created().await;
},
HydrationMode::ServerV1Fallback => {
self.ensure_server_created().await;
self.test_server.as_ref().unwrap().client().disable_v2_reconstruction(404);
},
HydrationMode::ServerMaxRanges2 => {
self.ensure_server_created().await;
self.test_server.as_ref().unwrap().client().set_max_ranges_per_fetch(2);
},
}
}
/// Ensures the test server is running, creating it if necessary.
/// Call this before configuring the server (e.g., disabling V2 or setting max ranges).
pub async fn ensure_server_created(&mut self) {
if self.use_test_server && self.test_server.is_none() {
let local_client = LocalClient::new(self.cas_dir.join("xet/xorbs")).await.unwrap();
self.test_server = Some(LocalTestServerBuilder::new().with_client(local_client).start().await);
}
}
/// Returns a reference to the test server, if one has been created.
pub fn test_server(&self) -> Option<&LocalTestServer> {
self.test_server.as_ref()
}
/// Lazily initializes the test server (if needed) and returns a CAS client.
async fn get_or_create_client(&mut self) -> Arc<dyn Client> {
if self.use_test_server {

View File

@@ -10,18 +10,25 @@ test_set_constants! {
MAX_XORB_CHUNKS = 8;
}
/// Runs clean/smudge test with all combinations of (use_test_server, sequential).
/// Runs clean/smudge test with all combinations of (hydration_mode, sequential).
/// Each combination runs sequentially with its own HydrateDehydrateTest instance to avoid
/// too many open files.
///
/// This exercises every hydration path for every test case:
/// - DirectClient: LocalClient without a server
/// - ServerV2: LocalTestServer with default V2 reconstruction
/// - ServerV1Fallback: LocalTestServer with V2 disabled (tests V1-to-V2 conversion)
/// - ServerMaxRanges2: LocalTestServer with max_ranges_per_fetch=2 (tests fetch splitting)
pub async fn check_clean_smudge_files(file_list: &[(impl AsRef<str> + Clone, usize)]) {
for use_server in [false, true] {
for &mode in HydrationMode::all() {
for sequential in [true, false] {
eprintln!("Testing use_test_server={use_server}, sequential={sequential}");
eprintln!("Testing mode={mode}, sequential={sequential}");
let mut ts = HydrateDehydrateTest::new(use_server);
let mut ts = HydrateDehydrateTest::for_mode(mode);
create_random_files(&ts.src_dir, file_list, 0);
ts.dehydrate(sequential).await;
ts.apply_hydration_mode(mode).await;
ts.hydrate().await;
ts.verify_src_dest_match();
ts.hydrate_partitioned_writers(4).await;
@@ -35,18 +42,21 @@ pub async fn check_clean_smudge_files(file_list: &[(impl AsRef<str> + Clone, usi
/// Helper for multipart tests:
/// - takes a slice of `(String, Vec<(u64, u64)>)` which fully specifies each file.
/// - for each file, calls `create_random_multipart_file` with the given segments.
///
/// Exercises all hydration modes just like `check_clean_smudge_files`.
async fn check_clean_smudge_files_multipart(file_specs: &[(String, Vec<(usize, u64)>)]) {
for use_server in [false, true] {
for &mode in HydrationMode::all() {
for sequential in [true, false] {
eprintln!("Testing use_test_server={use_server}, sequential={sequential}");
eprintln!("Testing mode={mode}, sequential={sequential}");
let mut ts = HydrateDehydrateTest::new(use_server);
let mut ts = HydrateDehydrateTest::for_mode(mode);
for (file_name, segments) in file_specs {
create_random_multipart_file(ts.src_dir.join(file_name), segments);
}
ts.dehydrate(sequential).await;
ts.apply_hydration_mode(mode).await;
ts.hydrate().await;
ts.verify_src_dest_match();
ts.hydrate_partitioned_writers(4).await;

View File

@@ -0,0 +1,71 @@
//! Clean/smudge integration tests with `enable_multirange_fetching = true`.
//!
//! This test binary is a separate copy of a subset of the clean/smudge tests
//! that runs with `enable_multirange_fetching` enabled, exercising the
//! multirange HTTP request path rather than the default single-range splitting.
use xet_data::deduplication::constants::{MAX_XORB_BYTES, MAX_XORB_CHUNKS, TARGET_CHUNK_SIZE};
use xet_data::processing::test_utils::*;
use xet_runtime::{test_set_config, test_set_constants};
test_set_constants! {
TARGET_CHUNK_SIZE = 1024;
MAX_XORB_BYTES = 5 * (*TARGET_CHUNK_SIZE);
MAX_XORB_CHUNKS = 8;
}
test_set_config! {
client {
enable_multirange_fetching = true;
}
}
#[cfg(test)]
mod testing_clean_smudge_multirange {
use super::*;
pub async fn check_clean_smudge_files(file_list: &[(impl AsRef<str> + Clone, usize)]) {
for &mode in HydrationMode::all() {
for sequential in [true, false] {
eprintln!("Testing mode={mode}, sequential={sequential} (forced multirange)");
let mut ts = HydrateDehydrateTest::for_mode(mode);
create_random_files(&ts.src_dir, file_list, 0);
ts.dehydrate(sequential).await;
ts.apply_hydration_mode(mode).await;
ts.hydrate().await;
ts.verify_src_dest_match();
}
}
}
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn test_simple_directory() {
check_clean_smudge_files(&[("a", 16)]).await;
}
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn test_multiple() {
check_clean_smudge_files(&[("a", 16), ("b", 8)]).await;
}
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn test_single_large() {
check_clean_smudge_files(&[("a", *MAX_XORB_BYTES + 1)]).await;
}
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn test_multiple_large() {
check_clean_smudge_files(&[("a", *MAX_XORB_BYTES + 1), ("b", *MAX_XORB_BYTES + 2)]).await;
}
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn test_many_small_multiple_xorbs() {
let n = 16;
let size = *MAX_XORB_BYTES / 8 + 1;
let files: Vec<_> = (0..n).map(|idx| (format!("f_{idx}"), size)).collect();
check_clean_smudge_files(&files).await;
}
}

View File

@@ -217,7 +217,7 @@ crate::config_group!({
/// The default value is 2.
///
/// Use the environment variable `HF_XET_CLIENT_AC_INITIAL_UPLOAD_CONCURRENCY` to set this value.
ref ac_initial_upload_concurrency: usize = 1;
ref ac_initial_upload_concurrency: usize = 2;
/// The maximum number of simultaneous download streams permitted by
/// the adaptive concurrency control.
@@ -238,10 +238,10 @@ crate::config_group!({
/// The starting number of concurrent download streams, which will increase up to max_concurrent_downloads
/// on successful completions.
///
/// The default value is 1.
/// The default value is 4.
///
/// Use the environment variable `HF_XET_CLIENT_AC_INITIAL_DOWNLOAD_CONCURRENCY` to set this value.
ref ac_initial_download_concurrency: usize = 1;
ref ac_initial_download_concurrency: usize = 4;
/// Path to Unix domain socket for CAS HTTP connections.
/// When set, all CAS HTTP traffic uses this socket instead of TCP.
@@ -252,4 +252,24 @@ crate::config_group!({
/// Use the environment variable `HF_XET_CLIENT_UNIX_SOCKET_PATH` to set this value.
ref unix_socket_path: Option<String> = None;
/// The reconstruction API version to request from the CAS server.
/// When set to 1 or 2, forces that version with no fallback.
/// When unset, auto-detects by trying V2 first, falling back to V1 on 404 or 501.
///
/// The default value is None (auto-detect).
///
/// Use the environment variable `HF_XET_CLIENT_RECONSTRUCTION_API_VERSION` to set this value.
ref reconstruction_api_version: Option<u32> = None;
/// Whether to use multi-range HTTP requests when fetching xorb data.
/// When false (default), V2 multi-range fetch entries are split into
/// individual single-range requests executed in parallel, which avoids
/// slow server-side multirange processing.
/// When true, multi-range requests are sent as-is.
///
/// The default value is false.
///
/// Use the environment variable `HF_XET_CLIENT_ENABLE_MULTIRANGE_FETCHING` to set this value.
ref enable_multirange_fetching: bool = false;
});