mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
V2 reconstruction with client-side optional single range splitting (#703)
This PR introduces V2 multirange URL fetching for xorbs, but optionally splits the multirange requests into multiple single-range requests that can be executed in parallel. This allows the reconstruction process to generate full multirange presigned URLs, but the client effectively performs the retrieval stage as a sequence of parallel single-range queries. The config variable `client.enable_multirange_fetching` controls this behavior; by default it is set to false due to the current observed slowness of fetching multiranged URLs. --------- Co-authored-by: Adrien <adrien@huggingface.co>
This commit is contained in:
94
api_changes/update_260316_v2_reconstruction_multirange.md
Normal file
94
api_changes/update_260316_v2_reconstruction_multirange.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# API Update: V2 Reconstruction with Multi-Range Fetch Support (2026-03-16)
|
||||
|
||||
## Overview
|
||||
|
||||
The CAS reconstruction API now supports a V2 endpoint that returns optimized
|
||||
multi-range fetch descriptors. The client auto-detects V2 and falls back to V1
|
||||
transparently. Two new config options control reconstruction behavior.
|
||||
|
||||
---
|
||||
|
||||
## 1. New CAS Endpoint
|
||||
|
||||
`GET /v2/reconstructions/{file_id}` returns `QueryReconstructionResponseV2`:
|
||||
|
||||
```json
|
||||
{
|
||||
"terms": [...],
|
||||
"offset_into_first_range": 0,
|
||||
"xorbs": {
|
||||
"<hex_hash>": [
|
||||
{
|
||||
"url": "https://...",
|
||||
"ranges": [
|
||||
{ "chunks": { "start": 0, "end": 3 }, "bytes": { "start": 0, "end": 1023 } },
|
||||
{ "chunks": { "start": 5, "end": 8 }, "bytes": { "start": 2048, "end": 3071 } }
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each `XorbMultiRangeFetch` entry groups multiple disjoint chunk ranges under a
|
||||
single presigned URL, enabling multi-range HTTP requests.
|
||||
|
||||
The client tries V2 first. On 404 or 501 it falls back to V1 and caches the
|
||||
result so subsequent calls skip the V2 attempt. Setting
|
||||
`HF_XET_CLIENT_RECONSTRUCTION_API_VERSION=1` or `=2` forces a specific version
|
||||
with no fallback.
|
||||
|
||||
The `Client::get_reconstruction` trait method now always returns
|
||||
`QueryReconstructionResponseV2`. When the server returns V1, the client
|
||||
converts it internally.
|
||||
|
||||
---
|
||||
|
||||
## 2. New Config Options
|
||||
|
||||
### `HF_XET_CLIENT_RECONSTRUCTION_API_VERSION`
|
||||
|
||||
Forces a specific reconstruction API version (1 or 2). When unset, the client
|
||||
auto-detects by trying V2 first.
|
||||
|
||||
### `HF_XET_CLIENT_ENABLE_MULTIRANGE_FETCHING`
|
||||
|
||||
Default: `false`. When false, V2 multi-range fetch entries are split into
|
||||
individual single-range requests executed in parallel. When true, multi-range
|
||||
requests are sent as-is (using `multipart/byteranges` responses).
|
||||
|
||||
---
|
||||
|
||||
## 3. Default Concurrency Changes
|
||||
|
||||
- `ac_initial_upload_concurrency`: 1 → 2
|
||||
- `ac_initial_download_concurrency`: 1 → 4
|
||||
|
||||
These align the defaults with the documented values.
|
||||
|
||||
---
|
||||
|
||||
## 4. New Types in `xet_client::cas_types`
|
||||
|
||||
- `QueryReconstructionResponseV2` — V2 reconstruction response
|
||||
- `XorbMultiRangeFetch` — A presigned URL with associated chunk/byte ranges
|
||||
- `XorbRangeDescriptor` — A single chunk range + byte range pair
|
||||
|
||||
---
|
||||
|
||||
## 5. Multipart/Byteranges Parsing
|
||||
|
||||
`xet_client::cas_client::multipart::parse_multipart_byteranges` parses RFC 7233
|
||||
`multipart/byteranges` HTTP responses. Used when `enable_multirange_fetching`
|
||||
is true and the presigned URL server returns multiple byte ranges in a single
|
||||
response.
|
||||
|
||||
---
|
||||
|
||||
## 6. Downstream Impact
|
||||
|
||||
- `Client::get_reconstruction` return type changed to `QueryReconstructionResponseV2`
|
||||
(all trait implementations updated).
|
||||
- `URLProvider::retrieve_url` now returns `Vec<HttpRange>` instead of a single
|
||||
`HttpRange` to support multi-range blocks.
|
||||
- No wire format or serialization changes; V1 responses are converted client-side.
|
||||
@@ -14,24 +14,27 @@ security:
|
||||
paths:
|
||||
/v1/reconstructions/{file_id}:
|
||||
get:
|
||||
summary: Get File Reconstruction
|
||||
summary: Get File Reconstruction (V1)
|
||||
description: |
|
||||
Retrieves reconstruction information for a specific file. Supports byte range via the optional `Range` header.
|
||||
Returns one presigned URL per chunk range per xorb.
|
||||
|
||||
Minimum token scope: `read`.
|
||||
x-required-scope: read
|
||||
operationId: getReconstruction
|
||||
operationId: getReconstructionV1
|
||||
parameters:
|
||||
- $ref: '#/components/parameters/FileIdParam'
|
||||
- $ref: '#/components/parameters/RangeHeader'
|
||||
responses:
|
||||
'200':
|
||||
description: Reconstruction object
|
||||
description: V1 reconstruction object
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/QueryReconstructionResponse'
|
||||
examples:
|
||||
example:
|
||||
v1:
|
||||
summary: V1 response
|
||||
value:
|
||||
offset_into_first_range: 0
|
||||
terms:
|
||||
@@ -57,6 +60,60 @@ paths:
|
||||
description: Not Found — File does not exist
|
||||
'416':
|
||||
description: Range Not Satisfiable — Requested byte range start exceeds file length
|
||||
/v2/reconstructions/{file_id}:
|
||||
get:
|
||||
summary: Get File Reconstruction (V2)
|
||||
description: |
|
||||
V2 reconstruction endpoint optimized for multi-range fetching.
|
||||
Returns fewer signed URLs by combining multiple byte ranges for the same xorb into a single URL,
|
||||
enabling multi-range HTTP requests (RFC 7233).
|
||||
|
||||
Clients SHOULD try V2 first and fall back to V1 if the server returns 404 or 501.
|
||||
|
||||
Minimum token scope: `read`.
|
||||
x-required-scope: read
|
||||
operationId: getReconstructionV2
|
||||
parameters:
|
||||
- $ref: '#/components/parameters/FileIdParam'
|
||||
- $ref: '#/components/parameters/RangeHeader'
|
||||
responses:
|
||||
'200':
|
||||
description: V2 reconstruction object
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/QueryReconstructionResponseV2'
|
||||
examples:
|
||||
v2:
|
||||
summary: V2 response (multi-range optimized)
|
||||
value:
|
||||
offset_into_first_range: 0
|
||||
terms:
|
||||
- hash: a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456
|
||||
unpacked_length: 263873
|
||||
range:
|
||||
start: 0
|
||||
end: 4
|
||||
xorbs:
|
||||
a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456:
|
||||
- url: "https://transfer.xethub.hf.co/xorbs/default/a1b2c3...?<signed-params>"
|
||||
ranges:
|
||||
- chunks:
|
||||
start: 0
|
||||
end: 4
|
||||
bytes:
|
||||
start: 0
|
||||
end: 131071
|
||||
'400':
|
||||
description: Bad Request — Malformed file_id
|
||||
'401':
|
||||
description: Unauthorized — Missing/expired token
|
||||
'404':
|
||||
description: Not Found — File does not exist, or V2 not supported (fall back to V1)
|
||||
'416':
|
||||
description: Range Not Satisfiable — Requested byte range start exceeds file length
|
||||
'501':
|
||||
description: Not Implemented — V2 not supported by this server (fall back to V1)
|
||||
/v1/chunks/{prefix}/{hash}:
|
||||
get:
|
||||
summary: Query Chunk Deduplication (Global Deduplication)
|
||||
@@ -286,6 +343,56 @@ components:
|
||||
$ref: '#/components/schemas/CASReconstructionFetchInfo'
|
||||
required: [offset_into_first_range, terms, fetch_info]
|
||||
additionalProperties: false
|
||||
XorbRangeDescriptor:
|
||||
type: object
|
||||
description: A chunk/byte range within a xorb
|
||||
properties:
|
||||
chunks:
|
||||
$ref: '#/components/schemas/IndexRange'
|
||||
bytes:
|
||||
$ref: '#/components/schemas/ByteRange'
|
||||
required: [chunks, bytes]
|
||||
additionalProperties: false
|
||||
XorbMultiRangeFetch:
|
||||
type: object
|
||||
description: A signed multi-range fetch entry covering a subset of ranges for a xorb
|
||||
properties:
|
||||
url:
|
||||
type: string
|
||||
format: uri
|
||||
description: |
|
||||
Signed URL with all byte ranges encoded.
|
||||
Client must send exactly the signed range value as the Range header.
|
||||
ranges:
|
||||
type: array
|
||||
items:
|
||||
$ref: '#/components/schemas/XorbRangeDescriptor'
|
||||
description: Byte ranges covered by this URL, sorted by chunk start
|
||||
required: [url, ranges]
|
||||
additionalProperties: false
|
||||
QueryReconstructionResponseV2:
|
||||
type: object
|
||||
description: V2 reconstruction response optimized for multi-range fetching
|
||||
properties:
|
||||
offset_into_first_range:
|
||||
type: integer
|
||||
minimum: 0
|
||||
terms:
|
||||
type: array
|
||||
items:
|
||||
$ref: '#/components/schemas/CASReconstructionTerm'
|
||||
xorbs:
|
||||
type: object
|
||||
description: Map from xorb hash to list of multi-range fetch entries
|
||||
propertyNames:
|
||||
$ref: '#/components/schemas/HexString64Lowercase'
|
||||
additionalProperties:
|
||||
type: array
|
||||
items:
|
||||
$ref: '#/components/schemas/XorbMultiRangeFetch'
|
||||
minItems: 1
|
||||
required: [offset_into_first_range, terms, xorbs]
|
||||
additionalProperties: false
|
||||
UploadXorbResponse:
|
||||
type: object
|
||||
properties:
|
||||
|
||||
@@ -6,14 +6,16 @@ use xet_core_structures::xorb_object::SerializedXorbObject;
|
||||
use super::adaptive_concurrency::ConnectionPermit;
|
||||
use super::error::Result;
|
||||
use super::progress_tracked_streams::ProgressCallback;
|
||||
use crate::cas_types::{BatchQueryReconstructionResponse, FileRange, HttpRange, QueryReconstructionResponse};
|
||||
use crate::cas_types::{BatchQueryReconstructionResponse, FileRange, HttpRange, QueryReconstructionResponseV2};
|
||||
|
||||
#[async_trait::async_trait]
|
||||
pub trait URLProvider: Send + Sync {
|
||||
// Retrieves the URL.
|
||||
async fn retrieve_url(&self) -> Result<(String, HttpRange)>;
|
||||
/// Retrieves the URL and the byte ranges to fetch.
|
||||
/// For single-range (V1) blocks, the Vec has one entry.
|
||||
/// For multi-range (V2) blocks, all ranges are included.
|
||||
async fn retrieve_url(&self) -> Result<(String, Vec<HttpRange>)>;
|
||||
|
||||
// Asks for a refresh of the URL; triggered on 403 errors.
|
||||
/// Asks for a refresh of the URL; triggered on 403 errors.
|
||||
async fn refresh_url(&self) -> Result<()>;
|
||||
}
|
||||
|
||||
@@ -30,11 +32,13 @@ pub trait Client: Send + Sync {
|
||||
file_hash: &MerkleHash,
|
||||
) -> Result<Option<(MDBFileInfo, Option<MerkleHash>)>>;
|
||||
|
||||
/// Returns reconstruction info always in V2 format.
|
||||
/// Implementations may try V2 first and fall back to V1 + convert.
|
||||
async fn get_reconstruction(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>>;
|
||||
) -> Result<Option<QueryReconstructionResponseV2>>;
|
||||
|
||||
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse>;
|
||||
|
||||
|
||||
@@ -16,6 +16,7 @@ mod error;
|
||||
pub mod exports;
|
||||
pub mod http_client;
|
||||
mod interface;
|
||||
pub mod multipart;
|
||||
pub mod progress_tracked_streams;
|
||||
pub mod remote_client;
|
||||
pub mod retry_wrapper;
|
||||
|
||||
186
xet_client/src/cas_client/multipart.rs
Normal file
186
xet_client/src/cas_client/multipart.rs
Normal file
@@ -0,0 +1,186 @@
|
||||
use bytes::Bytes;
|
||||
|
||||
use crate::cas_client::error::{CasClientError, Result};
|
||||
use crate::cas_types::HttpRange;
|
||||
|
||||
/// A single part from a multipart/byteranges HTTP response.
|
||||
pub struct MultipartPart {
|
||||
pub range: HttpRange,
|
||||
pub data: Bytes,
|
||||
}
|
||||
|
||||
/// Parse a `multipart/byteranges` HTTP response body (RFC 7233 §4.1).
|
||||
///
|
||||
/// Extracts the boundary from `content_type`, splits the body by boundary markers,
|
||||
/// parses `Content-Range` headers from each part, and returns parts sorted by byte range start.
|
||||
pub fn parse_multipart_byteranges(content_type: &str, body: Bytes) -> Result<Vec<MultipartPart>> {
|
||||
let boundary = extract_boundary(content_type)?;
|
||||
|
||||
let delimiter = format!("\r\n--{boundary}");
|
||||
let body_slice = body.as_ref();
|
||||
|
||||
let mut parts = Vec::new();
|
||||
|
||||
let first_delim = format!("--{boundary}");
|
||||
let Some(start) = find_subsequence(body_slice, first_delim.as_bytes()) else {
|
||||
return Err(CasClientError::Other("No boundary found in multipart body".to_string()));
|
||||
};
|
||||
|
||||
let mut remaining = &body_slice[start + first_delim.len()..];
|
||||
|
||||
loop {
|
||||
if remaining.starts_with(b"\r\n") {
|
||||
remaining = &remaining[2..];
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
|
||||
let next_boundary = find_subsequence(remaining, delimiter.as_bytes());
|
||||
let part_data = match next_boundary {
|
||||
Some(pos) => &remaining[..pos],
|
||||
None => remaining,
|
||||
};
|
||||
|
||||
let Some(header_end) = find_subsequence(part_data, b"\r\n\r\n") else {
|
||||
return Err(CasClientError::Other("Malformed multipart part: missing header/data separator".to_string()));
|
||||
};
|
||||
|
||||
let headers = &part_data[..header_end];
|
||||
let data_start = header_end + 4;
|
||||
let data = &part_data[data_start..];
|
||||
|
||||
let range = parse_content_range(headers)?;
|
||||
// Compute the absolute byte offset into the original `body` so we can
|
||||
// use Bytes::slice for zero-copy extraction of this part's data.
|
||||
let offset =
|
||||
body.len() - body_slice.len() + (remaining.as_ptr() as usize - body_slice.as_ptr() as usize) + data_start;
|
||||
parts.push(MultipartPart {
|
||||
range,
|
||||
data: body.slice(offset..offset + data.len()),
|
||||
});
|
||||
|
||||
match next_boundary {
|
||||
Some(pos) => {
|
||||
remaining = &remaining[pos + delimiter.len()..];
|
||||
},
|
||||
None => break,
|
||||
}
|
||||
}
|
||||
|
||||
parts.sort_by_key(|p| p.range.start);
|
||||
|
||||
Ok(parts)
|
||||
}
|
||||
|
||||
fn extract_boundary(content_type: &str) -> Result<String> {
|
||||
for part in content_type.split(';') {
|
||||
let part = part.trim();
|
||||
if let Some(value) = part.strip_prefix("boundary=") {
|
||||
let boundary = value.trim_matches('"');
|
||||
return Ok(boundary.to_string());
|
||||
}
|
||||
}
|
||||
Err(CasClientError::Other(format!("No boundary found in Content-Type: {content_type}")))
|
||||
}
|
||||
|
||||
fn parse_content_range(headers: &[u8]) -> Result<HttpRange> {
|
||||
let headers_str = std::str::from_utf8(headers)
|
||||
.map_err(|e| CasClientError::Other(format!("Invalid UTF-8 in part headers: {e}")))?;
|
||||
|
||||
for line in headers_str.split("\r\n") {
|
||||
let line_lower = line.to_ascii_lowercase();
|
||||
if let Some(value) = line_lower.strip_prefix("content-range:") {
|
||||
// Digits, dashes, and slashes are case-invariant, so we can parse
|
||||
// directly from the lowercased value.
|
||||
if let Some(range_spec) = value.trim().strip_prefix("bytes ") {
|
||||
let original_value = range_spec.trim();
|
||||
let slash_pos = original_value
|
||||
.find('/')
|
||||
.ok_or_else(|| CasClientError::Other(format!("Invalid Content-Range: {line}")))?;
|
||||
let range_part = &original_value[..slash_pos];
|
||||
let dash_pos = range_part
|
||||
.find('-')
|
||||
.ok_or_else(|| CasClientError::Other(format!("Invalid Content-Range: {line}")))?;
|
||||
let start: u64 = range_part[..dash_pos]
|
||||
.parse()
|
||||
.map_err(|e| CasClientError::Other(format!("Invalid Content-Range start: {e}")))?;
|
||||
let end: u64 = range_part[dash_pos + 1..]
|
||||
.parse()
|
||||
.map_err(|e| CasClientError::Other(format!("Invalid Content-Range end: {e}")))?;
|
||||
// RFC 7233 Content-Range uses an inclusive end, which matches HttpRange.
|
||||
return Ok(HttpRange::new(start, end));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Err(CasClientError::Other("No Content-Range header found in multipart part".to_string()))
|
||||
}
|
||||
|
||||
fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
|
||||
haystack.windows(needle.len()).position(|window| window == needle)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_extract_boundary() {
|
||||
assert_eq!(extract_boundary("multipart/byteranges; boundary=something").unwrap(), "something");
|
||||
assert_eq!(extract_boundary("multipart/byteranges; boundary=\"quoted\"").unwrap(), "quoted");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_extract_boundary_missing() {
|
||||
assert!(extract_boundary("text/plain").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_single_part() {
|
||||
let boundary = "abc123";
|
||||
let body = format!(
|
||||
"--{boundary}\r\nContent-Type: application/octet-stream\r\nContent-Range: bytes 0-99/1000\r\n\r\nHello World\r\n--{boundary}--\r\n"
|
||||
);
|
||||
let content_type = format!("multipart/byteranges; boundary={boundary}");
|
||||
|
||||
let parts = parse_multipart_byteranges(&content_type, Bytes::from(body)).unwrap();
|
||||
assert_eq!(parts.len(), 1);
|
||||
assert_eq!(parts[0].range.start, 0);
|
||||
assert_eq!(parts[0].range.end, 99);
|
||||
assert_eq!(&parts[0].data[..], b"Hello World");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_multiple_parts() {
|
||||
let boundary = "sep";
|
||||
let body = format!(
|
||||
"--{boundary}\r\nContent-Range: bytes 100-199/1000\r\n\r\nPart2Data\r\n--{boundary}\r\nContent-Range: bytes 0-49/1000\r\n\r\nPart1Data\r\n--{boundary}--\r\n"
|
||||
);
|
||||
let content_type = format!("multipart/byteranges; boundary={boundary}");
|
||||
|
||||
let parts = parse_multipart_byteranges(&content_type, Bytes::from(body)).unwrap();
|
||||
assert_eq!(parts.len(), 2);
|
||||
assert_eq!(parts[0].range.start, 0);
|
||||
assert_eq!(parts[0].range.end, 49);
|
||||
assert_eq!(&parts[0].data[..], b"Part1Data");
|
||||
assert_eq!(parts[1].range.start, 100);
|
||||
assert_eq!(parts[1].range.end, 199);
|
||||
assert_eq!(&parts[1].data[..], b"Part2Data");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_empty_body_no_boundary() {
|
||||
let content_type = "multipart/byteranges; boundary=xyz";
|
||||
let result = parse_multipart_byteranges(content_type, Bytes::new());
|
||||
assert!(result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_part_missing_header_separator() {
|
||||
let boundary = "xyz";
|
||||
let body = format!("--{boundary}\r\nContent-Range: bytes 0-9/100\r\nMISSING_SEPARATOR\r\n--{boundary}--\r\n");
|
||||
let content_type = format!("multipart/byteranges; boundary={boundary}");
|
||||
let result = parse_multipart_byteranges(&content_type, Bytes::from(body));
|
||||
assert!(result.is_err());
|
||||
}
|
||||
}
|
||||
@@ -1,5 +1,5 @@
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicU64, Ordering};
|
||||
use std::sync::atomic::{AtomicU32, AtomicU64, Ordering};
|
||||
|
||||
use bytes::Bytes;
|
||||
use futures::TryStreamExt;
|
||||
@@ -24,8 +24,8 @@ use super::progress_tracked_streams::{
|
||||
use super::retry_wrapper::{RetryWrapper, RetryableReqwestError};
|
||||
use super::{Client, INFORMATION_LOG_LEVEL};
|
||||
use crate::cas_types::{
|
||||
BatchQueryReconstructionResponse, FileRange, HttpRange, Key, QueryReconstructionResponse, UploadShardResponse,
|
||||
UploadShardResponseType, UploadXorbResponse,
|
||||
BatchQueryReconstructionResponse, FileRange, HttpRange, Key, QueryReconstructionResponse,
|
||||
QueryReconstructionResponseV2, UploadShardResponse, UploadShardResponseType, UploadXorbResponse,
|
||||
};
|
||||
|
||||
pub const CAS_ENDPOINT: &str = "http://localhost:8080";
|
||||
@@ -48,6 +48,8 @@ pub struct RemoteClient {
|
||||
shard_upload_http_client: Arc<ClientWithMiddleware>,
|
||||
upload_concurrency_controller: Arc<AdaptiveConcurrencyController>,
|
||||
download_concurrency_controller: Arc<AdaptiveConcurrencyController>,
|
||||
/// Caches the discovered reconstruction API version (0 = not yet probed, 1 = V1, 2 = V2).
|
||||
detected_reconstruction_api_version: AtomicU32,
|
||||
}
|
||||
|
||||
impl RemoteClient {
|
||||
@@ -85,6 +87,7 @@ impl RemoteClient {
|
||||
),
|
||||
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("upload"),
|
||||
download_concurrency_controller: AdaptiveConcurrencyController::new_download("download"),
|
||||
detected_reconstruction_api_version: AtomicU32::new(0),
|
||||
})
|
||||
}
|
||||
|
||||
@@ -168,6 +171,126 @@ impl RemoteClient {
|
||||
}
|
||||
}
|
||||
|
||||
impl RemoteClient {
|
||||
async fn get_reconstruction_impl<T>(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
api_version: &str,
|
||||
) -> Result<Option<T>>
|
||||
where
|
||||
T: serde::de::DeserializeOwned + 'static,
|
||||
{
|
||||
let call_id = FN_CALL_ID.fetch_add(1, Ordering::Relaxed);
|
||||
let url = Url::parse(&format!("{}/{api_version}/reconstructions/{}", self.endpoint, file_id.hex()))?;
|
||||
let api_tag = match api_version {
|
||||
"v1" => "cas::get_reconstruction_v1",
|
||||
"v2" => "cas::get_reconstruction_v2",
|
||||
_ => {
|
||||
return Err(CasClientError::internal(format!("unsupported reconstruction API version: {api_version}")));
|
||||
},
|
||||
};
|
||||
|
||||
event!(
|
||||
INFORMATION_LOG_LEVEL,
|
||||
call_id,
|
||||
%file_id,
|
||||
?bytes_range,
|
||||
api_version,
|
||||
"Starting get_reconstruction API call",
|
||||
);
|
||||
|
||||
let client = self.authenticated_http_client.clone();
|
||||
|
||||
let result: Result<T> = RetryWrapper::new(api_tag)
|
||||
.run_and_extract_json(move || {
|
||||
let mut request = client.get(url.clone()).with_extension(Api(api_tag));
|
||||
if let Some(range) = bytes_range {
|
||||
request = request.header(RANGE, HttpRange::from(range).range_header())
|
||||
}
|
||||
request.send()
|
||||
})
|
||||
.await;
|
||||
|
||||
match result {
|
||||
Ok(response) => {
|
||||
event!(
|
||||
INFORMATION_LOG_LEVEL,
|
||||
call_id,
|
||||
%file_id,
|
||||
?bytes_range,
|
||||
api_version,
|
||||
"Completed get_reconstruction API call"
|
||||
);
|
||||
Ok(Some(response))
|
||||
},
|
||||
Err(CasClientError::ReqwestError(ref e, _)) if e.status() == Some(StatusCode::RANGE_NOT_SATISFIABLE) => {
|
||||
Ok(None)
|
||||
},
|
||||
Err(e) => Err(e),
|
||||
}
|
||||
}
|
||||
|
||||
/// V1 reconstruction: returns per-range presigned URLs.
|
||||
pub async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
self.get_reconstruction_impl(file_id, bytes_range, "v1").await
|
||||
}
|
||||
|
||||
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
|
||||
pub async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.get_reconstruction_impl(file_id, bytes_range, "v2").await
|
||||
}
|
||||
|
||||
pub(crate) async fn get_reconstruction_with_version_override(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
forced_version: Option<u32>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
// Prefer V2; fall back to V1 on 404/501; persist detected version to
|
||||
// avoid repeated fallback attempts.
|
||||
let version = match forced_version {
|
||||
Some(v) => v,
|
||||
None => {
|
||||
let detected = self.detected_reconstruction_api_version.load(Ordering::Relaxed);
|
||||
if detected != 0 { detected } else { 2 }
|
||||
},
|
||||
};
|
||||
|
||||
match version {
|
||||
2 => match self.get_reconstruction_v2(file_id, bytes_range).await {
|
||||
Ok(result) => {
|
||||
if forced_version.is_none() {
|
||||
self.detected_reconstruction_api_version.store(2, Ordering::Relaxed);
|
||||
}
|
||||
Ok(result)
|
||||
},
|
||||
Err(e)
|
||||
if forced_version.is_none()
|
||||
&& matches!(e.status(), Some(StatusCode::NOT_FOUND) | Some(StatusCode::NOT_IMPLEMENTED)) =>
|
||||
{
|
||||
info!(status = ?e.status(), "V2 reconstruction not available, falling back to V1");
|
||||
let result = self.get_reconstruction_v1(file_id, bytes_range).await?.map(Into::into);
|
||||
// Store after success to make sure we don't mess up on e.g. network failure.
|
||||
self.detected_reconstruction_api_version.store(1, Ordering::Relaxed);
|
||||
Ok(result)
|
||||
},
|
||||
Err(e) => Err(e),
|
||||
},
|
||||
1 => Ok(self.get_reconstruction_v1(file_id, bytes_range).await?.map(Into::into)),
|
||||
other => Err(CasClientError::internal(format!("unsupported reconstruction API version: {other}"))),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg_attr(not(target_family = "wasm"), async_trait::async_trait)]
|
||||
#[cfg_attr(target_family = "wasm", async_trait::async_trait(?Send))]
|
||||
impl Client for RemoteClient {
|
||||
@@ -175,49 +298,10 @@ impl Client for RemoteClient {
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
let call_id = FN_CALL_ID.fetch_add(1, Ordering::Relaxed);
|
||||
let url = Url::parse(&format!("{}/v1/reconstructions/{}", self.endpoint, file_id.hex()))?;
|
||||
event!(
|
||||
INFORMATION_LOG_LEVEL,
|
||||
call_id,
|
||||
%file_id,
|
||||
?bytes_range,
|
||||
"Starting get_reconstruction API call",
|
||||
);
|
||||
|
||||
let api_tag = "cas::get_reconstruction";
|
||||
let client = self.authenticated_http_client.clone();
|
||||
|
||||
let result: Result<QueryReconstructionResponse> = RetryWrapper::new(api_tag)
|
||||
.run_and_extract_json(move || {
|
||||
let mut request = client.get(url.clone()).with_extension(Api(api_tag));
|
||||
if let Some(range) = bytes_range {
|
||||
// convert exclusive-end to inclusive-end range
|
||||
request = request.header(RANGE, HttpRange::from(range).range_header())
|
||||
}
|
||||
|
||||
request.send()
|
||||
})
|
||||
.await;
|
||||
|
||||
match result {
|
||||
Ok(query_reconstruction_response) => {
|
||||
event!(
|
||||
INFORMATION_LOG_LEVEL,
|
||||
call_id,
|
||||
%file_id,
|
||||
?bytes_range,
|
||||
"Completed get_reconstruction API call"
|
||||
);
|
||||
Ok(Some(query_reconstruction_response))
|
||||
},
|
||||
Err(CasClientError::ReqwestError(ref e, _)) if e.status() == Some(StatusCode::RANGE_NOT_SATISFIABLE) => {
|
||||
// bytes_range not satisfiable
|
||||
Ok(None)
|
||||
},
|
||||
Err(e) => Err(e),
|
||||
}
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
let forced_version = xet_config().client.reconstruction_api_version;
|
||||
self.get_reconstruction_with_version_override(file_id, bytes_range, forced_version)
|
||||
.await
|
||||
}
|
||||
|
||||
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse> {
|
||||
@@ -270,8 +354,8 @@ impl Client for RemoteClient {
|
||||
let http_client = self.http_client.clone();
|
||||
let url_info = Arc::new(url_info);
|
||||
|
||||
let (_, url_range) = url_info.retrieve_url().await?;
|
||||
let total_download_bytes = url_range.length();
|
||||
let (_, url_ranges) = url_info.retrieve_url().await?;
|
||||
let total_download_bytes: u64 = url_ranges.iter().map(|r| r.length()).sum();
|
||||
|
||||
let mut transfer_reporter = StreamProgressReporter::new(total_download_bytes)
|
||||
.with_adaptive_concurrency_reporter(download_permit.get_partial_completion_reporting_function());
|
||||
@@ -288,16 +372,28 @@ impl Client for RemoteClient {
|
||||
let url_info = url_info.clone();
|
||||
|
||||
async move {
|
||||
let (url_string, url_range) = url_info
|
||||
let (url_string, url_ranges) = url_info
|
||||
.retrieve_url()
|
||||
.await
|
||||
.map_err(|e| reqwest_middleware::Error::Middleware(e.into()))?;
|
||||
let url =
|
||||
Url::parse(&url_string).map_err(|e| reqwest_middleware::Error::Middleware(e.into()))?;
|
||||
|
||||
// RFC 7233 §2.1: single-range uses "bytes=S-E", multi-range uses "bytes=S1-E1,S2-E2,..."
|
||||
let range_header_value = if url_ranges.len() == 1 {
|
||||
url_ranges[0].range_header()
|
||||
} else {
|
||||
let joined = url_ranges
|
||||
.iter()
|
||||
.map(|r| format!("{}-{}", r.start, r.end))
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
format!("bytes={joined}")
|
||||
};
|
||||
|
||||
let response = http_client
|
||||
.get(url)
|
||||
.header(RANGE, url_range.range_header())
|
||||
.header(RANGE, range_header_value)
|
||||
.with_extension(Api(api_tag))
|
||||
.send()
|
||||
.await?;
|
||||
@@ -315,6 +411,57 @@ impl Client for RemoteClient {
|
||||
move |resp: Response| {
|
||||
let transfer_reporter = transfer_reporter.clone();
|
||||
async move {
|
||||
let content_type = resp
|
||||
.headers()
|
||||
.get("content-type")
|
||||
.and_then(|v| v.to_str().ok())
|
||||
.unwrap_or("")
|
||||
.to_string();
|
||||
|
||||
let is_multipart = content_type.contains("multipart/byteranges");
|
||||
|
||||
if is_multipart {
|
||||
let body = resp
|
||||
.bytes()
|
||||
.await
|
||||
.map_err(|e| RetryableReqwestError::RetryableError(CasClientError::from(e)))?;
|
||||
|
||||
let multipart_parts = crate::cas_client::multipart::parse_multipart_byteranges(&content_type, body)
|
||||
.map_err(RetryableReqwestError::FatalError)?;
|
||||
|
||||
let mut all_decompressed = Vec::with_capacity(uncompressed_size_if_known.unwrap_or(0));
|
||||
let mut all_chunk_indices = Vec::<u32>::new();
|
||||
let mut total_compressed_bytes = 0u64;
|
||||
|
||||
for part in multipart_parts {
|
||||
total_compressed_bytes += part.data.len() as u64;
|
||||
|
||||
let (data, chunk_indices) =
|
||||
xet_core_structures::xorb_object::deserialize_chunks(&mut std::io::Cursor::new(part.data.as_ref()))
|
||||
.map_err(|e| {
|
||||
RetryableReqwestError::RetryableError(CasClientError::FormatError(e))
|
||||
})?;
|
||||
|
||||
xet_core_structures::xorb_object::append_chunk_segment(
|
||||
&mut all_decompressed,
|
||||
&mut all_chunk_indices,
|
||||
&data,
|
||||
&chunk_indices,
|
||||
);
|
||||
|
||||
transfer_reporter.report_progress(total_compressed_bytes as usize);
|
||||
}
|
||||
|
||||
if let Some(expected) = uncompressed_size_if_known
|
||||
&& expected != all_decompressed.len()
|
||||
{
|
||||
return Err(RetryableReqwestError::RetryableError(CasClientError::Other(format!(
|
||||
"get_file_term_data: expected {expected} uncompressed bytes, got {}",
|
||||
all_decompressed.len()
|
||||
))));
|
||||
}
|
||||
Ok((Bytes::from(all_decompressed), all_chunk_indices))
|
||||
} else {
|
||||
let incoming_stream = DownloadProgressStream::wrap_stream(
|
||||
resp.bytes_stream().map_err(std::io::Error::other),
|
||||
transfer_reporter,
|
||||
@@ -345,6 +492,7 @@ impl Client for RemoteClient {
|
||||
Err(e) => Err(RetryableReqwestError::RetryableError(CasClientError::FormatError(e))),
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
)
|
||||
.await?;
|
||||
|
||||
@@ -157,10 +157,13 @@ impl RetryWrapper {
|
||||
}
|
||||
},
|
||||
(Err(e), Some(Retryable::Transient)) => {
|
||||
// Intercept the too many requests condition in the case of no retrying on 429.
|
||||
if e.status() == Some(StatusCode::TOO_MANY_REQUESTS) && self.no_retry_on_429 {
|
||||
let cas_err = process_error("Too Many Requests (retry on 429 disabled)", e, false);
|
||||
Err(RetryableReqwestError::FatalError(cas_err))
|
||||
} else if e.status() == Some(StatusCode::NOT_IMPLEMENTED) {
|
||||
// 501 is permanent -- the server won't implement this on retry.
|
||||
let cas_err = process_error("Not Implemented", e, true);
|
||||
Err(RetryableReqwestError::FatalError(cas_err))
|
||||
} else {
|
||||
let cas_err = process_error("Retryable Error", e, true);
|
||||
Err(RetryableReqwestError::RetryableError(cas_err))
|
||||
|
||||
@@ -36,6 +36,11 @@ where
|
||||
test_get_file_data_with_ranges(factory().await).await;
|
||||
test_get_file_size(factory().await).await;
|
||||
test_global_dedup(factory().await).await;
|
||||
test_v2_reconstruction_basic(factory().await).await;
|
||||
test_v2_reconstruction_ranges(factory().await).await;
|
||||
test_v2_reconstruction_matches_v1(factory().await).await;
|
||||
test_v2_max_ranges_per_fetch(factory().await).await;
|
||||
test_v2_url_encoding(factory().await).await;
|
||||
}
|
||||
|
||||
/// Tests that adjacent chunk ranges from the same xorb are merged into a single fetch_info.
|
||||
@@ -43,7 +48,7 @@ pub async fn test_reconstruction_merges_adjacent_ranges(client: Arc<dyn DirectAc
|
||||
let term_spec = &[(1, (0, 2)), (1, (2, 4))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 2);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -59,7 +64,7 @@ pub async fn test_reconstruction_with_multiple_xorbs(client: Arc<dyn DirectAcces
|
||||
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (1, (3, 5))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 3);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 2);
|
||||
}
|
||||
@@ -73,7 +78,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (0, 3)), (1, (1, 4))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 2);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -89,7 +94,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (0, 5)), (1, (1, 3))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 2);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -105,7 +110,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (0, 2)), (1, (1, 4)), (1, (3, 6))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 3);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -121,7 +126,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (0, 2)), (1, (4, 6))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 2);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -139,7 +144,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (0, 3)), (1, (3, 5))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 2);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -155,7 +160,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (2, 5)), (1, (2, 5)), (1, (2, 5))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 3);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -171,7 +176,7 @@ pub async fn test_reconstruction_overlapping_range_merging(client: Arc<dyn Direc
|
||||
let term_spec = &[(1, (0, 3)), (1, (2, 4)), (1, (6, 8)), (1, (7, 10))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 4);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
|
||||
@@ -191,12 +196,12 @@ pub async fn test_range_requests(client: Arc<dyn DirectAccessClient>) {
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
// Calculate total file size from terms
|
||||
let reconstruction_full = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction_full = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let total_file_size: u64 = reconstruction_full.terms.iter().map(|t| t.unpacked_length as u64).sum();
|
||||
|
||||
// Partial out-of-range truncates
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(total_file_size / 2, total_file_size + 1000)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(total_file_size / 2, total_file_size + 1000)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -205,19 +210,19 @@ pub async fn test_range_requests(client: Arc<dyn DirectAccessClient>) {
|
||||
|
||||
// Entire range out of bounds returns Ok(None) (like RemoteClient's 416 handling)
|
||||
let result = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(total_file_size + 100, total_file_size + 1000)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(total_file_size + 100, total_file_size + 1000)))
|
||||
.await;
|
||||
assert!(result.unwrap().is_none());
|
||||
|
||||
// Start equals file size returns Ok(None)
|
||||
let result = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(total_file_size, total_file_size + 100)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(total_file_size, total_file_size + 100)))
|
||||
.await;
|
||||
assert!(result.unwrap().is_none());
|
||||
|
||||
// Valid range within bounds succeeds
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(0, total_file_size / 2)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(0, total_file_size / 2)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -226,7 +231,7 @@ pub async fn test_range_requests(client: Arc<dyn DirectAccessClient>) {
|
||||
|
||||
// End exactly at file size succeeds
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(0, total_file_size)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(0, total_file_size)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -239,7 +244,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
|
||||
// Test 1: Single segment with 3 chunks
|
||||
{
|
||||
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 1);
|
||||
}
|
||||
|
||||
@@ -248,7 +253,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
|
||||
let term_spec = &[(1, (0, 2)), (1, (2, 4)), (1, (4, 6))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 3);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
}
|
||||
@@ -258,7 +263,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
|
||||
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (3, (0, 4))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 3);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 3);
|
||||
}
|
||||
@@ -268,7 +273,7 @@ pub async fn test_upload_configurations(client: Arc<dyn DirectAccessClient>) {
|
||||
let term_spec = &[(1, (0, 3)), (1, (1, 4)), (1, (2, 5))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
assert_eq!(reconstruction.terms.len(), 3);
|
||||
assert_eq!(reconstruction.fetch_info.len(), 1);
|
||||
}
|
||||
@@ -280,7 +285,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
|
||||
let term_spec = &[(1, (0, 5))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction_full = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction_full = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let total_file_size: u64 = reconstruction_full.terms.iter().map(|t| t.unpacked_length as u64).sum();
|
||||
assert_eq!(total_file_size, (5 * chunk_size) as u64);
|
||||
|
||||
@@ -289,7 +294,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
|
||||
let start = chunk_size as u64 + 500;
|
||||
let end = total_file_size;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -305,7 +310,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
|
||||
let start = (chunk_size * 2) as u64;
|
||||
let end = total_file_size;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -321,7 +326,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
|
||||
let start = 0u64;
|
||||
let end = (chunk_size * 2) as u64 + 500;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -337,7 +342,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
|
||||
let start = (chunk_size * 2) as u64 + 100;
|
||||
let end = (chunk_size * 2) as u64 + 500;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -353,7 +358,7 @@ pub async fn test_chunk_boundary_shrinking(client: Arc<dyn DirectAccessClient>)
|
||||
let start = chunk_size as u64 - 100;
|
||||
let end = chunk_size as u64 + 100;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -371,7 +376,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
|
||||
let term_spec = &[(1, (0, 4)), (2, (0, 4))];
|
||||
let file = client.upload_random_file(term_spec, chunk_size).await.unwrap();
|
||||
|
||||
let reconstruction_full = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction_full = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let total_file_size: u64 = reconstruction_full.terms.iter().map(|t| t.unpacked_length as u64).sum();
|
||||
assert_eq!(total_file_size, (8 * chunk_size) as u64);
|
||||
|
||||
@@ -380,7 +385,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
|
||||
let start = chunk_size as u64 + 500;
|
||||
let end = total_file_size;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -398,7 +403,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
|
||||
let start = chunk_size as u64;
|
||||
let end = (chunk_size * 3) as u64;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -415,7 +420,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
|
||||
let start = xorb1_size + chunk_size as u64;
|
||||
let end = xorb1_size + (chunk_size * 3) as u64;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -432,7 +437,7 @@ pub async fn test_chunk_boundary_multiple_segments(client: Arc<dyn DirectAccessC
|
||||
let start = (chunk_size * 2) as u64;
|
||||
let end = xorb1_size + (chunk_size * 2) as u64 + 500;
|
||||
let response = client
|
||||
.get_reconstruction(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(FileRange::new(start, end)))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -712,7 +717,7 @@ async fn test_url_expiration_within_window(client: Arc<dyn DirectAccessClient>)
|
||||
|
||||
// Upload a file and get reconstruction info (which creates URLs with current timestamp)
|
||||
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
// Get the fetch_info for the first term's xorb
|
||||
let xorb_hash = file.terms[0].xorb_hash;
|
||||
@@ -738,7 +743,7 @@ async fn test_url_expiration_after_window(client: Arc<dyn DirectAccessClient>) {
|
||||
|
||||
// Upload a file and get reconstruction info (which creates URLs with current timestamp)
|
||||
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
// Get the fetch_info for the first term's xorb
|
||||
let xorb_hash = file.terms[0].xorb_hash;
|
||||
@@ -764,7 +769,7 @@ async fn test_url_expiration_default_infinite(client: Arc<dyn DirectAccessClient
|
||||
|
||||
// Upload a file and get reconstruction info
|
||||
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
// Get the fetch_info for the first term's xorb
|
||||
let xorb_hash = file.terms[0].xorb_hash;
|
||||
@@ -790,7 +795,7 @@ async fn test_url_expiration_exact_boundary(client: Arc<dyn DirectAccessClient>)
|
||||
|
||||
// Upload a file and get reconstruction info
|
||||
let file = client.upload_random_file(&[(1, (0, 3))], 2048).await.unwrap();
|
||||
let reconstruction = client.get_reconstruction(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let reconstruction = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
// Get the fetch_info for the first term's xorb
|
||||
let xorb_hash = file.terms[0].xorb_hash;
|
||||
@@ -916,3 +921,190 @@ async fn test_api_delay_can_be_disabled(client: Arc<dyn DirectAccessClient>) {
|
||||
"Delay should not be applied after disabling: elapsed={elapsed:?}"
|
||||
);
|
||||
}
|
||||
|
||||
// ===== V2 Reconstruction Tests =====
|
||||
|
||||
/// Tests basic V2 reconstruction response structure.
|
||||
async fn test_v2_reconstruction_basic(client: Arc<dyn DirectAccessClient>) {
|
||||
let term_spec = &[(1, (0, 5))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let response = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
assert!(!response.terms.is_empty());
|
||||
assert!(!response.xorbs.is_empty());
|
||||
assert_eq!(response.offset_into_first_range, 0);
|
||||
|
||||
for term in &response.terms {
|
||||
let xorb_descriptor = response.xorbs.get(&term.hash).expect("xorb descriptor missing for term");
|
||||
assert!(!xorb_descriptor.is_empty());
|
||||
for fetch in xorb_descriptor {
|
||||
assert!(!fetch.url.is_empty());
|
||||
assert!(!fetch.ranges.is_empty());
|
||||
for range in &fetch.ranges {
|
||||
assert!(range.bytes.start < range.bytes.end);
|
||||
assert!(range.chunks.start < range.chunks.end);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Tests V2 reconstruction with byte range queries.
|
||||
async fn test_v2_reconstruction_ranges(client: Arc<dyn DirectAccessClient>) {
|
||||
let term_spec = &[(1, (0, 3)), (2, (0, 3)), (1, (3, 6))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let file_size = file.data.len() as u64;
|
||||
|
||||
// Partial range
|
||||
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
|
||||
let response = client
|
||||
.get_reconstruction_v2(&file.file_hash, Some(range))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
assert!(!response.terms.is_empty());
|
||||
assert!(!response.xorbs.is_empty());
|
||||
|
||||
// Out-of-range query returns None
|
||||
let out_of_range = FileRange::new(file_size + 100, file_size + 200);
|
||||
let none_result = client.get_reconstruction_v2(&file.file_hash, Some(out_of_range)).await.unwrap();
|
||||
assert!(none_result.is_none());
|
||||
}
|
||||
|
||||
/// Tests that V2 reconstruction terms match V1 terms and offsets.
|
||||
async fn test_v2_reconstruction_matches_v1(client: Arc<dyn DirectAccessClient>) {
|
||||
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (1, (3, 5))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let v1 = client.get_reconstruction_v1(&file.file_hash, None).await.unwrap().unwrap();
|
||||
let v2 = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
assert_eq!(v1.offset_into_first_range, v2.offset_into_first_range);
|
||||
assert_eq!(v1.terms.len(), v2.terms.len());
|
||||
for (t1, t2) in v1.terms.iter().zip(v2.terms.iter()) {
|
||||
assert_eq!(t1.hash, t2.hash);
|
||||
assert_eq!(t1.range, t2.range);
|
||||
assert_eq!(t1.unpacked_length, t2.unpacked_length);
|
||||
}
|
||||
|
||||
// Both should have the same xorb hashes
|
||||
let mut v1_xorb_hashes: Vec<_> = v1.fetch_info.keys().map(|h| h.to_string()).collect();
|
||||
let mut v2_xorb_hashes: Vec<_> = v2.xorbs.keys().map(|h| h.to_string()).collect();
|
||||
v1_xorb_hashes.sort();
|
||||
v2_xorb_hashes.sort();
|
||||
assert_eq!(v1_xorb_hashes, v2_xorb_hashes);
|
||||
|
||||
// Check range with partial file
|
||||
let file_size = file.data.len() as u64;
|
||||
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
|
||||
let v1r = client
|
||||
.get_reconstruction_v1(&file.file_hash, Some(range))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
let v2r = client
|
||||
.get_reconstruction_v2(&file.file_hash, Some(range))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert_eq!(v1r.offset_into_first_range, v2r.offset_into_first_range);
|
||||
assert_eq!(v1r.terms.len(), v2r.terms.len());
|
||||
}
|
||||
|
||||
/// Tests that max_ranges_per_fetch correctly splits multi-range fetch entries.
|
||||
async fn test_v2_max_ranges_per_fetch(client: Arc<dyn DirectAccessClient>) {
|
||||
// Use a file with many non-contiguous segments from the same xorb,
|
||||
// interleaved with another xorb to prevent merging.
|
||||
let term_spec = &[
|
||||
(1, (0, 2)),
|
||||
(2, (0, 1)),
|
||||
(1, (2, 4)),
|
||||
(2, (1, 2)),
|
||||
(1, (4, 6)),
|
||||
(2, (2, 3)),
|
||||
(1, (6, 8)),
|
||||
];
|
||||
let file = client.upload_random_file(term_spec, 512).await.unwrap();
|
||||
|
||||
// Without limit, xorb 1 should have all its ranges in a single fetch
|
||||
let response_unlimited = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
// Find xorb 1's descriptor
|
||||
let xorb1_hash = &file.terms[0].xorb_hash;
|
||||
let hex_hash: crate::cas_types::HexMerkleHash = (*xorb1_hash).into();
|
||||
let desc_unlimited = response_unlimited.xorbs.get(&hex_hash).unwrap();
|
||||
|
||||
// Now set max_ranges_per_fetch to 2
|
||||
client.set_max_ranges_per_fetch(2);
|
||||
|
||||
let response_limited = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
let desc_limited = response_limited.xorbs.get(&hex_hash).unwrap();
|
||||
|
||||
// With a limit of 2, the number of fetch entries should be >= the unlimited count
|
||||
assert!(
|
||||
desc_limited.len() >= desc_unlimited.len(),
|
||||
"Limited ({}) should have at least as many fetch entries as unlimited ({})",
|
||||
desc_limited.len(),
|
||||
desc_unlimited.len()
|
||||
);
|
||||
|
||||
// Each fetch entry should have at most 2 ranges
|
||||
for fetch in desc_limited {
|
||||
assert!(fetch.ranges.len() <= 2, "Expected at most 2 ranges per fetch, got {}", fetch.ranges.len());
|
||||
}
|
||||
|
||||
// Total ranges across all fetches should equal the unlimited total
|
||||
let total_unlimited: usize = desc_unlimited.iter().map(|f| f.ranges.len()).sum();
|
||||
let total_limited: usize = desc_limited.iter().map(|f| f.ranges.len()).sum();
|
||||
assert_eq!(total_unlimited, total_limited, "Total ranges should be preserved");
|
||||
|
||||
// Reset for other tests
|
||||
client.set_max_ranges_per_fetch(usize::MAX);
|
||||
}
|
||||
|
||||
/// Tests that V2 URLs are valid base64 and decode correctly.
|
||||
/// When going through a server, URLs are HTTP; when direct, they're base64.
|
||||
async fn test_v2_url_encoding(client: Arc<dyn DirectAccessClient>) {
|
||||
use base64::Engine;
|
||||
use base64::engine::general_purpose::URL_SAFE_NO_PAD;
|
||||
|
||||
let term_spec = &[(1, (0, 3))];
|
||||
let file = client.upload_random_file(term_spec, 2048).await.unwrap();
|
||||
|
||||
let response = client.get_reconstruction_v2(&file.file_hash, None).await.unwrap().unwrap();
|
||||
|
||||
for fetch_entries in response.xorbs.values() {
|
||||
for fetch in fetch_entries {
|
||||
assert!(!fetch.url.is_empty(), "URL should not be empty");
|
||||
|
||||
if fetch.url.starts_with("http://") || fetch.url.starts_with("https://") {
|
||||
// Server-transformed URL: should point to fetch_term
|
||||
assert!(fetch.url.contains("/fetch_term"), "HTTP URL should contain /fetch_term: {}", fetch.url);
|
||||
} else {
|
||||
// Direct client URL: should be valid base64
|
||||
let decoded = URL_SAFE_NO_PAD.decode(&fetch.url);
|
||||
assert!(decoded.is_ok(), "URL should be valid base64: {}", fetch.url);
|
||||
|
||||
let payload = String::from_utf8(decoded.unwrap()).unwrap();
|
||||
let parts: Vec<&str> = payload.splitn(3, ':').collect();
|
||||
assert_eq!(parts.len(), 3, "Payload should have 3 colon-separated parts");
|
||||
|
||||
let hash = xet_core_structures::merklehash::MerkleHash::from_hex(parts[0]);
|
||||
assert!(hash.is_ok(), "Hash part should be valid hex");
|
||||
|
||||
let ts: std::result::Result<u64, _> = parts[1].parse();
|
||||
assert!(ts.is_ok(), "Timestamp should be a valid u64");
|
||||
|
||||
for range_str in parts[2].split(',').filter(|s| !s.is_empty()) {
|
||||
let range_parts: Vec<&str> = range_str.split('-').collect();
|
||||
assert_eq!(range_parts.len(), 2, "Each range should be start-end");
|
||||
assert!(range_parts[0].parse::<u64>().is_ok());
|
||||
assert!(range_parts[1].parse::<u64>().is_ok());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -14,7 +14,9 @@ use xet_core_structures::xorb_object::XorbObject;
|
||||
|
||||
use super::super::error::Result;
|
||||
use super::super::interface::Client;
|
||||
use crate::cas_types::{FileRange, XorbReconstructionFetchInfo};
|
||||
use crate::cas_types::{
|
||||
FileRange, QueryReconstructionResponse, QueryReconstructionResponseV2, XorbReconstructionFetchInfo,
|
||||
};
|
||||
|
||||
/// A Client with direct access to XORB and file storage.
|
||||
///
|
||||
@@ -40,6 +42,39 @@ pub trait DirectAccessClient: Client + Send + Sync {
|
||||
/// Pass `None` to disable the delay.
|
||||
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>);
|
||||
|
||||
/// Sets the maximum number of byte ranges per `XorbMultiRangeFetch` entry
|
||||
/// in V2 reconstruction responses.
|
||||
///
|
||||
/// Default is `usize::MAX` (all ranges in one fetch). When set to N,
|
||||
/// ranges for each xorb are grouped into entries of at most N ranges.
|
||||
/// This simulates the CloudFront URL length limit that forces splitting.
|
||||
fn set_max_ranges_per_fetch(&self, max_ranges: usize);
|
||||
|
||||
/// Disables V2 reconstruction responses with the given HTTP status code.
|
||||
/// When disabled, the V2 endpoint returns this status, forcing clients to
|
||||
/// fall back to V1. Pass 0 to re-enable.
|
||||
fn disable_v2_reconstruction(&self, status_code: u16);
|
||||
|
||||
/// Returns the HTTP status code the V2 endpoint should return when disabled,
|
||||
/// or 0 if V2 is enabled.
|
||||
fn v2_disabled_status_code(&self) -> u16 {
|
||||
0
|
||||
}
|
||||
|
||||
/// V1 reconstruction: returns per-range presigned URLs.
|
||||
async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>>;
|
||||
|
||||
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
|
||||
async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>>;
|
||||
|
||||
/// Applies the configured API delay if set.
|
||||
///
|
||||
/// This method sleeps for a random duration within the configured delay range.
|
||||
|
||||
@@ -5,14 +5,12 @@ use std::mem::size_of;
|
||||
use std::ops::Range;
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicU64, Ordering};
|
||||
use std::sync::atomic::{AtomicU16, AtomicU64, AtomicUsize, Ordering};
|
||||
|
||||
use anyhow::anyhow;
|
||||
use async_trait::async_trait;
|
||||
use bytes::Bytes;
|
||||
use heed::types::*;
|
||||
use lazy_static::lazy_static;
|
||||
use more_asserts::*;
|
||||
use rand::Rng;
|
||||
use tempfile::TempDir;
|
||||
use tokio::time::{Duration, Instant};
|
||||
@@ -30,25 +28,16 @@ use xet_core_structures::xorb_object::{SerializedXorbObject, XorbObject};
|
||||
use xet_runtime::file_utils::SafeFileCreator;
|
||||
|
||||
use super::direct_access_client::DirectAccessClient;
|
||||
use super::xorb_utils::{self, REFERENCE_INSTANT};
|
||||
use crate::cas_client::Client;
|
||||
use crate::cas_client::adaptive_concurrency::AdaptiveConcurrencyController;
|
||||
use crate::cas_client::error::{CasClientError, Result};
|
||||
use crate::cas_client::progress_tracked_streams::ProgressCallback;
|
||||
use crate::cas_types::{
|
||||
BatchQueryReconstructionResponse, ChunkRange, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
|
||||
XorbReconstructionFetchInfo, XorbReconstructionTerm,
|
||||
BatchQueryReconstructionResponse, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
|
||||
QueryReconstructionResponseV2, XorbMultiRangeFetch, XorbRangeDescriptor, XorbReconstructionFetchInfo,
|
||||
};
|
||||
|
||||
lazy_static! {
|
||||
/// Reference instant for URL timestamps. Initialized far in the past to allow
|
||||
/// testing timestamps that are earlier in the current process lifetime.
|
||||
static ref REFERENCE_INSTANT: Instant = {
|
||||
let now = Instant::now();
|
||||
now.checked_sub(Duration::from_secs(365 * 24 * 60 * 60))
|
||||
.unwrap_or(now)
|
||||
};
|
||||
}
|
||||
|
||||
pub struct LocalClient {
|
||||
// Note: Field order matters for Drop! heed::Env must be dropped before _tmp_dir
|
||||
// because heed holds file handles that need to be closed before the directory is deleted.
|
||||
@@ -62,6 +51,10 @@ pub struct LocalClient {
|
||||
url_expiration_ms: AtomicU64,
|
||||
/// API delay range in milliseconds as (min_ms, max_ms). (0, 0) means disabled.
|
||||
random_ms_delay_window: (AtomicU64, AtomicU64),
|
||||
/// Max ranges per XorbMultiRangeFetch entry. usize::MAX means no splitting.
|
||||
max_ranges_per_fetch: AtomicUsize,
|
||||
/// HTTP status code to return when V2 is disabled (0 = enabled).
|
||||
v2_disabled_status: AtomicU16,
|
||||
_tmp_dir: Option<TempDir>, // Must be last - dropped after heed env is closed
|
||||
}
|
||||
|
||||
@@ -157,6 +150,8 @@ impl LocalClient {
|
||||
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("local_uploads"),
|
||||
url_expiration_ms: AtomicU64::new(u64::MAX),
|
||||
random_ms_delay_window: (AtomicU64::new(0), AtomicU64::new(0)),
|
||||
max_ranges_per_fetch: AtomicUsize::new(usize::MAX),
|
||||
v2_disabled_status: AtomicU16::new(0),
|
||||
_tmp_dir: tmp_dir, // Must be last - dropped after heed env is closed
|
||||
})
|
||||
}
|
||||
@@ -347,6 +342,34 @@ impl DirectAccessClient for LocalClient {
|
||||
self.url_expiration_ms.store(expiration.as_millis() as u64, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
|
||||
self.max_ranges_per_fetch.store(max_ranges, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
fn disable_v2_reconstruction(&self, status_code: u16) {
|
||||
self.v2_disabled_status.store(status_code, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
fn v2_disabled_status_code(&self) -> u16 {
|
||||
self.v2_disabled_status.load(Ordering::Relaxed)
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
LocalClient::get_reconstruction_v1(self, file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
LocalClient::get_reconstruction_v2(self, file_id, bytes_range).await
|
||||
}
|
||||
|
||||
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>) {
|
||||
match delay_range {
|
||||
Some(range) => {
|
||||
@@ -626,7 +649,126 @@ impl DirectAccessClient for LocalClient {
|
||||
}
|
||||
}
|
||||
|
||||
/// LocalClient is responsible for writing/reading Xorbs on the local disk.
|
||||
impl LocalClient {
|
||||
async fn compute_reconstruction_ranges(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<xorb_utils::ReconstructionRangesResult> {
|
||||
let Some((file_info, _)) = self.shard_manager.get_file_reconstruction_info(file_id).await? else {
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
xorb_utils::compute_reconstruction_ranges(&file_info, bytes_range, &mut |hash| self.xorb_footer_sync(hash))
|
||||
}
|
||||
|
||||
fn xorb_footer_sync(&self, hash: &MerkleHash) -> Result<XorbObject> {
|
||||
let file_path = self.get_path_for_entry(hash);
|
||||
let mut file = File::open(&file_path).map_err(|_| {
|
||||
error!("Unable to find file in local CAS {:?}", file_path);
|
||||
CasClientError::XORBNotFound(*hash)
|
||||
})?;
|
||||
XorbObject::deserialize(&mut file).map_err(Into::into)
|
||||
}
|
||||
|
||||
/// V1 reconstruction: returns per-range presigned URLs.
|
||||
pub async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
self.apply_api_delay().await;
|
||||
|
||||
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
|
||||
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
if terms.is_empty() {
|
||||
return Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
fetch_info: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
|
||||
let timestamp = Instant::now();
|
||||
let mut fetch_info: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
|
||||
for (hash, ranges) in merged_ranges {
|
||||
let file_path = self.get_path_for_entry(&hash);
|
||||
let entries = ranges
|
||||
.into_iter()
|
||||
.map(|r| XorbReconstructionFetchInfo {
|
||||
range: r.chunk_range,
|
||||
url: generate_fetch_url(&file_path, &r.byte_range, timestamp),
|
||||
url_range: HttpRange::from(r.byte_range),
|
||||
})
|
||||
.collect();
|
||||
fetch_info.insert(hash.into(), entries);
|
||||
}
|
||||
|
||||
Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
fetch_info,
|
||||
}))
|
||||
}
|
||||
|
||||
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
|
||||
pub async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.apply_api_delay().await;
|
||||
|
||||
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
|
||||
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
if terms.is_empty() {
|
||||
return Ok(Some(QueryReconstructionResponseV2 {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
xorbs: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
|
||||
let timestamp = Instant::now();
|
||||
let max_ranges = self.max_ranges_per_fetch.load(Ordering::Relaxed);
|
||||
|
||||
let mut xorbs: HashMap<HexMerkleHash, Vec<XorbMultiRangeFetch>> = HashMap::new();
|
||||
for (hash, ranges) in merged_ranges {
|
||||
let mut fetch_entries = Vec::new();
|
||||
|
||||
for chunk in ranges.chunks(max_ranges) {
|
||||
let range_descriptors: Vec<XorbRangeDescriptor> = chunk
|
||||
.iter()
|
||||
.map(|r| XorbRangeDescriptor {
|
||||
chunks: r.chunk_range,
|
||||
bytes: HttpRange::from(r.byte_range),
|
||||
})
|
||||
.collect();
|
||||
|
||||
let url = generate_v2_fetch_url(&hash, &range_descriptors, timestamp);
|
||||
fetch_entries.push(XorbMultiRangeFetch {
|
||||
url,
|
||||
ranges: range_descriptors,
|
||||
});
|
||||
}
|
||||
|
||||
xorbs.insert(hash.into(), fetch_entries);
|
||||
}
|
||||
|
||||
Ok(Some(QueryReconstructionResponseV2 {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
xorbs,
|
||||
}))
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl Client for LocalClient {
|
||||
async fn get_file_reconstruction_info(
|
||||
@@ -784,196 +926,8 @@ impl Client for LocalClient {
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
self.apply_api_delay().await;
|
||||
let Some((file_info, _)) = self.shard_manager.get_file_reconstruction_info(file_id).await? else {
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
// Calculate total file size from segments
|
||||
let total_file_size: u64 = file_info.file_size();
|
||||
// Handle range validation and truncation
|
||||
let file_range = if let Some(range) = bytes_range {
|
||||
// If the entire range is out of bounds, return None (like RemoteClient does for 416)
|
||||
if range.start >= total_file_size {
|
||||
// For empty files (size 0), only the first query (start == 0) should return the empty reconstruction
|
||||
// All subsequent queries should return None to prevent infinite remainder loops
|
||||
if total_file_size == 0 && range.start == 0 {
|
||||
// Empty file - return valid but empty reconstruction
|
||||
return Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range: 0,
|
||||
terms: vec![],
|
||||
fetch_info: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
return Ok(None);
|
||||
}
|
||||
// Truncate end if it extends beyond file size
|
||||
FileRange::new(range.start, range.end.min(total_file_size))
|
||||
} else {
|
||||
// No range specified - handle empty files
|
||||
if total_file_size == 0 {
|
||||
return Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range: 0,
|
||||
terms: vec![],
|
||||
fetch_info: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
FileRange::full()
|
||||
};
|
||||
|
||||
// First skip file segments until we find the first one that starts before the file range start
|
||||
let mut s_idx = 0;
|
||||
let mut cumulative_bytes = 0u64;
|
||||
let mut first_chunk_byte_start;
|
||||
|
||||
loop {
|
||||
if s_idx >= file_info.segments.len() {
|
||||
// We have here that the requested file range is out of bounds,
|
||||
// so return a range error.
|
||||
return Err(CasClientError::InvalidRange);
|
||||
}
|
||||
|
||||
let n = file_info.segments[s_idx].unpacked_segment_bytes as u64;
|
||||
if cumulative_bytes + n > file_range.start {
|
||||
assert_ge!(file_range.start, cumulative_bytes);
|
||||
first_chunk_byte_start = cumulative_bytes;
|
||||
break;
|
||||
} else {
|
||||
cumulative_bytes += n;
|
||||
s_idx += 1;
|
||||
}
|
||||
}
|
||||
|
||||
// Now, prepare the response by iterating over the segments and
|
||||
// adding the terms and fetch info to the response.
|
||||
let mut terms = Vec::new();
|
||||
|
||||
#[derive(Clone)]
|
||||
struct FetchInfoIntermediate {
|
||||
chunk_range: ChunkRange,
|
||||
byte_range: FileRange,
|
||||
}
|
||||
|
||||
let mut fetch_info_map: HashMap<MerkleHash, Vec<FetchInfoIntermediate>> = HashMap::new();
|
||||
|
||||
while s_idx < file_info.segments.len() && cumulative_bytes < file_range.end {
|
||||
let mut segment = file_info.segments[s_idx].clone();
|
||||
let mut chunk_range = ChunkRange::new(segment.chunk_index_start, segment.chunk_index_end);
|
||||
|
||||
// Now get the URL for this segment, which involves reading the actual byte range there.
|
||||
let xorb_footer = self.xorb_footer(&segment.xorb_hash).await?;
|
||||
|
||||
// Do we need to prune the first segment on chunk boundaries to align with the range given?
|
||||
if cumulative_bytes < file_range.start {
|
||||
while chunk_range.start < chunk_range.end {
|
||||
let next_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.start)? as u64;
|
||||
|
||||
if cumulative_bytes + next_chunk_size <= file_range.start {
|
||||
cumulative_bytes += next_chunk_size;
|
||||
first_chunk_byte_start += next_chunk_size;
|
||||
segment.unpacked_segment_bytes -= next_chunk_size as u32;
|
||||
|
||||
chunk_range.start += 1;
|
||||
|
||||
// Should find it somewhere in here.
|
||||
debug_assert_lt!(chunk_range.start, chunk_range.end);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Do we need to prune the last segment on chunk boundaries to align with the range given?
|
||||
if cumulative_bytes + segment.unpacked_segment_bytes as u64 > file_range.end {
|
||||
while chunk_range.end > chunk_range.start {
|
||||
let last_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.end - 1)?;
|
||||
|
||||
if cumulative_bytes + (segment.unpacked_segment_bytes - last_chunk_size) as u64 >= file_range.end {
|
||||
// We can cut the last chunk off and still contain the requested range.
|
||||
chunk_range.end -= 1;
|
||||
segment.unpacked_segment_bytes -= last_chunk_size;
|
||||
debug_assert_lt!(chunk_range.start, chunk_range.end);
|
||||
debug_assert_gt!(segment.unpacked_segment_bytes, 0);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let (byte_start, byte_end) = xorb_footer.get_byte_offset(chunk_range.start, chunk_range.end)?;
|
||||
let byte_range = FileRange::new(byte_start as u64, byte_end as u64);
|
||||
|
||||
let xorb_reconstruction_term = XorbReconstructionTerm {
|
||||
hash: segment.xorb_hash.into(),
|
||||
unpacked_length: segment.unpacked_segment_bytes,
|
||||
range: chunk_range,
|
||||
};
|
||||
|
||||
terms.push(xorb_reconstruction_term);
|
||||
|
||||
let fetch_info_intemediate = FetchInfoIntermediate {
|
||||
chunk_range,
|
||||
byte_range,
|
||||
};
|
||||
|
||||
fetch_info_map
|
||||
.entry(segment.xorb_hash)
|
||||
.or_default()
|
||||
.push(fetch_info_intemediate);
|
||||
|
||||
cumulative_bytes += segment.unpacked_segment_bytes as u64;
|
||||
s_idx += 1;
|
||||
}
|
||||
|
||||
assert!(!terms.is_empty());
|
||||
|
||||
let timestamp = Instant::now();
|
||||
|
||||
// Sort and merge adjacent/overlapping ranges in each fetch_info Vec
|
||||
let mut merged_fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
|
||||
for (hash, mut fi_vec) in fetch_info_map {
|
||||
// Sort by url_range.start
|
||||
fi_vec.sort_by_key(|fi| fi.chunk_range.start);
|
||||
let file_path = self.get_path_for_entry(&hash);
|
||||
|
||||
// Merge adjacent or overlapping ranges
|
||||
let mut merged: Vec<XorbReconstructionFetchInfo> = Vec::new();
|
||||
let mut idx = 0;
|
||||
|
||||
while idx < fi_vec.len() {
|
||||
// Go through and merge adjascent or overlapping ranges,
|
||||
// then form the full XorbReconstructionFetchInfo structs.
|
||||
let mut new_fi = fi_vec[idx].clone();
|
||||
|
||||
while idx + 1 < fi_vec.len() {
|
||||
let next_fi = &fi_vec[idx + 1];
|
||||
if next_fi.chunk_range.start <= new_fi.chunk_range.end {
|
||||
new_fi.chunk_range.end = next_fi.chunk_range.end.max(new_fi.chunk_range.end);
|
||||
new_fi.byte_range.end = next_fi.byte_range.end.max(new_fi.byte_range.end);
|
||||
idx += 1;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
merged.push(XorbReconstructionFetchInfo {
|
||||
range: new_fi.chunk_range,
|
||||
url: generate_fetch_url(&file_path, &new_fi.byte_range, timestamp),
|
||||
url_range: HttpRange::from(new_fi.byte_range),
|
||||
});
|
||||
|
||||
idx += 1;
|
||||
}
|
||||
|
||||
merged_fetch_info_map.insert(hash.into(), merged);
|
||||
}
|
||||
|
||||
Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range: file_range.start - first_chunk_byte_start,
|
||||
terms,
|
||||
fetch_info: merged_fetch_info_map,
|
||||
}))
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.get_reconstruction_v2(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse> {
|
||||
@@ -982,7 +936,7 @@ impl Client for LocalClient {
|
||||
let mut fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
|
||||
|
||||
for file_id in file_ids {
|
||||
if let Some(response) = self.get_reconstruction(file_id, None).await? {
|
||||
if let Some(response) = self.get_reconstruction_v1(file_id, None).await? {
|
||||
let hex_hash: HexMerkleHash = (*file_id).into();
|
||||
files.insert(hex_hash, response.terms);
|
||||
|
||||
@@ -1013,8 +967,14 @@ impl Client for LocalClient {
|
||||
// Retry loop: try to fetch, and if URL expired, refresh and retry once.
|
||||
for attempt in 0..2 {
|
||||
self.apply_api_delay().await;
|
||||
let (url, range) = url_info.retrieve_url().await?;
|
||||
let (file_path, _url_byte_range, url_timestamp) = parse_fetch_url(&url)?;
|
||||
let (url, http_ranges) = url_info.retrieve_url().await?;
|
||||
|
||||
let (file_path, url_timestamp) = if let Ok((path, _, ts)) = parse_fetch_url(&url) {
|
||||
(path, ts)
|
||||
} else {
|
||||
let (hash, ts, _) = xorb_utils::parse_v2_fetch_url(&url)?;
|
||||
(self.get_path_for_entry(&hash), ts)
|
||||
};
|
||||
|
||||
// Check if URL has expired
|
||||
let expiration_ms = self.url_expiration_ms.load(Ordering::Relaxed);
|
||||
@@ -1028,34 +988,46 @@ impl Client for LocalClient {
|
||||
return Err(CasClientError::PresignedUrlExpirationError);
|
||||
}
|
||||
|
||||
// Read the byte range from the file and deserialize
|
||||
// Read each byte range from the serialized file and deserialize the chunks.
|
||||
let mut file = File::open(&file_path).map_err(|_| CasClientError::XORBNotFound(MerkleHash::default()))?;
|
||||
let start = range.start;
|
||||
let end = range.end + 1; // HttpRange is inclusive end
|
||||
file.seek(SeekFrom::Start(start))?;
|
||||
let len = (end - start) as usize;
|
||||
|
||||
let mut all_decompressed = Vec::new();
|
||||
let mut all_chunk_indices = Vec::<u32>::new();
|
||||
let mut total_transfer = 0u64;
|
||||
|
||||
for http_range in &http_ranges {
|
||||
let len = http_range.length() as usize;
|
||||
total_transfer += http_range.length();
|
||||
|
||||
file.seek(SeekFrom::Start(http_range.start))?;
|
||||
let mut data = vec![0u8; len];
|
||||
std::io::Read::read_exact(&mut file, &mut data)?;
|
||||
|
||||
// Deserialize the chunks from the raw XORB data
|
||||
let (decompressed_data, chunk_byte_indices) =
|
||||
let (decompressed, chunk_indices) =
|
||||
xet_core_structures::xorb_object::deserialize_chunks(&mut Cursor::new(&data))?;
|
||||
|
||||
if let Some(expected) = uncompressed_size_if_known {
|
||||
debug_assert_eq!(
|
||||
decompressed_data.len(),
|
||||
expected,
|
||||
"get_file_term_data: expected {} bytes, got {}",
|
||||
expected,
|
||||
decompressed_data.len()
|
||||
xet_core_structures::xorb_object::append_chunk_segment(
|
||||
&mut all_decompressed,
|
||||
&mut all_chunk_indices,
|
||||
&decompressed,
|
||||
&chunk_indices,
|
||||
);
|
||||
}
|
||||
|
||||
let transfer_len = len as u64;
|
||||
if let Some(ref cb) = progress_callback {
|
||||
cb(transfer_len, transfer_len, transfer_len);
|
||||
if let Some(expected) = uncompressed_size_if_known {
|
||||
debug_assert_eq!(
|
||||
all_decompressed.len(),
|
||||
expected,
|
||||
"get_file_term_data: expected {} bytes, got {}",
|
||||
expected,
|
||||
all_decompressed.len()
|
||||
);
|
||||
}
|
||||
return Ok((Bytes::from(decompressed_data), chunk_byte_indices));
|
||||
|
||||
if let Some(ref cb) = progress_callback {
|
||||
cb(total_transfer, total_transfer, total_transfer);
|
||||
}
|
||||
return Ok((Bytes::from(all_decompressed), all_chunk_indices));
|
||||
}
|
||||
|
||||
// Should not reach here, but return error if we do.
|
||||
@@ -1093,6 +1065,10 @@ fn parse_fetch_url(url: &str) -> Result<(PathBuf, FileRange, Instant)> {
|
||||
|
||||
Ok((file_path, byte_range, timestamp))
|
||||
}
|
||||
|
||||
fn generate_v2_fetch_url(hash: &MerkleHash, ranges: &[XorbRangeDescriptor], timestamp: Instant) -> String {
|
||||
xorb_utils::generate_v2_fetch_url(hash, ranges, timestamp)
|
||||
}
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use xet_core_structures::xorb_object::xorb_format_test_utils::{
|
||||
@@ -1102,7 +1078,7 @@ mod tests {
|
||||
use super::*;
|
||||
use crate::cas_client::simulation::DeletionControlableClient;
|
||||
use crate::cas_client::simulation::client_testing_utils::ClientTestingUtils;
|
||||
use crate::cas_types::XorbReconstructionFetchInfo;
|
||||
use crate::cas_types::{ChunkRange, XorbReconstructionFetchInfo};
|
||||
|
||||
/// Runs the common TestingClient trait test suite for LocalClient.
|
||||
#[tokio::test]
|
||||
|
||||
@@ -32,8 +32,8 @@ use super::super::super::error::CasClientError;
|
||||
use super::super::super::{DeletionControlableClient, DirectAccessClient};
|
||||
use super::latency_simulation::{LatencySimulation, ServerLatencyProfile};
|
||||
use crate::cas_types::{
|
||||
FileRange, HexKey, HexMerkleHash, UploadShardResponse, UploadShardResponseType, UploadXorbResponse,
|
||||
XorbReconstructionFetchInfo,
|
||||
FileRange, HexKey, HexMerkleHash, QueryReconstructionResponseV2, UploadShardResponse, UploadShardResponseType,
|
||||
UploadXorbResponse, XorbRangeDescriptor, XorbReconstructionFetchInfo,
|
||||
};
|
||||
|
||||
/// Server state passed to all handlers.
|
||||
@@ -128,27 +128,55 @@ pub(super) fn error_to_response(e: CasClientError) -> Response {
|
||||
(status, e.to_string()).into_response()
|
||||
}
|
||||
|
||||
/// Encodes term data (file path) into a URL-safe base64 string.
|
||||
///
|
||||
/// The term encodes the local file path that the LocalClient uses.
|
||||
/// This allows the fetch_term endpoint to retrieve the data.
|
||||
/// Encodes a fetch term for HTTP transport.
|
||||
///
|
||||
/// The encoded term contains:
|
||||
/// - xorb_hash: The XORB hash (hex encoded)
|
||||
///
|
||||
/// The byte range to fetch comes from the HTTP Range header, not encoded in the term.
|
||||
/// Encodes a V1 fetch term for HTTP transport.
|
||||
/// Contains only the xorb hash; the byte range comes from the HTTP Range header.
|
||||
fn encode_term(xorb_hash: &MerkleHash) -> String {
|
||||
URL_SAFE_NO_PAD.encode(xorb_hash.hex().as_bytes())
|
||||
}
|
||||
|
||||
/// Decodes a fetch term back into its components.
|
||||
///
|
||||
/// Returns the xorb_hash.
|
||||
fn decode_term(term: &str) -> Result<MerkleHash, String> {
|
||||
/// Encodes a V2 fetch term with embedded byte ranges.
|
||||
/// Format: "{hash_hex}:{start1}-{end1},{start2}-{end2},..."
|
||||
/// Byte ranges use exclusive end (FileRange convention).
|
||||
fn encode_term_with_ranges(xorb_hash: &MerkleHash, ranges: &[XorbRangeDescriptor]) -> String {
|
||||
let ranges_str: Vec<String> = ranges
|
||||
.iter()
|
||||
.map(|r| {
|
||||
let file_range = FileRange::from(r.bytes);
|
||||
format!("{}-{}", file_range.start, file_range.end)
|
||||
})
|
||||
.collect();
|
||||
let payload = format!("{}:{}", xorb_hash.hex(), ranges_str.join(","));
|
||||
URL_SAFE_NO_PAD.encode(payload.as_bytes())
|
||||
}
|
||||
|
||||
/// Decoded fetch term: hash and optional byte ranges (exclusive end).
|
||||
struct DecodedTerm {
|
||||
hash: MerkleHash,
|
||||
byte_ranges: Vec<FileRange>,
|
||||
}
|
||||
|
||||
/// Decodes a fetch term. Supports both V1 (hash only) and V2 (hash + ranges).
|
||||
fn decode_term(term: &str) -> Result<DecodedTerm, String> {
|
||||
let bytes = URL_SAFE_NO_PAD.decode(term).map_err(|e| format!("Invalid base64: {e}"))?;
|
||||
let hash_hex = String::from_utf8(bytes).map_err(|e| format!("Invalid UTF-8: {e}"))?;
|
||||
MerkleHash::from_hex(&hash_hex).map_err(|e| format!("Invalid hash: {e}"))
|
||||
let payload = String::from_utf8(bytes).map_err(|e| format!("Invalid UTF-8: {e}"))?;
|
||||
|
||||
if let Some((hash_hex, ranges_str)) = payload.split_once(':') {
|
||||
let hash = MerkleHash::from_hex(hash_hex).map_err(|e| format!("Invalid hash: {e}"))?;
|
||||
let mut byte_ranges = Vec::new();
|
||||
for r in ranges_str.split(',').filter(|s| !s.is_empty()) {
|
||||
let (start_s, end_s) = r.split_once('-').ok_or("Invalid range syntax")?;
|
||||
let start: u64 = start_s.parse().map_err(|e| format!("Invalid range start: {e}"))?;
|
||||
let end: u64 = end_s.parse().map_err(|e| format!("Invalid range end: {e}"))?;
|
||||
byte_ranges.push(FileRange::new(start, end));
|
||||
}
|
||||
Ok(DecodedTerm { hash, byte_ranges })
|
||||
} else {
|
||||
let hash = MerkleHash::from_hex(&payload).map_err(|e| format!("Invalid hash: {e}"))?;
|
||||
Ok(DecodedTerm {
|
||||
hash,
|
||||
byte_ranges: vec![],
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// Extracts the base URL from request headers (Host header).
|
||||
@@ -220,7 +248,7 @@ pub async fn get_reconstruction(
|
||||
Err((status, msg)) => return (status, msg).into_response(),
|
||||
};
|
||||
|
||||
match state.client.get_reconstruction(&file_id, range).await {
|
||||
match state.client.get_reconstruction_v1(&file_id, range).await {
|
||||
Ok(Some(mut response)) => {
|
||||
transform_fetch_info_urls(&mut response.fetch_info, &base_url);
|
||||
Json(response).into_response()
|
||||
@@ -230,6 +258,78 @@ pub async fn get_reconstruction(
|
||||
}
|
||||
}
|
||||
|
||||
/// GET /v2/reconstructions/{file_id}
|
||||
///
|
||||
/// Returns V2 reconstruction information for a file, including:
|
||||
/// - List of terms (chunks) needed to reconstruct the file
|
||||
/// - Per-xorb fetch descriptors with multi-range URLs
|
||||
///
|
||||
/// Supports Range header for partial file reconstruction.
|
||||
/// URLs in the response point to the /v1/fetch_term endpoint.
|
||||
pub async fn get_reconstruction_v2(
|
||||
State(state): State<ServerState>,
|
||||
Path(HexMerkleHash(file_id)): Path<HexMerkleHash>,
|
||||
headers: HeaderMap,
|
||||
) -> Response {
|
||||
let connection_guard = state.latency_simulation.register_connection().await;
|
||||
if let Some(simulated_error) = connection_guard.simulate_error() {
|
||||
return simulated_error;
|
||||
}
|
||||
|
||||
// Allow testing V1 fallback by simulating V2 endpoint unavailability.
|
||||
let disabled_status = state.client.v2_disabled_status_code();
|
||||
if disabled_status != 0 {
|
||||
let code = StatusCode::from_u16(disabled_status).unwrap_or(StatusCode::NOT_FOUND);
|
||||
return (code, "V2 reconstruction endpoint disabled").into_response();
|
||||
}
|
||||
|
||||
let base_url = get_base_url(&headers);
|
||||
|
||||
let range = match parse_range_header(headers.get(RANGE)) {
|
||||
Ok(Some(FileRangeVariant::Normal(range))) => Some(range),
|
||||
Ok(Some(FileRangeVariant::OpenRHS(start))) => {
|
||||
let file_size = match state.client.get_file_size(&file_id).await {
|
||||
Ok(size) => size,
|
||||
Err(e) => return error_to_response(e),
|
||||
};
|
||||
Some(FileRange::new(start, file_size))
|
||||
},
|
||||
Ok(Some(FileRangeVariant::Suffix(suffix))) => {
|
||||
let file_size = match state.client.get_file_size(&file_id).await {
|
||||
Ok(size) => size,
|
||||
Err(e) => return error_to_response(e),
|
||||
};
|
||||
Some(FileRange::new(file_size.saturating_sub(suffix), file_size))
|
||||
},
|
||||
Ok(None) => None,
|
||||
Err((status, msg)) => return (status, msg).into_response(),
|
||||
};
|
||||
|
||||
match state.client.get_reconstruction_v2(&file_id, range).await {
|
||||
Ok(Some(mut response)) => {
|
||||
transform_v2_xorb_urls(&mut response, &base_url);
|
||||
Json(response).into_response()
|
||||
},
|
||||
Ok(None) => (StatusCode::RANGE_NOT_SATISFIABLE, "Range not satisfiable").into_response(),
|
||||
Err(e) => error_to_response(e),
|
||||
}
|
||||
}
|
||||
|
||||
/// Transforms V2 xorb URLs from client-internal format to HTTP URLs.
|
||||
///
|
||||
/// Each `XorbMultiRangeFetch` URL is replaced with an HTTP URL pointing
|
||||
/// to the /v1/fetch_term endpoint. The byte ranges from the V2 response
|
||||
/// are encoded into the term so the endpoint can serve all ranges in one request.
|
||||
fn transform_v2_xorb_urls(response: &mut QueryReconstructionResponseV2, base_url: &str) {
|
||||
for (xorb_hash, fetch_entries) in response.xorbs.iter_mut() {
|
||||
let xorb_hash: MerkleHash = (*xorb_hash).into();
|
||||
for fetch in fetch_entries.iter_mut() {
|
||||
let encoded_term = encode_term_with_ranges(&xorb_hash, &fetch.ranges);
|
||||
fetch.url = format!("{base_url}/v1/fetch_term?term={encoded_term}");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// GET /reconstructions?file_id=...&file_id=...
|
||||
///
|
||||
/// Batch query for reconstruction information for multiple files using query parameters.
|
||||
@@ -285,10 +385,12 @@ pub async fn batch_get_reconstruction(
|
||||
/// GET /v1/fetch_term?term=<base64_encoded_term>
|
||||
///
|
||||
/// Fetches raw XORB data based on an encoded term.
|
||||
/// The term contains the xorb hash. The byte range is specified via HTTP Range header.
|
||||
///
|
||||
/// This endpoint is called by RemoteClient when fetching reconstruction terms.
|
||||
/// It returns raw (compressed) bytes that the client will decompress.
|
||||
/// For V1 terms (hash only), the byte range comes from the HTTP Range header.
|
||||
/// For V2 terms (hash + ranges), all encoded byte ranges are fetched and
|
||||
/// concatenated in order, allowing a single request to serve multi-range blocks.
|
||||
///
|
||||
/// Returns raw (compressed) bytes that the client will decompress.
|
||||
pub async fn fetch_term(State(state): State<ServerState>, uri: axum::http::Uri, headers: HeaderMap) -> Response {
|
||||
let connection_guard = state.latency_simulation.register_connection().await;
|
||||
if let Some(simulated_error) = connection_guard.simulate_error() {
|
||||
@@ -304,13 +406,69 @@ pub async fn fetch_term(State(state): State<ServerState>, uri: axum::http::Uri,
|
||||
return (StatusCode::BAD_REQUEST, "Missing 'term' query parameter").into_response();
|
||||
};
|
||||
|
||||
let xorb_hash = match decode_term(&term) {
|
||||
Ok(h) => h,
|
||||
let decoded = match decode_term(&term) {
|
||||
Ok(d) => d,
|
||||
Err(e) => return (StatusCode::BAD_REQUEST, format!("Invalid term: {e}")).into_response(),
|
||||
};
|
||||
|
||||
// Get total length of the raw XORB data for Range header handling
|
||||
let total_length = match state.client.xorb_raw_length(&xorb_hash).await {
|
||||
if !decoded.byte_ranges.is_empty() {
|
||||
// If the client sends a single-range HTTP Range header, serve just that range.
|
||||
// This simulates S3/CDN behavior where the Range header controls the response
|
||||
// regardless of what ranges are encoded in the presigned URL. This is the
|
||||
// common path when ranges are split into single-range requests based on
|
||||
// the multirange thresholds (V2 URLs with individual requests).
|
||||
if let Ok(Some(FileRangeVariant::Normal(range))) = parse_range_header(headers.get(RANGE)) {
|
||||
return match state.client.get_xorb_raw_bytes(&decoded.hash, Some(range)).await {
|
||||
Ok(data) => (StatusCode::PARTIAL_CONTENT, data).into_response(),
|
||||
Err(e) => error_to_response(e),
|
||||
};
|
||||
}
|
||||
|
||||
if decoded.byte_ranges.len() == 1 {
|
||||
let range = &decoded.byte_ranges[0];
|
||||
return match state.client.get_xorb_raw_bytes(&decoded.hash, Some(*range)).await {
|
||||
Ok(data) => (StatusCode::PARTIAL_CONTENT, data).into_response(),
|
||||
Err(e) => error_to_response(e),
|
||||
};
|
||||
}
|
||||
|
||||
// Multiple ranges with no Range header override: return a multipart/byteranges
|
||||
// response (RFC 7233 Section 4.1), matching S3/CloudFront multi-range format.
|
||||
let total_length = match state.client.xorb_raw_length(&decoded.hash).await {
|
||||
Ok(len) => len,
|
||||
Err(e) => return error_to_response(e),
|
||||
};
|
||||
|
||||
let boundary = "xet_multipart_boundary";
|
||||
let mut response_body = Vec::new();
|
||||
|
||||
for range in &decoded.byte_ranges {
|
||||
let data = match state.client.get_xorb_raw_bytes(&decoded.hash, Some(*range)).await {
|
||||
Ok(d) => d,
|
||||
Err(e) => return error_to_response(e),
|
||||
};
|
||||
// FileRange uses exclusive end; Content-Range header uses inclusive end.
|
||||
let inclusive_end = range.end.saturating_sub(1);
|
||||
let part_header = format!(
|
||||
"--{boundary}\r\nContent-Type: application/octet-stream\r\nContent-Range: bytes {}-{}/{total_length}\r\n\r\n",
|
||||
range.start, inclusive_end
|
||||
);
|
||||
response_body.extend_from_slice(part_header.as_bytes());
|
||||
response_body.extend_from_slice(&data);
|
||||
response_body.extend_from_slice(b"\r\n");
|
||||
}
|
||||
response_body.extend_from_slice(format!("--{boundary}--\r\n").as_bytes());
|
||||
|
||||
let content_type = format!("multipart/byteranges; boundary={boundary}");
|
||||
let mut headers = HeaderMap::new();
|
||||
headers.insert(http::header::CONTENT_TYPE, HeaderValue::from_str(&content_type).unwrap());
|
||||
|
||||
return (StatusCode::PARTIAL_CONTENT, headers, Bytes::from(response_body)).into_response();
|
||||
}
|
||||
|
||||
// V1 term: byte range comes from the HTTP Range header.
|
||||
// Get total length of the raw XORB data for Range header handling.
|
||||
let total_length = match state.client.xorb_raw_length(&decoded.hash).await {
|
||||
Ok(len) => len,
|
||||
Err(e) => return error_to_response(e),
|
||||
};
|
||||
@@ -327,7 +485,7 @@ pub async fn fetch_term(State(state): State<ServerState>, uri: axum::http::Uri,
|
||||
};
|
||||
|
||||
// Fetch raw (serialized/compressed) bytes from the XORB
|
||||
match state.client.get_xorb_raw_bytes(&xorb_hash, byte_range).await {
|
||||
match state.client.get_xorb_raw_bytes(&decoded.hash, byte_range).await {
|
||||
Ok(data) => (StatusCode::OK, data).into_response(),
|
||||
Err(e) => error_to_response(e),
|
||||
}
|
||||
@@ -713,9 +871,33 @@ mod tests {
|
||||
let xorb_hash = MerkleHash::from_hex(&format!("{:0>64}", "abc123")).unwrap();
|
||||
|
||||
let encoded = encode_term(&xorb_hash);
|
||||
let decoded_hash = decode_term(&encoded).unwrap();
|
||||
let decoded = decode_term(&encoded).unwrap();
|
||||
assert_eq!(decoded.hash, xorb_hash);
|
||||
assert!(decoded.byte_ranges.is_empty());
|
||||
}
|
||||
|
||||
assert_eq!(decoded_hash, xorb_hash);
|
||||
#[test]
|
||||
fn test_encode_decode_term_with_ranges() {
|
||||
use crate::cas_types::{ChunkRange, HttpRange, XorbRangeDescriptor};
|
||||
|
||||
let xorb_hash = MerkleHash::from_hex(&format!("{:0>64}", "abc123")).unwrap();
|
||||
let ranges = vec![
|
||||
XorbRangeDescriptor {
|
||||
chunks: ChunkRange::new(0, 3),
|
||||
bytes: HttpRange::new(0, 1023),
|
||||
},
|
||||
XorbRangeDescriptor {
|
||||
chunks: ChunkRange::new(5, 8),
|
||||
bytes: HttpRange::new(2048, 4095),
|
||||
},
|
||||
];
|
||||
|
||||
let encoded = encode_term_with_ranges(&xorb_hash, &ranges);
|
||||
let decoded = decode_term(&encoded).unwrap();
|
||||
assert_eq!(decoded.hash, xorb_hash);
|
||||
assert_eq!(decoded.byte_ranges.len(), 2);
|
||||
assert_eq!(decoded.byte_ranges[0], FileRange::new(0, 1024));
|
||||
assert_eq!(decoded.byte_ranges[1], FileRange::new(2048, 4096));
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -177,6 +177,7 @@ impl LocalServer {
|
||||
.route("/get_xorb/{prefix}/{hash}/", get(handlers::get_file_term_data))
|
||||
.route("/fetch_term", get(handlers::fetch_term)),
|
||||
)
|
||||
.nest("/v2", Router::new().route("/reconstructions/{file_id}", get(handlers::get_reconstruction_v2)))
|
||||
.nest(
|
||||
"/simulation",
|
||||
super::simulation_handlers::simulation_routes()
|
||||
@@ -425,7 +426,7 @@ impl Client for LocalTestServer {
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
|
||||
self.remote_client.get_reconstruction(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
@@ -492,6 +493,34 @@ impl DirectAccessClient for LocalTestServer {
|
||||
self.client.set_fetch_term_url_expiration(expiration);
|
||||
}
|
||||
|
||||
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
|
||||
self.client.set_max_ranges_per_fetch(max_ranges);
|
||||
}
|
||||
|
||||
fn disable_v2_reconstruction(&self, status_code: u16) {
|
||||
self.client.disable_v2_reconstruction(status_code);
|
||||
}
|
||||
|
||||
fn v2_disabled_status_code(&self) -> u16 {
|
||||
self.client.v2_disabled_status_code()
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
self.remote_client.get_reconstruction_v1(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
|
||||
self.remote_client.get_reconstruction_v2(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
fn set_api_delay_range(&self, delay_range: Option<std::ops::Range<std::time::Duration>>) {
|
||||
self.client.set_api_delay_range(delay_range);
|
||||
}
|
||||
@@ -588,7 +617,7 @@ mod tests {
|
||||
use crate::cas_client::simulation::client_testing_utils::ClientTestingUtils;
|
||||
use crate::cas_client::simulation::local_server::SimulationControlClient;
|
||||
use crate::cas_client::simulation::{DeletionControlableClient, DirectAccessClient};
|
||||
use crate::cas_types::FileRange;
|
||||
use crate::cas_types::{FileRange, QueryReconstructionResponseV2};
|
||||
|
||||
const CHUNK_SIZE: usize = 123;
|
||||
|
||||
@@ -604,16 +633,16 @@ mod tests {
|
||||
let local_data = server.client().get_file_data(&file.file_hash, None).await.unwrap();
|
||||
assert_eq!(file.data, local_data);
|
||||
|
||||
// Full file reconstruction - compare remote and local
|
||||
// Full file reconstruction - compare remote and local (V1)
|
||||
let remote_recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file.file_hash, None)
|
||||
.get_reconstruction_v1(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
let local_recon = server
|
||||
.client()
|
||||
.get_reconstruction(&file.file_hash, None)
|
||||
.get_reconstruction_v1(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -629,7 +658,7 @@ mod tests {
|
||||
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
|
||||
let range_recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file.file_hash, Some(range))
|
||||
.get_reconstruction_v1(&file.file_hash, Some(range))
|
||||
.await
|
||||
.unwrap();
|
||||
assert!(range_recon.is_some());
|
||||
@@ -639,7 +668,7 @@ mod tests {
|
||||
let multi_file = server.client().upload_random_file(term_spec, CHUNK_SIZE).await.unwrap();
|
||||
let multi_recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&multi_file.file_hash, None)
|
||||
.get_reconstruction_v1(&multi_file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -750,7 +779,7 @@ mod tests {
|
||||
// Verify single XORB URLs are HTTP
|
||||
let recon1 = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file1.file_hash, None)
|
||||
.get_reconstruction_v1(&file1.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -770,7 +799,7 @@ mod tests {
|
||||
// Verify multi-XORB file has HTTP URLs for all XORBs
|
||||
let multi_recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&multi_file.file_hash, None)
|
||||
.get_reconstruction_v1(&multi_file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -786,7 +815,7 @@ mod tests {
|
||||
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
|
||||
let range_recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&multi_file.file_hash, Some(range))
|
||||
.get_reconstruction_v1(&multi_file.file_hash, Some(range))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -817,7 +846,7 @@ mod tests {
|
||||
// Get reconstruction via remote client
|
||||
let recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file.file_hash, None)
|
||||
.get_reconstruction_v1(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -841,7 +870,7 @@ mod tests {
|
||||
// Get reconstruction
|
||||
let recon = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file.file_hash, None)
|
||||
.get_reconstruction_v1(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
@@ -906,6 +935,241 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
/// Tests V2 reconstruction endpoint returns valid responses through the server.
|
||||
async fn check_v2_reconstruction(server: &LocalTestServer) {
|
||||
let file = server.client().upload_random_file(&[(1, (0, 5))], CHUNK_SIZE).await.unwrap();
|
||||
|
||||
// Query V2 endpoint via remote client
|
||||
let v2 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_v2(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
assert!(!v2.terms.is_empty());
|
||||
assert!(!v2.xorbs.is_empty());
|
||||
assert_eq!(v2.offset_into_first_range, 0);
|
||||
|
||||
// V2 URLs should be HTTP URLs pointing to /v1/fetch_term
|
||||
for fetch_entries in v2.xorbs.values() {
|
||||
for fetch in fetch_entries {
|
||||
assert!(fetch.url.starts_with("http://"), "V2 URL should be HTTP, got: {}", fetch.url);
|
||||
assert!(
|
||||
fetch.url.contains("/v1/fetch_term?term="),
|
||||
"V2 URL should point to fetch_term endpoint, got: {}",
|
||||
fetch.url
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// V2 terms should match V1 terms
|
||||
let v1 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_v1(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(v1.terms.len(), v2.terms.len());
|
||||
assert_eq!(v1.offset_into_first_range, v2.offset_into_first_range);
|
||||
for (t1, t2) in v1.terms.iter().zip(v2.terms.iter()) {
|
||||
assert_eq!(t1.hash, t2.hash);
|
||||
assert_eq!(t1.range, t2.range);
|
||||
}
|
||||
}
|
||||
|
||||
/// Tests V2 fetch URLs are fetchable via the /v1/fetch_term endpoint.
|
||||
async fn check_v2_url_transformation(server: &LocalTestServer) {
|
||||
let http_client = reqwest::Client::new();
|
||||
|
||||
let file = server
|
||||
.client()
|
||||
.upload_random_file(&[(1, (0, 3)), (2, (0, 2))], CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let v2 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_v2(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
for fetch_entries in v2.xorbs.values() {
|
||||
for fetch in fetch_entries {
|
||||
let response = http_client.get(&fetch.url).send().await.unwrap();
|
||||
assert!(
|
||||
response.status().is_success(),
|
||||
"V2 fetch URL should be fetchable: {} (status: {})",
|
||||
fetch.url,
|
||||
response.status()
|
||||
);
|
||||
let data = response.bytes().await.unwrap();
|
||||
assert!(!data.is_empty(), "Fetched data should not be empty");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Tests V2 with range requests through the server.
|
||||
async fn check_v2_range_reconstruction(server: &LocalTestServer) {
|
||||
let term_spec = &[(1, (0, 3)), (2, (0, 2)), (1, (3, 5))];
|
||||
let file = server.client().upload_random_file(term_spec, CHUNK_SIZE).await.unwrap();
|
||||
let file_size = file.data.len() as u64;
|
||||
|
||||
let range = FileRange::new(file_size / 4, file_size * 3 / 4);
|
||||
let v2 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_v2(&file.file_hash, Some(range))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
assert!(!v2.terms.is_empty());
|
||||
for fetch_entries in v2.xorbs.values() {
|
||||
for fetch in fetch_entries {
|
||||
assert!(fetch.url.starts_with("http://"));
|
||||
}
|
||||
}
|
||||
|
||||
// Validate open-ended and suffix range variants through the V2 HTTP endpoint.
|
||||
let v2_url = format!("{}/v2/reconstructions/{}", server.endpoint(), file.file_hash.hex());
|
||||
let http_client = reqwest::Client::new();
|
||||
|
||||
let open_rhs: QueryReconstructionResponseV2 = http_client
|
||||
.get(&v2_url)
|
||||
.header(reqwest::header::RANGE, "bytes=100-")
|
||||
.send()
|
||||
.await
|
||||
.unwrap()
|
||||
.error_for_status()
|
||||
.unwrap()
|
||||
.json()
|
||||
.await
|
||||
.unwrap();
|
||||
assert!(!open_rhs.terms.is_empty());
|
||||
|
||||
let suffix: QueryReconstructionResponseV2 = http_client
|
||||
.get(&v2_url)
|
||||
.header(reqwest::header::RANGE, "bytes=-128")
|
||||
.send()
|
||||
.await
|
||||
.unwrap()
|
||||
.error_for_status()
|
||||
.unwrap()
|
||||
.json()
|
||||
.await
|
||||
.unwrap();
|
||||
assert!(!suffix.terms.is_empty());
|
||||
}
|
||||
|
||||
/// Tests V2 max_ranges_per_fetch through the server.
|
||||
async fn check_v2_max_ranges(server: &LocalTestServer) {
|
||||
let term_spec = &[(1, (0, 2)), (2, (0, 1)), (1, (2, 4)), (2, (1, 2)), (1, (4, 6))];
|
||||
let file = server.client().upload_random_file(term_spec, 512).await.unwrap();
|
||||
|
||||
// Set max_ranges_per_fetch to 1
|
||||
server.set_max_ranges_per_fetch(1);
|
||||
|
||||
let v2 = server
|
||||
.client()
|
||||
.get_reconstruction_v2(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
let xorb1_hash: crate::cas_types::HexMerkleHash = file.terms[0].xorb_hash.into();
|
||||
if let Some(desc) = v2.xorbs.get(&xorb1_hash) {
|
||||
for fetch in desc {
|
||||
assert!(fetch.ranges.len() <= 1, "Each fetch should have at most 1 range, got {}", fetch.ranges.len());
|
||||
}
|
||||
}
|
||||
|
||||
// Reset
|
||||
server.set_max_ranges_per_fetch(usize::MAX);
|
||||
}
|
||||
|
||||
/// Verifies that disabling V2 with various status codes causes the V2 endpoint
|
||||
/// to return that code, and that get_reconstruction falls back to V1.
|
||||
async fn check_v2_disabled_fallback(server: &LocalTestServer) {
|
||||
let file = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 3)), (2, (0, 2))], CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
// V2 should work before disabling.
|
||||
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
|
||||
assert!(v2_result.is_ok());
|
||||
|
||||
// Test 501 (Not Implemented) fallback first, before the RemoteClient
|
||||
// caches a V1 preference from a 404 fallback.
|
||||
server.disable_v2_reconstruction(501);
|
||||
|
||||
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
|
||||
assert!(v2_result.is_err(), "V2 should return error when disabled with 501");
|
||||
|
||||
// Forced V2 should surface the endpoint error directly with no fallback.
|
||||
let forced_v2 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_with_version_override(&file.file_hash, None, Some(2))
|
||||
.await;
|
||||
assert!(forced_v2.is_err());
|
||||
assert_eq!(forced_v2.unwrap_err().status(), Some(reqwest::StatusCode::NOT_IMPLEMENTED));
|
||||
|
||||
// Forced V1 should continue to succeed when V2 is disabled.
|
||||
let forced_v1 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_with_version_override(&file.file_hash, None, Some(1))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert_eq!(forced_v1.terms.len(), 2);
|
||||
|
||||
let result = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert_eq!(result.terms.len(), 2);
|
||||
|
||||
// Re-enable V2, then test 404 fallback.
|
||||
server.disable_v2_reconstruction(0);
|
||||
|
||||
// Reset the RemoteClient's cached version by making a successful V2 call.
|
||||
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
|
||||
assert!(v2_result.is_ok(), "V2 should work again after re-enabling");
|
||||
|
||||
server.disable_v2_reconstruction(404);
|
||||
|
||||
let v2_result = server.remote_client().get_reconstruction_v2(&file.file_hash, None).await;
|
||||
assert!(v2_result.is_err(), "V2 should return error when disabled with 404");
|
||||
|
||||
let forced_v2 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_with_version_override(&file.file_hash, None, Some(2))
|
||||
.await;
|
||||
assert!(forced_v2.is_err());
|
||||
assert_eq!(forced_v2.unwrap_err().status(), Some(reqwest::StatusCode::NOT_FOUND));
|
||||
|
||||
let forced_v1 = server
|
||||
.remote_client()
|
||||
.get_reconstruction_with_version_override(&file.file_hash, None, Some(1))
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert_eq!(forced_v1.terms.len(), 2);
|
||||
|
||||
let result = server
|
||||
.remote_client()
|
||||
.get_reconstruction(&file.file_hash, None)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert_eq!(result.terms.len(), 2);
|
||||
}
|
||||
|
||||
/// Runs all server checks for a given test server instance.
|
||||
async fn run_all_server_checks(server: &LocalTestServer) {
|
||||
check_basic_correctness(server).await;
|
||||
@@ -915,6 +1179,11 @@ mod tests {
|
||||
check_downloaded_terms_match_expected_data(server).await;
|
||||
check_complete_file_reconstruction(server).await;
|
||||
check_chunk_hashes_correctness(server).await;
|
||||
check_v2_reconstruction(server).await;
|
||||
check_v2_url_transformation(server).await;
|
||||
check_v2_range_reconstruction(server).await;
|
||||
check_v2_max_ranges(server).await;
|
||||
check_v2_disabled_fallback(server).await;
|
||||
}
|
||||
|
||||
async fn all_file_hashes(client: &LocalClient) -> HashSet<MerkleHash> {
|
||||
|
||||
@@ -17,7 +17,7 @@ use crate::cas_client::RemoteClient;
|
||||
use crate::cas_client::error::{CasClientError, Result};
|
||||
use crate::cas_client::interface::Client;
|
||||
use crate::cas_client::simulation::{DeletionControlableClient, DirectAccessClient};
|
||||
use crate::cas_types::{FileRange, HexMerkleHash, XorbReconstructionFetchInfo};
|
||||
use crate::cas_types::{FileRange, HexMerkleHash, QueryReconstructionResponseV2, XorbReconstructionFetchInfo};
|
||||
|
||||
/// A client that connects to a `LocalTestServer` via HTTP and provides access
|
||||
/// to both `DirectAccessClient` and `DeletionControlableClient` operations
|
||||
@@ -91,7 +91,7 @@ impl Client for SimulationControlClient {
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.remote_client.get_reconstruction(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
@@ -172,6 +172,30 @@ impl DirectAccessClient for SimulationControlClient {
|
||||
// No-op: delays are applied server-side via set_api_delay_range
|
||||
}
|
||||
|
||||
fn set_max_ranges_per_fetch(&self, _max_ranges: usize) {
|
||||
// No-op: SimulationControlClient configures server via HTTP; endpoint not yet implemented.
|
||||
}
|
||||
|
||||
fn disable_v2_reconstruction(&self, _status_code: u16) {
|
||||
// No-op: SimulationControlClient configures server via HTTP; endpoint not yet implemented.
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
self.remote_client.get_reconstruction_v1(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.remote_client.get_reconstruction_v2(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
/// Sets the API delay range via the `/simulation/config/api_delay` endpoint.
|
||||
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>) {
|
||||
let url = self.sim_url("/config/api_delay");
|
||||
|
||||
@@ -2,11 +2,10 @@ use std::collections::HashMap;
|
||||
use std::io::{BufReader, Cursor};
|
||||
use std::ops::Range;
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicU64, Ordering};
|
||||
use std::sync::atomic::{AtomicU16, AtomicU64, AtomicUsize, Ordering};
|
||||
|
||||
use async_trait::async_trait;
|
||||
use bytes::Bytes;
|
||||
use more_asserts::{assert_ge, assert_gt, debug_assert_lt};
|
||||
use rand::Rng;
|
||||
use tokio::sync::RwLock;
|
||||
use tokio::time::{Duration, Instant};
|
||||
@@ -26,21 +25,12 @@ use super::super::progress_tracked_streams::ProgressCallback;
|
||||
use super::client_testing_utils::{FileTermReference, RandomFileContents};
|
||||
use super::direct_access_client::DirectAccessClient;
|
||||
use super::random_xorb::RandomXorb;
|
||||
use super::xorb_utils::{self, REFERENCE_INSTANT};
|
||||
use crate::cas_types::{
|
||||
BatchQueryReconstructionResponse, ChunkRange, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
|
||||
XorbReconstructionFetchInfo, XorbReconstructionTerm,
|
||||
BatchQueryReconstructionResponse, FileRange, HexMerkleHash, HttpRange, QueryReconstructionResponse,
|
||||
QueryReconstructionResponseV2, XorbMultiRangeFetch, XorbRangeDescriptor, XorbReconstructionFetchInfo,
|
||||
};
|
||||
|
||||
lazy_static::lazy_static! {
|
||||
/// Reference instant for URL timestamps. Initialized far in the past to allow
|
||||
/// testing timestamps that are earlier in the current process lifetime.
|
||||
static ref REFERENCE_INSTANT: Instant = {
|
||||
let now = Instant::now();
|
||||
now.checked_sub(Duration::from_secs(365 * 24 * 60 * 60))
|
||||
.unwrap_or(now)
|
||||
};
|
||||
}
|
||||
|
||||
/// Stored XORB data: the serialized data and the deserialized XorbObject (header/footer).
|
||||
struct MaterializedXorb {
|
||||
serialized_data: Bytes,
|
||||
@@ -69,6 +59,10 @@ pub struct MemoryClient {
|
||||
url_expiration_ms: AtomicU64,
|
||||
/// API delay range in milliseconds as (min_ms, max_ms). (0, 0) means disabled.
|
||||
random_ms_delay_window: (AtomicU64, AtomicU64),
|
||||
/// Max ranges per XorbMultiRangeFetch entry. usize::MAX means no splitting.
|
||||
max_ranges_per_fetch: AtomicUsize,
|
||||
/// HTTP status code to return when V2 is disabled (0 = enabled).
|
||||
v2_disabled_status: AtomicU16,
|
||||
}
|
||||
|
||||
impl MemoryClient {
|
||||
@@ -81,6 +75,8 @@ impl MemoryClient {
|
||||
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("memory_uploads"),
|
||||
url_expiration_ms: AtomicU64::new(u64::MAX),
|
||||
random_ms_delay_window: (AtomicU64::new(0), AtomicU64::new(0)),
|
||||
max_ranges_per_fetch: AtomicUsize::new(usize::MAX),
|
||||
v2_disabled_status: AtomicU16::new(0),
|
||||
})
|
||||
}
|
||||
|
||||
@@ -225,6 +221,8 @@ impl Default for MemoryClient {
|
||||
upload_concurrency_controller: AdaptiveConcurrencyController::new_upload("memory_uploads"),
|
||||
url_expiration_ms: AtomicU64::new(u64::MAX),
|
||||
random_ms_delay_window: (AtomicU64::new(0), AtomicU64::new(0)),
|
||||
max_ranges_per_fetch: AtomicUsize::new(usize::MAX),
|
||||
v2_disabled_status: AtomicU16::new(0),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -236,6 +234,34 @@ impl DirectAccessClient for MemoryClient {
|
||||
self.url_expiration_ms.store(expiration.as_millis() as u64, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
|
||||
self.max_ranges_per_fetch.store(max_ranges, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
fn disable_v2_reconstruction(&self, status_code: u16) {
|
||||
self.v2_disabled_status.store(status_code, Ordering::Relaxed);
|
||||
}
|
||||
|
||||
fn v2_disabled_status_code(&self) -> u16 {
|
||||
self.v2_disabled_status.load(Ordering::Relaxed)
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
MemoryClient::get_reconstruction_v1(self, file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
MemoryClient::get_reconstruction_v2(self, file_id, bytes_range).await
|
||||
}
|
||||
|
||||
fn set_api_delay_range(&self, delay_range: Option<Range<Duration>>) {
|
||||
match delay_range {
|
||||
Some(range) => {
|
||||
@@ -514,6 +540,130 @@ impl DirectAccessClient for MemoryClient {
|
||||
}
|
||||
}
|
||||
|
||||
impl MemoryClient {
|
||||
async fn compute_reconstruction_ranges(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<xorb_utils::ReconstructionRangesResult> {
|
||||
let file_info = {
|
||||
let shard = self.shard.read().await;
|
||||
match shard.get_file_reconstruction_info(file_id) {
|
||||
Some(fi) => fi,
|
||||
None => return Ok(None),
|
||||
}
|
||||
};
|
||||
|
||||
let xorbs = self.xorbs.read().await;
|
||||
xorb_utils::compute_reconstruction_ranges(&file_info, bytes_range, &mut |hash| {
|
||||
let storage = xorbs.get(hash).ok_or_else(|| {
|
||||
error!("Unable to find xorb in memory CAS {:?}", hash);
|
||||
CasClientError::XORBNotFound(*hash)
|
||||
})?;
|
||||
Ok(match storage {
|
||||
XorbStorage::Materialized(entry) => entry.xorb_object.clone(),
|
||||
XorbStorage::Random(xorb) => xorb.get_xorb_object(),
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
/// V1 reconstruction: returns per-range presigned URLs.
|
||||
pub async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
self.apply_api_delay().await;
|
||||
|
||||
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
|
||||
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
if terms.is_empty() {
|
||||
return Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
fetch_info: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
|
||||
let timestamp = Instant::now();
|
||||
let mut fetch_info: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
|
||||
for (hash, ranges) in merged_ranges {
|
||||
let entries = ranges
|
||||
.into_iter()
|
||||
.map(|r| XorbReconstructionFetchInfo {
|
||||
range: r.chunk_range,
|
||||
url: generate_fetch_url(&hash, &r.byte_range, timestamp),
|
||||
url_range: HttpRange::from(r.byte_range),
|
||||
})
|
||||
.collect();
|
||||
fetch_info.insert(hash.into(), entries);
|
||||
}
|
||||
|
||||
Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
fetch_info,
|
||||
}))
|
||||
}
|
||||
|
||||
/// V2 reconstruction: returns per-xorb multi-range fetch descriptors.
|
||||
pub async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.apply_api_delay().await;
|
||||
|
||||
let result = self.compute_reconstruction_ranges(file_id, bytes_range).await?;
|
||||
let Some((offset_into_first_range, terms, merged_ranges)) = result else {
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
if terms.is_empty() {
|
||||
return Ok(Some(QueryReconstructionResponseV2 {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
xorbs: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
|
||||
let timestamp = Instant::now();
|
||||
let max_ranges = self.max_ranges_per_fetch.load(Ordering::Relaxed);
|
||||
|
||||
let mut xorbs: HashMap<HexMerkleHash, Vec<XorbMultiRangeFetch>> = HashMap::new();
|
||||
for (hash, ranges) in merged_ranges {
|
||||
let mut fetch_entries = Vec::new();
|
||||
|
||||
for chunk in ranges.chunks(max_ranges) {
|
||||
let range_descriptors: Vec<XorbRangeDescriptor> = chunk
|
||||
.iter()
|
||||
.map(|r| XorbRangeDescriptor {
|
||||
chunks: r.chunk_range,
|
||||
bytes: HttpRange::from(r.byte_range),
|
||||
})
|
||||
.collect();
|
||||
|
||||
let url = generate_v2_fetch_url(&hash, &range_descriptors, timestamp);
|
||||
fetch_entries.push(XorbMultiRangeFetch {
|
||||
url,
|
||||
ranges: range_descriptors,
|
||||
});
|
||||
}
|
||||
|
||||
xorbs.insert(hash.into(), fetch_entries);
|
||||
}
|
||||
|
||||
Ok(Some(QueryReconstructionResponseV2 {
|
||||
offset_into_first_range,
|
||||
terms,
|
||||
xorbs,
|
||||
}))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg_attr(not(target_family = "wasm"), async_trait)]
|
||||
#[cfg_attr(target_family = "wasm", async_trait(?Send))]
|
||||
impl Client for MemoryClient {
|
||||
@@ -651,194 +801,8 @@ impl Client for MemoryClient {
|
||||
&self,
|
||||
file_id: &MerkleHash,
|
||||
bytes_range: Option<FileRange>,
|
||||
) -> Result<Option<QueryReconstructionResponse>> {
|
||||
self.apply_api_delay().await;
|
||||
let file_info = {
|
||||
let shard = self.shard.read().await;
|
||||
match shard.get_file_reconstruction_info(file_id) {
|
||||
Some(fi) => fi,
|
||||
None => return Ok(None),
|
||||
}
|
||||
};
|
||||
|
||||
let total_file_size: u64 = file_info.file_size();
|
||||
|
||||
// Handle range validation and truncation
|
||||
let file_range = if let Some(range) = bytes_range {
|
||||
// If the entire range is out of bounds, return None (like RemoteClient does for 416)
|
||||
if range.start >= total_file_size {
|
||||
// For empty files (size 0), only the first query (start == 0) should return the empty reconstruction
|
||||
// All subsequent queries should return None to prevent infinite remainder loops
|
||||
if total_file_size == 0 && range.start == 0 {
|
||||
// Empty file - return valid but empty reconstruction
|
||||
return Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range: 0,
|
||||
terms: vec![],
|
||||
fetch_info: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
return Ok(None);
|
||||
}
|
||||
FileRange::new(range.start, range.end.min(total_file_size))
|
||||
} else {
|
||||
// No range specified - handle empty files
|
||||
if total_file_size == 0 {
|
||||
return Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range: 0,
|
||||
terms: vec![],
|
||||
fetch_info: HashMap::new(),
|
||||
}));
|
||||
}
|
||||
FileRange::full()
|
||||
};
|
||||
|
||||
// Find the first segment that contains bytes in our range
|
||||
let mut s_idx = 0;
|
||||
let mut cumulative_bytes = 0u64;
|
||||
let mut first_chunk_byte_start;
|
||||
|
||||
loop {
|
||||
if s_idx >= file_info.segments.len() {
|
||||
return Err(CasClientError::InvalidRange);
|
||||
}
|
||||
|
||||
let n = file_info.segments[s_idx].unpacked_segment_bytes as u64;
|
||||
if cumulative_bytes + n > file_range.start {
|
||||
assert_ge!(file_range.start, cumulative_bytes);
|
||||
first_chunk_byte_start = cumulative_bytes;
|
||||
break;
|
||||
} else {
|
||||
cumulative_bytes += n;
|
||||
s_idx += 1;
|
||||
}
|
||||
}
|
||||
|
||||
let mut terms = Vec::new();
|
||||
|
||||
#[derive(Clone)]
|
||||
struct FetchInfoIntermediate {
|
||||
chunk_range: ChunkRange,
|
||||
byte_range: FileRange,
|
||||
}
|
||||
|
||||
let mut fetch_info_map: MerkleHashMap<Vec<FetchInfoIntermediate>> = MerkleHashMap::new();
|
||||
|
||||
let xorbs = self.xorbs.read().await;
|
||||
|
||||
while s_idx < file_info.segments.len() && cumulative_bytes < file_range.end {
|
||||
let mut segment = file_info.segments[s_idx].clone();
|
||||
let mut chunk_range = ChunkRange::new(segment.chunk_index_start, segment.chunk_index_end);
|
||||
|
||||
let storage = xorbs.get(&segment.xorb_hash).ok_or_else(|| {
|
||||
error!("Unable to find xorb in memory CAS {:?}", segment.xorb_hash);
|
||||
CasClientError::XORBNotFound(segment.xorb_hash)
|
||||
})?;
|
||||
let xorb_footer = match storage {
|
||||
XorbStorage::Materialized(entry) => entry.xorb_object.clone(),
|
||||
XorbStorage::Random(xorb) => xorb.get_xorb_object(),
|
||||
};
|
||||
|
||||
// Prune first segment on chunk boundaries
|
||||
if cumulative_bytes < file_range.start {
|
||||
while chunk_range.start < chunk_range.end {
|
||||
let next_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.start)? as u64;
|
||||
|
||||
if cumulative_bytes + next_chunk_size <= file_range.start {
|
||||
cumulative_bytes += next_chunk_size;
|
||||
first_chunk_byte_start += next_chunk_size;
|
||||
segment.unpacked_segment_bytes -= next_chunk_size as u32;
|
||||
chunk_range.start += 1;
|
||||
debug_assert_lt!(chunk_range.start, chunk_range.end);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Prune last segment on chunk boundaries
|
||||
if cumulative_bytes + segment.unpacked_segment_bytes as u64 > file_range.end {
|
||||
while chunk_range.end > chunk_range.start {
|
||||
let last_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.end - 1)?;
|
||||
|
||||
if cumulative_bytes + (segment.unpacked_segment_bytes - last_chunk_size) as u64 >= file_range.end {
|
||||
chunk_range.end -= 1;
|
||||
segment.unpacked_segment_bytes -= last_chunk_size;
|
||||
debug_assert_lt!(chunk_range.start, chunk_range.end);
|
||||
assert_gt!(segment.unpacked_segment_bytes, 0);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let (byte_start, byte_end) = xorb_footer.get_byte_offset(chunk_range.start, chunk_range.end)?;
|
||||
let byte_range = FileRange::new(byte_start as u64, byte_end as u64);
|
||||
|
||||
let xorb_reconstruction_term = XorbReconstructionTerm {
|
||||
hash: segment.xorb_hash.into(),
|
||||
unpacked_length: segment.unpacked_segment_bytes,
|
||||
range: chunk_range,
|
||||
};
|
||||
|
||||
terms.push(xorb_reconstruction_term);
|
||||
|
||||
let fetch_info_intermediate = FetchInfoIntermediate {
|
||||
chunk_range,
|
||||
byte_range,
|
||||
};
|
||||
|
||||
fetch_info_map
|
||||
.entry(segment.xorb_hash)
|
||||
.or_default()
|
||||
.push(fetch_info_intermediate);
|
||||
|
||||
cumulative_bytes += segment.unpacked_segment_bytes as u64;
|
||||
s_idx += 1;
|
||||
}
|
||||
|
||||
assert!(!terms.is_empty());
|
||||
|
||||
let timestamp = Instant::now();
|
||||
|
||||
// Sort and merge adjacent/overlapping ranges in each fetch_info Vec
|
||||
let mut merged_fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
|
||||
for (hash, mut fi_vec) in fetch_info_map {
|
||||
fi_vec.sort_by_key(|fi| fi.chunk_range.start);
|
||||
|
||||
let mut merged: Vec<XorbReconstructionFetchInfo> = Vec::new();
|
||||
let mut idx = 0;
|
||||
|
||||
while idx < fi_vec.len() {
|
||||
let mut new_fi = fi_vec[idx].clone();
|
||||
|
||||
while idx + 1 < fi_vec.len() {
|
||||
let next_fi = &fi_vec[idx + 1];
|
||||
if next_fi.chunk_range.start <= new_fi.chunk_range.end {
|
||||
new_fi.chunk_range.end = next_fi.chunk_range.end.max(new_fi.chunk_range.end);
|
||||
new_fi.byte_range.end = next_fi.byte_range.end.max(new_fi.byte_range.end);
|
||||
idx += 1;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
merged.push(XorbReconstructionFetchInfo {
|
||||
range: new_fi.chunk_range,
|
||||
url: generate_fetch_url(&hash, &new_fi.byte_range, timestamp),
|
||||
url_range: HttpRange::from(new_fi.byte_range),
|
||||
});
|
||||
|
||||
idx += 1;
|
||||
}
|
||||
|
||||
merged_fetch_info_map.insert(hash.into(), merged);
|
||||
}
|
||||
|
||||
Ok(Some(QueryReconstructionResponse {
|
||||
offset_into_first_range: file_range.start - first_chunk_byte_start,
|
||||
terms,
|
||||
fetch_info: merged_fetch_info_map,
|
||||
}))
|
||||
) -> Result<Option<QueryReconstructionResponseV2>> {
|
||||
self.get_reconstruction_v2(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn batch_get_reconstruction(&self, file_ids: &[MerkleHash]) -> Result<BatchQueryReconstructionResponse> {
|
||||
@@ -847,7 +811,7 @@ impl Client for MemoryClient {
|
||||
let mut fetch_info_map: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>> = HashMap::new();
|
||||
|
||||
for file_id in file_ids {
|
||||
if let Some(response) = self.get_reconstruction(file_id, None).await? {
|
||||
if let Some(response) = self.get_reconstruction_v1(file_id, None).await? {
|
||||
let hex_hash: HexMerkleHash = (*file_id).into();
|
||||
files.insert(hex_hash, response.terms);
|
||||
|
||||
@@ -876,8 +840,8 @@ impl Client for MemoryClient {
|
||||
uncompressed_size_if_known: Option<usize>,
|
||||
) -> Result<(Bytes, Vec<u32>)> {
|
||||
self.apply_api_delay().await;
|
||||
let (url, range) = url_info.retrieve_url().await?;
|
||||
let (xorb_hash, _url_byte_range, url_timestamp) = parse_fetch_url(&url)?;
|
||||
let (url, http_ranges) = url_info.retrieve_url().await?;
|
||||
let (xorb_hash, url_timestamp) = parse_any_fetch_url(&url)?;
|
||||
|
||||
// Check if URL has expired
|
||||
let expiration_ms = self.url_expiration_ms.load(Ordering::Relaxed);
|
||||
@@ -889,12 +853,17 @@ impl Client for MemoryClient {
|
||||
let xorbs = self.xorbs.read().await;
|
||||
let storage = xorbs.get(&xorb_hash).ok_or(CasClientError::XORBNotFound(xorb_hash))?;
|
||||
|
||||
// Extract the byte range from the serialized data and deserialize
|
||||
let start = range.start as usize;
|
||||
let end = range.end as usize + 1; // HttpRange is inclusive end
|
||||
let transfer_len = (end - start) as u64;
|
||||
// Extract each byte range from the serialized data and deserialize
|
||||
let mut all_decompressed = Vec::new();
|
||||
let mut all_chunk_indices = Vec::<u32>::new();
|
||||
let mut total_transfer = 0u64;
|
||||
|
||||
let (decompressed_data, chunk_byte_indices) = match storage {
|
||||
for http_range in &http_ranges {
|
||||
let start = http_range.start as usize;
|
||||
let end = http_range.end as usize + 1;
|
||||
total_transfer += http_range.length();
|
||||
|
||||
let (data, chunk_indices) = match storage {
|
||||
XorbStorage::Materialized(entry) => {
|
||||
let range_data = &entry.serialized_data[start..end];
|
||||
xet_core_structures::xorb_object::deserialize_chunks(&mut Cursor::new(range_data))?
|
||||
@@ -905,20 +874,28 @@ impl Client for MemoryClient {
|
||||
},
|
||||
};
|
||||
|
||||
xet_core_structures::xorb_object::append_chunk_segment(
|
||||
&mut all_decompressed,
|
||||
&mut all_chunk_indices,
|
||||
&data,
|
||||
&chunk_indices,
|
||||
);
|
||||
}
|
||||
|
||||
if let Some(expected) = uncompressed_size_if_known {
|
||||
debug_assert_eq!(
|
||||
decompressed_data.len(),
|
||||
all_decompressed.len(),
|
||||
expected,
|
||||
"get_file_term_data: expected {} bytes, got {}",
|
||||
expected,
|
||||
decompressed_data.len()
|
||||
all_decompressed.len()
|
||||
);
|
||||
}
|
||||
|
||||
if let Some(ref cb) = progress_callback {
|
||||
cb(transfer_len, transfer_len, transfer_len);
|
||||
cb(total_transfer, total_transfer, total_transfer);
|
||||
}
|
||||
Ok((Bytes::from(decompressed_data), chunk_byte_indices))
|
||||
Ok((Bytes::from(all_decompressed), all_chunk_indices))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -946,6 +923,19 @@ fn parse_fetch_url(url: &str) -> Result<(MerkleHash, FileRange, Instant)> {
|
||||
Ok((hash, byte_range, timestamp))
|
||||
}
|
||||
|
||||
fn generate_v2_fetch_url(hash: &MerkleHash, ranges: &[XorbRangeDescriptor], timestamp: Instant) -> String {
|
||||
xorb_utils::generate_v2_fetch_url(hash, ranges, timestamp)
|
||||
}
|
||||
|
||||
/// Parse either a V1 or V2 fetch URL, returning (hash, timestamp).
|
||||
fn parse_any_fetch_url(url: &str) -> Result<(MerkleHash, Instant)> {
|
||||
if let Ok((hash, _, ts)) = parse_fetch_url(url) {
|
||||
return Ok((hash, ts));
|
||||
}
|
||||
let (hash, ts, _) = xorb_utils::parse_v2_fetch_url(url)?;
|
||||
Ok((hash, ts))
|
||||
}
|
||||
|
||||
#[cfg(all(test, not(target_family = "wasm")))]
|
||||
mod tests {
|
||||
use super::super::client_testing_utils::ClientTestingUtils;
|
||||
@@ -1062,7 +1052,7 @@ mod tests {
|
||||
assert_eq!(range_data.as_ref(), &file2.data[start as usize..end as usize]);
|
||||
|
||||
// Reconstruction workflow
|
||||
let recon = client.get_reconstruction(&file2.file_hash, None).await.unwrap().unwrap();
|
||||
let recon = client.get_reconstruction_v1(&file2.file_hash, None).await.unwrap().unwrap();
|
||||
for term in &recon.terms {
|
||||
let xorb_hash: MerkleHash = term.hash.into();
|
||||
for fetch_info in recon.fetch_info.get(&term.hash).unwrap() {
|
||||
|
||||
@@ -34,6 +34,7 @@ mod simulation_server;
|
||||
#[cfg(unix)]
|
||||
#[cfg(not(target_family = "wasm"))]
|
||||
pub mod socket_proxy;
|
||||
pub(crate) mod xorb_utils;
|
||||
|
||||
pub use client_testing_utils::{ClientTestingUtils, RandomFileContents};
|
||||
#[cfg(not(target_family = "wasm"))]
|
||||
|
||||
@@ -132,7 +132,7 @@ impl Client for RemoteSimulationClient {
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
|
||||
self.inner.get_reconstruction(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
|
||||
@@ -440,7 +440,7 @@ impl Client for LocalTestServer {
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
|
||||
self.remote_simulation_client.get_reconstruction(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
@@ -512,6 +512,30 @@ impl DirectAccessClient for LocalTestServer {
|
||||
self.client.set_api_delay_range(delay_range);
|
||||
}
|
||||
|
||||
fn set_max_ranges_per_fetch(&self, max_ranges: usize) {
|
||||
self.client.set_max_ranges_per_fetch(max_ranges);
|
||||
}
|
||||
|
||||
fn disable_v2_reconstruction(&self, status_code: u16) {
|
||||
self.client.disable_v2_reconstruction(status_code);
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v1(
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponse>> {
|
||||
self.client.get_reconstruction_v1(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn get_reconstruction_v2(
|
||||
&self,
|
||||
file_id: &xet_core_structures::merklehash::MerkleHash,
|
||||
bytes_range: Option<crate::cas_types::FileRange>,
|
||||
) -> Result<Option<crate::cas_types::QueryReconstructionResponseV2>> {
|
||||
self.client.get_reconstruction_v2(file_id, bytes_range).await
|
||||
}
|
||||
|
||||
async fn apply_api_delay(&self) {
|
||||
self.client.apply_api_delay().await;
|
||||
}
|
||||
@@ -690,31 +714,22 @@ mod tests {
|
||||
|
||||
// Fetch term endpoint - verify URLs are HTTP and data can be fetched
|
||||
let http_client = reqwest::Client::new();
|
||||
for fetch_infos in remote_recon.fetch_info.values() {
|
||||
for fi in fetch_infos {
|
||||
assert!(fi.url.starts_with("http://"));
|
||||
assert!(fi.url.contains("/fetch_term?term="));
|
||||
let response = http_client.get(&fi.url).send().await.unwrap();
|
||||
for multi_range_fetches in remote_recon.xorbs.values() {
|
||||
for mrf in multi_range_fetches {
|
||||
assert!(mrf.url.starts_with("http://"));
|
||||
assert!(mrf.url.contains("/fetch_term?term="));
|
||||
let response = http_client.get(&mrf.url).send().await.unwrap();
|
||||
assert!(response.status().is_success());
|
||||
assert!(!response.bytes().await.unwrap().is_empty());
|
||||
}
|
||||
}
|
||||
|
||||
// Fetch term with range request
|
||||
let first_fi = &remote_recon.fetch_info.values().next().unwrap()[0];
|
||||
let full_data = http_client.get(&first_fi.url).send().await.unwrap().bytes().await.unwrap();
|
||||
if full_data.len() > 100 {
|
||||
let range_resp = http_client
|
||||
.get(&first_fi.url)
|
||||
.header(reqwest::header::RANGE, "bytes=0-99")
|
||||
.send()
|
||||
.await
|
||||
.unwrap();
|
||||
assert!(range_resp.status().is_success());
|
||||
let range_data = range_resp.bytes().await.unwrap();
|
||||
assert_eq!(range_data.len(), 100);
|
||||
assert_eq!(&range_data[..], &full_data[..100]);
|
||||
}
|
||||
// Verify V2 fetch URLs return consistent data across multiple requests.
|
||||
let first_mrf = &remote_recon.xorbs.values().next().unwrap()[0];
|
||||
let data_1 = http_client.get(&first_mrf.url).send().await.unwrap().bytes().await.unwrap();
|
||||
let data_2 = http_client.get(&first_mrf.url).send().await.unwrap().bytes().await.unwrap();
|
||||
assert_eq!(data_1, data_2);
|
||||
assert!(!data_1.is_empty());
|
||||
}
|
||||
|
||||
/// Tests that invalid requests return appropriate error responses.
|
||||
@@ -762,16 +777,16 @@ mod tests {
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
for (hash, fetch_infos) in &recon1.fetch_info {
|
||||
for fi in fetch_infos {
|
||||
for (hash, multi_range_fetches) in &recon1.xorbs {
|
||||
for mrf in multi_range_fetches {
|
||||
assert!(
|
||||
fi.url.starts_with("http://") || fi.url.starts_with("https://"),
|
||||
mrf.url.starts_with("http://") || mrf.url.starts_with("https://"),
|
||||
"URL for hash {} should be HTTP, got: {}",
|
||||
hash,
|
||||
fi.url
|
||||
mrf.url
|
||||
);
|
||||
assert!(fi.url.contains("/fetch_term?term="));
|
||||
assert!(!fi.url.contains("\":"));
|
||||
assert!(mrf.url.contains("/fetch_term?term="));
|
||||
assert!(!mrf.url.contains("\":"));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -782,10 +797,10 @@ mod tests {
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert!(multi_recon.fetch_info.len() >= 2);
|
||||
for fetch_infos in multi_recon.fetch_info.values() {
|
||||
for fi in fetch_infos {
|
||||
assert!(fi.url.starts_with("http://"));
|
||||
assert!(multi_recon.xorbs.len() >= 2);
|
||||
for multi_range_fetches in multi_recon.xorbs.values() {
|
||||
for mrf in multi_range_fetches {
|
||||
assert!(mrf.url.starts_with("http://"));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -798,18 +813,18 @@ mod tests {
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
for fetch_infos in range_recon.fetch_info.values() {
|
||||
for fi in fetch_infos {
|
||||
assert!(fi.url.starts_with("http://"));
|
||||
assert!(fi.url.contains("/fetch_term?term="));
|
||||
for multi_range_fetches in range_recon.xorbs.values() {
|
||||
for mrf in multi_range_fetches {
|
||||
assert!(mrf.url.starts_with("http://"));
|
||||
assert!(mrf.url.contains("/fetch_term?term="));
|
||||
}
|
||||
}
|
||||
|
||||
// Verify all term URLs are fetchable
|
||||
for term in &recon1.terms {
|
||||
let fetch_infos = recon1.fetch_info.get(&term.hash).unwrap();
|
||||
for fi in fetch_infos {
|
||||
let response = http_client.get(&fi.url).send().await.unwrap();
|
||||
let multi_range_fetches = recon1.xorbs.get(&term.hash).unwrap();
|
||||
for mrf in multi_range_fetches {
|
||||
let response = http_client.get(&mrf.url).send().await.unwrap();
|
||||
assert!(response.status().is_success());
|
||||
assert!(!response.bytes().await.unwrap().is_empty());
|
||||
}
|
||||
@@ -860,9 +875,9 @@ mod tests {
|
||||
let expected_term = &file.terms[term_idx];
|
||||
assert_eq!(recon_term.hash.0, expected_term.xorb_hash);
|
||||
|
||||
// Verify fetch_info exists for each XORB
|
||||
let fetch_infos = recon.fetch_info.get(&recon_term.hash).unwrap();
|
||||
assert!(!fetch_infos.is_empty());
|
||||
// Verify xorbs has entry for each term
|
||||
let multi_range_fetches = recon.xorbs.get(&recon_term.hash).unwrap();
|
||||
assert!(!multi_range_fetches.is_empty());
|
||||
}
|
||||
|
||||
// Verify the complete file can be retrieved correctly via LocalClient
|
||||
|
||||
499
xet_client/src/cas_client/simulation/xorb_utils.rs
Normal file
499
xet_client/src/cas_client/simulation/xorb_utils.rs
Normal file
@@ -0,0 +1,499 @@
|
||||
//! Shared utilities for reconstruction range computation and V2 URL encoding.
|
||||
//!
|
||||
//! This module consolidates logic used by both `MemoryClient` and `LocalClient`
|
||||
//! for computing reconstruction ranges from file segment info, merging adjacent
|
||||
//! ranges, and encoding/decoding V2 fetch URLs.
|
||||
|
||||
use base64::Engine;
|
||||
use base64::engine::general_purpose::URL_SAFE_NO_PAD;
|
||||
use more_asserts::{assert_ge, assert_gt, debug_assert_lt};
|
||||
use tokio::time::{Duration, Instant};
|
||||
use xet_core_structures::MerkleHashMap;
|
||||
use xet_core_structures::merklehash::MerkleHash;
|
||||
use xet_core_structures::metadata_shard::file_structs::MDBFileInfo;
|
||||
use xet_core_structures::xorb_object::XorbObject;
|
||||
|
||||
use crate::cas_client::error::{CasClientError, Result};
|
||||
use crate::cas_types::{ChunkRange, FileRange, HttpRange, XorbRangeDescriptor, XorbReconstructionTerm};
|
||||
|
||||
lazy_static::lazy_static! {
|
||||
/// Reference instant for URL timestamps. Initialized far in the past to allow
|
||||
/// testing timestamps that are earlier in the current process lifetime.
|
||||
pub(crate) static ref REFERENCE_INSTANT: Instant = {
|
||||
let now = Instant::now();
|
||||
now.checked_sub(Duration::from_secs(365 * 24 * 60 * 60))
|
||||
.unwrap_or(now)
|
||||
};
|
||||
}
|
||||
|
||||
/// A merged byte/chunk range for a single xorb.
|
||||
#[derive(Clone, Debug)]
|
||||
pub(crate) struct MergedRange {
|
||||
pub chunk_range: ChunkRange,
|
||||
pub byte_range: FileRange,
|
||||
}
|
||||
|
||||
/// Result of `compute_reconstruction_ranges`: the offset into the first range,
|
||||
/// the list of reconstruction terms, and the merged ranges per xorb hash.
|
||||
pub(crate) type ReconstructionRangesResult =
|
||||
Option<(u64, Vec<XorbReconstructionTerm>, MerkleHashMap<Vec<MergedRange>>)>;
|
||||
|
||||
/// Computes reconstruction ranges from file segment info.
|
||||
///
|
||||
/// Iterates the segments in `file_info`, prunes chunk boundaries to the
|
||||
/// requested `bytes_range`, and merges adjacent/overlapping ranges per xorb.
|
||||
///
|
||||
/// `get_xorb_footer` is called for each unique xorb hash encountered to obtain
|
||||
/// the `XorbObject` metadata needed for chunk-level byte offset calculations.
|
||||
///
|
||||
/// Returns `Ok(None)` when the range is out of bounds, or
|
||||
/// `Ok(Some((offset_into_first_range, terms, merged_ranges_per_xorb)))`.
|
||||
pub(crate) fn compute_reconstruction_ranges(
|
||||
file_info: &MDBFileInfo,
|
||||
bytes_range: Option<FileRange>,
|
||||
get_xorb_footer: &mut dyn FnMut(&MerkleHash) -> Result<XorbObject>,
|
||||
) -> Result<ReconstructionRangesResult> {
|
||||
let total_file_size: u64 = file_info.file_size();
|
||||
|
||||
let file_range = if let Some(range) = bytes_range {
|
||||
if range.start >= total_file_size {
|
||||
if total_file_size == 0 && range.start == 0 {
|
||||
return Ok(Some((0, vec![], MerkleHashMap::new())));
|
||||
}
|
||||
return Ok(None);
|
||||
}
|
||||
FileRange::new(range.start, range.end.min(total_file_size))
|
||||
} else {
|
||||
if total_file_size == 0 {
|
||||
return Ok(Some((0, vec![], MerkleHashMap::new())));
|
||||
}
|
||||
FileRange::full()
|
||||
};
|
||||
|
||||
let mut s_idx = 0;
|
||||
let mut cumulative_bytes = 0u64;
|
||||
let mut first_chunk_byte_start;
|
||||
|
||||
loop {
|
||||
if s_idx >= file_info.segments.len() {
|
||||
return Err(CasClientError::InvalidRange);
|
||||
}
|
||||
|
||||
let n = file_info.segments[s_idx].unpacked_segment_bytes as u64;
|
||||
if cumulative_bytes + n > file_range.start {
|
||||
assert_ge!(file_range.start, cumulative_bytes);
|
||||
first_chunk_byte_start = cumulative_bytes;
|
||||
break;
|
||||
} else {
|
||||
cumulative_bytes += n;
|
||||
s_idx += 1;
|
||||
}
|
||||
}
|
||||
|
||||
let mut terms = Vec::new();
|
||||
|
||||
#[derive(Clone)]
|
||||
struct FetchInfoIntermediate {
|
||||
chunk_range: ChunkRange,
|
||||
byte_range: FileRange,
|
||||
}
|
||||
|
||||
let mut fetch_info_map: MerkleHashMap<Vec<FetchInfoIntermediate>> = MerkleHashMap::new();
|
||||
|
||||
while s_idx < file_info.segments.len() && cumulative_bytes < file_range.end {
|
||||
let mut segment = file_info.segments[s_idx].clone();
|
||||
let mut chunk_range = ChunkRange::new(segment.chunk_index_start, segment.chunk_index_end);
|
||||
|
||||
let xorb_footer = get_xorb_footer(&segment.xorb_hash)?;
|
||||
|
||||
if cumulative_bytes < file_range.start {
|
||||
while chunk_range.start < chunk_range.end {
|
||||
let next_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.start)? as u64;
|
||||
if cumulative_bytes + next_chunk_size <= file_range.start {
|
||||
cumulative_bytes += next_chunk_size;
|
||||
first_chunk_byte_start += next_chunk_size;
|
||||
segment.unpacked_segment_bytes -= next_chunk_size as u32;
|
||||
chunk_range.start += 1;
|
||||
debug_assert_lt!(chunk_range.start, chunk_range.end);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if cumulative_bytes + segment.unpacked_segment_bytes as u64 > file_range.end {
|
||||
while chunk_range.end > chunk_range.start {
|
||||
let last_chunk_size = xorb_footer.uncompressed_chunk_length(chunk_range.end - 1)?;
|
||||
if cumulative_bytes + (segment.unpacked_segment_bytes - last_chunk_size) as u64 >= file_range.end {
|
||||
chunk_range.end -= 1;
|
||||
segment.unpacked_segment_bytes -= last_chunk_size;
|
||||
debug_assert_lt!(chunk_range.start, chunk_range.end);
|
||||
assert_gt!(segment.unpacked_segment_bytes, 0);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let (byte_start, byte_end) = xorb_footer.get_byte_offset(chunk_range.start, chunk_range.end)?;
|
||||
let byte_range = FileRange::new(byte_start as u64, byte_end as u64);
|
||||
|
||||
terms.push(XorbReconstructionTerm {
|
||||
hash: segment.xorb_hash.into(),
|
||||
unpacked_length: segment.unpacked_segment_bytes,
|
||||
range: chunk_range,
|
||||
});
|
||||
|
||||
fetch_info_map
|
||||
.entry(segment.xorb_hash)
|
||||
.or_default()
|
||||
.push(FetchInfoIntermediate {
|
||||
chunk_range,
|
||||
byte_range,
|
||||
});
|
||||
|
||||
cumulative_bytes += segment.unpacked_segment_bytes as u64;
|
||||
s_idx += 1;
|
||||
}
|
||||
|
||||
debug_assert!(!terms.is_empty());
|
||||
|
||||
let mut merged: MerkleHashMap<Vec<MergedRange>> = MerkleHashMap::new();
|
||||
for (hash, mut fi_vec) in fetch_info_map {
|
||||
fi_vec.sort_by_key(|fi| fi.chunk_range.start);
|
||||
|
||||
let mut result: Vec<MergedRange> = Vec::new();
|
||||
let mut idx = 0;
|
||||
|
||||
while idx < fi_vec.len() {
|
||||
let mut cur = fi_vec[idx].clone();
|
||||
|
||||
while idx + 1 < fi_vec.len() {
|
||||
let next = &fi_vec[idx + 1];
|
||||
if next.chunk_range.start <= cur.chunk_range.end {
|
||||
cur.chunk_range.end = cur.chunk_range.end.max(next.chunk_range.end);
|
||||
cur.byte_range.end = cur.byte_range.end.max(next.byte_range.end);
|
||||
idx += 1;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
result.push(MergedRange {
|
||||
chunk_range: cur.chunk_range,
|
||||
byte_range: cur.byte_range,
|
||||
});
|
||||
idx += 1;
|
||||
}
|
||||
|
||||
merged.insert(hash, result);
|
||||
}
|
||||
|
||||
Ok(Some((file_range.start - first_chunk_byte_start, terms, merged)))
|
||||
}
|
||||
|
||||
/// Generates a V2 fetch URL: base64("{hash_hex}:{timestamp_ms}:{r1_start}-{r1_end},...")
|
||||
pub(crate) fn generate_v2_fetch_url(hash: &MerkleHash, ranges: &[XorbRangeDescriptor], timestamp: Instant) -> String {
|
||||
let timestamp_ms = timestamp.saturating_duration_since(*REFERENCE_INSTANT).as_millis() as u64;
|
||||
let ranges_str: Vec<String> = ranges.iter().map(|r| format!("{}-{}", r.bytes.start, r.bytes.end)).collect();
|
||||
let payload = format!("{}:{}:{}", hash.hex(), timestamp_ms, ranges_str.join(","));
|
||||
URL_SAFE_NO_PAD.encode(payload.as_bytes())
|
||||
}
|
||||
|
||||
/// Parses a V2 fetch URL back into (hash, timestamp, byte ranges).
|
||||
pub(crate) fn parse_v2_fetch_url(url: &str) -> Result<(MerkleHash, Instant, Vec<HttpRange>)> {
|
||||
let bytes = URL_SAFE_NO_PAD.decode(url).map_err(|_| CasClientError::InvalidArguments)?;
|
||||
let payload = String::from_utf8(bytes).map_err(|_| CasClientError::InvalidArguments)?;
|
||||
|
||||
let mut parts = payload.splitn(3, ':');
|
||||
let hash_hex = parts.next().ok_or(CasClientError::InvalidArguments)?;
|
||||
let ts_str = parts.next().ok_or(CasClientError::InvalidArguments)?;
|
||||
let ranges_str = parts.next().ok_or(CasClientError::InvalidArguments)?;
|
||||
|
||||
let hash = MerkleHash::from_hex(hash_hex).map_err(|_| CasClientError::InvalidArguments)?;
|
||||
let timestamp_ms: u64 = ts_str.parse().map_err(|_| CasClientError::InvalidArguments)?;
|
||||
let timestamp = *REFERENCE_INSTANT + Duration::from_millis(timestamp_ms);
|
||||
|
||||
let mut ranges = Vec::new();
|
||||
for r in ranges_str.split(',').filter(|s| !s.is_empty()) {
|
||||
let mut parts = r.splitn(2, '-');
|
||||
let start: u64 = parts
|
||||
.next()
|
||||
.ok_or(CasClientError::InvalidArguments)?
|
||||
.parse()
|
||||
.map_err(|_| CasClientError::InvalidArguments)?;
|
||||
let end: u64 = parts
|
||||
.next()
|
||||
.ok_or(CasClientError::InvalidArguments)?
|
||||
.parse()
|
||||
.map_err(|_| CasClientError::InvalidArguments)?;
|
||||
ranges.push(HttpRange::new(start, end));
|
||||
}
|
||||
|
||||
Ok((hash, timestamp, ranges))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use xet_core_structures::metadata_shard::file_structs::{
|
||||
FileDataSequenceEntry, FileDataSequenceHeader, MDBFileInfo,
|
||||
};
|
||||
|
||||
use super::super::random_xorb::RandomXorb;
|
||||
use super::*;
|
||||
|
||||
fn make_range_descriptor(chunk_start: u32, chunk_end: u32, byte_start: u64, byte_end: u64) -> XorbRangeDescriptor {
|
||||
XorbRangeDescriptor {
|
||||
chunks: ChunkRange::new(chunk_start, chunk_end),
|
||||
bytes: HttpRange::new(byte_start, byte_end),
|
||||
}
|
||||
}
|
||||
|
||||
fn build_xorb(chunk_sizes: &[usize]) -> (MerkleHash, XorbObject) {
|
||||
let seed_and_sizes: Vec<(u64, u32)> =
|
||||
chunk_sizes.iter().enumerate().map(|(i, &s)| (i as u64, s as u32)).collect();
|
||||
let xorb = RandomXorb::new(&seed_and_sizes);
|
||||
let xorb_object = xorb.get_xorb_object();
|
||||
let hash = xorb.xorb_hash();
|
||||
(hash, xorb_object)
|
||||
}
|
||||
|
||||
fn make_segment(
|
||||
xorb_hash: MerkleHash,
|
||||
chunk_start: u32,
|
||||
chunk_end: u32,
|
||||
unpacked_bytes: u32,
|
||||
) -> FileDataSequenceEntry {
|
||||
FileDataSequenceEntry {
|
||||
xorb_hash,
|
||||
xorb_flags: 0,
|
||||
chunk_index_start: chunk_start,
|
||||
chunk_index_end: chunk_end,
|
||||
unpacked_segment_bytes: unpacked_bytes,
|
||||
}
|
||||
}
|
||||
|
||||
fn make_file_info(segments: Vec<FileDataSequenceEntry>) -> MDBFileInfo {
|
||||
MDBFileInfo {
|
||||
metadata: FileDataSequenceHeader {
|
||||
file_hash: MerkleHash::default(),
|
||||
..Default::default()
|
||||
},
|
||||
segments,
|
||||
verification: vec![],
|
||||
metadata_ext: None,
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_v2_url_roundtrip() {
|
||||
let hash = MerkleHash::from_hex("a32d3a2a2e83e4d41b04899f13a8e891f4dd3f2ed940f96f91da7bf55b7ee299").unwrap();
|
||||
let ranges = vec![
|
||||
make_range_descriptor(0, 3, 0, 1024),
|
||||
make_range_descriptor(5, 8, 2048, 4096),
|
||||
];
|
||||
let timestamp = Instant::now();
|
||||
|
||||
let url = generate_v2_fetch_url(&hash, &ranges, timestamp);
|
||||
let (parsed_hash, parsed_ts, parsed_ranges) = parse_v2_fetch_url(&url).unwrap();
|
||||
|
||||
assert_eq!(hash, parsed_hash);
|
||||
assert_eq!(parsed_ranges.len(), 2);
|
||||
assert_eq!(parsed_ranges[0].start, 0);
|
||||
assert_eq!(parsed_ranges[0].end, 1024);
|
||||
assert_eq!(parsed_ranges[1].start, 2048);
|
||||
assert_eq!(parsed_ranges[1].end, 4096);
|
||||
|
||||
let diff = if parsed_ts > timestamp {
|
||||
parsed_ts - timestamp
|
||||
} else {
|
||||
timestamp - parsed_ts
|
||||
};
|
||||
assert!(diff < Duration::from_millis(2));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_v2_url_single_range() {
|
||||
let hash = MerkleHash::default();
|
||||
let ranges = vec![make_range_descriptor(0, 1, 100, 200)];
|
||||
let timestamp = Instant::now();
|
||||
|
||||
let url = generate_v2_fetch_url(&hash, &ranges, timestamp);
|
||||
let (_, _, parsed_ranges) = parse_v2_fetch_url(&url).unwrap();
|
||||
|
||||
assert_eq!(parsed_ranges.len(), 1);
|
||||
assert_eq!(parsed_ranges[0].start, 100);
|
||||
assert_eq!(parsed_ranges[0].end, 200);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_v2_url_invalid_base64() {
|
||||
assert!(parse_v2_fetch_url("not-valid!!!").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_v2_url_invalid_payload() {
|
||||
let url = URL_SAFE_NO_PAD.encode(b"bad");
|
||||
assert!(parse_v2_fetch_url(&url).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_single_segment() {
|
||||
let (xorb_hash, xorb_object) = build_xorb(&[100, 200, 300]);
|
||||
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 3, 600)]);
|
||||
|
||||
let result = compute_reconstruction_ranges(&file_info, None, &mut |_| Ok(xorb_object.clone())).unwrap();
|
||||
let (offset, terms, merged) = result.unwrap();
|
||||
|
||||
assert_eq!(offset, 0);
|
||||
assert_eq!(terms.len(), 1);
|
||||
assert_eq!(terms[0].unpacked_length, 600);
|
||||
assert_eq!(terms[0].range.start, 0);
|
||||
assert_eq!(terms[0].range.end, 3);
|
||||
|
||||
let xorb_ranges = merged.get(&xorb_hash).unwrap();
|
||||
assert_eq!(xorb_ranges.len(), 1);
|
||||
assert_eq!(xorb_ranges[0].chunk_range.start, 0);
|
||||
assert_eq!(xorb_ranges[0].chunk_range.end, 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_partial_range() {
|
||||
let (xorb_hash, xorb_object) = build_xorb(&[100, 200, 300]);
|
||||
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 3, 600)]);
|
||||
|
||||
let range = FileRange::new(100, 300);
|
||||
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| Ok(xorb_object.clone())).unwrap();
|
||||
let (offset, terms, merged) = result.unwrap();
|
||||
|
||||
assert_eq!(offset, 0, "range starts exactly at chunk boundary");
|
||||
assert_eq!(terms.len(), 1);
|
||||
assert_eq!(terms[0].range.start, 1);
|
||||
assert_eq!(terms[0].range.end, 2);
|
||||
assert_eq!(terms[0].unpacked_length, 200);
|
||||
|
||||
let xorb_ranges = merged.get(&xorb_hash).unwrap();
|
||||
assert_eq!(xorb_ranges.len(), 1);
|
||||
assert_eq!(xorb_ranges[0].chunk_range.start, 1);
|
||||
assert_eq!(xorb_ranges[0].chunk_range.end, 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_out_of_bounds() {
|
||||
let file_info = make_file_info(vec![make_segment(MerkleHash::default(), 0, 1, 100)]);
|
||||
|
||||
let range = FileRange::new(200, 300);
|
||||
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| {
|
||||
panic!("should not be called for out-of-range")
|
||||
})
|
||||
.unwrap();
|
||||
assert!(result.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_empty_file() {
|
||||
let file_info = make_file_info(vec![]);
|
||||
|
||||
let result =
|
||||
compute_reconstruction_ranges(&file_info, None, &mut |_| panic!("should not be called for empty file"))
|
||||
.unwrap();
|
||||
let (offset, terms, merged) = result.unwrap();
|
||||
assert_eq!(offset, 0);
|
||||
assert!(terms.is_empty());
|
||||
assert!(merged.is_empty());
|
||||
|
||||
let result = compute_reconstruction_ranges(&file_info, Some(FileRange::new(0, 100)), &mut |_| {
|
||||
panic!("should not be called for empty file")
|
||||
})
|
||||
.unwrap();
|
||||
let (offset, terms, _) = result.unwrap();
|
||||
assert_eq!(offset, 0);
|
||||
assert!(terms.is_empty());
|
||||
|
||||
let result = compute_reconstruction_ranges(&file_info, Some(FileRange::new(1, 100)), &mut |_| {
|
||||
panic!("should not be called for empty file")
|
||||
})
|
||||
.unwrap();
|
||||
assert!(result.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_merges_adjacent() {
|
||||
let (xorb_hash, xorb_object) = build_xorb(&[100, 100, 100, 100]);
|
||||
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 2, 200), make_segment(xorb_hash, 2, 4, 200)]);
|
||||
|
||||
let result = compute_reconstruction_ranges(&file_info, None, &mut |_| Ok(xorb_object.clone())).unwrap();
|
||||
let (offset, terms, merged) = result.unwrap();
|
||||
|
||||
assert_eq!(offset, 0);
|
||||
assert_eq!(terms.len(), 2);
|
||||
|
||||
let xorb_ranges = merged.get(&xorb_hash).unwrap();
|
||||
assert_eq!(xorb_ranges.len(), 1);
|
||||
assert_eq!(xorb_ranges[0].chunk_range.start, 0);
|
||||
assert_eq!(xorb_ranges[0].chunk_range.end, 4);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_multi_xorb_non_contiguous() {
|
||||
let (hash_a, obj_a) = build_xorb(&[100, 100, 100, 100]);
|
||||
let (hash_b, obj_b) = build_xorb(&[150, 150]);
|
||||
|
||||
let file_info = make_file_info(vec![
|
||||
make_segment(hash_a, 0, 2, 200),
|
||||
make_segment(hash_b, 0, 2, 300),
|
||||
make_segment(hash_a, 2, 4, 200),
|
||||
]);
|
||||
|
||||
let result = compute_reconstruction_ranges(&file_info, None, &mut |hash| {
|
||||
if *hash == hash_a {
|
||||
Ok(obj_a.clone())
|
||||
} else if *hash == hash_b {
|
||||
Ok(obj_b.clone())
|
||||
} else {
|
||||
Err(CasClientError::XORBNotFound(*hash))
|
||||
}
|
||||
})
|
||||
.unwrap();
|
||||
|
||||
let (offset, terms, merged) = result.unwrap();
|
||||
assert_eq!(offset, 0);
|
||||
assert_eq!(terms.len(), 3);
|
||||
|
||||
let a_ranges = merged.get(&hash_a).unwrap();
|
||||
assert_eq!(a_ranges.len(), 1);
|
||||
assert_eq!(a_ranges[0].chunk_range.start, 0);
|
||||
assert_eq!(a_ranges[0].chunk_range.end, 4);
|
||||
|
||||
let b_ranges = merged.get(&hash_b).unwrap();
|
||||
assert_eq!(b_ranges.len(), 1);
|
||||
assert_eq!(b_ranges[0].chunk_range.start, 0);
|
||||
assert_eq!(b_ranges[0].chunk_range.end, 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_truncates_to_file_size() {
|
||||
let (xorb_hash, xorb_object) = build_xorb(&[500]);
|
||||
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 1, 500)]);
|
||||
|
||||
let range = FileRange::new(0, 10000);
|
||||
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| Ok(xorb_object.clone())).unwrap();
|
||||
let (offset, terms, _) = result.unwrap();
|
||||
assert_eq!(offset, 0);
|
||||
assert_eq!(terms.len(), 1);
|
||||
assert_eq!(terms[0].unpacked_length, 500);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_ranges_offset_into_first_range() {
|
||||
let (xorb_hash, xorb_object) = build_xorb(&[100, 200, 300]);
|
||||
let file_info = make_file_info(vec![make_segment(xorb_hash, 0, 3, 600)]);
|
||||
|
||||
let range = FileRange::new(150, 600);
|
||||
let result = compute_reconstruction_ranges(&file_info, Some(range), &mut |_| Ok(xorb_object.clone())).unwrap();
|
||||
let (offset, terms, _) = result.unwrap();
|
||||
|
||||
assert_eq!(offset, 50);
|
||||
assert_eq!(terms[0].range.start, 1);
|
||||
}
|
||||
}
|
||||
@@ -217,6 +217,66 @@ pub struct QueryReconstructionResponse {
|
||||
pub fetch_info: HashMap<HexMerkleHash, Vec<XorbReconstructionFetchInfo>>,
|
||||
}
|
||||
|
||||
/// V2 reconstruction response - optimized for multi-range fetching.
|
||||
/// May provide fewer signed URLs per xorb by combining multiple byte ranges
|
||||
/// into a single URL where possible.
|
||||
#[derive(Debug, Serialize, Deserialize, Clone)]
|
||||
pub struct QueryReconstructionResponseV2 {
|
||||
pub offset_into_first_range: u64,
|
||||
pub terms: Vec<XorbReconstructionTerm>,
|
||||
/// Map from xorb hash -> list of multi-range fetch entries.
|
||||
/// Typically 1 entry per xorb. Multiple entries when the URL length limit
|
||||
/// (~8 KiB, roughly ~500 ranges) forces a split.
|
||||
pub xorbs: HashMap<HexMerkleHash, Vec<XorbMultiRangeFetch>>,
|
||||
}
|
||||
|
||||
/// A signed multi-range fetch: one URL covering a subset of ranges for a xorb.
|
||||
#[derive(Debug, Serialize, Deserialize, Clone)]
|
||||
pub struct XorbMultiRangeFetch {
|
||||
/// Signed URL with all byte ranges encoded. Client must send exactly the
|
||||
/// signed range value as the Range header.
|
||||
pub url: String,
|
||||
/// Byte ranges covered by this URL, sorted by chunk start.
|
||||
pub ranges: Vec<XorbRangeDescriptor>,
|
||||
}
|
||||
|
||||
/// A single byte range within a xorb, mapping chunk indices to physical bytes.
|
||||
#[derive(Debug, Serialize, Deserialize, Clone)]
|
||||
pub struct XorbRangeDescriptor {
|
||||
/// Chunk index range [start, end) within the xorb.
|
||||
pub chunks: ChunkRange,
|
||||
/// Physical byte range [start, end] (inclusive end) for the HTTP Range header.
|
||||
pub bytes: HttpRange,
|
||||
}
|
||||
|
||||
impl From<QueryReconstructionResponse> for QueryReconstructionResponseV2 {
|
||||
fn from(v1: QueryReconstructionResponse) -> Self {
|
||||
let xorbs = v1
|
||||
.fetch_info
|
||||
.into_iter()
|
||||
.map(|(hash, fetch_infos)| {
|
||||
let fetch = fetch_infos
|
||||
.into_iter()
|
||||
.map(|info| XorbMultiRangeFetch {
|
||||
url: info.url,
|
||||
ranges: vec![XorbRangeDescriptor {
|
||||
chunks: info.range,
|
||||
bytes: info.url_range,
|
||||
}],
|
||||
})
|
||||
.collect();
|
||||
(hash, fetch)
|
||||
})
|
||||
.collect();
|
||||
|
||||
QueryReconstructionResponseV2 {
|
||||
offset_into_first_range: v1.offset_into_first_range,
|
||||
terms: v1.terms,
|
||||
xorbs,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Request json body type representation for the POST /reconstructions endpoint
|
||||
// to get the reconstruction for multiple files at a time.
|
||||
// listing of non-duplicate (enforced by HashSet) keys (file ids) to get reconstructions for
|
||||
|
||||
40
xet_client/tests/test_shard_upload_timeout.rs
Normal file
40
xet_client/tests/test_shard_upload_timeout.rs
Normal file
@@ -0,0 +1,40 @@
|
||||
//! Integration tests for the shard upload no-read-timeout client (XET-885).
|
||||
//!
|
||||
//! Verifies that shard uploads succeed even when the server takes a long time to process,
|
||||
//! since the shard upload client has no read_timeout.
|
||||
|
||||
use std::time::Duration;
|
||||
|
||||
use xet_client::cas_client::simulation::ClientTestingUtils;
|
||||
use xet_client::cas_client::{DirectAccessClient, LocalTestServerBuilder};
|
||||
use xet_runtime::test_set_config;
|
||||
|
||||
test_set_config! {
|
||||
client {
|
||||
retry_max_attempts = 1usize;
|
||||
retry_base_delay = Duration::from_millis(10);
|
||||
}
|
||||
}
|
||||
|
||||
const CHUNK_SIZE: usize = 123;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_shard_upload_succeeds_with_no_server_delay() {
|
||||
let server = LocalTestServerBuilder::new().start().await;
|
||||
|
||||
let result = server.remote_client().upload_random_file(&[(1, (0, 5))], CHUNK_SIZE).await;
|
||||
|
||||
assert!(result.is_ok(), "Shard upload should succeed with no server delay: {result:?}");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_shard_upload_succeeds_with_slow_server() {
|
||||
let server = LocalTestServerBuilder::new().start().await;
|
||||
|
||||
// Server takes 3s to respond — shard upload client has no read_timeout so this should succeed
|
||||
server.set_api_delay_range(Some(Duration::from_secs(3)..Duration::from_secs(3)));
|
||||
|
||||
let result = server.remote_client().upload_random_file(&[(1, (0, 5))], CHUNK_SIZE).await;
|
||||
|
||||
assert!(result.is_ok(), "Shard upload should succeed even with slow server (no read_timeout): {result:?}");
|
||||
}
|
||||
@@ -192,6 +192,27 @@ pub fn deserialize_chunks<R: Read>(reader: &mut R) -> Result<(Vec<u8>, Vec<u32>)
|
||||
Ok((buf, chunk_byte_indices))
|
||||
}
|
||||
|
||||
/// Appends a deserialized chunk segment to existing accumulated buffers.
|
||||
///
|
||||
/// `deserialize_chunks` returns `chunk_byte_indices` starting with a leading `0`.
|
||||
/// When concatenating multiple segments, this function deduplicates that leading
|
||||
/// zero for subsequent segments and rebases all indices to account for data already
|
||||
/// accumulated.
|
||||
pub fn append_chunk_segment(
|
||||
all_data: &mut Vec<u8>,
|
||||
all_chunk_indices: &mut Vec<u32>,
|
||||
segment_data: &[u8],
|
||||
segment_indices: &[u32],
|
||||
) {
|
||||
let base_offset = all_data.len() as u32;
|
||||
if all_chunk_indices.is_empty() {
|
||||
all_chunk_indices.extend_from_slice(segment_indices);
|
||||
} else {
|
||||
all_chunk_indices.extend(segment_indices.iter().skip(1).map(|&o| o + base_offset));
|
||||
}
|
||||
all_data.extend_from_slice(segment_data);
|
||||
}
|
||||
|
||||
/// Reads the next chunk header, returning `None` on clean EOF.
|
||||
///
|
||||
/// Uses a single `read()` call to detect EOF (returns 0), then completes
|
||||
@@ -338,6 +359,37 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_append_chunk_segment() {
|
||||
let mut all_data = Vec::new();
|
||||
let mut all_indices = Vec::<u32>::new();
|
||||
|
||||
// First segment: simulates deserialize_chunks output [0, 10, 25]
|
||||
append_chunk_segment(&mut all_data, &mut all_indices, &[0u8; 25], &[0, 10, 25]);
|
||||
assert_eq!(all_data.len(), 25);
|
||||
assert_eq!(all_indices, vec![0, 10, 25]);
|
||||
|
||||
// Second segment: [0, 8, 20] — leading 0 should be skipped, offsets rebased by 25
|
||||
append_chunk_segment(&mut all_data, &mut all_indices, &[1u8; 20], &[0, 8, 20]);
|
||||
assert_eq!(all_data.len(), 45);
|
||||
assert_eq!(all_indices, vec![0, 10, 25, 33, 45]);
|
||||
|
||||
// Third segment: single chunk [0, 5] — leading 0 skipped, rebased by 45
|
||||
append_chunk_segment(&mut all_data, &mut all_indices, &[2u8; 5], &[0, 5]);
|
||||
assert_eq!(all_data.len(), 50);
|
||||
assert_eq!(all_indices, vec![0, 10, 25, 33, 45, 50]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_append_chunk_segment_single() {
|
||||
let mut all_data = Vec::new();
|
||||
let mut all_indices = Vec::<u32>::new();
|
||||
|
||||
append_chunk_segment(&mut all_data, &mut all_indices, &[0u8; 10], &[0, 10]);
|
||||
assert_eq!(all_data.len(), 10);
|
||||
assert_eq!(all_indices, vec![0, 10]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_truncated_stream_returns_error() {
|
||||
let (_, xorb_data, _, _) = build_xorb_object(3, ChunkSize::Fixed(1024), CompressionScheme::None);
|
||||
|
||||
@@ -375,6 +375,7 @@ mod tests {
|
||||
use ulid::Ulid;
|
||||
use xet_client::cas_client::{ClientTestingUtils, DirectAccessClient, LocalClient, RandomFileContents};
|
||||
use xet_client::cas_types::FileRange;
|
||||
use xet_runtime::core::XetRuntime;
|
||||
|
||||
use super::*;
|
||||
use crate::progress_tracking::NoOpProgressUpdater;
|
||||
@@ -405,6 +406,7 @@ mod tests {
|
||||
file_hash: MerkleHash,
|
||||
byte_range: Option<FileRange>,
|
||||
config: &ReconstructionConfig,
|
||||
semaphore: Option<Arc<AdjustableSemaphore>>,
|
||||
) -> Result<Vec<u8>> {
|
||||
let buffer = Arc::new(std::sync::Mutex::new(Cursor::new(Vec::new())));
|
||||
let writer = StaticCursorWriter(buffer.clone());
|
||||
@@ -415,6 +417,9 @@ mod tests {
|
||||
if let Some(range) = byte_range {
|
||||
reconstructor = reconstructor.with_byte_range(range);
|
||||
}
|
||||
if let Some(sem) = semaphore {
|
||||
reconstructor = reconstructor.with_buffer_semaphore(sem);
|
||||
}
|
||||
|
||||
reconstructor.reconstruct_to_writer(writer).await?;
|
||||
|
||||
@@ -528,7 +533,7 @@ mod tests {
|
||||
config.use_vectored_write = use_vectored;
|
||||
|
||||
// Test 1: reconstruct_to_writer
|
||||
let vec_result = reconstruct_to_vec(client, h, None, &config).await.unwrap();
|
||||
let vec_result = reconstruct_to_vec(client, h, None, &config, None).await.unwrap();
|
||||
assert_eq!(vec_result, *expected, "vec failed (vectored={use_vectored})");
|
||||
|
||||
// Test 2: reconstruct_to_file
|
||||
@@ -560,7 +565,7 @@ mod tests {
|
||||
config.use_vectored_write = use_vectored;
|
||||
|
||||
// Test 1: reconstruct_to_writer
|
||||
let vec_result = reconstruct_to_vec(client, file_contents.file_hash, Some(range), &config)
|
||||
let vec_result = reconstruct_to_vec(client, file_contents.file_hash, Some(range), &config, None)
|
||||
.await
|
||||
.expect("reconstruct_to_vec should succeed");
|
||||
assert_eq!(vec_result, expected, "vec failed (vectored={use_vectored})");
|
||||
@@ -911,7 +916,11 @@ mod tests {
|
||||
#[tokio::test]
|
||||
async fn test_non_contiguous_chunks() {
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6))]).await;
|
||||
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
|
||||
let config = test_config();
|
||||
let result = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data);
|
||||
}
|
||||
|
||||
// ==================== Default Config Tests ====================
|
||||
@@ -1157,7 +1166,7 @@ mod tests {
|
||||
let mut config = test_config();
|
||||
config.download_buffer_perfile_size = xet_runtime::utils::ByteSize::from("8kb");
|
||||
|
||||
let reconstructed = reconstruct_to_vec(&client, file_contents.file_hash, None, &config)
|
||||
let reconstructed = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(reconstructed, file_contents.data);
|
||||
@@ -1287,6 +1296,348 @@ mod tests {
|
||||
assert_eq!(&result[start as usize..end as usize], &file_contents.data[start as usize..end as usize]);
|
||||
}
|
||||
|
||||
// ==================== V1 Fallback Tests ====================
|
||||
//
|
||||
// These tests use LocalTestServer with V2 disabled to verify that
|
||||
// reconstruction works correctly when the client falls back from V2 to V1.
|
||||
|
||||
/// Helper to reconstruct through a LocalTestServer (RemoteClient HTTP path).
|
||||
async fn reconstruct_via_server(
|
||||
server: &xet_client::cas_client::LocalTestServer,
|
||||
file_hash: MerkleHash,
|
||||
byte_range: Option<FileRange>,
|
||||
config: &ReconstructionConfig,
|
||||
) -> Result<Vec<u8>> {
|
||||
let buffer = Arc::new(std::sync::Mutex::new(Cursor::new(Vec::new())));
|
||||
let writer = StaticCursorWriter(buffer.clone());
|
||||
|
||||
let client: Arc<dyn Client> = server.remote_client().clone();
|
||||
let mut reconstructor = FileReconstructor::new(&client, file_hash).with_config(config);
|
||||
|
||||
if let Some(range) = byte_range {
|
||||
reconstructor = reconstructor.with_byte_range(range);
|
||||
}
|
||||
|
||||
reconstructor.reconstruct_to_writer(writer).await?;
|
||||
|
||||
let data = buffer.lock().unwrap().get_ref().clone();
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_v1_fallback_full_reconstruction() {
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 3)), (2, (0, 2))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
// Disable V2 so the remote client falls back to V1 + conversion.
|
||||
server.disable_v2_reconstruction(404);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_v1_fallback_partial_range() {
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 5)), (2, (0, 3))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
server.disable_v2_reconstruction(404);
|
||||
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, Some(range), &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, &file_contents.data[range.start as usize..range.end as usize]);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_v1_fallback_non_contiguous_chunks() {
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 2)), (1, (4, 6))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
server.disable_v2_reconstruction(404);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_v1_fallback_multiple_xorbs() {
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 2)), (2, (0, 3)), (3, (0, 2)), (1, (2, 4))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
server.disable_v2_reconstruction(404);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
/// V1 fallback with three disjoint ranges from the same xorb.
|
||||
#[tokio::test]
|
||||
async fn test_v1_fallback_triple_disjoint_ranges() {
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
server.disable_v2_reconstruction(404);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, None, &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
// ==================== Max Ranges Tests ====================
|
||||
//
|
||||
// These tests use LocalTestServer with max_ranges_per_fetch=2 to verify that
|
||||
// multi-range fetch splitting works correctly through the full HTTP path.
|
||||
|
||||
/// Helper to set up a server with max_ranges_per_fetch and reconstruct.
|
||||
async fn reconstruct_via_server_with_max_ranges(
|
||||
term_spec: &[(u64, (u64, u64))],
|
||||
max_ranges: usize,
|
||||
byte_range: Option<FileRange>,
|
||||
) -> (Vec<u8>, RandomFileContents) {
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(term_spec, TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
server.set_max_ranges_per_fetch(max_ranges);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, byte_range, &config)
|
||||
.await
|
||||
.unwrap();
|
||||
(result, file_contents)
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_max_ranges_simple() {
|
||||
let (result, file_contents) =
|
||||
reconstruct_via_server_with_max_ranges(&[(1, (0, 3)), (2, (0, 2))], 2, None).await;
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
/// A single xorb with two disjoint ranges, split at max_ranges=1.
|
||||
/// Each range becomes its own fetch entry.
|
||||
#[tokio::test]
|
||||
async fn test_max_ranges_1_disjoint() {
|
||||
let (result, file_contents) =
|
||||
reconstruct_via_server_with_max_ranges(&[(1, (0, 2)), (1, (4, 6))], 1, None).await;
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
/// Three disjoint ranges from the same xorb with max_ranges=2.
|
||||
/// First two ranges are grouped, third gets its own fetch entry.
|
||||
#[tokio::test]
|
||||
async fn test_max_ranges_2_triple_disjoint() {
|
||||
let (result, file_contents) =
|
||||
reconstruct_via_server_with_max_ranges(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))], 2, None).await;
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
/// Multiple xorbs, each with disjoint ranges, with max_ranges=2.
|
||||
/// Tests that splitting is applied per-xorb correctly.
|
||||
#[tokio::test]
|
||||
async fn test_max_ranges_2_multi_xorb_disjoint() {
|
||||
let term_spec = &[
|
||||
(1, (0, 2)),
|
||||
(2, (0, 2)),
|
||||
(1, (4, 6)),
|
||||
(2, (4, 6)),
|
||||
(1, (8, 10)),
|
||||
(2, (8, 10)),
|
||||
];
|
||||
let (result, file_contents) = reconstruct_via_server_with_max_ranges(term_spec, 2, None).await;
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
/// Complex interleaved pattern with max_ranges=2 and a partial byte range.
|
||||
#[tokio::test]
|
||||
async fn test_max_ranges_2_partial_range() {
|
||||
let term_spec = &[
|
||||
(1, (0, 3)),
|
||||
(2, (0, 2)),
|
||||
(1, (3, 5)),
|
||||
(3, (1, 4)),
|
||||
(2, (4, 6)),
|
||||
(1, (0, 2)),
|
||||
];
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let file_contents = server
|
||||
.remote_client()
|
||||
.upload_random_file(term_spec, TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
server.set_max_ranges_per_fetch(2);
|
||||
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_via_server(&server, file_contents.file_hash, Some(range), &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, &file_contents.data[range.start as usize..range.end as usize]);
|
||||
}
|
||||
|
||||
// ==================== Multi-Disjoint Range Tests (LocalClient) ====================
|
||||
//
|
||||
// These tests exercise complex disjoint range patterns through the LocalClient path
|
||||
// (no HTTP server), ensuring the reconstruction logic handles V2 multi-range
|
||||
// XorbBlocks correctly.
|
||||
|
||||
/// Single xorb with three disjoint chunk ranges.
|
||||
#[tokio::test]
|
||||
async fn test_triple_disjoint_ranges_full() {
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
|
||||
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
|
||||
}
|
||||
|
||||
/// Single xorb with three disjoint chunk ranges, partial byte range.
|
||||
#[tokio::test]
|
||||
async fn test_triple_disjoint_ranges_partial() {
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
|
||||
reconstruct_and_verify_range(&client, &file_contents, range, test_config()).await;
|
||||
}
|
||||
|
||||
/// Multiple xorbs, each with multiple disjoint ranges, interleaved.
|
||||
#[tokio::test]
|
||||
async fn test_multi_xorb_interleaved_disjoint() {
|
||||
let term_spec = &[
|
||||
(1, (0, 2)),
|
||||
(2, (0, 2)),
|
||||
(1, (4, 6)),
|
||||
(2, (4, 6)),
|
||||
(1, (8, 10)),
|
||||
(2, (8, 10)),
|
||||
];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
|
||||
}
|
||||
|
||||
/// Multiple xorbs with interleaved disjoint ranges, partial byte range.
|
||||
#[tokio::test]
|
||||
async fn test_multi_xorb_interleaved_disjoint_partial() {
|
||||
let term_spec = &[
|
||||
(1, (0, 2)),
|
||||
(2, (0, 2)),
|
||||
(1, (4, 6)),
|
||||
(2, (4, 6)),
|
||||
(1, (8, 10)),
|
||||
(2, (8, 10)),
|
||||
];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 3, file_len * 2 / 3);
|
||||
reconstruct_and_verify_range(&client, &file_contents, range, test_config()).await;
|
||||
}
|
||||
|
||||
/// Single xorb with four disjoint ranges (many gaps).
|
||||
#[tokio::test]
|
||||
async fn test_four_disjoint_ranges() {
|
||||
let term_spec = &[(1, (0, 2)), (1, (4, 6)), (1, (8, 10)), (1, (12, 14))];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
|
||||
}
|
||||
|
||||
/// Mix of contiguous and disjoint ranges from the same xorb.
|
||||
#[tokio::test]
|
||||
async fn test_mixed_contiguous_and_disjoint() {
|
||||
let term_spec = &[
|
||||
(1, (0, 3)), // contiguous block
|
||||
(1, (3, 5)), // continues contiguously
|
||||
(1, (8, 10)), // gap, then disjoint
|
||||
];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
|
||||
}
|
||||
|
||||
/// Disjoint ranges across three xorbs with a complex access pattern.
|
||||
#[tokio::test]
|
||||
async fn test_complex_three_xorb_disjoint() {
|
||||
let term_spec = &[
|
||||
(1, (0, 2)),
|
||||
(2, (0, 3)),
|
||||
(3, (2, 5)),
|
||||
(1, (5, 8)),
|
||||
(2, (6, 8)),
|
||||
(3, (0, 2)),
|
||||
];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
reconstruct_and_verify_full(&client, &file_contents, test_config()).await;
|
||||
}
|
||||
|
||||
/// LocalClient with max_ranges_per_fetch=2 (tests V2 response splitting without HTTP).
|
||||
#[tokio::test]
|
||||
async fn test_local_client_max_ranges_2_disjoint() {
|
||||
let client = LocalClient::temporary().await.unwrap();
|
||||
client.set_max_ranges_per_fetch(2);
|
||||
|
||||
let term_spec = &[(1, (0, 2)), (1, (4, 6)), (1, (8, 10)), (1, (12, 14))];
|
||||
let file_contents = client.upload_random_file(term_spec, TEST_CHUNK_SIZE).await.unwrap();
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
/// LocalClient with max_ranges_per_fetch=1 (every range gets its own fetch entry).
|
||||
#[tokio::test]
|
||||
async fn test_local_client_max_ranges_1_multi_xorb() {
|
||||
let client = LocalClient::temporary().await.unwrap();
|
||||
client.set_max_ranges_per_fetch(1);
|
||||
|
||||
let term_spec = &[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6))];
|
||||
let file_contents = client.upload_random_file(term_spec, TEST_CHUNK_SIZE).await.unwrap();
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_to_vec(&client, file_contents.file_hash, None, &config, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, file_contents.data.as_ref());
|
||||
}
|
||||
|
||||
// ==================== Cancellation Flag Tests ====================
|
||||
|
||||
#[tokio::test]
|
||||
@@ -1385,4 +1736,132 @@ mod tests {
|
||||
assert_eq!(bytes_written, file_contents.data.len() as u64);
|
||||
assert_eq!(buffer.lock().unwrap().get_ref().clone(), file_contents.data);
|
||||
}
|
||||
|
||||
// ==================== Multirange Fetching Tests ====================
|
||||
//
|
||||
// These tests verify that reconstruction works correctly with both values
|
||||
// of `enable_multirange_fetching`. When true, V2 multi-range fetch entries
|
||||
// are used as-is (multirange HTTP requests). When false (default), each
|
||||
// range is split into its own XorbBlock and fetched via a separate
|
||||
// single-range request in parallel.
|
||||
//
|
||||
// Uses XetRuntime::new_with_config() to override the config per-test,
|
||||
// following the pattern from test_dynamic_buffer_scaling_noop_increment_preserves_total_permits.
|
||||
|
||||
fn with_multirange_config(enable: bool) -> Arc<XetRuntime> {
|
||||
let mut config = xet_runtime::config::XetConfig::new();
|
||||
config.client.enable_multirange_fetching = enable;
|
||||
XetRuntime::new_with_config(config).unwrap()
|
||||
}
|
||||
|
||||
/// Exercises multiple disjoint-range scenarios through LocalClient with both
|
||||
/// enable_multirange_fetching=true and =false.
|
||||
#[test]
|
||||
fn test_multirange_local_client() {
|
||||
for enable in [false, true] {
|
||||
let rt = with_multirange_config(enable);
|
||||
rt.external_run_async_task(async move {
|
||||
let scenarios: Vec<Vec<(u64, (u64, u64))>> = vec![
|
||||
vec![(1, (0, 2)), (1, (4, 6)), (1, (8, 10))],
|
||||
vec![
|
||||
(1, (0, 2)),
|
||||
(2, (0, 2)),
|
||||
(1, (4, 6)),
|
||||
(2, (4, 6)),
|
||||
(1, (8, 10)),
|
||||
(2, (8, 10)),
|
||||
],
|
||||
vec![
|
||||
(1, (0, 2)),
|
||||
(2, (0, 3)),
|
||||
(3, (2, 5)),
|
||||
(1, (5, 8)),
|
||||
(2, (6, 8)),
|
||||
(3, (0, 2)),
|
||||
],
|
||||
];
|
||||
let config = test_config();
|
||||
for term_spec in &scenarios {
|
||||
let (client, fc) = setup_test_file(term_spec).await;
|
||||
reconstruct_and_verify_full(&client, &fc, config.clone()).await;
|
||||
|
||||
let file_len = fc.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
|
||||
reconstruct_and_verify_range(&client, &fc, range, config.clone()).await;
|
||||
}
|
||||
})
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
/// LocalClient with max_ranges_per_fetch constraint, both enable settings.
|
||||
#[test]
|
||||
fn test_multirange_max_ranges() {
|
||||
for enable in [false, true] {
|
||||
let rt = with_multirange_config(enable);
|
||||
rt.external_run_async_task(async {
|
||||
let client = LocalClient::temporary().await.unwrap();
|
||||
client.set_max_ranges_per_fetch(2);
|
||||
|
||||
let term_spec = &[(1, (0, 2)), (1, (4, 6)), (1, (8, 10)), (1, (12, 14))];
|
||||
let fc = client.upload_random_file(term_spec, TEST_CHUNK_SIZE).await.unwrap();
|
||||
|
||||
let config = test_config();
|
||||
let result = reconstruct_to_vec(&client, fc.file_hash, None, &config, None).await.unwrap();
|
||||
assert_eq!(result, fc.data.as_ref());
|
||||
})
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
/// Exercises HTTP server path with full, max-ranges-split, and partial-range
|
||||
/// reconstruction, both enable_multirange_fetching values.
|
||||
#[test]
|
||||
fn test_multirange_via_server() {
|
||||
for enable in [false, true] {
|
||||
let rt = with_multirange_config(enable);
|
||||
rt.external_run_async_task(async {
|
||||
let config = test_config();
|
||||
|
||||
// Full reconstruction with disjoint ranges
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let fc = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
let result = reconstruct_via_server(&server, fc.file_hash, None, &config).await.unwrap();
|
||||
assert_eq!(result, fc.data.as_ref());
|
||||
|
||||
// Multi-xorb with max_ranges_per_fetch=2
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let fc = server
|
||||
.remote_client()
|
||||
.upload_random_file(
|
||||
&[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6)), (1, (8, 10))],
|
||||
TEST_CHUNK_SIZE,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
server.set_max_ranges_per_fetch(2);
|
||||
let result = reconstruct_via_server(&server, fc.file_hash, None, &config).await.unwrap();
|
||||
assert_eq!(result, fc.data.as_ref());
|
||||
|
||||
// Partial byte range
|
||||
let server = xet_client::cas_client::LocalTestServerBuilder::new().start().await;
|
||||
let fc = server
|
||||
.remote_client()
|
||||
.upload_random_file(&[(1, (0, 3)), (2, (0, 2)), (1, (3, 5)), (2, (4, 6))], TEST_CHUNK_SIZE)
|
||||
.await
|
||||
.unwrap();
|
||||
let file_len = fc.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
|
||||
let result = reconstruct_via_server(&server, fc.file_hash, Some(range), &config)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, &fc.data[range.start as usize..range.end as usize]);
|
||||
})
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -7,6 +7,7 @@ use tokio::sync::OnceCell;
|
||||
use xet_client::cas_client::Client;
|
||||
use xet_client::cas_types::{ChunkRange, FileRange, HttpRange};
|
||||
use xet_core_structures::merklehash::MerkleHash;
|
||||
use xet_runtime::core::xet_config;
|
||||
use xet_runtime::utils::UniqueId;
|
||||
|
||||
use super::super::FileReconstructionError;
|
||||
@@ -19,17 +20,28 @@ use crate::progress_tracking::download_tracking::DownloadTaskUpdater;
|
||||
/// in the output file that maps to a chunk range within a xorb block.
|
||||
#[derive(Clone)]
|
||||
pub struct FileTerm {
|
||||
// The byte range in the file of this term.
|
||||
pub byte_range: FileRange,
|
||||
|
||||
// Absolute chunk range within the full xorb. Doesn't account for only a partial xorb being downloaded.
|
||||
pub xorb_chunk_range: ChunkRange,
|
||||
|
||||
// The index of the (chunk index, byte offset) pair in the xorb block that starts this file term.
|
||||
pub xorb_block_start_index: usize,
|
||||
|
||||
// The byte offset into the first range of the xorb block should this term not start on a chunk boundary.
|
||||
pub offset_into_first_range: u64,
|
||||
|
||||
// The xorb block that sourced this file term.
|
||||
pub xorb_block: Arc<XorbBlock>,
|
||||
|
||||
// The retrieval URL information for this file term.
|
||||
pub url_info: Arc<TermBlockRetrievalURLs>,
|
||||
}
|
||||
|
||||
impl FileTerm {
|
||||
pub fn extract_bytes(&self, xorb_block_data: &XorbBlockData) -> Bytes {
|
||||
let local_start_chunk = (self.xorb_chunk_range.start - self.xorb_block.chunk_range.start) as usize;
|
||||
let start_byte_offset = xorb_block_data.chunk_offsets[local_start_chunk];
|
||||
let (_, start_byte_offset) = xorb_block_data.chunk_offsets[self.xorb_block_start_index];
|
||||
let start_byte_offset = start_byte_offset + self.offset_into_first_range as usize;
|
||||
let expected_size = (self.byte_range.end - self.byte_range.start) as usize;
|
||||
let end_byte_offset = start_byte_offset + expected_size;
|
||||
@@ -67,6 +79,25 @@ impl FileTerm {
|
||||
}
|
||||
}
|
||||
|
||||
/// Intermediate data for a single file term, collected during the first pass of
|
||||
/// `retrieve_file_term_block` before the final `FileTerm` structs are built.
|
||||
///
|
||||
/// We need this because `FileTerm` requires `Arc<XorbBlock>` and `Arc<TermBlockRetrievalURLs>`,
|
||||
/// which can't be constructed until all terms have been processed.
|
||||
struct FileTermEntry {
|
||||
/// The byte range in the output file that this term covers.
|
||||
byte_range: FileRange,
|
||||
/// The chunk range within the xorb that sources this term's data.
|
||||
xorb_chunk_range: ChunkRange,
|
||||
/// Byte offset into the first chunk's data, non-zero only for the first term
|
||||
/// when the query range starts mid-chunk.
|
||||
offset_into_first_range: u64,
|
||||
/// Index into the `xorb_blocks` / `xorb_block_retrieval_urls` vectors.
|
||||
xorb_block_index: usize,
|
||||
/// Flattened index into the xorb block's `chunk_offsets` for this term's start chunk.
|
||||
xorb_block_start_index: usize,
|
||||
}
|
||||
|
||||
/// Retrieve file terms from the client for a given file hash and byte range.
|
||||
/// Returns None if the requested byte range is past the end of the file.
|
||||
/// Returns the actual retrieved range and the number of bytes required for the
|
||||
@@ -77,78 +108,111 @@ pub async fn retrieve_file_term_block(
|
||||
file_hash: MerkleHash,
|
||||
query_file_byte_range: FileRange,
|
||||
) -> Result<Option<(FileRange, u64, Vec<FileTerm>)>> {
|
||||
// First, get the raw reconstruction.
|
||||
// get_reconstruction always returns V2 format (the client converts V1 internally).
|
||||
let Some(raw_reconstruction) = client.get_reconstruction(&file_hash, Some(query_file_byte_range)).await? else {
|
||||
// None means we've requested a byte range beyond the end of the file.
|
||||
return Ok(None);
|
||||
};
|
||||
|
||||
// Set a new url acquisition id to ensure that we don't double up the url acquisitions.
|
||||
// Each acquisition gets a unique ID used for single-flight URL refresh dedup.
|
||||
let acquisition_id = UniqueId::new();
|
||||
|
||||
// Intermediate storage for file term data before we create the actual FileTerm structs.
|
||||
// (byte_range, xorb_chunk_range, offset_into_first_range, index into xorb_blocks)
|
||||
let mut file_term_data = Vec::<(FileRange, ChunkRange, u64, usize)>::with_capacity(raw_reconstruction.terms.len());
|
||||
// First pass: iterate through the reconstruction terms and build up intermediate
|
||||
// FileTermEntry data, XorbBlock objects, and retrieval URL info. We can't construct
|
||||
// the final FileTerm structs yet because they need Arc<XorbBlock> and Arc<TermBlockRetrievalURLs>,
|
||||
// which require all terms to be processed first.
|
||||
let mut file_term_data = Vec::<FileTermEntry>::with_capacity(raw_reconstruction.terms.len());
|
||||
|
||||
let n_xorb_terms = raw_reconstruction.fetch_info.values().map(|v| v.len()).sum();
|
||||
// Parallel vectors indexed by xorb_block_index:
|
||||
// - xorb_blocks: the block metadata (hash, chunk ranges, references)
|
||||
// - xorb_block_retrieval_urls: the download URL and byte ranges for each block
|
||||
let mut xorb_blocks: Vec<XorbBlock> = Vec::new();
|
||||
let mut xorb_block_retrieval_urls = Vec::<(String, Vec<HttpRange>)>::new();
|
||||
|
||||
// Keep track of the xorb blocks we've created, keyed by (xorb_hash, first chunk index).
|
||||
let mut xorb_blocks: Vec<XorbBlock> = Vec::with_capacity(n_xorb_terms);
|
||||
// Dedup map: (xorb_hash, first_range_chunk_start) -> xorb_block_index.
|
||||
// Multiple terms may reference the same xorb block; this ensures we create
|
||||
// each block only once and share it across terms.
|
||||
let mut xorb_index_lookup = HashMap::<(MerkleHash, u32), usize>::new();
|
||||
|
||||
// Keep track of the URLs for each.
|
||||
let mut xorb_block_retrieval_urls = Vec::<(String, HttpRange)>::with_capacity(n_xorb_terms);
|
||||
|
||||
// Get a hash map so we can reindex the xorb terms; map of (xorb_hash, first chunk index) -> xorb block index.
|
||||
let mut xorb_index_lookup = HashMap::<(MerkleHash, u64), usize>::with_capacity(n_xorb_terms);
|
||||
|
||||
// Keep track of where we are so as to map the file terms to the byte range within the file.
|
||||
// Track the current byte offset in the output file as we process terms sequentially.
|
||||
let mut cur_file_byte_offset = query_file_byte_range.start;
|
||||
|
||||
// We'll create the URL info after processing all terms, once we know the actual range.
|
||||
let enable_multirange = xet_config().client.enable_multirange_fetching;
|
||||
|
||||
// Iterate over the terms and build the file terms and xorb terms.
|
||||
for (local_term_index, term) in raw_reconstruction.terms.iter().enumerate() {
|
||||
let xorb_hash: MerkleHash = term.hash.into();
|
||||
|
||||
// Get the xorb info here.
|
||||
let Some(xorb_info) = raw_reconstruction.fetch_info.get(&term.hash) else {
|
||||
let Some(xorb_descriptor) = raw_reconstruction.xorbs.get(&term.hash) else {
|
||||
return Err(FileReconstructionError::CorruptedReconstruction(format!(
|
||||
"Xorb info not found for xorb hash {xorb_hash:?}"
|
||||
)));
|
||||
};
|
||||
|
||||
// Get the xorb block index that this term belongs to.
|
||||
// Find the XorbBlock for this term's chunk range. The behavior depends on the
|
||||
// enable_multirange_fetching config:
|
||||
//
|
||||
// - When true: one XorbBlock per XorbMultiRangeFetch entry, preserving all ranges in a single block
|
||||
// (multi-range HTTP request).
|
||||
// - When false (default): one XorbBlock per individual XorbRangeDescriptor, so each range is fetched as a
|
||||
// separate single-range HTTP request in parallel.
|
||||
let xorb_block_index = 'find_xorb_block: {
|
||||
for raw_xorb_block_info in xorb_info.iter() {
|
||||
let chunk_range = raw_xorb_block_info.range;
|
||||
for fetch_entry in xorb_descriptor.iter() {
|
||||
if enable_multirange {
|
||||
let term_contained = fetch_entry
|
||||
.ranges
|
||||
.iter()
|
||||
.any(|r| r.chunks.start <= term.range.start && term.range.end <= r.chunks.end);
|
||||
|
||||
if chunk_range.start <= term.range.start && term.range.start <= chunk_range.end {
|
||||
// Verify that the term range is contained within the xorb block.
|
||||
if term.range.end > chunk_range.end {
|
||||
return Err(FileReconstructionError::CorruptedReconstruction(format!(
|
||||
"Term range extends beyond xorb block range for xorb hash {xorb_hash:?}"
|
||||
)));
|
||||
if !term_contained {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Reuse the previous one if it exists, otherwise insert a new one.
|
||||
let index = match xorb_index_lookup.entry((xorb_hash, chunk_range.start as u64)) {
|
||||
let first_chunk_start = fetch_entry.ranges[0].chunks.start;
|
||||
|
||||
let index = match xorb_index_lookup.entry((xorb_hash, first_chunk_start)) {
|
||||
Entry::Occupied(entry) => *entry.get(),
|
||||
Entry::Vacant(entry) => {
|
||||
let new_index = xorb_blocks.len();
|
||||
|
||||
let chunk_ranges: Vec<ChunkRange> = fetch_entry.ranges.iter().map(|r| r.chunks).collect();
|
||||
let http_ranges: Vec<HttpRange> = fetch_entry.ranges.iter().map(|r| r.bytes).collect();
|
||||
|
||||
xorb_blocks.push(XorbBlock {
|
||||
xorb_hash,
|
||||
chunk_range,
|
||||
chunk_ranges,
|
||||
xorb_block_index: new_index,
|
||||
references: vec![],
|
||||
uncompressed_size_if_known: None,
|
||||
data: OnceCell::new(),
|
||||
});
|
||||
|
||||
// Store the retrieval URL and range for this xorb block.
|
||||
xorb_block_retrieval_urls
|
||||
.push((raw_xorb_block_info.url.clone(), raw_xorb_block_info.url_range));
|
||||
xorb_block_retrieval_urls.push((fetch_entry.url.clone(), http_ranges));
|
||||
|
||||
entry.insert(new_index);
|
||||
new_index
|
||||
},
|
||||
};
|
||||
|
||||
break 'find_xorb_block index;
|
||||
} else {
|
||||
for range in &fetch_entry.ranges {
|
||||
if range.chunks.start <= term.range.start && term.range.end <= range.chunks.end {
|
||||
let index = match xorb_index_lookup.entry((xorb_hash, range.chunks.start)) {
|
||||
Entry::Occupied(entry) => *entry.get(),
|
||||
Entry::Vacant(entry) => {
|
||||
let new_index = xorb_blocks.len();
|
||||
|
||||
xorb_blocks.push(XorbBlock {
|
||||
xorb_hash,
|
||||
chunk_ranges: vec![range.chunks],
|
||||
xorb_block_index: new_index,
|
||||
references: vec![],
|
||||
uncompressed_size_if_known: None,
|
||||
data: OnceCell::new(),
|
||||
});
|
||||
|
||||
xorb_block_retrieval_urls.push((fetch_entry.url.clone(), vec![range.bytes]));
|
||||
|
||||
// Store the index.
|
||||
entry.insert(new_index);
|
||||
new_index
|
||||
},
|
||||
@@ -157,90 +221,120 @@ pub async fn retrieve_file_term_block(
|
||||
break 'find_xorb_block index;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return Err(FileReconstructionError::CorruptedReconstruction(format!(
|
||||
"No xorb chunk range found for file term {local_term_index:?} in xorb info for xorb hash {xorb_hash:?}"
|
||||
"No xorb fetch entry found for file term {local_term_index:?} in xorb info for xorb hash {xorb_hash:?}"
|
||||
)));
|
||||
};
|
||||
|
||||
// Do we need to adjust for an offset into the first range?
|
||||
let offset_into_first_range = {
|
||||
if local_term_index == 0 {
|
||||
// Only the first term can have a non-zero offset into its first chunk,
|
||||
// which happens when the query byte range starts mid-chunk.
|
||||
let offset_into_first_range = if local_term_index == 0 {
|
||||
raw_reconstruction.offset_into_first_range
|
||||
} else {
|
||||
0
|
||||
}
|
||||
};
|
||||
|
||||
// The effective size of this term in the file.
|
||||
// The term's contribution to the output file is its full uncompressed size
|
||||
// minus any offset into the first chunk.
|
||||
let term_byte_size = term.unpacked_length as u64 - offset_into_first_range;
|
||||
|
||||
// Update the references term on the XorbBlock to track where the xorb gets used.
|
||||
// Record this term as a reference on its xorb block (used later to determine
|
||||
// whether the block's total uncompressed size can be inferred).
|
||||
xorb_blocks[xorb_block_index].references.push(XorbReference {
|
||||
term_chunks: term.range,
|
||||
uncompressed_size: term.unpacked_length as usize,
|
||||
});
|
||||
|
||||
// Store the file term data (byte_range, xorb_chunk_range, offset_into_first_range, xorb_block_index).
|
||||
// We'll create the FileTerm structs after we know the actual range.
|
||||
file_term_data.push((
|
||||
FileRange::new(cur_file_byte_offset, cur_file_byte_offset + term_byte_size),
|
||||
term.range,
|
||||
// Compute the flattened index into the block's chunk_offsets for this term's
|
||||
// starting chunk. This accounts for disjoint chunk ranges in multi-range blocks.
|
||||
//
|
||||
// The term_contained check above guarantees term.range.start falls within one of
|
||||
// the block's chunk_ranges, so this loop always finds a match.
|
||||
let xorb_block_start_index = {
|
||||
let chunk_start = term.range.start;
|
||||
let chunk_ranges = &xorb_blocks[xorb_block_index].chunk_ranges;
|
||||
let mut idx = 0;
|
||||
let mut found = false;
|
||||
for range in chunk_ranges {
|
||||
if chunk_start >= range.start && chunk_start < range.end {
|
||||
idx += (chunk_start - range.start) as usize;
|
||||
found = true;
|
||||
break;
|
||||
}
|
||||
idx += (range.end - range.start) as usize;
|
||||
}
|
||||
if !found {
|
||||
return Err(FileReconstructionError::CorruptedReconstruction(format!(
|
||||
"chunk_start {chunk_start} not found in chunk_ranges {chunk_ranges:?} for file term {local_term_index}"
|
||||
)));
|
||||
}
|
||||
idx
|
||||
};
|
||||
|
||||
file_term_data.push(FileTermEntry {
|
||||
byte_range: FileRange::new(cur_file_byte_offset, cur_file_byte_offset + term_byte_size),
|
||||
xorb_chunk_range: term.range,
|
||||
offset_into_first_range,
|
||||
xorb_block_index,
|
||||
));
|
||||
xorb_block_start_index,
|
||||
});
|
||||
|
||||
cur_file_byte_offset += term_byte_size;
|
||||
}
|
||||
|
||||
// Sort the block references so that we can easily scan the terms to figure out how many references
|
||||
// a particular chunk may have.
|
||||
// Sort each block's references by chunk start so that determine_size_if_possible
|
||||
// can use its forward-chaining DP to check coverage.
|
||||
for block in &mut xorb_blocks {
|
||||
block.references.sort_by_key(|r| r.term_chunks.start);
|
||||
block.uncompressed_size_if_known = XorbBlock::determine_size_if_possible(block.chunk_range, &block.references);
|
||||
block.uncompressed_size_if_known =
|
||||
XorbBlock::determine_size_if_possible(&block.chunk_ranges, &block.references);
|
||||
}
|
||||
|
||||
// Now, it's possible that we have to shrink the byte range of the last term, as we may have retrieved more
|
||||
// due to chunk offsets.
|
||||
// The last term in the reconstruction may extend beyond the requested range
|
||||
// (e.g. when the query ends mid-chunk). Trim it to the query boundary.
|
||||
if cur_file_byte_offset > query_file_byte_range.end {
|
||||
let last_term_shrinkage = cur_file_byte_offset - query_file_byte_range.end;
|
||||
|
||||
debug_assert!(!file_term_data.is_empty());
|
||||
|
||||
if let Some(fi) = file_term_data.last_mut() {
|
||||
fi.0.end -= last_term_shrinkage;
|
||||
if let Some(entry) = file_term_data.last_mut() {
|
||||
entry.byte_range.end -= last_term_shrinkage;
|
||||
}
|
||||
}
|
||||
|
||||
// Calculate the actual retrieved range from the file terms.
|
||||
// The actual range covered, which may be smaller than requested if the file
|
||||
// ends before the requested range.
|
||||
let actual_range = FileRange::new(
|
||||
file_term_data.first().map(|(br, _, _, _)| br.start).unwrap_or(0),
|
||||
file_term_data.last().map(|(br, _, _, _)| br.end).unwrap_or(0),
|
||||
file_term_data.first().map(|e| e.byte_range.start).unwrap_or(0),
|
||||
file_term_data.last().map(|e| e.byte_range.end).unwrap_or(0),
|
||||
);
|
||||
|
||||
// Now, calculate the total number of bytes that needs to be downloaded given dedup and compression savings.
|
||||
let total_transfer_bytes = xorb_block_retrieval_urls
|
||||
// Total compressed bytes that will be transferred across all xorb block downloads.
|
||||
let total_transfer_bytes: u64 = xorb_block_retrieval_urls
|
||||
.iter()
|
||||
.map(|(_, http_range)| {
|
||||
let file_range = FileRange::from(*http_range);
|
||||
file_range.end.saturating_sub(file_range.start)
|
||||
})
|
||||
.flat_map(|(_, ranges)| ranges)
|
||||
.map(|r| r.length())
|
||||
.sum();
|
||||
|
||||
// Now create the URL info with the actual range and retrieval URLs.
|
||||
// Wrap the retrieval URLs in a shared struct so all file terms can share them
|
||||
// and coordinate URL refreshes through a single lock.
|
||||
let url_info =
|
||||
Arc::new(TermBlockRetrievalURLs::new(file_hash, actual_range, acquisition_id, xorb_block_retrieval_urls));
|
||||
|
||||
// Convert xorb_blocks to Arc<XorbBlock> for use in FileTerms.
|
||||
// Second pass: convert the intermediate FileTermEntry data into final FileTerm
|
||||
// structs, now that we can wrap xorb blocks in Arc and share the url_info.
|
||||
let xorb_blocks_arc: Vec<Arc<XorbBlock>> = xorb_blocks.into_iter().map(Arc::new).collect();
|
||||
|
||||
// Convert the intermediate data to FileTerm structs with the shared url_info.
|
||||
let file_terms: Vec<FileTerm> = file_term_data
|
||||
.into_iter()
|
||||
.map(|(byte_range, xorb_chunk_range, offset_into_first_range, xorb_block_index)| FileTerm {
|
||||
byte_range,
|
||||
xorb_chunk_range,
|
||||
offset_into_first_range,
|
||||
xorb_block: xorb_blocks_arc[xorb_block_index].clone(),
|
||||
.map(|entry| FileTerm {
|
||||
byte_range: entry.byte_range,
|
||||
xorb_chunk_range: entry.xorb_chunk_range,
|
||||
xorb_block_start_index: entry.xorb_block_start_index,
|
||||
offset_into_first_range: entry.offset_into_first_range,
|
||||
xorb_block: xorb_blocks_arc[entry.xorb_block_index].clone(),
|
||||
url_info: url_info.clone(),
|
||||
})
|
||||
.collect();
|
||||
@@ -252,7 +346,7 @@ pub async fn retrieve_file_term_block(
|
||||
mod tests {
|
||||
use std::sync::Arc;
|
||||
|
||||
use more_asserts::{assert_ge, assert_le};
|
||||
use more_asserts::assert_le;
|
||||
use xet_client::cas_client::{ClientTestingUtils, LocalClient, RandomFileContents};
|
||||
use xet_client::cas_types::{ChunkRange, FileRange};
|
||||
use xet_runtime::utils::UniqueId;
|
||||
@@ -351,10 +445,18 @@ mod tests {
|
||||
// Track xorb block index
|
||||
seen_xorb_indices.insert(file_term.xorb_block.xorb_block_index);
|
||||
|
||||
// Verify chunk range is within xorb block boundaries.
|
||||
// Verify chunk range is within xorb block boundaries: the term's chunk range
|
||||
// must be contained within at least one of the block's chunk ranges.
|
||||
let xorb_block = &file_term.xorb_block;
|
||||
assert_ge!(file_term.xorb_chunk_range.start, xorb_block.chunk_range.start);
|
||||
assert_le!(file_term.xorb_chunk_range.end, xorb_block.chunk_range.end);
|
||||
let term_in_some_range = xorb_block
|
||||
.chunk_ranges
|
||||
.iter()
|
||||
.any(|cr| file_term.xorb_chunk_range.start >= cr.start && file_term.xorb_chunk_range.end <= cr.end);
|
||||
assert!(
|
||||
term_in_some_range,
|
||||
"term chunk range {:?} not within any block chunk range {:?}",
|
||||
file_term.xorb_chunk_range, xorb_block.chunk_ranges
|
||||
);
|
||||
|
||||
// Cross-reference with known file contents.
|
||||
if expected_term_idx < file_contents.terms.len() {
|
||||
@@ -365,7 +467,7 @@ mod tests {
|
||||
|
||||
// Verify chunk range matches (accounting for partial first term).
|
||||
if file_term_data_offset == 0 {
|
||||
assert_eq!(file_term.xorb_chunk_range.start as u32, expected_term.chunk_start);
|
||||
assert_eq!(file_term.xorb_chunk_range.start, expected_term.chunk_start);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -549,10 +651,11 @@ mod tests {
|
||||
// Get the first file term's xorb block to test URL retrieval
|
||||
let file_term = &file_terms[0];
|
||||
let xorb_block_index = file_term.xorb_block.xorb_block_index;
|
||||
let (unique_id, url, http_range) = file_term.url_info.get_retrieval_url(xorb_block_index).await;
|
||||
let (unique_id, url, http_ranges) = file_term.url_info.get_retrieval_url(xorb_block_index).await;
|
||||
|
||||
assert!(!url.is_empty());
|
||||
assert!(http_range.start < http_range.end);
|
||||
assert!(!http_ranges.is_empty());
|
||||
assert!(http_ranges[0].start <= http_ranges[0].end);
|
||||
assert!(unique_id != UniqueId::null());
|
||||
}
|
||||
|
||||
@@ -591,87 +694,136 @@ mod tests {
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_range_few_bytes_before_end() {
|
||||
// Test requesting a range that ends just a few bytes before the file end,
|
||||
// within the same chunk as the file end.
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
|
||||
// Request range ending 3 bytes before the end
|
||||
let range = FileRange::new(0, file_len - 3);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
|
||||
// Request range ending 1 byte before the end
|
||||
let range = FileRange::new(0, file_len - 1);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_range_few_bytes_after_start() {
|
||||
// Test requesting a range that starts just a few bytes after the file start,
|
||||
// within the same chunk as the file start.
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
|
||||
// Request range starting 3 bytes after the start
|
||||
let range = FileRange::new(3, file_len);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
|
||||
// Request range starting 1 byte after the start
|
||||
let range = FileRange::new(1, file_len);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_range_few_bytes_offset_both_ends() {
|
||||
// Test requesting a range with small offsets at both ends within the same chunk.
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
|
||||
// Request range with 2 bytes trimmed from start and 2 bytes from end
|
||||
let range = FileRange::new(2, file_len - 2);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
|
||||
// Request just the middle byte of a small range
|
||||
let range = FileRange::new(file_len / 2 - 1, file_len / 2 + 1);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_range_single_byte_at_various_positions() {
|
||||
// Test requesting single bytes at various positions in the file.
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 5))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
|
||||
// First byte
|
||||
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(0, 1))).await;
|
||||
|
||||
// Last byte
|
||||
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(file_len - 1, file_len))).await;
|
||||
|
||||
// Middle byte
|
||||
let mid = file_len / 2;
|
||||
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(mid, mid + 1))).await;
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_multi_term_range_ends_mid_chunk() {
|
||||
// Test with multiple terms where the requested range ends in the middle of the last term's chunk.
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 3)), (2, (0, 3)), (3, (0, 3))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
|
||||
// End a few bytes before the file end
|
||||
let range = FileRange::new(0, file_len - 5);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_multi_term_range_starts_mid_chunk() {
|
||||
// Test with multiple terms where the requested range starts in the middle of the first term's chunk.
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 3)), (2, (0, 3)), (3, (0, 3))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
|
||||
// Start a few bytes after the file start
|
||||
let range = FileRange::new(5, file_len);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
}
|
||||
|
||||
// ==================== Multi-Disjoint Range Edge Cases ====================
|
||||
|
||||
/// Single xorb with three disjoint chunk ranges.
|
||||
/// This creates one XorbBlock with chunk_ranges = [(0,2), (4,6), (8,10)].
|
||||
#[tokio::test]
|
||||
async fn test_triple_disjoint_same_xorb() {
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
|
||||
retrieve_and_verify(&client, &file_contents, None).await;
|
||||
}
|
||||
|
||||
/// Triple disjoint ranges with a partial byte range spanning the gap.
|
||||
#[tokio::test]
|
||||
async fn test_triple_disjoint_partial_range_across_gap() {
|
||||
let (client, file_contents) = setup_test_file(&[(1, (0, 2)), (1, (4, 6)), (1, (8, 10))]).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
let range = FileRange::new(file_len / 4, file_len * 3 / 4);
|
||||
retrieve_and_verify(&client, &file_contents, Some(range)).await;
|
||||
}
|
||||
|
||||
/// Two xorbs, each with two disjoint ranges, interleaved in file order.
|
||||
#[tokio::test]
|
||||
async fn test_two_xorbs_interleaved_disjoint() {
|
||||
let term_spec = &[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6))];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
retrieve_and_verify(&client, &file_contents, None).await;
|
||||
}
|
||||
|
||||
/// Two xorbs interleaved with disjoint ranges, partial byte range.
|
||||
#[tokio::test]
|
||||
async fn test_two_xorbs_interleaved_disjoint_partial() {
|
||||
let term_spec = &[(1, (0, 2)), (2, (0, 2)), (1, (4, 6)), (2, (4, 6))];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
let file_len = file_contents.data.len() as u64;
|
||||
retrieve_and_verify(&client, &file_contents, Some(FileRange::new(file_len / 3, file_len * 2 / 3))).await;
|
||||
}
|
||||
|
||||
/// Single xorb with four disjoint ranges, each a single chunk wide.
|
||||
#[tokio::test]
|
||||
async fn test_four_single_chunk_disjoint() {
|
||||
let term_spec = &[(1, (0, 1)), (1, (3, 4)), (1, (6, 7)), (1, (9, 10))];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
retrieve_and_verify(&client, &file_contents, None).await;
|
||||
}
|
||||
|
||||
/// Mix of contiguous and disjoint ranges from the same xorb.
|
||||
/// Chunks 0-4 are contiguous, then a gap, then chunk 8-10.
|
||||
#[tokio::test]
|
||||
async fn test_contiguous_then_disjoint() {
|
||||
let term_spec = &[(1, (0, 2)), (1, (2, 4)), (1, (8, 10))];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
retrieve_and_verify(&client, &file_contents, None).await;
|
||||
}
|
||||
|
||||
/// Three xorbs with complex disjoint access patterns.
|
||||
#[tokio::test]
|
||||
async fn test_three_xorbs_complex_disjoint() {
|
||||
let term_spec = &[
|
||||
(1, (0, 2)),
|
||||
(2, (0, 3)),
|
||||
(3, (2, 5)),
|
||||
(1, (5, 8)),
|
||||
(2, (6, 8)),
|
||||
(3, (0, 2)),
|
||||
];
|
||||
let (client, file_contents) = setup_test_file(term_spec).await;
|
||||
retrieve_and_verify(&client, &file_contents, None).await;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -21,9 +21,11 @@ pub struct TermBlockRetrievalURLs {
|
||||
// which may be smaller than the originally requested range if the file ends early.
|
||||
pub byte_range: FileRange,
|
||||
|
||||
// The xorb retreival URLs. These could be refreshed if need be.
|
||||
// The xorb retrieval URLs. These could be refreshed if need be.
|
||||
// Indexed by xorb_block_index stored in each XorbBlock.
|
||||
pub(crate) xorb_block_retrieval_urls: RwLock<(UniqueId, Vec<(String, HttpRange)>)>,
|
||||
// Each entry is (url, http_ranges) to support multi-range V2 blocks.
|
||||
#[allow(clippy::type_complexity)]
|
||||
pub(crate) xorb_block_retrieval_urls: RwLock<(UniqueId, Vec<(String, Vec<HttpRange>)>)>,
|
||||
}
|
||||
|
||||
impl TermBlockRetrievalURLs {
|
||||
@@ -32,7 +34,7 @@ impl TermBlockRetrievalURLs {
|
||||
file_hash: MerkleHash,
|
||||
byte_range: FileRange,
|
||||
acquisition_id: UniqueId,
|
||||
retrieval_urls: Vec<(String, HttpRange)>,
|
||||
retrieval_urls: Vec<(String, Vec<HttpRange>)>,
|
||||
) -> Self {
|
||||
Self {
|
||||
file_hash,
|
||||
@@ -41,15 +43,13 @@ impl TermBlockRetrievalURLs {
|
||||
}
|
||||
}
|
||||
|
||||
/// Gets the retrieval URL for a given xorb block. All URL requests go through
|
||||
/// this method in order to manage url refreshes; this function returns the
|
||||
/// most recent retrieval URL in the case of a refresh.
|
||||
pub async fn get_retrieval_url(&self, xorb_block_index: usize) -> (UniqueId, String, HttpRange) {
|
||||
/// Gets the retrieval URL and all byte ranges for a given xorb block.
|
||||
/// All URL requests go through this method in order to manage URL refreshes;
|
||||
/// this function returns the most recent retrieval URL in the case of a refresh.
|
||||
pub async fn get_retrieval_url(&self, xorb_block_index: usize) -> (UniqueId, String, Vec<HttpRange>) {
|
||||
let xbru = self.xorb_block_retrieval_urls.read().await;
|
||||
|
||||
let (url, url_range) = xbru.1[xorb_block_index].clone();
|
||||
|
||||
(xbru.0, url, url_range)
|
||||
let (url, url_ranges) = &xbru.1[xorb_block_index];
|
||||
(xbru.0, url.clone(), url_ranges.clone())
|
||||
}
|
||||
|
||||
/// Refresh the retrieval URLs for all xorb blocks in this block.
|
||||
@@ -61,8 +61,7 @@ impl TermBlockRetrievalURLs {
|
||||
/// the new request will get a new URL.
|
||||
pub async fn refresh_retrieval_urls(&self, client: Arc<dyn Client>, acquisition_id: UniqueId) -> Result<()> {
|
||||
if self.xorb_block_retrieval_urls.read().await.0 != acquisition_id {
|
||||
// This means another process has got in here while we're waiting for the lock and
|
||||
// refreshed them.
|
||||
// Another task already refreshed while we were waiting for the read lock.
|
||||
debug!(
|
||||
file_hash = %self.file_hash,
|
||||
byte_range = ?(self.byte_range.start, self.byte_range.end),
|
||||
@@ -74,7 +73,7 @@ impl TermBlockRetrievalURLs {
|
||||
let mut retrieval_urls = self.xorb_block_retrieval_urls.write().await;
|
||||
|
||||
if retrieval_urls.0 != acquisition_id {
|
||||
// It's already been refreshed by another process.
|
||||
// Already refreshed by another task while waiting for the write lock.
|
||||
debug!(
|
||||
file_hash = %self.file_hash,
|
||||
byte_range = ?(self.byte_range.start, self.byte_range.end),
|
||||
@@ -90,8 +89,7 @@ impl TermBlockRetrievalURLs {
|
||||
"Refreshing expired retrieval URLs"
|
||||
);
|
||||
|
||||
// Since this hopefully doesn't happen too often, go through and retrieve an
|
||||
// entire new block, then make sure everything matches up and take in the new stuff.
|
||||
// Re-fetch the entire block to get fresh URLs, then verify the structure matches.
|
||||
let Some((returned_range, _transfer_bytes, file_terms)) =
|
||||
retrieve_file_term_block(client, self.file_hash, self.byte_range).await?
|
||||
else {
|
||||
@@ -141,11 +139,13 @@ pub struct XorbURLProvider {
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl URLProvider for XorbURLProvider {
|
||||
async fn retrieve_url(&self) -> std::result::Result<(String, HttpRange), xet_client::cas_client::CasClientError> {
|
||||
let (unique_id, url, http_range) = self.url_info.get_retrieval_url(self.xorb_block_index).await;
|
||||
async fn retrieve_url(
|
||||
&self,
|
||||
) -> std::result::Result<(String, Vec<HttpRange>), xet_client::cas_client::CasClientError> {
|
||||
let (unique_id, url, http_ranges) = self.url_info.get_retrieval_url(self.xorb_block_index).await;
|
||||
*self.last_acquisition_id.lock().await = unique_id;
|
||||
|
||||
Ok((url, http_range))
|
||||
Ok((url, http_ranges))
|
||||
}
|
||||
|
||||
async fn refresh_url(&self) -> std::result::Result<(), xet_client::cas_client::CasClientError> {
|
||||
@@ -155,3 +155,110 @@ impl URLProvider for XorbURLProvider {
|
||||
.map_err(|e| xet_client::cas_client::CasClientError::Other(e.to_string()))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use std::sync::Arc;
|
||||
|
||||
use tokio::sync::Mutex;
|
||||
use xet_client::cas_client::{ClientTestingUtils, LocalClient, URLProvider};
|
||||
use xet_client::cas_types::{FileRange, HttpRange};
|
||||
use xet_core_structures::merklehash::MerkleHash;
|
||||
use xet_runtime::utils::UniqueId;
|
||||
|
||||
use super::{TermBlockRetrievalURLs, XorbURLProvider};
|
||||
|
||||
fn sample_urls(n: usize) -> Vec<(String, Vec<HttpRange>)> {
|
||||
(0..n)
|
||||
.map(|i| (format!("https://example.com/xorb_{i}"), vec![HttpRange::new(0, 100)]))
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_new_and_get_retrieval_url() {
|
||||
let id = UniqueId::new();
|
||||
let urls = sample_urls(3);
|
||||
let block = TermBlockRetrievalURLs::new(MerkleHash::default(), FileRange::new(0, 100), id, urls.clone());
|
||||
|
||||
for (i, expected) in urls.iter().enumerate() {
|
||||
let (ret_id, url, ranges) = block.get_retrieval_url(i).await;
|
||||
assert!(ret_id == id, "acquisition ID mismatch for block {i}");
|
||||
assert_eq!(url, expected.0);
|
||||
assert_eq!(ranges, expected.1);
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_refresh_skipped_when_already_refreshed() {
|
||||
let (client, file_contents) = {
|
||||
let c = LocalClient::temporary().await.unwrap();
|
||||
let fc = c.upload_random_file(&[(1, (0, 3))], 64).await.unwrap();
|
||||
(c, fc)
|
||||
};
|
||||
|
||||
let file_range = FileRange::new(0, file_contents.data.len() as u64);
|
||||
let dyn_client: Arc<dyn xet_client::cas_client::Client> = client.clone();
|
||||
|
||||
let (_, _, file_terms) =
|
||||
super::retrieve_file_term_block(dyn_client.clone(), file_contents.file_hash, file_range)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
let url_info = file_terms[0].url_info.clone();
|
||||
|
||||
// Get original acquisition ID
|
||||
let (original_id, _, _) = url_info.get_retrieval_url(0).await;
|
||||
|
||||
// Refresh with a stale (different) ID should be a no-op.
|
||||
let stale_id = UniqueId::new();
|
||||
url_info.refresh_retrieval_urls(dyn_client.clone(), stale_id).await.unwrap();
|
||||
let (id_after, _, _) = url_info.get_retrieval_url(0).await;
|
||||
assert!(id_after == original_id, "refresh with stale ID should not change acquisition ID");
|
||||
|
||||
// Refresh with the correct ID should update URLs.
|
||||
url_info.refresh_retrieval_urls(dyn_client.clone(), original_id).await.unwrap();
|
||||
let (refreshed_id, _, _) = url_info.get_retrieval_url(0).await;
|
||||
assert!(refreshed_id != original_id, "refresh with correct ID should change acquisition ID");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_xorb_url_provider_retrieve_and_refresh() {
|
||||
let (client, file_contents) = {
|
||||
let c = LocalClient::temporary().await.unwrap();
|
||||
let fc = c.upload_random_file(&[(1, (0, 3))], 64).await.unwrap();
|
||||
(c, fc)
|
||||
};
|
||||
|
||||
let file_range = FileRange::new(0, file_contents.data.len() as u64);
|
||||
let dyn_client: Arc<dyn xet_client::cas_client::Client> = client.clone();
|
||||
|
||||
let (_, _, file_terms) =
|
||||
super::retrieve_file_term_block(dyn_client.clone(), file_contents.file_hash, file_range)
|
||||
.await
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
|
||||
let url_info = file_terms[0].url_info.clone();
|
||||
|
||||
let provider = XorbURLProvider {
|
||||
client: dyn_client.clone(),
|
||||
url_info,
|
||||
xorb_block_index: 0,
|
||||
last_acquisition_id: Mutex::new(UniqueId::null()),
|
||||
};
|
||||
|
||||
// retrieve_url should succeed and return a valid URL.
|
||||
let (url, ranges) = provider.retrieve_url().await.unwrap();
|
||||
assert!(!url.is_empty());
|
||||
assert!(!ranges.is_empty());
|
||||
|
||||
// refresh_url should succeed (refreshes with the current acquisition ID).
|
||||
provider.refresh_url().await.unwrap();
|
||||
|
||||
// After refresh, retrieve_url should still work with updated URLs.
|
||||
let (url2, ranges2) = provider.retrieve_url().await.unwrap();
|
||||
assert!(!url2.is_empty());
|
||||
assert!(!ranges2.is_empty());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -13,27 +13,49 @@ use super::retrieval_urls::{TermBlockRetrievalURLs, XorbURLProvider};
|
||||
use crate::progress_tracking::download_tracking::DownloadTaskUpdater;
|
||||
|
||||
/// Downloaded and decompressed data for a xorb block, including chunk boundary offsets.
|
||||
///
|
||||
/// A single `XorbBlockData` may hold data from multiple disjoint chunk ranges
|
||||
/// (V2 multi-range fetch). The chunks are concatenated in range order, and
|
||||
/// `chunk_offsets` maps each chunk index to its byte position within `data`.
|
||||
pub struct XorbBlockData {
|
||||
pub chunk_offsets: Vec<usize>,
|
||||
pub uncompressed_size: u64,
|
||||
/// Pairs of (chunk_index, byte_offset) mapping each chunk to its start position
|
||||
/// within `data`. Because the block can span multiple disjoint chunk ranges,
|
||||
/// storing the chunk index alongside the offset avoids ambiguity.
|
||||
pub chunk_offsets: Vec<(usize, usize)>,
|
||||
|
||||
/// The concatenated decompressed chunk data for all ranges in this block.
|
||||
pub data: Bytes,
|
||||
}
|
||||
|
||||
/// A reference from a file term back to the xorb block it belongs to.
|
||||
/// Used by `determine_size_if_possible` to check whether the block's total
|
||||
/// uncompressed size can be inferred from the terms that reference it.
|
||||
#[derive(Debug)]
|
||||
pub struct XorbReference {
|
||||
/// The chunk range within the xorb that this file term covers.
|
||||
pub term_chunks: ChunkRange,
|
||||
/// The uncompressed byte size of this term's data.
|
||||
pub uncompressed_size: usize,
|
||||
}
|
||||
|
||||
/// A downloadable xorb block identified by hash and chunk range, with cached data.
|
||||
/// Multiple file terms may reference the same xorb block.
|
||||
/// A downloadable xorb block identified by hash and chunk ranges, with cached data.
|
||||
///
|
||||
/// A block may contain multiple disjoint chunk ranges from the same xorb (V2 multi-range).
|
||||
/// Multiple file terms may reference the same block. Downloaded data is cached in `data`
|
||||
/// so that the first term to request it triggers the download, and subsequent terms
|
||||
/// reuse the cached result.
|
||||
pub struct XorbBlock {
|
||||
pub xorb_hash: MerkleHash,
|
||||
pub chunk_range: ChunkRange,
|
||||
/// The chunk ranges fetched for this block. For V1 this is a single range;
|
||||
/// for V2 multi-range fetches this may contain multiple disjoint ranges.
|
||||
pub chunk_ranges: Vec<ChunkRange>,
|
||||
/// Index into the parent `TermBlockRetrievalURLs` for URL lookup.
|
||||
pub xorb_block_index: usize,
|
||||
/// All file-term chunk ranges covered by this xorb block, sorted by range start.
|
||||
/// All file-term references covered by this block, sorted by chunk range start.
|
||||
/// Populated during `retrieve_file_term_block` and used to compute `uncompressed_size_if_known`.
|
||||
pub references: Vec<XorbReference>,
|
||||
/// Expected decompressed size of the block when known. Used for debug_assert in clients.
|
||||
/// Expected total decompressed size across all chunk ranges, if it can be determined
|
||||
/// from the references. Passed to clients as a debug assertion hint.
|
||||
pub uncompressed_size_if_known: Option<usize>,
|
||||
pub data: OnceCell<Arc<XorbBlockData>>,
|
||||
}
|
||||
@@ -41,7 +63,7 @@ pub struct XorbBlock {
|
||||
impl PartialEq for XorbBlock {
|
||||
fn eq(&self, other: &Self) -> bool {
|
||||
self.xorb_hash == other.xorb_hash
|
||||
&& self.chunk_range == other.chunk_range
|
||||
&& self.chunk_ranges == other.chunk_ranges
|
||||
&& self.xorb_block_index == other.xorb_block_index
|
||||
}
|
||||
}
|
||||
@@ -63,6 +85,7 @@ impl XorbBlock {
|
||||
) -> Result<Arc<XorbBlockData>> {
|
||||
let xorb_block_index = self.xorb_block_index;
|
||||
let uncompressed_size_if_known = self.uncompressed_size_if_known;
|
||||
let chunk_ranges = self.chunk_ranges.clone();
|
||||
|
||||
self.data
|
||||
.get_or_try_init(|| async {
|
||||
@@ -89,14 +112,18 @@ impl XorbBlock {
|
||||
.get_file_term_data(Box::new(url_provider), permit, progress_callback, uncompressed_size_if_known)
|
||||
.await?;
|
||||
|
||||
let chunk_offsets: Vec<usize> = chunk_byte_offsets.iter().map(|&x| x as usize).collect();
|
||||
let uncompressed_size = data.len() as u64;
|
||||
// Build chunk_offsets by zipping each chunk index (from all chunk_ranges)
|
||||
// with the corresponding byte offset from the returned data.
|
||||
let mut chunk_offsets = Vec::new();
|
||||
let mut offset_idx = 0;
|
||||
for range in &chunk_ranges {
|
||||
for chunk_idx in range.start..range.end {
|
||||
chunk_offsets.push((chunk_idx as usize, chunk_byte_offsets[offset_idx] as usize));
|
||||
offset_idx += 1;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(Arc::new(XorbBlockData {
|
||||
chunk_offsets,
|
||||
uncompressed_size,
|
||||
data,
|
||||
}))
|
||||
Ok(Arc::new(XorbBlockData { chunk_offsets, data }))
|
||||
})
|
||||
.await
|
||||
.cloned()
|
||||
@@ -105,33 +132,67 @@ impl XorbBlock {
|
||||
/// Determines the total uncompressed size of the xorb block from the reference terms,
|
||||
/// if possible.
|
||||
///
|
||||
/// The size can be determined when:
|
||||
/// 1. A single term's chunk range exactly matches the full xorb range, or
|
||||
/// 2. A chain of term chunk ranges exactly covers the full xorb range with no gaps (e.g. [0..3, 3..5] covers 0..5).
|
||||
/// Uses a forward-chaining DP: starting from the first chunk range's start,
|
||||
/// we track which chunk positions are "reachable" (i.e., fully covered by a
|
||||
/// contiguous chain of terms) along with the accumulated uncompressed size.
|
||||
///
|
||||
/// The `terms` slice must be sorted by chunk range start index.
|
||||
pub fn determine_size_if_possible(xorb_range: ChunkRange, terms: &[XorbReference]) -> Option<usize> {
|
||||
/// For multi-range blocks with disjoint chunk ranges (e.g. `[0,3)` and `[5,8)`),
|
||||
/// the gaps between ranges are inserted as zero-cost bridges. This lets the DP
|
||||
/// traverse the full set of ranges in a single pass — a gap `[3,5)` contributes
|
||||
/// no data but connects the end of one range to the start of the next.
|
||||
///
|
||||
/// Returns `Some(total_size)` if every range is fully covered, `None` otherwise.
|
||||
///
|
||||
/// The `terms` slice must be sorted by `term_chunks.start`.
|
||||
pub fn determine_size_if_possible(xorb_ranges: &[ChunkRange], terms: &[XorbReference]) -> Option<usize> {
|
||||
debug_assert!(
|
||||
terms.windows(2).all(|w| w[0].term_chunks.start <= w[1].term_chunks.start),
|
||||
"terms must be sorted by chunk range start"
|
||||
);
|
||||
|
||||
// DP approach: track which chunk endpoints are reachable from xorb_range.start
|
||||
// via contiguous chains, along with accumulated uncompressed sizes.
|
||||
// This correctly handles multiple terms with the same start index by
|
||||
// considering all possible chain continuations.
|
||||
let mut reachable: BTreeMap<u32, usize> = BTreeMap::new();
|
||||
reachable.insert(xorb_range.start, 0);
|
||||
debug_assert!(
|
||||
terms.iter().all(|term| xorb_ranges
|
||||
.iter()
|
||||
.any(|r| term.term_chunks.start >= r.start && term.term_chunks.end <= r.end)),
|
||||
"all terms must fall within one of the xorb ranges"
|
||||
);
|
||||
|
||||
if xorb_ranges.is_empty() {
|
||||
return Some(0);
|
||||
}
|
||||
|
||||
// Build a lookup from range-end -> next-range-start for gap bridging.
|
||||
// E.g. for ranges [0,3) and [5,8), maps 3 -> 5, meaning once chunk 3
|
||||
// is reachable we can bridge to chunk 5 at zero cost.
|
||||
let gap_bridges: BTreeMap<u32, u32> = xorb_ranges
|
||||
.windows(2)
|
||||
.filter(|pair| pair[0].end < pair[1].start)
|
||||
.map(|pair| (pair[0].end, pair[1].start))
|
||||
.collect();
|
||||
|
||||
// DP map: chunk position -> accumulated uncompressed size to reach that position.
|
||||
// Seed with the start of the first range.
|
||||
let mut reachable: BTreeMap<u32, usize> = BTreeMap::new();
|
||||
reachable.insert(xorb_ranges[0].start, 0);
|
||||
|
||||
// Process terms in sorted order, extending reachable positions.
|
||||
for term in terms {
|
||||
if let Some(&accumulated) = reachable.get(&term.term_chunks.start) {
|
||||
reachable
|
||||
.entry(term.term_chunks.end)
|
||||
.or_insert(accumulated + term.uncompressed_size);
|
||||
let new_end = term.term_chunks.end;
|
||||
let new_size = accumulated + term.uncompressed_size;
|
||||
|
||||
reachable.entry(new_end).or_insert(new_size);
|
||||
|
||||
// If this term reaches the end of a range that has a gap bridge,
|
||||
// make the start of the next range reachable at the same accumulated size.
|
||||
if let Some(&bridge_target) = gap_bridges.get(&new_end) {
|
||||
reachable.entry(bridge_target).or_insert(new_size);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
reachable.get(&xorb_range.end).copied()
|
||||
// The block is fully covered if we can reach the end of the last range.
|
||||
reachable.get(&xorb_ranges.last().unwrap().end).copied()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -153,197 +214,210 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_single_term_exact_match() {
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 5), 1000)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_two_terms_chained() {
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 3), 600), (ChunkRange::new(3, 5), 400)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_three_terms_chained() {
|
||||
let xorb_range = ChunkRange::new(0, 6);
|
||||
let ranges = &[ChunkRange::new(0, 6)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 2), 200),
|
||||
(ChunkRange::new(2, 4), 300),
|
||||
(ChunkRange::new(4, 6), 500),
|
||||
]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_gap_in_chain() {
|
||||
let xorb_range = ChunkRange::new(0, 6);
|
||||
let ranges = &[ChunkRange::new(0, 6)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 2), 200), (ChunkRange::new(4, 6), 500)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_does_not_start_at_xorb_start() {
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[(ChunkRange::new(1, 5), 800)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_does_not_end_at_xorb_end() {
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 3), 600)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_terms() {
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms: Vec<XorbReference> = vec![];
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_overlapping_terms_with_exact_cover() {
|
||||
// Terms [0..3, 1..4, 3..5] - the chain 0..3, 3..5 covers 0..5.
|
||||
// The overlapping term 1..4 should be skipped.
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 3), 600),
|
||||
(ChunkRange::new(1, 4), 700),
|
||||
(ChunkRange::new(3, 5), 400),
|
||||
]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_duplicate_terms_first_covers() {
|
||||
// Two identical terms covering the full range.
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 5), 1000), (ChunkRange::new(0, 5), 1000)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_nonzero_xorb_start() {
|
||||
let xorb_range = ChunkRange::new(3, 8);
|
||||
let ranges = &[ChunkRange::new(3, 8)];
|
||||
let terms = build_refs(&[(ChunkRange::new(3, 5), 400), (ChunkRange::new(5, 8), 600)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(1000));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(1000));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_nonzero_xorb_start_no_match() {
|
||||
let xorb_range = ChunkRange::new(3, 8);
|
||||
let ranges = &[ChunkRange::new(3, 8)];
|
||||
let terms = build_refs(&[(ChunkRange::new(3, 5), 400)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_single_chunk_range() {
|
||||
let xorb_range = ChunkRange::new(0, 1);
|
||||
let ranges = &[ChunkRange::new(0, 1)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 1), 42)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(42));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(42));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_chain_with_extra_terms_before_and_after() {
|
||||
// Extra terms that don't participate in the chain but are within the sorted list.
|
||||
let xorb_range = ChunkRange::new(2, 8);
|
||||
fn test_chain_with_overlapping_inner_terms() {
|
||||
let ranges = &[ChunkRange::new(2, 8)];
|
||||
// The overlapping term [3,6) is within the range but doesn't form
|
||||
// a better chain than [2,5) + [5,8), so it's harmlessly ignored.
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 2), 100), // before xorb range
|
||||
(ChunkRange::new(2, 5), 500), // chain start
|
||||
(ChunkRange::new(3, 6), 999), // overlapping, skipped
|
||||
(ChunkRange::new(5, 8), 300), // chain end
|
||||
(ChunkRange::new(8, 10), 200), // after xorb range
|
||||
(ChunkRange::new(2, 5), 500),
|
||||
(ChunkRange::new(3, 6), 999),
|
||||
(ChunkRange::new(5, 8), 300),
|
||||
]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(800));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(800));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_partial_overlap_no_cover() {
|
||||
// Terms partially overlap but don't form a contiguous chain covering the full range.
|
||||
let xorb_range = ChunkRange::new(0, 10);
|
||||
let ranges = &[ChunkRange::new(0, 10)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 4), 400),
|
||||
(ChunkRange::new(3, 7), 400),
|
||||
(ChunkRange::new(6, 10), 400),
|
||||
]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_start_short_then_long_covering_full() {
|
||||
// Short range first, then a long range that covers the full xorb.
|
||||
let xorb_range = ChunkRange::new(0, 5);
|
||||
let ranges = &[ChunkRange::new(0, 5)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 3), 300), (ChunkRange::new(0, 5), 500)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(500));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(500));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_start_short_then_long_with_chain() {
|
||||
// Short range first, then a longer range, where the short range can also chain.
|
||||
let xorb_range = ChunkRange::new(0, 6);
|
||||
// Chain via 0..3 + 3..6 = 600
|
||||
let ranges = &[ChunkRange::new(0, 6)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 2), 200),
|
||||
(ChunkRange::new(0, 3), 300),
|
||||
(ChunkRange::new(3, 6), 300),
|
||||
]);
|
||||
// Chain via 0..3 + 3..6 = 600
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(600));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(600));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_start_multiple_duplicates_chain_through_second() {
|
||||
// Multiple terms at start 0 with different lengths; only the middle one chains.
|
||||
let xorb_range = ChunkRange::new(0, 6);
|
||||
// Chain via 0..4 + 4..6 = 600
|
||||
let ranges = &[ChunkRange::new(0, 6)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 2), 200),
|
||||
(ChunkRange::new(0, 4), 400),
|
||||
(ChunkRange::new(0, 5), 500),
|
||||
(ChunkRange::new(4, 6), 200),
|
||||
]);
|
||||
// Chain via 0..4 + 4..6 = 600
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(600));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(600));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_start_at_midpoint() {
|
||||
// Duplicate starts at a midpoint in the chain, not just at the beginning.
|
||||
let xorb_range = ChunkRange::new(0, 8);
|
||||
// Chain via 0..3 + 3..6 + 6..8 = 800
|
||||
let ranges = &[ChunkRange::new(0, 8)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 3), 300),
|
||||
(ChunkRange::new(3, 5), 200),
|
||||
(ChunkRange::new(3, 6), 300),
|
||||
(ChunkRange::new(6, 8), 200),
|
||||
]);
|
||||
// Chain via 0..3 + 3..6 + 6..8 = 800
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(800));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(800));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_start_none_covers() {
|
||||
// Multiple terms at start 0, but none chain to cover the full range.
|
||||
let xorb_range = ChunkRange::new(0, 10);
|
||||
let ranges = &[ChunkRange::new(0, 10)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 2), 200),
|
||||
(ChunkRange::new(0, 4), 400),
|
||||
(ChunkRange::new(0, 6), 600),
|
||||
]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), None);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_start_two_groups_chained() {
|
||||
// Two groups of duplicate-start terms that chain together.
|
||||
let xorb_range = ChunkRange::new(0, 6);
|
||||
// Chain via 0..3 + 3..6 = 600
|
||||
let ranges = &[ChunkRange::new(0, 6)];
|
||||
let terms = build_refs(&[
|
||||
(ChunkRange::new(0, 2), 200),
|
||||
(ChunkRange::new(0, 3), 300),
|
||||
(ChunkRange::new(3, 5), 200),
|
||||
(ChunkRange::new(3, 6), 300),
|
||||
]);
|
||||
// Chain via 0..3 + 3..6 = 600
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(xorb_range, &terms), Some(600));
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(600));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_multiple_disjoint_ranges_both_covered() {
|
||||
let ranges = &[ChunkRange::new(0, 3), ChunkRange::new(5, 8)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 3), 300), (ChunkRange::new(5, 8), 400)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), Some(700));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_multiple_disjoint_ranges_one_uncovered() {
|
||||
let ranges = &[ChunkRange::new(0, 3), ChunkRange::new(5, 8)];
|
||||
let terms = build_refs(&[(ChunkRange::new(0, 3), 300)]);
|
||||
assert_eq!(XorbBlock::determine_size_if_possible(ranges, &terms), None);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -7,8 +7,8 @@ use anyhow::Result;
|
||||
use clap::{Args, Parser, Subcommand};
|
||||
use http::header::{self, HeaderMap, HeaderValue};
|
||||
use walkdir::WalkDir;
|
||||
use xet_client::cas_client::RemoteClient;
|
||||
use xet_client::cas_client::auth::TokenRefresher;
|
||||
use xet_client::cas_client::{Client, RemoteClient};
|
||||
use xet_client::cas_types::{FileRange, QueryReconstructionResponse};
|
||||
use xet_client::hub_client::{BearerCredentialHelper, HubClient, Operation, RepoInfo};
|
||||
use xet_core_structures::merklehash::MerkleHash;
|
||||
@@ -230,8 +230,9 @@ async fn query_reconstruction(
|
||||
cas_storage_config.custom_headers.clone(),
|
||||
);
|
||||
|
||||
// Use V1 directly so the query tool returns the raw QueryReconstructionResponse for inspection.
|
||||
remote_client
|
||||
.get_reconstruction(&file_hash, bytes_range)
|
||||
.get_reconstruction_v1(&file_hash, bytes_range)
|
||||
.await
|
||||
.map_err(anyhow::Error::from)
|
||||
}
|
||||
|
||||
@@ -15,6 +15,48 @@ use super::file_cleaner::Sha256Policy;
|
||||
use super::{FileDownloadSession, FileUploadSession, XetFileInfo};
|
||||
use crate::progress_tracking::TrackingProgressUpdater;
|
||||
|
||||
/// Describes how hydration (download/smudge) should be performed during a test.
|
||||
///
|
||||
/// Each variant exercises a different reconstruction path:
|
||||
/// - `DirectClient`: Uses `LocalClient` directly (no HTTP server).
|
||||
/// - `ServerV2`: Uses `LocalTestServer` with default V2 reconstruction.
|
||||
/// - `ServerV1Fallback`: Uses `LocalTestServer` with V2 disabled, forcing V1 fallback.
|
||||
/// - `ServerMaxRanges2`: Uses `LocalTestServer` with `max_ranges_per_fetch=2`, forcing multi-range fetch splitting in
|
||||
/// V2 responses.
|
||||
#[derive(Debug, Clone, Copy)]
|
||||
pub enum HydrationMode {
|
||||
DirectClient,
|
||||
ServerV2,
|
||||
ServerV1Fallback,
|
||||
ServerMaxRanges2,
|
||||
}
|
||||
|
||||
impl HydrationMode {
|
||||
pub fn all() -> &'static [HydrationMode] {
|
||||
&[
|
||||
HydrationMode::DirectClient,
|
||||
HydrationMode::ServerV2,
|
||||
HydrationMode::ServerV1Fallback,
|
||||
HydrationMode::ServerMaxRanges2,
|
||||
]
|
||||
}
|
||||
|
||||
pub fn uses_server(&self) -> bool {
|
||||
!matches!(self, HydrationMode::DirectClient)
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for HydrationMode {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
match self {
|
||||
HydrationMode::DirectClient => write!(f, "direct_client"),
|
||||
HydrationMode::ServerV2 => write!(f, "server_v2"),
|
||||
HydrationMode::ServerV1Fallback => write!(f, "server_v1_fallback"),
|
||||
HydrationMode::ServerMaxRanges2 => write!(f, "server_max_ranges_2"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Creates or overwrites a single file in `dir` with `size` bytes of random data.
|
||||
/// Panics on any I/O error. Returns the total number of bytes written (=`size`).
|
||||
pub fn create_random_file(path: impl AsRef<Path>, size: usize, seed: u64) -> usize {
|
||||
@@ -174,6 +216,44 @@ impl HydrateDehydrateTest {
|
||||
}
|
||||
}
|
||||
|
||||
/// Creates a new test harness configured for a specific hydration mode.
|
||||
pub fn for_mode(mode: HydrationMode) -> Self {
|
||||
Self::new(mode.uses_server())
|
||||
}
|
||||
|
||||
/// Applies hydration mode configuration to the test server.
|
||||
/// Must be called after `dehydrate()` and before `hydrate()`.
|
||||
pub async fn apply_hydration_mode(&mut self, mode: HydrationMode) {
|
||||
match mode {
|
||||
HydrationMode::DirectClient => {},
|
||||
HydrationMode::ServerV2 => {
|
||||
self.ensure_server_created().await;
|
||||
},
|
||||
HydrationMode::ServerV1Fallback => {
|
||||
self.ensure_server_created().await;
|
||||
self.test_server.as_ref().unwrap().client().disable_v2_reconstruction(404);
|
||||
},
|
||||
HydrationMode::ServerMaxRanges2 => {
|
||||
self.ensure_server_created().await;
|
||||
self.test_server.as_ref().unwrap().client().set_max_ranges_per_fetch(2);
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
/// Ensures the test server is running, creating it if necessary.
|
||||
/// Call this before configuring the server (e.g., disabling V2 or setting max ranges).
|
||||
pub async fn ensure_server_created(&mut self) {
|
||||
if self.use_test_server && self.test_server.is_none() {
|
||||
let local_client = LocalClient::new(self.cas_dir.join("xet/xorbs")).await.unwrap();
|
||||
self.test_server = Some(LocalTestServerBuilder::new().with_client(local_client).start().await);
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns a reference to the test server, if one has been created.
|
||||
pub fn test_server(&self) -> Option<&LocalTestServer> {
|
||||
self.test_server.as_ref()
|
||||
}
|
||||
|
||||
/// Lazily initializes the test server (if needed) and returns a CAS client.
|
||||
async fn get_or_create_client(&mut self) -> Arc<dyn Client> {
|
||||
if self.use_test_server {
|
||||
|
||||
@@ -10,18 +10,25 @@ test_set_constants! {
|
||||
MAX_XORB_CHUNKS = 8;
|
||||
}
|
||||
|
||||
/// Runs clean/smudge test with all combinations of (use_test_server, sequential).
|
||||
/// Runs clean/smudge test with all combinations of (hydration_mode, sequential).
|
||||
/// Each combination runs sequentially with its own HydrateDehydrateTest instance to avoid
|
||||
/// too many open files.
|
||||
///
|
||||
/// This exercises every hydration path for every test case:
|
||||
/// - DirectClient: LocalClient without a server
|
||||
/// - ServerV2: LocalTestServer with default V2 reconstruction
|
||||
/// - ServerV1Fallback: LocalTestServer with V2 disabled (tests V1-to-V2 conversion)
|
||||
/// - ServerMaxRanges2: LocalTestServer with max_ranges_per_fetch=2 (tests fetch splitting)
|
||||
pub async fn check_clean_smudge_files(file_list: &[(impl AsRef<str> + Clone, usize)]) {
|
||||
for use_server in [false, true] {
|
||||
for &mode in HydrationMode::all() {
|
||||
for sequential in [true, false] {
|
||||
eprintln!("Testing use_test_server={use_server}, sequential={sequential}");
|
||||
eprintln!("Testing mode={mode}, sequential={sequential}");
|
||||
|
||||
let mut ts = HydrateDehydrateTest::new(use_server);
|
||||
let mut ts = HydrateDehydrateTest::for_mode(mode);
|
||||
create_random_files(&ts.src_dir, file_list, 0);
|
||||
|
||||
ts.dehydrate(sequential).await;
|
||||
ts.apply_hydration_mode(mode).await;
|
||||
ts.hydrate().await;
|
||||
ts.verify_src_dest_match();
|
||||
ts.hydrate_partitioned_writers(4).await;
|
||||
@@ -35,18 +42,21 @@ pub async fn check_clean_smudge_files(file_list: &[(impl AsRef<str> + Clone, usi
|
||||
/// Helper for multipart tests:
|
||||
/// - takes a slice of `(String, Vec<(u64, u64)>)` which fully specifies each file.
|
||||
/// - for each file, calls `create_random_multipart_file` with the given segments.
|
||||
///
|
||||
/// Exercises all hydration modes just like `check_clean_smudge_files`.
|
||||
async fn check_clean_smudge_files_multipart(file_specs: &[(String, Vec<(usize, u64)>)]) {
|
||||
for use_server in [false, true] {
|
||||
for &mode in HydrationMode::all() {
|
||||
for sequential in [true, false] {
|
||||
eprintln!("Testing use_test_server={use_server}, sequential={sequential}");
|
||||
eprintln!("Testing mode={mode}, sequential={sequential}");
|
||||
|
||||
let mut ts = HydrateDehydrateTest::new(use_server);
|
||||
let mut ts = HydrateDehydrateTest::for_mode(mode);
|
||||
|
||||
for (file_name, segments) in file_specs {
|
||||
create_random_multipart_file(ts.src_dir.join(file_name), segments);
|
||||
}
|
||||
|
||||
ts.dehydrate(sequential).await;
|
||||
ts.apply_hydration_mode(mode).await;
|
||||
ts.hydrate().await;
|
||||
ts.verify_src_dest_match();
|
||||
ts.hydrate_partitioned_writers(4).await;
|
||||
|
||||
71
xet_data/tests/test_clean_smudge_multirange.rs
Normal file
71
xet_data/tests/test_clean_smudge_multirange.rs
Normal file
@@ -0,0 +1,71 @@
|
||||
//! Clean/smudge integration tests with `enable_multirange_fetching = true`.
|
||||
//!
|
||||
//! This test binary is a separate copy of a subset of the clean/smudge tests
|
||||
//! that runs with `enable_multirange_fetching` enabled, exercising the
|
||||
//! multirange HTTP request path rather than the default single-range splitting.
|
||||
|
||||
use xet_data::deduplication::constants::{MAX_XORB_BYTES, MAX_XORB_CHUNKS, TARGET_CHUNK_SIZE};
|
||||
use xet_data::processing::test_utils::*;
|
||||
use xet_runtime::{test_set_config, test_set_constants};
|
||||
|
||||
test_set_constants! {
|
||||
TARGET_CHUNK_SIZE = 1024;
|
||||
MAX_XORB_BYTES = 5 * (*TARGET_CHUNK_SIZE);
|
||||
MAX_XORB_CHUNKS = 8;
|
||||
}
|
||||
|
||||
test_set_config! {
|
||||
client {
|
||||
enable_multirange_fetching = true;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod testing_clean_smudge_multirange {
|
||||
use super::*;
|
||||
|
||||
pub async fn check_clean_smudge_files(file_list: &[(impl AsRef<str> + Clone, usize)]) {
|
||||
for &mode in HydrationMode::all() {
|
||||
for sequential in [true, false] {
|
||||
eprintln!("Testing mode={mode}, sequential={sequential} (forced multirange)");
|
||||
|
||||
let mut ts = HydrateDehydrateTest::for_mode(mode);
|
||||
create_random_files(&ts.src_dir, file_list, 0);
|
||||
|
||||
ts.dehydrate(sequential).await;
|
||||
ts.apply_hydration_mode(mode).await;
|
||||
ts.hydrate().await;
|
||||
ts.verify_src_dest_match();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
|
||||
async fn test_simple_directory() {
|
||||
check_clean_smudge_files(&[("a", 16)]).await;
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
|
||||
async fn test_multiple() {
|
||||
check_clean_smudge_files(&[("a", 16), ("b", 8)]).await;
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
|
||||
async fn test_single_large() {
|
||||
check_clean_smudge_files(&[("a", *MAX_XORB_BYTES + 1)]).await;
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
|
||||
async fn test_multiple_large() {
|
||||
check_clean_smudge_files(&[("a", *MAX_XORB_BYTES + 1), ("b", *MAX_XORB_BYTES + 2)]).await;
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
|
||||
async fn test_many_small_multiple_xorbs() {
|
||||
let n = 16;
|
||||
let size = *MAX_XORB_BYTES / 8 + 1;
|
||||
|
||||
let files: Vec<_> = (0..n).map(|idx| (format!("f_{idx}"), size)).collect();
|
||||
check_clean_smudge_files(&files).await;
|
||||
}
|
||||
}
|
||||
@@ -217,7 +217,7 @@ crate::config_group!({
|
||||
/// The default value is 2.
|
||||
///
|
||||
/// Use the environment variable `HF_XET_CLIENT_AC_INITIAL_UPLOAD_CONCURRENCY` to set this value.
|
||||
ref ac_initial_upload_concurrency: usize = 1;
|
||||
ref ac_initial_upload_concurrency: usize = 2;
|
||||
|
||||
/// The maximum number of simultaneous download streams permitted by
|
||||
/// the adaptive concurrency control.
|
||||
@@ -238,10 +238,10 @@ crate::config_group!({
|
||||
/// The starting number of concurrent download streams, which will increase up to max_concurrent_downloads
|
||||
/// on successful completions.
|
||||
///
|
||||
/// The default value is 1.
|
||||
/// The default value is 4.
|
||||
///
|
||||
/// Use the environment variable `HF_XET_CLIENT_AC_INITIAL_DOWNLOAD_CONCURRENCY` to set this value.
|
||||
ref ac_initial_download_concurrency: usize = 1;
|
||||
ref ac_initial_download_concurrency: usize = 4;
|
||||
|
||||
/// Path to Unix domain socket for CAS HTTP connections.
|
||||
/// When set, all CAS HTTP traffic uses this socket instead of TCP.
|
||||
@@ -252,4 +252,24 @@ crate::config_group!({
|
||||
/// Use the environment variable `HF_XET_CLIENT_UNIX_SOCKET_PATH` to set this value.
|
||||
ref unix_socket_path: Option<String> = None;
|
||||
|
||||
/// The reconstruction API version to request from the CAS server.
|
||||
/// When set to 1 or 2, forces that version with no fallback.
|
||||
/// When unset, auto-detects by trying V2 first, falling back to V1 on 404 or 501.
|
||||
///
|
||||
/// The default value is None (auto-detect).
|
||||
///
|
||||
/// Use the environment variable `HF_XET_CLIENT_RECONSTRUCTION_API_VERSION` to set this value.
|
||||
ref reconstruction_api_version: Option<u32> = None;
|
||||
|
||||
/// Whether to use multi-range HTTP requests when fetching xorb data.
|
||||
/// When false (default), V2 multi-range fetch entries are split into
|
||||
/// individual single-range requests executed in parallel, which avoids
|
||||
/// slow server-side multirange processing.
|
||||
/// When true, multi-range requests are sent as-is.
|
||||
///
|
||||
/// The default value is false.
|
||||
///
|
||||
/// Use the environment variable `HF_XET_CLIENT_ENABLE_MULTIRANGE_FETCHING` to set this value.
|
||||
ref enable_multirange_fetching: bool = false;
|
||||
|
||||
});
|
||||
|
||||
Reference in New Issue
Block a user