Assaf Vayner 3ce2b975a0 clean utils deps (#60)
* keep vscode settings.json

* clean deps from utils
2024-10-24 10:41:40 -07:00
2024-10-24 10:41:40 -07:00
2024-10-23 17:57:45 -07:00
2024-10-23 17:57:45 -07:00
2024-10-24 10:41:40 -07:00
2024-10-24 10:41:40 -07:00
2024-10-24 10:41:40 -07:00
2024-10-23 17:57:45 -07:00

Xet-Core

Purpose

Xet-core is the repo responsible for Rust-based code running on the client machine. This includes the deduplication/chunking implementation, network communications to backend services (ex. CAS), and local disk caching. There are some crates in this repo that are shared with xetcas.

Included Crates

  • cas_client: communication with CAS backend services, which include APIs for Xorbs and Shards.
  • cas_object: CAS object (Xorb) format and associated APIs, including chunks (ranges within Xorbs).
  • cas_types: common types shared across crates in xet-core and xetcas.
  • chunk_cache: local disk cache of Xorb chunks.
  • chunk_cache_bench: benchmarking crate for chunk_cache.
  • data: main driver for client operations - FilePointerTranslator drives hydrating or shrinking files.
  • error_printer: utility for printing errors conveniently.
  • file_utils: SafeFileCreator utility, used by chunk_cache.
  • hf_xet: Python integration with Rust code, uses maturin to build hfxet Python package. Main integration with HF Hub Python package.
  • mdb_shard: Shard operations, including Shard format, dedupe probing, benchmarks, and utilities.
  • merkledb: Chunking + deduplication implementation. Scanning files, building chunks, organizing them, etc.
  • merklehash: DataHash type, 256-bit hash, widely used across many crates.
  • parutils: Provides parallel execution utilities relying on Tokio (ex. parallel foreach).
  • progress_reporting: offers ReportedWriter so progress for Writer operations can be displayed.
  • utils: general utilities - unclear how much is currently in use
  • xet_error: Error utility crate, widely used for anyhow! logging in other crates.

Local Development

To build xet-core, look at requirements in GitHub Actions CI Workflow for the Rust toolchain to install. Follow Rust documentation for installing rustup and that version of the toolchain. Use the following steps for building, testing, benchmarking.

Many of us on the team use VSCode, so we have checked in some settings in the .vscode directory. Install the rust-analyzer extension.

Build:

cargo build

Test:

cargo test

Benchmark:

cargo bench

Building Python package and running locally:

  1. Create Python3 VirtualEnv: python3 -mvenv ~/venv
  2. Activate virtualenv: source ~/venv/bin/activate
  3. Install maturin: pip3 install maturin ipython
  4. Go to hf_xet crate: cd hf_xet
  5. Build: maturin develop
  6. Test:
ipython
import hfxet 
hfxet.upload_files()
hfxet.download_files()

Testing

Unit-tests are run with cargo test, benchmarks are run with cargo bench. Some crates have a main.rs that can be run for manual testing.

References

Historical Design Documents

Repo History

A trimmed version of xet-core. The xetdata/xet-core repo contains deep git-integration, along with very different backend services implementation.

Description
Languages
Rust 96.7%
Python 1.7%
Shell 1.3%
JavaScript 0.1%
HTML 0.1%