Files
D-SCRIPT/docs/source/usage.rst
Samuel Sledzieski 1bed6a048a Claude/expand test coverage (#91)
* Expand test coverage with comprehensive test suites

Add extensive test coverage for previously untested modules:

- test_utils.py: Comprehensive tests for utility functions (setup_logger, log, RBF,
  parse_device, load_hdf5_parallel, PairedDataset, collate_paired_sequences)

- test_glider.py: Complete test suite for graph-based link prediction module
  (get_dim, densify, compute_X_normalized, scoring functions, GLIDE algorithms)

- test_loading.py: Tests for parallel HDF5 data loading with LoadingPool,
  including edge cases, error handling, and integration tests

- test_language_model.py: Expanded from 2 to 13 test methods, adding coverage
  for lm_embed, embed_from_fasta with various edge cases and validations

These additions significantly improve test coverage for:
- dscript/utils.py (167 lines, previously untested)
- dscript/glider.py (346 lines, previously untested)
- dscript/loading.py (92 lines, previously untested)
- dscript/language_model.py (minimal coverage expanded)

Total new test methods: ~200+ assertions across 4 test modules

* Add comprehensive tests for command modules and worker functions

Create four new test modules to expand coverage of previously untested code:

1. test_extract_3di.py (19 test methods, ~370 lines)
   - Tests for 3Di sequence extraction from PDB/CIF files
   - Argument parsing, file filtering, FASTA output validation
   - Integration tests for full workflow
   - Covers dscript/commands/extract_3di.py (~58 lines)

2. test_par_writer.py (24 test methods, ~400 lines)
   - Tests for parallel prediction writer process
   - TSV output writing, threshold filtering, contact map storage
   - HDF5 contact map dataset handling
   - Progress tracking and data type validation
   - Covers dscript/commands/par_writer.py (~40 lines)

3. test_main.py (24 test methods, ~320 lines)
   - Tests for CLI entry point and argument parsing
   - CitationAction class testing
   - All subcommand registration and invocation
   - Version and help flag handling
   - Integration tests for command dispatch
   - Covers dscript/__main__.py (~87 lines, increasing from ~85% to ~95%)

4. test_load_worker.py (23 test methods, ~330 lines)
   - Direct unit tests for HDF5 loading worker function
   - Queue handling, data type conversion, memory sharing
   - Error handling for corrupted/missing files
   - Multi-dimensional array support
   - Covers dscript/load_worker.py (~25 lines, previously only indirect coverage)

Total additions:
- ~1,420 lines of new test code
- 90+ test methods with comprehensive assertions
- ~210 lines of source code now directly tested
- Addresses high-priority gaps identified in coverage analysis

These tests complement the existing suite and focus on command-line
interface components and parallel processing infrastructure.

* Fix linting issues and apply code formatting

- Remove unused variables flagged by ruff
- Apply ruff formatting to all test files
- Ensure all pre-commit hooks pass

Changes:
- test_loading.py: Remove unused 'f' variable
- test_main.py: Remove unused 'fake_out' and 'output' variables
- test_utils.py: Remove unused 'log_file' variable and tmp_path param
- Applied ruff formatting to maintain code style consistency

* Fix test_load_worker.py hanging issue in CI

Rewrote test_load_worker.py to prevent CI hangs that occurred when
tests called the blocking worker function directly. The worker function
_hdf5_load_partial_func runs in an infinite loop waiting on a queue,
which caused tests to hang indefinitely.

Changes:
- Created run_worker_with_timeout() helper that wraps worker execution
  in a daemon thread with configurable timeout (default 5 seconds)
- Modified all tests to use this helper and assert successful completion
- Changed queue operations from blocking get() to non-blocking get_nowait()
- Reduced test count from 23 to 16 focused tests
- Added documentation noting worker is primarily tested via LoadingPool

This should resolve the CI timeout issue where tests hung at 43% completion.

* Rewrite test_language_model.py to use mocks instead of real model

The original tests were calling the real language model which:
- Downloads/loads pretrained model weights (slow, can fail)
- Runs actual neural network inference (resource intensive)
- Causes test failures when model files aren't available

Changes:
- Rewrote unit tests to mock get_pretrained() function
- Mock model returns realistic tensor shapes but doesn't load weights
- Tests are now fast, reliable, and don't require model files
- Moved real model tests to TestLanguageModelIntegration class
- Marked integration tests with @pytest.mark.slow so they can be skipped
- Removed unnecessary loguru import that caused import errors
- Removed problematic setup.py install step from setup_class

This should fix the 4 failing tests reported by CI.

* fix failing tests

* Update .github/workflows/autorun-tests.yml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update .github/workflows/autorun-tests.yml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-16 10:24:04 -05:00

272 lines
14 KiB
ReStructuredText

Usage
=====
Quick Start
~~~~~~~~~~~
Embed sequences with language model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sequences should be in ``.fasta`` format.
.. code-block:: bash
dscript embed --seqs [sequences] --outfile [embedding file]
Predict a new network using a trained model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pre-trained models can be downloaded from `here <https://d-script.readthedocs.io/en/main/data.html#trained-models>`_.
Protein names should be listed one per line with no header for prediction between all pairs of proteins.
Alternatively, candidate pairs should be in tab-separated (``.tsv``) format with no header, and columns for [protein name 1], [protein name 2].
For a list of pairs, additional columns (for example, a [label] in training or test data files), can exist but are ignored.
.. code-block:: bash
dscript predict --proteins [list of proteins] --embeddings [embedding file] --outfile [outfile] --model [model file]
dscript predict --pairs [list of pairs] --embeddings [embedding file] --outfile [outfile] --model [model file]
Train and save a model
^^^^^^^^^^^^^^^^^^^^^^
Training and validation data should be in tab-separated (``.tsv``) format with no header, and columns for [protein name 1], [protein name 2], [label].
.. code-block:: bash
dscript train --train [training data] --val [validation data] --embedding [embedding file] --save-prefix [prefix]
Evaluate a trained model
^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
dscript evaluate --model [model file] --test [test data] --embeddings [embedding file] --outfile [result file]
Blocked, Multi-GPU Prediction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
usage: dscript predict [-h] [--proteins PROTEINS] [--pairs PAIRS] [--model MODEL] --embeddings EMBEDDINGS [--foldseek_fasta FOLDSEEK_FASTA] [-o OUTFILE] [-d DEVICE]
[--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC] [--blocks BLOCKS] [--sparse_loading]
Make new predictions with a pre-trained model using blocked, multi-GPU pariwise inference. One of --proteins and --pairs is required.
options:
-h, --help show this help message and exit
--proteins PROTEINS File with protein IDs for which to predict all pairs, one per line; specify one of proteins or pairs
--pairs PAIRS File with candidate protein pairs to predict, one pair per line; specify one of proteins or pairs
--model MODEL Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub
[default: samsl/topsy_turvy_human_v1]
--embeddings EMBEDDINGS
h5 file with (a superset of) pre-embedded sequences. Generate with dscript embed.
--foldseek_fasta FOLDSEEK_FASTA
3di sequences in .fasta format. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default
D-SCRIPT/TT will be run.
-o OUTFILE, --outfile OUTFILE
File for predictions
-d DEVICE, --device DEVICE
Compute device to use. Options: 'cpu', 'all' (all GPUs), or GPU index (0, 1, 2, etc.). To use specific GPUs, set CUDA_VISIBLE_DEVICES
beforehand and use 'all'. [default: all]
--store_cmaps Store contact maps for predicted pairs above `--thresh` in an h5 file
--thresh THRESH Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
--load_proc LOAD_PROC
Number of processes to use when loading embeddings (-1 = # of available CPUs, default=16). Because loading is IO-bound, values larger that the
# of CPUs are allowed.
--blocks BLOCKS Number of equal-sized blocks to split proteins into. In the multi-block case, maximum (embedding) memory usage should be 3 blocks' worth. When
multiple GPUs are used, memory usage may briefly be higher when different GPUs are working on tasks from different blocks. And, small blocks
may lead to occasional brief hangs with multiple GPUs. Default 1.
--sparse_loading Load only the proteins required from each block, but do not reuse loaded blocks in memory. Recommended when predicting with many blocks on
sparse pairs, such that many pairs of blocks might contain no pairs of proteins of interest. Only available when blocks > 1 and pairs
specified. Maximum (embedding) memory usage with this option is 4 blocks' worth.
Bipartite Prediction
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
usage: dscript predict_bipartite [-h] --protA PROTA --protB PROTB [--model MODEL] --embedA EMBEDA [--embedB EMBEDB] [--foldseekA FOLDSEEKA] [--foldseekB FOLDSEEKB] [-o OUTFILE] [-d DEVICE] [--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC] [--blocksA BLOCKSA]
[--blocksB BLOCKSB]
Make new predictions between two protein sets using blocked, multi-GPU pariwise inference with a pre-trained model.
options:
-h, --help show this help message and exit
--protA PROTA A text file with protein IDs, one on each line. All pairs between proteins in this file and proteins in protB will be predicted
--protB PROTB A text file with protein IDs, one on each line. All pairs between proteins in protA and proteins in this file will be predicted
--model MODEL Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub [default: samsl/topsy_turvy_human_v1]
--embedA EMBEDA h5 file with (a superset of) pre-embedded sequences from the file protA. Generate with dscript embed. If a single file contains embeddings for both protA and protB, specify it as embedA.
--embedB EMBEDB h5 file with (a superset of) pre-embedded sequences from the file protB. Generate with dscript embed.
--foldseekA FOLDSEEKA
3di sequences in .fasta format for proteins in protA. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default D-SCRIPT/TT will be run. If a single file contains 3di sequences for both protA and protB,
specify it as foldseekA.
--foldseekB FOLDSEEKB
3di sequences in .fasta format for proteins in protA. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default D-SCRIPT/TT will be run.
-o OUTFILE, --outfile OUTFILE
File for predictions
-d DEVICE, --device DEVICE
Compute device to use. Options: 'cpu', 'all' (all GPUs), or GPU index (0, 1, 2, etc.). To use specific GPUs, set CUDA_VISIBLE_DEVICES
beforehand and use 'all'. [default: all]
--store_cmaps Store contact maps for predicted pairs above `--thresh` in an h5 file
--thresh THRESH Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
--load_proc LOAD_PROC
Number of processes to use when loading embeddings (-1 = # of available CPUs, default=16). Because loading is IO-bound, values larger that the # of CPUs are allowed.
--blocksA BLOCKSA Number of equal-sized blocks to split proteins in protA into. If one set is smuch smaller, it is recommended to set the corresponding # of blocks to 1. Default 1.
--blocksB BLOCKSB Number of equal-sized blocks to split proteins in protB into. Default 1.
Serial Prediction
~~~~~~~~~~~~~~~~~
.. code-block:: bash
usage: dscript predict_serial [-h] --pairs PAIRS [--model MODEL] [--seqs SEQS] [--embeddings EMBEDDINGS] [--foldseek_fasta FOLDSEEK_FASTA] [-o OUTFILE] [-d DEVICE]
[--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC]
Make new predictions with a pre-trained model using legacy (serial) inference. One of --seqs or --embeddings is required.
options:
-h, --help show this help message and exit
--pairs PAIRS Candidate protein pairs to predict
--model MODEL Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub [default:
samsl/topsy_turvy_human_v1]
--seqs SEQS Protein sequences in .fasta format
--embeddings EMBEDDINGS
h5 file with embedded sequences
--foldseek_fasta FOLDSEEK_FASTA
3di sequences in .fasta format. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default
D-SCRIPT/TT will be run.
-o OUTFILE, --outfile OUTFILE
File for predictions
-d DEVICE, --device DEVICE
Compute device to use. Options: 'cpu' or GPU index (0, 1, 2, etc.).
--store_cmaps Store contact maps for predicted pairs above `--thresh` in an h5 file
--thresh THRESH Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
--load_proc LOAD_PROC
Number of processes to use when loading embeddings (-1 = # of CPUs, default=32)
Embedding
~~~~~~~~~
.. code-block:: bash
usage: dscript embed [-h] --seqs SEQS -o OUTFILE [-d DEVICE]
Generate new embeddings using pre-trained language model
optional arguments:
-h, --help show this help message and exit
--seqs SEQS Sequences to be embedded
-o, --outfile OUTFILE h5 file to write results
-d DEVICE, --device DEVICE
Compute device to use. Options: 'cpu' or GPU index (0, 1, 2, etc.).
Training
~~~~~~~~
.. code-block:: bash
usage: dscript train [-h] --train TRAIN --test TEST --embedding EMBEDDING
[--no-augment] [--input-dim INPUT_DIM]
[--projection-dim PROJECTION_DIM] [--dropout-p DROPOUT_P]
[--hidden-dim HIDDEN_DIM] [--kernel-width KERNEL_WIDTH]
[--no-w] [--no-sigmoid] [--do-pool]
[--pool-width POOL_WIDTH] [--num-epochs NUM_EPOCHS]
[--batch-size BATCH_SIZE] [--weight-decay WEIGHT_DECAY]
[--lr LR] [--lambda INTERACTION_WEIGHT] [--topsy-turvy]
[--glider-weight GLIDER_WEIGHT]
[--glider-thresh GLIDER_THRESH] [-o OUTFILE]
[--save-prefix SAVE_PREFIX] [-d DEVICE]
[--checkpoint CHECKPOINT]
Train a new model.
optional arguments:
-h, --help show this help message and exit
Data:
--train TRAIN list of training pairs
--test TEST list of validation/testing pairs
--embedding EMBEDDING
h5py path containing embedded sequences
--no-augment data is automatically augmented by adding (B A) for
all pairs (A B). Set this flag to not augment data
Projection Module:
--input-dim INPUT_DIM
dimension of input language model embedding (per amino
acid) (default: 6165)
--projection-dim PROJECTION_DIM
dimension of embedding projection layer (default: 100)
--dropout-p DROPOUT_P
parameter p for embedding dropout layer (default: 0.5)
Contact Module:
--hidden-dim HIDDEN_DIM
number of hidden units for comparison layer in contact
prediction (default: 50)
--kernel-width KERNEL_WIDTH
width of convolutional filter for contact prediction
(default: 7)
Interaction Module:
--no-w don't use weight matrix in interaction prediction
model
--no-sigmoid don't use sigmoid activation at end of interaction
model
--do-pool use max pool layer in interaction prediction model
--pool-width POOL_WIDTH
size of max-pool in interaction model (default: 9)
Training:
--num-epochs NUM_EPOCHS
number of epochs (default: 10)
--batch-size BATCH_SIZE
minibatch size (default: 25)
--weight-decay WEIGHT_DECAY
L2 regularization (default: 0)
--lr LR learning rate (default: 0.001)
--lambda INTERACTION_WEIGHT
weight on the similarity objective (default: 0.35)
--topsy-turvy run in Topsy-Turvy mode -- use top-down GLIDER scoring
to guide training (reference TBD)
--glider-weight GLIDER_WEIGHT
weight on the GLIDER accuracy objective (default: 0.2)
--glider-thresh GLIDER_THRESH
proportion of GLIDER scores treated as positive edges
(0 < gt < 1) (default: 0.925)
Output and Device:
-o OUTPUT, --output OUTPUT
output file path (default: stdout)
--save-prefix SAVE_PREFIX
path prefix for saving models
-d DEVICE, --device DEVICE
compute device to use
--checkpoint CHECKPOINT
checkpoint model to start training from
Evaluation
~~~~~~~~~~
.. code-block:: bash
usage: dscript eval [-h] --model MODEL --test TEST --embedding EMBEDDING
[-o OUTFILE] [-d DEVICE]
Evaluate a trained model
optional arguments:
-h, --help show this help message and exit
--model MODEL Trained prediction model
--test TEST Test Data
--embedding EMBEDDING
h5 file with embedded sequences
-o OUTFILE, --outfile OUTFILE
Output file to write results
-d DEVICE, --device DEVICE
Compute device to use