mirror of
https://github.com/samsledje/D-SCRIPT.git
synced 2026-06-05 07:24:22 +08:00
* Expand test coverage with comprehensive test suites Add extensive test coverage for previously untested modules: - test_utils.py: Comprehensive tests for utility functions (setup_logger, log, RBF, parse_device, load_hdf5_parallel, PairedDataset, collate_paired_sequences) - test_glider.py: Complete test suite for graph-based link prediction module (get_dim, densify, compute_X_normalized, scoring functions, GLIDE algorithms) - test_loading.py: Tests for parallel HDF5 data loading with LoadingPool, including edge cases, error handling, and integration tests - test_language_model.py: Expanded from 2 to 13 test methods, adding coverage for lm_embed, embed_from_fasta with various edge cases and validations These additions significantly improve test coverage for: - dscript/utils.py (167 lines, previously untested) - dscript/glider.py (346 lines, previously untested) - dscript/loading.py (92 lines, previously untested) - dscript/language_model.py (minimal coverage expanded) Total new test methods: ~200+ assertions across 4 test modules * Add comprehensive tests for command modules and worker functions Create four new test modules to expand coverage of previously untested code: 1. test_extract_3di.py (19 test methods, ~370 lines) - Tests for 3Di sequence extraction from PDB/CIF files - Argument parsing, file filtering, FASTA output validation - Integration tests for full workflow - Covers dscript/commands/extract_3di.py (~58 lines) 2. test_par_writer.py (24 test methods, ~400 lines) - Tests for parallel prediction writer process - TSV output writing, threshold filtering, contact map storage - HDF5 contact map dataset handling - Progress tracking and data type validation - Covers dscript/commands/par_writer.py (~40 lines) 3. test_main.py (24 test methods, ~320 lines) - Tests for CLI entry point and argument parsing - CitationAction class testing - All subcommand registration and invocation - Version and help flag handling - Integration tests for command dispatch - Covers dscript/__main__.py (~87 lines, increasing from ~85% to ~95%) 4. test_load_worker.py (23 test methods, ~330 lines) - Direct unit tests for HDF5 loading worker function - Queue handling, data type conversion, memory sharing - Error handling for corrupted/missing files - Multi-dimensional array support - Covers dscript/load_worker.py (~25 lines, previously only indirect coverage) Total additions: - ~1,420 lines of new test code - 90+ test methods with comprehensive assertions - ~210 lines of source code now directly tested - Addresses high-priority gaps identified in coverage analysis These tests complement the existing suite and focus on command-line interface components and parallel processing infrastructure. * Fix linting issues and apply code formatting - Remove unused variables flagged by ruff - Apply ruff formatting to all test files - Ensure all pre-commit hooks pass Changes: - test_loading.py: Remove unused 'f' variable - test_main.py: Remove unused 'fake_out' and 'output' variables - test_utils.py: Remove unused 'log_file' variable and tmp_path param - Applied ruff formatting to maintain code style consistency * Fix test_load_worker.py hanging issue in CI Rewrote test_load_worker.py to prevent CI hangs that occurred when tests called the blocking worker function directly. The worker function _hdf5_load_partial_func runs in an infinite loop waiting on a queue, which caused tests to hang indefinitely. Changes: - Created run_worker_with_timeout() helper that wraps worker execution in a daemon thread with configurable timeout (default 5 seconds) - Modified all tests to use this helper and assert successful completion - Changed queue operations from blocking get() to non-blocking get_nowait() - Reduced test count from 23 to 16 focused tests - Added documentation noting worker is primarily tested via LoadingPool This should resolve the CI timeout issue where tests hung at 43% completion. * Rewrite test_language_model.py to use mocks instead of real model The original tests were calling the real language model which: - Downloads/loads pretrained model weights (slow, can fail) - Runs actual neural network inference (resource intensive) - Causes test failures when model files aren't available Changes: - Rewrote unit tests to mock get_pretrained() function - Mock model returns realistic tensor shapes but doesn't load weights - Tests are now fast, reliable, and don't require model files - Moved real model tests to TestLanguageModelIntegration class - Marked integration tests with @pytest.mark.slow so they can be skipped - Removed unnecessary loguru import that caused import errors - Removed problematic setup.py install step from setup_class This should fix the 4 failing tests reported by CI. * fix failing tests * Update .github/workflows/autorun-tests.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update .github/workflows/autorun-tests.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
272 lines
14 KiB
ReStructuredText
272 lines
14 KiB
ReStructuredText
Usage
|
|
=====
|
|
|
|
Quick Start
|
|
~~~~~~~~~~~
|
|
|
|
Embed sequences with language model
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Sequences should be in ``.fasta`` format.
|
|
|
|
.. code-block:: bash
|
|
|
|
dscript embed --seqs [sequences] --outfile [embedding file]
|
|
|
|
Predict a new network using a trained model
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Pre-trained models can be downloaded from `here <https://d-script.readthedocs.io/en/main/data.html#trained-models>`_.
|
|
Protein names should be listed one per line with no header for prediction between all pairs of proteins.
|
|
Alternatively, candidate pairs should be in tab-separated (``.tsv``) format with no header, and columns for [protein name 1], [protein name 2].
|
|
For a list of pairs, additional columns (for example, a [label] in training or test data files), can exist but are ignored.
|
|
|
|
.. code-block:: bash
|
|
|
|
dscript predict --proteins [list of proteins] --embeddings [embedding file] --outfile [outfile] --model [model file]
|
|
dscript predict --pairs [list of pairs] --embeddings [embedding file] --outfile [outfile] --model [model file]
|
|
|
|
Train and save a model
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Training and validation data should be in tab-separated (``.tsv``) format with no header, and columns for [protein name 1], [protein name 2], [label].
|
|
|
|
.. code-block:: bash
|
|
|
|
dscript train --train [training data] --val [validation data] --embedding [embedding file] --save-prefix [prefix]
|
|
|
|
|
|
Evaluate a trained model
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. code-block:: bash
|
|
|
|
dscript evaluate --model [model file] --test [test data] --embeddings [embedding file] --outfile [result file]
|
|
|
|
|
|
Blocked, Multi-GPU Prediction
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
usage: dscript predict [-h] [--proteins PROTEINS] [--pairs PAIRS] [--model MODEL] --embeddings EMBEDDINGS [--foldseek_fasta FOLDSEEK_FASTA] [-o OUTFILE] [-d DEVICE]
|
|
[--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC] [--blocks BLOCKS] [--sparse_loading]
|
|
|
|
Make new predictions with a pre-trained model using blocked, multi-GPU pariwise inference. One of --proteins and --pairs is required.
|
|
|
|
options:
|
|
-h, --help show this help message and exit
|
|
--proteins PROTEINS File with protein IDs for which to predict all pairs, one per line; specify one of proteins or pairs
|
|
--pairs PAIRS File with candidate protein pairs to predict, one pair per line; specify one of proteins or pairs
|
|
--model MODEL Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub
|
|
[default: samsl/topsy_turvy_human_v1]
|
|
--embeddings EMBEDDINGS
|
|
h5 file with (a superset of) pre-embedded sequences. Generate with dscript embed.
|
|
--foldseek_fasta FOLDSEEK_FASTA
|
|
3di sequences in .fasta format. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default
|
|
D-SCRIPT/TT will be run.
|
|
-o OUTFILE, --outfile OUTFILE
|
|
File for predictions
|
|
-d DEVICE, --device DEVICE
|
|
Compute device to use. Options: 'cpu', 'all' (all GPUs), or GPU index (0, 1, 2, etc.). To use specific GPUs, set CUDA_VISIBLE_DEVICES
|
|
beforehand and use 'all'. [default: all]
|
|
--store_cmaps Store contact maps for predicted pairs above `--thresh` in an h5 file
|
|
--thresh THRESH Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
|
|
--load_proc LOAD_PROC
|
|
Number of processes to use when loading embeddings (-1 = # of available CPUs, default=16). Because loading is IO-bound, values larger that the
|
|
# of CPUs are allowed.
|
|
--blocks BLOCKS Number of equal-sized blocks to split proteins into. In the multi-block case, maximum (embedding) memory usage should be 3 blocks' worth. When
|
|
multiple GPUs are used, memory usage may briefly be higher when different GPUs are working on tasks from different blocks. And, small blocks
|
|
may lead to occasional brief hangs with multiple GPUs. Default 1.
|
|
--sparse_loading Load only the proteins required from each block, but do not reuse loaded blocks in memory. Recommended when predicting with many blocks on
|
|
sparse pairs, such that many pairs of blocks might contain no pairs of proteins of interest. Only available when blocks > 1 and pairs
|
|
specified. Maximum (embedding) memory usage with this option is 4 blocks' worth.
|
|
|
|
Bipartite Prediction
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
usage: dscript predict_bipartite [-h] --protA PROTA --protB PROTB [--model MODEL] --embedA EMBEDA [--embedB EMBEDB] [--foldseekA FOLDSEEKA] [--foldseekB FOLDSEEKB] [-o OUTFILE] [-d DEVICE] [--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC] [--blocksA BLOCKSA]
|
|
[--blocksB BLOCKSB]
|
|
|
|
Make new predictions between two protein sets using blocked, multi-GPU pariwise inference with a pre-trained model.
|
|
|
|
options:
|
|
-h, --help show this help message and exit
|
|
--protA PROTA A text file with protein IDs, one on each line. All pairs between proteins in this file and proteins in protB will be predicted
|
|
--protB PROTB A text file with protein IDs, one on each line. All pairs between proteins in protA and proteins in this file will be predicted
|
|
--model MODEL Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub [default: samsl/topsy_turvy_human_v1]
|
|
--embedA EMBEDA h5 file with (a superset of) pre-embedded sequences from the file protA. Generate with dscript embed. If a single file contains embeddings for both protA and protB, specify it as embedA.
|
|
--embedB EMBEDB h5 file with (a superset of) pre-embedded sequences from the file protB. Generate with dscript embed.
|
|
--foldseekA FOLDSEEKA
|
|
3di sequences in .fasta format for proteins in protA. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default D-SCRIPT/TT will be run. If a single file contains 3di sequences for both protA and protB,
|
|
specify it as foldseekA.
|
|
--foldseekB FOLDSEEKB
|
|
3di sequences in .fasta format for proteins in protA. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default D-SCRIPT/TT will be run.
|
|
-o OUTFILE, --outfile OUTFILE
|
|
File for predictions
|
|
-d DEVICE, --device DEVICE
|
|
Compute device to use. Options: 'cpu', 'all' (all GPUs), or GPU index (0, 1, 2, etc.). To use specific GPUs, set CUDA_VISIBLE_DEVICES
|
|
beforehand and use 'all'. [default: all]
|
|
--store_cmaps Store contact maps for predicted pairs above `--thresh` in an h5 file
|
|
--thresh THRESH Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
|
|
--load_proc LOAD_PROC
|
|
Number of processes to use when loading embeddings (-1 = # of available CPUs, default=16). Because loading is IO-bound, values larger that the # of CPUs are allowed.
|
|
--blocksA BLOCKSA Number of equal-sized blocks to split proteins in protA into. If one set is smuch smaller, it is recommended to set the corresponding # of blocks to 1. Default 1.
|
|
--blocksB BLOCKSB Number of equal-sized blocks to split proteins in protB into. Default 1.
|
|
|
|
|
|
Serial Prediction
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
usage: dscript predict_serial [-h] --pairs PAIRS [--model MODEL] [--seqs SEQS] [--embeddings EMBEDDINGS] [--foldseek_fasta FOLDSEEK_FASTA] [-o OUTFILE] [-d DEVICE]
|
|
[--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC]
|
|
|
|
Make new predictions with a pre-trained model using legacy (serial) inference. One of --seqs or --embeddings is required.
|
|
|
|
options:
|
|
-h, --help show this help message and exit
|
|
--pairs PAIRS Candidate protein pairs to predict
|
|
--model MODEL Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub [default:
|
|
samsl/topsy_turvy_human_v1]
|
|
--seqs SEQS Protein sequences in .fasta format
|
|
--embeddings EMBEDDINGS
|
|
h5 file with embedded sequences
|
|
--foldseek_fasta FOLDSEEK_FASTA
|
|
3di sequences in .fasta format. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default
|
|
D-SCRIPT/TT will be run.
|
|
-o OUTFILE, --outfile OUTFILE
|
|
File for predictions
|
|
-d DEVICE, --device DEVICE
|
|
Compute device to use. Options: 'cpu' or GPU index (0, 1, 2, etc.).
|
|
--store_cmaps Store contact maps for predicted pairs above `--thresh` in an h5 file
|
|
--thresh THRESH Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
|
|
--load_proc LOAD_PROC
|
|
Number of processes to use when loading embeddings (-1 = # of CPUs, default=32)
|
|
|
|
|
|
Embedding
|
|
~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
usage: dscript embed [-h] --seqs SEQS -o OUTFILE [-d DEVICE]
|
|
|
|
Generate new embeddings using pre-trained language model
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
--seqs SEQS Sequences to be embedded
|
|
-o, --outfile OUTFILE h5 file to write results
|
|
-d DEVICE, --device DEVICE
|
|
Compute device to use. Options: 'cpu' or GPU index (0, 1, 2, etc.).
|
|
|
|
Training
|
|
~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
usage: dscript train [-h] --train TRAIN --test TEST --embedding EMBEDDING
|
|
[--no-augment] [--input-dim INPUT_DIM]
|
|
[--projection-dim PROJECTION_DIM] [--dropout-p DROPOUT_P]
|
|
[--hidden-dim HIDDEN_DIM] [--kernel-width KERNEL_WIDTH]
|
|
[--no-w] [--no-sigmoid] [--do-pool]
|
|
[--pool-width POOL_WIDTH] [--num-epochs NUM_EPOCHS]
|
|
[--batch-size BATCH_SIZE] [--weight-decay WEIGHT_DECAY]
|
|
[--lr LR] [--lambda INTERACTION_WEIGHT] [--topsy-turvy]
|
|
[--glider-weight GLIDER_WEIGHT]
|
|
[--glider-thresh GLIDER_THRESH] [-o OUTFILE]
|
|
[--save-prefix SAVE_PREFIX] [-d DEVICE]
|
|
[--checkpoint CHECKPOINT]
|
|
|
|
Train a new model.
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
|
|
Data:
|
|
--train TRAIN list of training pairs
|
|
--test TEST list of validation/testing pairs
|
|
--embedding EMBEDDING
|
|
h5py path containing embedded sequences
|
|
--no-augment data is automatically augmented by adding (B A) for
|
|
all pairs (A B). Set this flag to not augment data
|
|
|
|
Projection Module:
|
|
--input-dim INPUT_DIM
|
|
dimension of input language model embedding (per amino
|
|
acid) (default: 6165)
|
|
--projection-dim PROJECTION_DIM
|
|
dimension of embedding projection layer (default: 100)
|
|
--dropout-p DROPOUT_P
|
|
parameter p for embedding dropout layer (default: 0.5)
|
|
|
|
Contact Module:
|
|
--hidden-dim HIDDEN_DIM
|
|
number of hidden units for comparison layer in contact
|
|
prediction (default: 50)
|
|
--kernel-width KERNEL_WIDTH
|
|
width of convolutional filter for contact prediction
|
|
(default: 7)
|
|
|
|
Interaction Module:
|
|
--no-w don't use weight matrix in interaction prediction
|
|
model
|
|
--no-sigmoid don't use sigmoid activation at end of interaction
|
|
model
|
|
--do-pool use max pool layer in interaction prediction model
|
|
--pool-width POOL_WIDTH
|
|
size of max-pool in interaction model (default: 9)
|
|
|
|
Training:
|
|
--num-epochs NUM_EPOCHS
|
|
number of epochs (default: 10)
|
|
--batch-size BATCH_SIZE
|
|
minibatch size (default: 25)
|
|
--weight-decay WEIGHT_DECAY
|
|
L2 regularization (default: 0)
|
|
--lr LR learning rate (default: 0.001)
|
|
--lambda INTERACTION_WEIGHT
|
|
weight on the similarity objective (default: 0.35)
|
|
--topsy-turvy run in Topsy-Turvy mode -- use top-down GLIDER scoring
|
|
to guide training (reference TBD)
|
|
--glider-weight GLIDER_WEIGHT
|
|
weight on the GLIDER accuracy objective (default: 0.2)
|
|
--glider-thresh GLIDER_THRESH
|
|
proportion of GLIDER scores treated as positive edges
|
|
(0 < gt < 1) (default: 0.925)
|
|
|
|
Output and Device:
|
|
-o OUTPUT, --output OUTPUT
|
|
output file path (default: stdout)
|
|
--save-prefix SAVE_PREFIX
|
|
path prefix for saving models
|
|
-d DEVICE, --device DEVICE
|
|
compute device to use
|
|
--checkpoint CHECKPOINT
|
|
checkpoint model to start training from
|
|
|
|
Evaluation
|
|
~~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
usage: dscript eval [-h] --model MODEL --test TEST --embedding EMBEDDING
|
|
[-o OUTFILE] [-d DEVICE]
|
|
|
|
Evaluate a trained model
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
--model MODEL Trained prediction model
|
|
--test TEST Test Data
|
|
--embedding EMBEDDING
|
|
h5 file with embedded sequences
|
|
-o OUTFILE, --outfile OUTFILE
|
|
Output file to write results
|
|
-d DEVICE, --device DEVICE
|
|
Compute device to use
|