mirror of https://github.com/samsledje/D-SCRIPT.git synced 2026-06-04 15:04:24 +08:00

Files

Samuel Sledzieski 1bed6a048a Claude/expand test coverage (#91 )

* Expand test coverage with comprehensive test suites

Add extensive test coverage for previously untested modules:

- test_utils.py: Comprehensive tests for utility functions (setup_logger, log, RBF,
  parse_device, load_hdf5_parallel, PairedDataset, collate_paired_sequences)

- test_glider.py: Complete test suite for graph-based link prediction module
  (get_dim, densify, compute_X_normalized, scoring functions, GLIDE algorithms)

- test_loading.py: Tests for parallel HDF5 data loading with LoadingPool,
  including edge cases, error handling, and integration tests

- test_language_model.py: Expanded from 2 to 13 test methods, adding coverage
  for lm_embed, embed_from_fasta with various edge cases and validations

These additions significantly improve test coverage for:
- dscript/utils.py (167 lines, previously untested)
- dscript/glider.py (346 lines, previously untested)
- dscript/loading.py (92 lines, previously untested)
- dscript/language_model.py (minimal coverage expanded)

Total new test methods: ~200+ assertions across 4 test modules

* Add comprehensive tests for command modules and worker functions

Create four new test modules to expand coverage of previously untested code:

1. test_extract_3di.py (19 test methods, ~370 lines)
   - Tests for 3Di sequence extraction from PDB/CIF files
   - Argument parsing, file filtering, FASTA output validation
   - Integration tests for full workflow
   - Covers dscript/commands/extract_3di.py (~58 lines)

2. test_par_writer.py (24 test methods, ~400 lines)
   - Tests for parallel prediction writer process
   - TSV output writing, threshold filtering, contact map storage
   - HDF5 contact map dataset handling
   - Progress tracking and data type validation
   - Covers dscript/commands/par_writer.py (~40 lines)

3. test_main.py (24 test methods, ~320 lines)
   - Tests for CLI entry point and argument parsing
   - CitationAction class testing
   - All subcommand registration and invocation
   - Version and help flag handling
   - Integration tests for command dispatch
   - Covers dscript/__main__.py (~87 lines, increasing from ~85% to ~95%)

4. test_load_worker.py (23 test methods, ~330 lines)
   - Direct unit tests for HDF5 loading worker function
   - Queue handling, data type conversion, memory sharing
   - Error handling for corrupted/missing files
   - Multi-dimensional array support
   - Covers dscript/load_worker.py (~25 lines, previously only indirect coverage)

Total additions:
- ~1,420 lines of new test code
- 90+ test methods with comprehensive assertions
- ~210 lines of source code now directly tested
- Addresses high-priority gaps identified in coverage analysis

These tests complement the existing suite and focus on command-line
interface components and parallel processing infrastructure.

* Fix linting issues and apply code formatting

- Remove unused variables flagged by ruff
- Apply ruff formatting to all test files
- Ensure all pre-commit hooks pass

Changes:
- test_loading.py: Remove unused 'f' variable
- test_main.py: Remove unused 'fake_out' and 'output' variables
- test_utils.py: Remove unused 'log_file' variable and tmp_path param
- Applied ruff formatting to maintain code style consistency

* Fix test_load_worker.py hanging issue in CI

Rewrote test_load_worker.py to prevent CI hangs that occurred when
tests called the blocking worker function directly. The worker function
_hdf5_load_partial_func runs in an infinite loop waiting on a queue,
which caused tests to hang indefinitely.

Changes:
- Created run_worker_with_timeout() helper that wraps worker execution
  in a daemon thread with configurable timeout (default 5 seconds)
- Modified all tests to use this helper and assert successful completion
- Changed queue operations from blocking get() to non-blocking get_nowait()
- Reduced test count from 23 to 16 focused tests
- Added documentation noting worker is primarily tested via LoadingPool

This should resolve the CI timeout issue where tests hung at 43% completion.

* Rewrite test_language_model.py to use mocks instead of real model

The original tests were calling the real language model which:
- Downloads/loads pretrained model weights (slow, can fail)
- Runs actual neural network inference (resource intensive)
- Causes test failures when model files aren't available

Changes:
- Rewrote unit tests to mock get_pretrained() function
- Mock model returns realistic tensor shapes but doesn't load weights
- Tests are now fast, reliable, and don't require model files
- Moved real model tests to TestLanguageModelIntegration class
- Marked integration tests with @pytest.mark.slow so they can be skipped
- Removed unnecessary loguru import that caused import errors
- Removed problematic setup.py install step from setup_class

This should fix the 4 failing tests reported by CI.

* fix failing tests

* Update .github/workflows/autorun-tests.yml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update .github/workflows/autorun-tests.yml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2025-12-16 10:24:04 -05:00

4.4 KiB

Raw Blame History

D-SCRIPT

D-SCRIPT is a deep learning method for predicting a physical interaction between two proteins given just their sequences. It generalizes well to new species and is robust to limitations in training data size. Its design reflects the intuition that for two proteins to physically interact, a subset of amino acids from each protein should be in contact with the other. The intermediate stages of D-SCRIPT directly implement this intuition, with the penultimate stage in D-SCRIPT being a rough estimate of the inter-protein contact map of the protein dimer. This structurally-motivated design enhances the interpretability of the results and, since structure is more conserved evolutionarily than sequence, improves generalizability across species.

You can now make predictions with D-SCRIPT via the interface on HuggingFace!

Installation

pip install dscript

Usage

Protein sequences need to first be embedded using the Bepler+Berger protein language model; this requires a .fasta file as input. Everything before the first space will be used as the key.

dscript embed --seqs [sequences] --outfile [embedding file]

#Example
dscript embed --seqs data/seqs/ecoli.fasta --outfile ecoli_embed.h5

Candidate pairs should be in tab-separated (.tsv) format with no header, and columns for [protein key 1], [protein key 2]. Optionally, a third column with [label] can be provided, so predictions can be made using training or test data files (but the label will not affect the predictions only the first two columns will be read).

While pre-trained model files can be downloaded directly, we recommend instead passing the name of a pre-trained model that will be automatically downloaded from HuggingFace. Available models include:

samsl/dscript_human_v1
samsl/topsy_turvy_human_v1 (recommended, default)
samsl/tt3d_human_v1

dscript predict --pairs [input data] --embeddings [embedding file] --model [model file] --outfile [predictions file]

#Example
dscript predict --pairs data/pairs/ecoli_toy.tsv --embeddings ecoli_embed.h5 --outfile ecoli_toy_predict

For inference, proteins can be divided into blocks to reduce memory usage for embeddings using --blocks. By default, the CPU is used; a GPU to use can be specified with -d, followed by the index of a GPU or all for all available GPUs.

#Example with 16 blocks, using (using 3/16th the maximum embedding memory), and a GPU
dscript predict --pairs data/pairs/ecoli_test.tsv --embeddings ecoli_embed.h5 --outfile ecoli_test_predict --blocks 16 -d 0

For more information on prediction modes, such as all-pair and bipartite predictions, see our complete documentation

References

The original D-SCRIPT model is described in the paper “D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions.”
We have updated D-SCRIPT to incorporate network information (Topsy Turvy) and structure information (TT3D)
The addition of Blocked, Multi-GPU Parallel Inference to D-SCRIPT is described in the application note “Memory-Efficient, Accelerated Protein Interaction inference with Blocked, Multi-GPU D-SCRIPT.”
Documentation

4.4 KiB Raw Blame History

D-SCRIPT

Installation

Usage

References

4.4 KiB

Raw Blame History