mirror of
https://github.com/rdk/p2rank.git
synced 2026-06-04 12:44:24 +08:00
- export-pocket-{grid,descriptors}.md: audit-driven cleanup. Em-dashes
replaced, type notation unified to f64/i32, vis_pocket_grid_volume_radius
explanation consolidated into the parameters table, PyMOL/ChimeraX
section split into Files produced / Layers and toggles / Renderer notes,
preview warning rephrased.
- export-points.md, dev/evaluation-metric-fixes-2.6.md: em-dashes
replaced (colons, commas, parentheses).
- conservation.md: Background paragraph tightened to match release-notes
"previously only usable through PrankWeb" phrasing.
305 lines
12 KiB
Markdown
305 lines
12 KiB
Markdown
# Conservation Scores in P2Rank
|
|
|
|
## Background
|
|
|
|
Sequence conservation is a strong signal for predicting binding sites.
|
|
Conservation-aware models have been available in [PrankWeb](https://prankweb.cz) for several years,
|
|
but were previously only usable through PrankWeb (the conservation pipeline was bundled with
|
|
PrankWeb's infrastructure rather than runnable on its own).
|
|
|
|
Since **P2Rank 2.6.0** (including early `-dev` builds), users can obtain conservation scores
|
|
on the fly from an external conservation server and use conservation-aware prediction models
|
|
directly from the command line.
|
|
|
|
## Conservation-Aware Models
|
|
|
|
P2Rank ships with pre-trained models that include conservation features:
|
|
|
|
| Config | Description | Usage |
|
|
|--------|-------------|-------|
|
|
| `conservation_hmm` | For standard experimental structures (X-ray) | `-c conservation_hmm` |
|
|
| `alphafold_conservation_hmm` | For AlphaFold, cryo-EM, and NMR models (no b-factor feature) | `-c alphafold_conservation_hmm` |
|
|
|
|
Note: The pre-trained conservation-aware models shipped with P2Rank were trained
|
|
using HMM-based conservation scores (Jensen-Shannon divergence, `.hom` format).
|
|
Using conservation scores from a different method (e.g. with different score distribution or format)
|
|
is not expected to produce good results and may not work at all.
|
|
If you want to use a different conservation method, you would need to retrain the model.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Predict with conservation (fetching scores from a server)
|
|
prank predict protein.pdb \
|
|
-c conservation_hmm \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030
|
|
|
|
# Preload conservation cache for a dataset, then predict
|
|
prank preload-conservation dataset.ds \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030
|
|
|
|
prank predict dataset.ds \
|
|
-c conservation_hmm \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030
|
|
```
|
|
|
|
## Obtaining Conservation Scores
|
|
|
|
There are two ways to provide conservation scores to P2Rank:
|
|
|
|
### 1. Pre-computed Score Files
|
|
|
|
Place `.hom` score files next to the protein files (or in directories specified by `-conservation_dirs`).
|
|
Files are matched by naming convention: `{proteinBaseName}_{chainId}.hom` (e.g., `2W83_A.hom`).
|
|
|
|
```bash
|
|
prank predict protein.pdb -c conservation_hmm -conservation_dirs ./my_scores/
|
|
```
|
|
|
|
Note: `-conservation_type` is only needed when using a provider or the cache mechanism.
|
|
With pre-computed files and `-conservation_dirs`, files are matched by name convention alone.
|
|
|
|
### 2. External Conservation Server (since 2.6.0)
|
|
|
|
Configure P2Rank to fetch scores from an HTTP server on demand.
|
|
Fetched scores are cached locally so subsequent runs reuse them.
|
|
|
|
```bash
|
|
prank predict protein.pdb \
|
|
-c conservation_hmm \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030
|
|
```
|
|
|
|
The server receives a POST request to `{url}/conservation` with a JSON body:
|
|
```json
|
|
{"fasta_content": ">proteinName_chainId\nSEQUENCE"}
|
|
```
|
|
and returns the raw `.hom` TSV content.
|
|
|
|
## Running the Conservation Server with Docker
|
|
|
|
The HMM conservation pipeline is available as a Docker image
|
|
(not published on Docker Hub yet, but can be built from `prankweb` repo).
|
|
|
|
Note: These instructions will be simplified in the near future
|
|
when the conservation server becomes part of the official P2Rank docker image
|
|
or is published as a separate dedicated docker image.
|
|
|
|
See also https://github.com/rdk/prankweb/tree/conservation-server/executor-p2rank/conservation.
|
|
|
|
```bash
|
|
# for now, you need to build the image from the `conservation-server` branch of the prankweb repo fork
|
|
git clone -b conservation-server git@github.com:rdk/prankweb.git
|
|
cd prankweb
|
|
|
|
# to avoid running docker as root, add current user to docker group (you may need to log out and back in for this to take effect)
|
|
sudo usermod -aG docker $USER
|
|
|
|
# prepare data/cache directory on the host
|
|
mkdir -p /ssd/p2rank-conservation-docker-data/hmm-based
|
|
|
|
# download latest uniref database
|
|
cd /ssd/p2rank-conservation-docker-data/hmm-based
|
|
wget https://ftp.expasy.org/databases/uniprot/current_release/uniref/uniref50/uniref50.fasta.gz && gunzip uniref50.fasta.gz
|
|
|
|
# build/rebuild docker image (only necessary after changes in this repo)
|
|
docker compose build conservation-server
|
|
|
|
# run server with mounted data directory
|
|
docker compose run --rm -p 8030:8030 \
|
|
--user "$(id -u):$(id -g)" \
|
|
-v /ssd/p2rank-conservation-docker-data:/data/conservation \
|
|
conservation-server
|
|
```
|
|
|
|
Once the server is running, you can verify it is available:
|
|
```bash
|
|
curl http://localhost:8030/health
|
|
```
|
|
|
|
### Server-side Cache
|
|
|
|
The conservation server maintains a sequence-indexed result cache
|
|
in `/ssd/p2rank-conservation-docker-data/hmm-based-cache` (inside the mounted data directory).
|
|
Repeated queries for the same sequence are served immediately from this cache.
|
|
|
|
There is also an option to download a cache of conservation scores pre-computed by the PrankWeb server
|
|
for most PDB entries and some AlphaFold DB subsets (available upon request).
|
|
However, these were calculated with older versions of the UniRef database (approximately 2023 or older),
|
|
so results may differ from scores computed with a current UniRef release.
|
|
|
|
## Local Cache of Conservation Scores
|
|
|
|
By default, cache files are stored next to each protein file in `.p2rank-cache` directory.
|
|
For example, for protein `{protein_dir}/{baseName}.pdb` conservation files would be cached in:
|
|
```
|
|
{protein_dir}/.p2rank-cache/conservation/{conservation_type}/{baseName}_{chainId}.hom
|
|
```
|
|
|
|
To use a custom cache directory instead use the `-conservation_cache_dir` parameter:
|
|
```bash
|
|
prank predict protein.pdb \
|
|
-c conservation_hmm \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030 \
|
|
-conservation_cache_dir /path/to/custom/cache
|
|
```
|
|
|
|
Cache layout: `{conservation_cache_dir}/{conservation_type}/{baseName}_{chainId}.hom`
|
|
|
|
### Disabling Cache
|
|
|
|
To disable caching entirely (always fetch from server), add `-conservation_disable_cache true`
|
|
to your command.
|
|
|
|
### Score Resolution Order
|
|
|
|
When a conservation provider is configured, scores are resolved in this order:
|
|
1. Pre-computed files (from `conservation_dirs` or the protein file directory)
|
|
2. Local cache
|
|
3. Fetch from the conservation server (result is cached)
|
|
|
|
## Preloading Conservation Cache (Pre-calculating Conservation Scores)
|
|
|
|
Note that conservation calculation is computationally intensive
|
|
and typically takes several minutes per chain (depending on sequence length and hardware).
|
|
|
|
The `preload-conservation` command fetches and caches conservation scores for all chains in a dataset
|
|
without running predictions. This is useful on large datasets for catching any errors related to conservation
|
|
retrieval/calculation before starting a full prediction run.
|
|
|
|
```bash
|
|
prank preload-conservation dataset.ds \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030
|
|
```
|
|
|
|
The command reports progress and a summary of cached/fetched/failed chains.
|
|
|
|
Some long sequences may exceed the default timeout.
|
|
If some chains fail with timeout errors, you can safely re-run the same command.
|
|
Already cached chains will be skipped, and only the failed ones will be retried.
|
|
Use `-conservation_provider_timeout` to increase the per-request timeout:
|
|
|
|
```bash
|
|
prank preload-conservation dataset.ds \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030 \
|
|
-conservation_provider_timeout 1200 # 20 minutes, default is 600 seconds (10 minutes)
|
|
```
|
|
|
|
You can also control the number of concurrent requests to avoid overloading the server:
|
|
|
|
```bash
|
|
prank preload-conservation dataset.ds \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030 \
|
|
-conservation_provider_threads 4
|
|
```
|
|
|
|
### Preloading to a Dedicated Directory for Reproducible Experiments
|
|
|
|
For reproducible experiments, it is better to preload conservation scores into a dedicated directory
|
|
next to the dataset file and then use `-conservation_dirs` to point to it during prediction.
|
|
This way the scores are stored in a known location, are easy to version or share,
|
|
and you don't rely on the implicit per-protein cache mechanism.
|
|
|
|
```bash
|
|
# 1. Preload scores into a dedicated directory next to the dataset file
|
|
prank preload-conservation dataset.ds \
|
|
-conservation_type hmm \
|
|
-conservation_provider hmm_server \
|
|
-conservation_provider_url http://localhost:8030 \
|
|
-conservation_cache_dir ./conservation
|
|
|
|
# 2. Run prediction using -conservation_dirs, with cache disabled
|
|
# (scores are loaded directly from the directory, no cache is read or written)
|
|
prank predict dataset.ds \
|
|
-c alphafold_conservation_hmm \
|
|
-conservation_dirs ./conservation/hmm \
|
|
-conservation_disable_cache true
|
|
```
|
|
|
|
Note: `-conservation_dirs` points to `./conservation/hmm` because preloading
|
|
creates the layout `{conservation_cache_dir}/{conservation_type}/`, so score files
|
|
end up in `./conservation/hmm/*.hom`.
|
|
|
|
## Parameters Reference
|
|
|
|
### Core
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `conservation_type` | `null` | Type of conservation scores. Determines cache subdirectory. Currently: `hmm` |
|
|
| `conservation_dirs` | `[]` | Directories to search for pre-computed score files. Relative to dataset dir. |
|
|
|
|
### Provider
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `conservation_provider` | `null` | External provider type. Currently: `hmm_server`. Null = use pre-computed files only. |
|
|
| `conservation_provider_url` | `null` | Base URL of the conservation server. Required when provider is set. |
|
|
| `conservation_provider_timeout` | `600` | Per-request timeout in seconds. |
|
|
| `conservation_provider_threads` | `0` | Max concurrent requests to the server. 0 = use `-threads` value. |
|
|
|
|
### Cache
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `conservation_cache_dir` | `null` | Centralized cache directory. Null = cache next to each protein file. |
|
|
| `conservation_disable_cache` | `false` | When true, cache is neither read nor written. |
|
|
|
|
### Other
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `fail_on_conserv_seq_mismatch` | `false` | Fail when sequences in the structure and score file do not match exactly. |
|
|
|
|
## Non-Standard Residue Mapping
|
|
|
|
PDB structures often contain modified (non-canonical) amino acid residues (e.g., selenomethionine MSE, phosphoserine SEP).
|
|
Before sending sequences to the conservation server or matching them against score files,
|
|
P2Rank maps these residues to standard amino acids according to the `-aa_mapping` parameter.
|
|
|
|
For example, with `-aa_mapping pdbfixer`, MSE is mapped to MET, SEP to SER, etc.
|
|
This affects the FASTA sequence that P2Rank sends to the conservation provider
|
|
and the sequence used for alignment between structure residues and conservation scores.
|
|
|
|
See [aa-mapping.md](aa-mapping.md) for details on available mapping modes and custom mapping files.
|
|
|
|
## Debugging Sequence and Conservation Mapping
|
|
|
|
P2Rank provides several `analyze` subcommands useful for inspecting how sequences are processed:
|
|
|
|
```bash
|
|
# Export raw FASTA sequences as P2Rank sees them (after aa_mapping, before masking)
|
|
prank analyze fasta-raw dataset.ds
|
|
|
|
# Export masked FASTA sequences (non-standard residue codes replaced with X)
|
|
prank analyze fasta-masked dataset.ds
|
|
|
|
# Analyze conservation scores mapped to residues and optionally produce PyMOL visualizations
|
|
prank analyze conservation dataset.ds -threads 1 -visualizations true
|
|
```
|
|
|
|
`fasta-raw` shows the sequence after applying `-aa_mapping` but without any further masking.
|
|
`fasta-masked` additionally replaces non-standard one-letter codes with `X`.
|
|
This is the version sent to the conservation server and used for mapping scores to residues.
|
|
Comparing the two outputs helps identify residues that may cause mismatches with conservation score files.
|
|
|
|
The `conservation` subcommand loads and prints per-residue conservation scores for each chain.
|
|
With `-visualizations true`, it also generates PyMOL visualization scripts.
|
|
|