mirror of
https://github.com/rdk/p2rank.git
synced 2026-06-04 20:54:23 +08:00
- export-pocket-{grid,descriptors}.md: audit-driven cleanup. Em-dashes
replaced, type notation unified to f64/i32, vis_pocket_grid_volume_radius
explanation consolidated into the parameters table, PyMOL/ChimeraX
section split into Files produced / Layers and toggles / Renderer notes,
preview warning rephrased.
- export-points.md, dev/evaluation-metric-fixes-2.6.md: em-dashes
replaced (colons, commas, parentheses).
- conservation.md: Background paragraph tightened to match release-notes
"previously only usable through PrankWeb" phrasing.
163 lines
5.9 KiB
Markdown
163 lines
5.9 KiB
Markdown
# Exporting SAS Points with Feature Vectors
|
|
|
|
Export SAS points with feature vectors and (optionally) predicted ligandability scores and pocket assignments.
|
|
|
|
## Commands
|
|
|
|
There are two ways to export SAS points:
|
|
|
|
### `export-points` - standalone export (no model needed)
|
|
|
|
```bash
|
|
prank export-points -f protein.pdb
|
|
prank export-points -f protein.pdb -export_points_format parquet
|
|
prank export-points dataset.ds -export_points_format arrow.zst
|
|
```
|
|
|
|
The `export-points` command calculates SAS surface points with feature vectors and exports them directly - **no model is loaded and no prediction is made**.
|
|
This means the output does **not** contain `score` or `pocket` columns, but you are free to use any custom feature setup via `-features` and `-extra_features` parameters.
|
|
|
|
### `predict` / `rescore` - export alongside prediction
|
|
|
|
```bash
|
|
prank predict -f protein.pdb -export_points 1
|
|
prank predict -f protein.pdb -export_points 1 -export_points_format csv.gz
|
|
prank predict -f protein.pdb -export_points 1 -export_points_format arrow
|
|
prank predict -f protein.pdb -export_points 1 -export_points_format parquet
|
|
prank predict dataset.ds -export_points 1 -export_points_format arrow.zst
|
|
```
|
|
|
|
The `rescore` command also supports export (pocket points only):
|
|
```bash
|
|
prank rescore joined-fpocket.ds -export_points 1 -export_points_format arrow.zst
|
|
```
|
|
|
|
With `predict`/`rescore`, the output includes a `score` column with predicted ligandability and a `pocket` column with the predicted pocket rank (`0` if the point is not assigned to any pocket).
|
|
However, because prediction relies on a pre-trained model that expects a particular set and order of features,
|
|
you **cannot** customize the feature setup (changing `-features` or `-extra_features` would break the model).
|
|
|
|
### Which command to use?
|
|
|
|
| | `export-points` | `predict -export_points 1` |
|
|
|---|---|---|
|
|
| Custom feature setup | Yes | No (must match the model) |
|
|
| Predicted `score` and `pocket` columns | No | Yes |
|
|
| Requires a model | No | Yes |
|
|
|
|
## Output
|
|
|
|
For each protein file, a `{protein_file}_points.{format}` file is generated:
|
|
|
|
| Column | Description |
|
|
|--------|-------------|
|
|
| `x`, `y`, `z` | SAS point coordinates |
|
|
| `score` | Predicted ligandability [0-1] (`predict`/`rescore` only) |
|
|
| `pocket` | Predicted pocket rank (1, 2, …); `0` = point not assigned to any pocket. Integer column, present in `predict`/`rescore` output, absent in standalone `export-points` |
|
|
| `feature1`, ... | Feature values based on effective feature setup (`-features`, `-extra_features`) |
|
|
|
|
Example - `predict -export_points 1` (CSV):
|
|
```csv
|
|
x,y,z,score,pocket,chem.hydrophobic,chem.aromatic,protrusion,...
|
|
12.3456,23.4567,34.5678,0.8234,1,0.5123,-0.2345,15.0000,...
|
|
```
|
|
|
|
Example - `export-points` (CSV):
|
|
```csv
|
|
x,y,z,chem.hydrophobic,chem.aromatic,protrusion,...
|
|
12.3456,23.4567,34.5678,0.5123,-0.2345,15.0000,...
|
|
```
|
|
|
|
## Parameters
|
|
|
|
| Parameter | Default | Values |
|
|
|-----------|---------|--------|
|
|
| `export_points` | `false` | `true` / `false` |
|
|
| `export_points_format` | `csv` | `csv`, `csv.gz`, `csv.zst`, `arrow`, `arrow.gz`, `arrow.zst`, `parquet` |
|
|
|
|
**Arrow format** preserves full double precision. Offers faster loading and lower memory usage compared to CSV.
|
|
|
|
**Parquet format** is a columnar storage format widely supported by data analysis tools (pandas, polars, DuckDB, Spark). Uses SNAPPY compression internally.
|
|
|
|
## Format Recommendations
|
|
|
|
| Use Case | Recommended Format |
|
|
|----------|-------------------|
|
|
| Smallest file size | `csv.zst` or `arrow.zst` |
|
|
| Python/R analysis | `parquet` or `csv.gz` |
|
|
| Streaming/pipes | `arrow` (uncompressed) |
|
|
| Maximum compatibility | `csv` |
|
|
|
|
## Notes
|
|
|
|
- `export-points` and `predict` export all SAS points; `rescore` exports only pocket points
|
|
- `export-points` does not require `-export_points 1` - exporting is always on
|
|
- CSV format uses up to 7 decimal places for floating-point values; the `pocket` column is written as a plain integer
|
|
- Arrow uses IPC streaming format with 64-bit floats; the `pocket` column uses Int32
|
|
- Parquet uses SNAPPY compression (not configurable); the `pocket` column uses INT32
|
|
- Zstd compression uses level 16 for good compression ratio
|
|
- Export is disabled when using `-output_only_stats 1`
|
|
- `pocket` matches the `rank` column of `*_predictions.csv`. Boundary points that fall within the extended shells of two pockets (controlled by `extended_pocket_cutoff`) are labeled with the **best** (lowest) rank they belong to.
|
|
|
|
## Example Analysis
|
|
|
|
**Python (CSV):**
|
|
~~~python
|
|
import pandas as pd
|
|
df = pd.read_csv('protein_points.csv.gz')
|
|
high_score = df[df['score'] > 0.5]
|
|
print(df.describe())
|
|
|
|
# Per-pocket aggregated descriptors (predict / rescore output only):
|
|
pocket_descriptors = df[df['pocket'] > 0].groupby('pocket').mean()
|
|
~~~
|
|
|
|
**Python (Arrow):**
|
|
~~~python
|
|
import pyarrow as pa
|
|
|
|
# Uncompressed
|
|
df = pa.ipc.open_stream('protein_points.arrow').read_pandas()
|
|
|
|
# Gzip-compressed - streaming format allows direct reading
|
|
import gzip
|
|
with gzip.open('protein_points.arrow.gz', 'rb') as f:
|
|
df = pa.ipc.open_stream(f).read_pandas()
|
|
|
|
# Zstd-compressed
|
|
import zstandard as zstd
|
|
with open('protein_points.arrow.zst', 'rb') as f:
|
|
df = pa.ipc.open_stream(zstd.ZstdDecompressor().stream_reader(f)).read_pandas()
|
|
~~~
|
|
|
|
**Python (Parquet):**
|
|
~~~python
|
|
import pandas as pd
|
|
df = pd.read_parquet('protein_points.parquet')
|
|
|
|
# Or with PyArrow directly
|
|
import pyarrow.parquet as pq
|
|
table = pq.read_table('protein_points.parquet')
|
|
df = table.to_pandas()
|
|
~~~
|
|
|
|
**Python (Polars):**
|
|
~~~python
|
|
import polars as pl
|
|
|
|
# Parquet (fastest)
|
|
df = pl.read_parquet('protein_points.parquet')
|
|
|
|
# CSV with compression
|
|
df = pl.read_csv('protein_points.csv.gz')
|
|
|
|
# Filter high-scoring points
|
|
high_score = df.filter(pl.col('score') > 0.5)
|
|
~~~
|
|
|
|
## See also
|
|
|
|
- [`export-pocket-grid.md`](export-pocket-grid.md): 3D grid of points
|
|
covering empty space around the protein, tagged by predicted pocket
|
|
- [`export-pocket-descriptors.md`](export-pocket-descriptors.md):
|
|
per-pocket geometric descriptors (volume, sphericity, ...)
|