Files
p2rank/documentation/export-points.md
rdk 59bc84c265 Mention pocket column alongside score in export-points docs
The score and pocket columns share the same predict/rescore-only
origin, so describe them together in the prose, the export-points
"not contained" caveat, the predict/rescore output description, and
the "Which command to use?" table.
2026-05-07 03:21:38 +02:00

5.6 KiB

Exporting SAS Points with Feature Vectors

Export SAS points with feature vectors and (optionally) predicted ligandability scores and pocket assignments.

Commands

There are two ways to export SAS points:

export-points - standalone export (no model needed)

prank export-points -f protein.pdb
prank export-points -f protein.pdb -export_points_format parquet
prank export-points dataset.ds     -export_points_format arrow.zst

The export-points command calculates SAS surface points with feature vectors and exports them directly - no model is loaded and no prediction is made. This means the output does not contain score or pocket columns, but you are free to use any custom feature setup via -features and -extra_features parameters.

predict / rescore - export alongside prediction

prank predict -f protein.pdb -export_points 1
prank predict -f protein.pdb -export_points 1 -export_points_format csv.gz
prank predict -f protein.pdb -export_points 1 -export_points_format arrow
prank predict -f protein.pdb -export_points 1 -export_points_format parquet
prank predict dataset.ds     -export_points 1 -export_points_format arrow.zst

The rescore command also supports export (pocket points only):

prank rescore joined-fpocket.ds -export_points 1 -export_points_format arrow.zst

With predict/rescore, the output includes a score column with predicted ligandability and a pocket column with the predicted pocket rank (0 if the point is not assigned to any pocket). However, because prediction relies on a pre-trained model that expects a particular set and order of features, you cannot customize the feature setup (changing -features or -extra_features would break the model).

Which command to use?

export-points predict -export_points 1
Custom feature setup Yes No (must match the model)
Predicted score and pocket columns No Yes
Requires a model No Yes

Output

For each protein file, a {protein_file}_points.{format} file is generated:

Column Description
x, y, z SAS point coordinates
score Predicted ligandability [0-1] (predict/rescore only)
pocket Predicted pocket rank (1, 2, …); 0 = point not assigned to any pocket. Integer column, present in predict/rescore output, absent in standalone export-points
feature1, ... Feature values based on effective feature setup (-features, -extra_features)

Example - predict -export_points 1 (CSV):

x,y,z,score,pocket,chem.hydrophobic,chem.aromatic,protrusion,...
12.3456,23.4567,34.5678,0.8234,1,0.5123,-0.2345,15.0000,...

Example - export-points (CSV):

x,y,z,chem.hydrophobic,chem.aromatic,protrusion,...
12.3456,23.4567,34.5678,0.5123,-0.2345,15.0000,...

Parameters

Parameter Default Values
export_points false true / false
export_points_format csv csv, csv.gz, csv.zst, arrow, arrow.gz, arrow.zst, parquet

Arrow format preserves full double precision. Offers faster loading and lower memory usage compared to CSV.

Parquet format is a columnar storage format widely supported by data analysis tools (pandas, polars, DuckDB, Spark). Uses SNAPPY compression internally.

Format Recommendations

Use Case Recommended Format
Smallest file size csv.zst or arrow.zst
Python/R analysis parquet or csv.gz
Streaming/pipes arrow (uncompressed)
Maximum compatibility csv

Notes

  • export-points and predict export all SAS points; rescore exports only pocket points
  • export-points does not require -export_points 1 - exporting is always on
  • CSV format uses up to 7 decimal places for floating-point values; the pocket column is written as a plain integer
  • Arrow uses IPC streaming format with 64-bit floats; the pocket column uses Int32
  • Parquet uses SNAPPY compression (not configurable); the pocket column uses INT32
  • Zstd compression uses level 16 for good compression ratio
  • Export is disabled when using -output_only_stats 1
  • pocket matches the rank column of *_predictions.csv. Boundary points that fall within the extended shells of two pockets (controlled by extended_pocket_cutoff) are labeled with the best (lowest) rank they belong to.

Example Analysis

Python (CSV):

import pandas as pd
df = pd.read_csv('protein_points.csv.gz')
high_score = df[df['score'] > 0.5]
print(df.describe())

# Per-pocket aggregated descriptors (predict / rescore output only):
pocket_descriptors = df[df['pocket'] > 0].groupby('pocket').mean()

Python (Arrow):

import pyarrow as pa

# Uncompressed
df = pa.ipc.open_stream('protein_points.arrow').read_pandas()

# Gzip-compressed - streaming format allows direct reading
import gzip
with gzip.open('protein_points.arrow.gz', 'rb') as f:
    df = pa.ipc.open_stream(f).read_pandas()

# Zstd-compressed
import zstandard as zstd
with open('protein_points.arrow.zst', 'rb') as f:
    df = pa.ipc.open_stream(zstd.ZstdDecompressor().stream_reader(f)).read_pandas()

Python (Parquet):

import pandas as pd
df = pd.read_parquet('protein_points.parquet')

# Or with PyArrow directly
import pyarrow.parquet as pq
table = pq.read_table('protein_points.parquet')
df = table.to_pandas()

Python (Polars):

import polars as pl

# Parquet (fastest)
df = pl.read_parquet('protein_points.parquet')

# CSV with compression
df = pl.read_csv('protein_points.csv.gz')

# Filter high-scoring points
high_score = df.filter(pl.col('score') > 0.5)