pub_soft/ReQFlow

Fork 0

mirror of https://github.com/AngxiaoYue/ReQFlow.git synced 2026-06-04 12:14:23 +08:00

Files

History

Angxiao Yue 8194619462 update

2025-02-22 00:32:08 +08:00

demo_mmCIF/01

update readme

2025-02-21 16:46:47 +08:00

demo_processed_pdb

update readme

2025-02-21 16:46:47 +08:00

all_atom.py

upload code

2025-02-20 17:54:00 +08:00

datasets.py

upload code

2025-02-20 17:54:00 +08:00

errors.py

upload code

2025-02-20 17:54:00 +08:00

interpolant.py

upload code

2025-02-20 17:54:00 +08:00

parsers.py

upload code

2025-02-20 17:54:00 +08:00

pdb_dataloader.py

upload code

2025-02-20 17:54:00 +08:00

process_pdb_dataset.py

update readme

2025-02-21 16:46:47 +08:00

process_pdb_files.py

upload code

2025-02-20 17:54:00 +08:00

protein_dataloader.py

upload code

2025-02-20 17:54:00 +08:00

protein.py

upload code

2025-02-20 17:54:00 +08:00

README.md

update readme

2025-02-21 16:46:47 +08:00

rectify_datasets.py

upload code

2025-02-20 17:54:00 +08:00

rectify_process_pdb_files.py

update readme

2025-02-21 16:46:47 +08:00

rectify_prot_datasets.py

upload code

2025-02-20 17:54:00 +08:00

rectify_protein_dataloader.py

upload code

2025-02-20 17:54:00 +08:00

residue_constants.py

upload code

2025-02-20 17:54:00 +08:00

so3_utils.py

update

2025-02-22 00:32:08 +08:00

utils.py

upload code

2025-02-20 17:54:00 +08:00

README.md

Data process guidance

Here is the guidance of data preparation for training (base and reflow).

Demo Dataset

Here we provide demo data in data folder. The most important files are metadata_debug.csv(in demo_processed_pdb folder) and clusters-by-entity-30.txt (in metadata folder), make sure you have properly set the directory of them in _datasets.yaml.

scope_dataset:
  ...
  csv_path: ./metadata/scope_metadata.csv # For normal training only
  rectify_csv_path: path/to/rectify_scope_metadata.csv # For reflow training only. Can be ignored in base training
  ...

pdb_dataset:
  ...
  csv_path: path/to/pdb_metadata.csv # For normal training only
  rectify_csv_path: path/to/rectify_pdb_metadata.csv  # For reflow training only. Can be ignored in base training
  cluster_path: ./metadata/clusters-by-entity-30.txt # For normal training only

To make a test or debug, we recommend running on demo dataset.

Download and process Full PDB Dataset

The following procedure is the same as FrameDiff, but you don't need to clone one more repository.

WARNING: Downloading PDB can take up to 400GB of space.

To start download, run

nohup rsync -rlpt -v -z --delete --port=33444 rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./mmCIF > download.log 2>&1 &

The process of download can last for hours(~80GB), nohup or tumx is recommended.(Using nohup above)

After downloading, you should have a directory formatted like this: https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/

00/
01/
02/
..
zz/

The folder name indicates protein name inside it.

Then, unzip all files(up to 300GB):

cd mmCIF
find . -name '*.gz' -exec gzip -d {} \;

Then run the following command to process files.

python data/process_pdb_dataset.py --mmcif_dir <mmcif_dir> --write_dir <path_to_write>

See the script for more options. Each mmCIF will be written as a pickle file that we read and process in the data loading pipeline. A metadata.csv will be saved that contains the pickle path of each example as well as additional information about each example for faster filtering.

The clustering file is provided in the repo (in metadata folder), but can also get from

https://cdn.rcsb.org/resources/sequence/clusters/clusters-by-entity-30.txt

Be sure to correctly config file path as mentioned in Demo Dataset section.

Reflow training

To reflow, you need to inference some data for training on next stage. Run the following code to process generated .pdb files to .pkl that can be used for training. Make sure you have paired pdb files in your inference folder. (It should be)

python data/rectify_process_pdb_files.py --pdb_dir path/to/inference/dir --write_dir path/to/write/dir

In the output folder, there would be a metadata.csv. Set rectify_csv_path in _datasets.yaml to that directory, and data preparation part is done.