mirror of
https://github.com/rdkit/rdkit.git
synced 2026-06-04 21:54:27 +08:00
FastCluster
This is simple workflow for clustering molecules
- Author: iwatobipen
- Date: 201712010
- Current version uses Morgan FP; rad2 is used for clustering
Requirements
- python3.x
- bayon https://github.com/fujimizu/bayon
- rdkit
Description
- Users need to install bayon at first, and can find tutorial at following URL. https://github.com/fujimizu/bayon/wiki/Tutorial_English
- Also it is needed to install RDKit for parsing SMILES.
- That's all!
Basic usage
- input file format is tab delimited text format, "ID" \t "SMILES" \n .....
- $ python fastcluster.py {input; inputfile} {N; number of clusters} { --output; filename of output} {--centroid; filename of centroid information}
- Example usage
- $ ptyhon fastcluster.py cdk2.smi 5 # clustering 47 compounds to 5 clusters.
Fastcluster iwatobipen$ python fastcluster.py cdk2.smi 5
real 0m0.015s
user 0m0.006s
sys 0m0.002s
Output format
- clusterd.tsv is default output format of bayon. List of clusters with similarity points.
cluster_1 \t molid1 \t point \t molid2 \t point ... \n
cluster_2 \t molid4 \t point \t molid5 \t point ... \n
....
- cluser_parse.tsv is rectangle format of cluster.tsv
molid1 \t point \t clusterID1 \n
molid2 \t point \t clusterID2 \n
molid3 \t point \t clusterID3 \n
molid4 \t point \t clusterID4 \n
....
Memo
- It will need more cpu time compared with directly using bayon. Because this script converts smiles to fingerprint dataset at first then performs clustering.