[DGL-Go] Change name to dglgo (#3778)

* add

* remove

* fix

* rework the readme and some changes

* add png

* update png

* add recipe get

Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
This commit is contained in:
Jinjing Zhou
2022-02-28 10:40:01 +08:00
committed by GitHub
parent d41d07d0f6
commit 266b21e535
58 changed files with 657 additions and 330 deletions

397
dglgo/README.md Normal file
View File

@@ -0,0 +1,397 @@
# DGL-Go
DGL-Go is a command line tool for users to get started with training, using and
studying Graph Neural Networks (GNNs). Data scientists can quickly apply GNNs
to their problems, whereas researchers will find it useful to customize their
experiments.
## Installation and get started
DGL-Go requires DGL v0.8+ so please make sure DGL is updated properly.
Install DGL-Go by `pip install dglgo` and type `dgl` in your console:
```
Usage: dgl [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
configure Generate a configuration file
export Export a runnable python script
recipe Get example recipes
train Launch training
```
![img](./dglgo.png)
Using DGL-Go is as easy as three steps:
1. Use `dgl configure` to pick the task, dataset and model of your interests. It generates
a configuration file for later use. You could also use `dgl recipe get` to retrieve
a configuration file we provided.
1. Use `dgl train` to launch training according to the configuration and see the results.
1. Use `dgl export` to generate a *self-contained, reproducible* Python script for advanced
customization, or try the model on custom data stored in CSV format.
Next, we will walk through all these steps one-by-one.
## Training GraphSAGE for node classification on Cora
Let's use one of the most classical setups -- training a GraphSAGE model for node
classification on the Cora citation graph dataset as an
example.
### Step one: `dgl configure`
First step, use `dgl configure` to generate a YAML configuration file.
```
dgl configure nodepred --data cora --model sage --cfg cora_sage.yaml
```
Note that `nodepred` is the name of DGL-Go *pipeline*. For now, you can think of
pipeline as training task: `nodepred` is for node prediction task; other
options include `linkpred` for link prediction task, etc. The command will
generate a configurate file `cora_sage.yaml` which includes:
* Options for the selected dataset (i.e., `cora` here).
* Model hyperparameters (e.g., number of layers, hidden size, etc.).
* Training hyperparameters (e.g., learning rate, loss function, etc.).
Different choices of task, model and datasets may give very different options,
so DGL-Go also adds a comment for what each option does in the file.
At this point you can also change options to explore optimization potentials.
Below shows the configuration file generated by the command above.
```yaml
version: 0.0.1
pipeline_name: nodepred
device: cpu
data:
name: cora
split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
model:
name: sage
embed_size: -1 # The dimension of created embedding table. -1 means using original node embedding
hidden_size: 16 # Hidden size.
num_layers: 1 # Number of hidden layers.
activation: relu # Activation function name under torch.nn.functional
dropout: 0.5 # Dropout rate.
aggregator_type: gcn # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
general_pipeline:
early_stop:
patience: 20 # Steps before early stop
checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
num_epochs: 200 # Number of training epochs
eval_period: 5 # Interval epochs between evaluations
optimizer:
name: Adam
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: model.pth # Path to save the model
num_runs: 1 # Number of experiments to run
```
Apart from `dgl configure`, you could also get one of DGL-Go's built-in configuration files
(called *recipe*) using `dgl recipe`. There are two sub-commands:
```
dgl recipe list
```
will list the available recipes:
```
➜ dgl recipe list
===============================================================================
| Filename | Pipeline | Dataset |
===============================================================================
| linkpred_citation2_sage.yaml | linkpred | ogbl-citation2 |
| linkpred_collab_sage.yaml | linkpred | ogbl-collab |
| nodepred_citeseer_sage.yaml | nodepred | citeseer |
| nodepred_citeseer_gcn.yaml | nodepred | citeseer |
| nodepred-ns_arxiv_gcn.yaml | nodepred-ns | ogbn-arxiv |
| nodepred_cora_gat.yaml | nodepred | cora |
| nodepred_pubmed_sage.yaml | nodepred | pubmed |
| linkpred_cora_sage.yaml | linkpred | cora |
| nodepred_pubmed_gcn.yaml | nodepred | pubmed |
| nodepred_pubmed_gat.yaml | nodepred | pubmed |
| nodepred_cora_gcn.yaml | nodepred | cora |
| nodepred_cora_sage.yaml | nodepred | cora |
| nodepred_citeseer_gat.yaml | nodepred | citeseer |
| nodepred-ns_product_sage.yaml | nodepred-ns | ogbn-products |
===============================================================================
```
Then use
```
dgl recipe get nodepred_cora_sage.yaml
```
to copy the YAML configuration file to your local folder.
### Step 2: `dgl train`
Simply run `dgl train --cfg cora_sage.yaml` will start the training process.
```log
...
Epoch 00190 | Loss 1.5225 | TrainAcc 0.9500 | ValAcc 0.6840
Epoch 00191 | Loss 1.5416 | TrainAcc 0.9357 | ValAcc 0.6840
Epoch 00192 | Loss 1.5391 | TrainAcc 0.9357 | ValAcc 0.6840
Epoch 00193 | Loss 1.5257 | TrainAcc 0.9643 | ValAcc 0.6840
Epoch 00194 | Loss 1.5196 | TrainAcc 0.9286 | ValAcc 0.6840
EarlyStopping counter: 12 out of 20
Epoch 00195 | Loss 1.4862 | TrainAcc 0.9643 | ValAcc 0.6760
Epoch 00196 | Loss 1.5142 | TrainAcc 0.9714 | ValAcc 0.6760
Epoch 00197 | Loss 1.5145 | TrainAcc 0.9714 | ValAcc 0.6760
Epoch 00198 | Loss 1.5174 | TrainAcc 0.9571 | ValAcc 0.6760
Epoch 00199 | Loss 1.5235 | TrainAcc 0.9714 | ValAcc 0.6760
Test Accuracy 0.7740
Accuracy across 1 runs: 0.774 ± 0.0
```
That's all! Basically you only need two commands to train a graph neural network.
### Step 3: `dgl export` for more advanced customization
That's not everything yet. You may want to open the hood and and invoke deeper
customization. DGL-Go can export a **self-contained, reproducible** Python
script for you to do anything you like.
Try `dgl export --cfg cora_sage.yaml --output script.py`,
and you'll get the script used to train the model. Here's the code snippet:
```python
...
class GraphSAGE(nn.Module):
def __init__(self,
data_info: dict,
embed_size: int = -1,
hidden_size: int = 16,
num_layers: int = 1,
activation: str = "relu",
dropout: float = 0.5,
aggregator_type: str = "gcn"):
"""GraphSAGE model
Parameters
----------
data_info : dict
The information about the input dataset.
embed_size : int
The dimension of created embedding table. -1 means using original node embedding
hidden_size : int
Hidden size.
num_layers : int
Number of hidden layers.
dropout : float
Dropout rate.
activation : str
Activation function name under torch.nn.functional
aggregator_type : str
Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
"""
super(GraphSAGE, self).__init__()
self.data_info = data_info
self.embed_size = embed_size
if embed_size > 0:
self.embed = nn.Embedding(data_info["num_nodes"], embed_size)
in_size = embed_size
else:
in_size = data_info["in_size"]
self.layers = nn.ModuleList()
self.dropout = nn.Dropout(dropout)
self.activation = getattr(nn.functional, activation)
for i in range(num_layers):
in_hidden = hidden_size if i > 0 else in_size
out_hidden = hidden_size if i < num_layers - 1 else data_info["out_size"]
self.layers.append(dgl.nn.SAGEConv( in_hidden, out_hidden, aggregator_type))
def forward(self, graph, node_feat, edge_feat=None):
if self.embed_size > 0:
dgl_warning(
"The embedding for node feature is used, and input node_feat is ignored, due to the provided embed_size.",
norepeat=True)
h = self.embed.weight
else:
h = node_feat
h = self.dropout(h)
for l, layer in enumerate(self.layers):
h = layer(graph, h, edge_feat)
if l != len(self.layers) - 1:
h = self.activation(h)
h = self.dropout(h)
return h
...
def train(cfg, pipeline_cfg, device, data, model, optimizer, loss_fcn):
g = data[0] # Only train on the first graph
g = dgl.remove_self_loop(g)
g = dgl.add_self_loop(g)
g = g.to(device)
node_feat = g.ndata.get('feat', None)
edge_feat = g.edata.get('feat', None)
label = g.ndata['label']
train_mask, val_mask, test_mask = g.ndata['train_mask'].bool(
), g.ndata['val_mask'].bool(), g.ndata['test_mask'].bool()
stopper = EarlyStopping(**pipeline_cfg['early_stop'])
val_acc = 0.
for epoch in range(pipeline_cfg['num_epochs']):
model.train()
logits = model(g, node_feat, edge_feat)
loss = loss_fcn(logits[train_mask], label[train_mask])
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_acc = accuracy(logits[train_mask], label[train_mask])
if epoch != 0 and epoch % pipeline_cfg['eval_period'] == 0:
val_acc = accuracy(logits[val_mask], label[val_mask])
if stopper.step(val_acc, model):
break
print("Epoch {:05d} | Loss {:.4f} | TrainAcc {:.4f} | ValAcc {:.4f}".
format(epoch, loss.item(), train_acc, val_acc))
stopper.load_checkpoint(model)
model.eval()
with torch.no_grad():
logits = model(g, node_feat, edge_feat)
test_acc = accuracy(logits[test_mask], label[test_mask])
return test_acc
def main():
cfg = {
'version': '0.0.1',
'device': 'cuda:0',
'model': {
'embed_size': -1,
'hidden_size': 16,
'num_layers': 2,
'activation': 'relu',
'dropout': 0.5,
'aggregator_type': 'gcn'},
'general_pipeline': {
'early_stop': {
'patience': 100,
'checkpoint_path': 'checkpoint.pth'},
'num_epochs': 200,
'eval_period': 5,
'optimizer': {
'lr': 0.01,
'weight_decay': 0.0005},
'loss': 'CrossEntropyLoss',
'save_path': 'model.pth',
'num_runs': 10}}
device = cfg['device']
pipeline_cfg = cfg['general_pipeline']
# load data
data = AsNodePredDataset(CoraGraphDataset())
# create model
model_cfg = cfg["model"]
cfg["model"]["data_info"] = {
"in_size": model_cfg['embed_size'] if model_cfg['embed_size'] > 0 else data[0].ndata['feat'].shape[1],
"out_size": data.num_classes,
"num_nodes": data[0].num_nodes()
}
model = GraphSAGE(**cfg["model"])
model = model.to(device)
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
model.parameters(),
**pipeline_cfg["optimizer"])
# train
test_acc = train(cfg, pipeline_cfg, device, data, model, optimizer, loss)
torch.save(model, pipeline_cfg["save_path"])
return test_acc
...
```
You can see that everything is collected into one Python script which includes the
entire `GraphSAGE` model definition, data processing and training loop. Simply running
`python script.py` will give you the *exact same* result as you've seen by `dgl train`.
At this point, you can change any part as you wish such as plugging your own GNN module,
changing the loss function and so on.
## Use DGL-Go on your own dataset
DGL-Go supports training a model on custom dataset by DGL's `CSVDataset`.
### Step 1: Prepare your CSV and metadata file.
Follow the tutorial at [Loading data from CSV
files](https://docs.dgl.ai/en/latest/guide/data-loadcsv.html#guide-data-pipeline-loadcsv`)
to prepare your dataset. Generally, the dataset folder should include:
* At least one CSV file for node data.
* At least one CSV file for edge data.
* A metadata file called `meta.yaml`.
### Step 2: `dgl configure` with `--data csv` option
Run
```
dgl configure nodepred --data csv --model sage --cfg csv_sage.yaml
```
to generate the configuration file. You will see that the file includes a section like
the followings:
```yaml
...
data:
name: csv
split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
data_path: ./ # metadata.yaml, nodes.csv, edges.csv should in this folder
...
```
Fill in the `data_path` option with the path to your dataset folder.
If your dataset does not have any native split for training, validation and test sets,
you can set the split ratio in the `split_ratio` option, which will
generate a random split for you.
### Step 3: `train` the model / `export` the script
Then you can do the same as the tutorial above, either train the model by
`dgl train --cfg csv_sage.yaml` or use `dgl export --cfg csv_sage.yaml
--output script.py` to get the training script.
## FAQ
**Q: What are the available options for each command?**
A: You can use `--help` for all commands. For example, use `dgl --help` for general
help message; use `dgl configure --help` for the configuration options; use
`dgl configure nodepred --help` for the configuration options of node prediction pipeline.
**Q: What exactly is nodepred/linkpred? How many are they?**
A: They are called DGl-Go pipelines. A pipeline represents the training methodology for
a certain task. Therefore, its naming convention is *<task_name>[-<method_name>]*. For example,
`nodepred` trains the selected GNN model for node classification using full-graph training method;
while `nodepred-ns` trains the model for node classifiation but using neighbor sampling.
The first release included three training pipelines (`nodepred`, `nodepred-ns` and `linkpred`)
but you can expect more will be coming in the future. Use `dgl configure --help` to see
all the available pipelines.
**Q: How to add my model to the official model recipe zoo?**
A: Currently not supported. We will enable this feature soon. Please stay tuned!
**Q: After training a model on some dataset, how can I apply it to another one?**
A: The `save_path` option in the generated configuration file allows you to specify where
to save the model after training. You can then modify the script generated by `dgl export`
to load the the model checkpoint and evaluate it on another dataset.

BIN
dglgo/dglgo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

20
dglgo/dglgo/cli/cli.py Normal file
View File

@@ -0,0 +1,20 @@
import typer
from ..pipeline import *
from ..model import *
from .config_cli import config_app
from .train_cli import train
from .export_cli import export
from .recipe_cli import recipe_app
no_args_is_help = False
app = typer.Typer(no_args_is_help=True, add_completion=False)
app.add_typer(config_app, name="configure", no_args_is_help=no_args_is_help)
app.add_typer(recipe_app, name="recipe", no_args_is_help=True)
app.command(help="Launch training", no_args_is_help=no_args_is_help)(train)
app.command(help="Export a runnable python script", no_args_is_help=no_args_is_help)(export)
def main():
app()
if __name__ == "__main__":
app()

View File

@@ -6,9 +6,9 @@ import typing
import yaml
from pathlib import Path
config_app = typer.Typer(help="Generate the config files")
config_app = typer.Typer(help="Generate a configuration file")
for key, pipeline in PipelineFactory.registry.items():
config_app.command(key, help=pipeline.get_description())(pipeline.get_cfg_func())
if __name__ == "__main__":
config_app()
config_app()

View File

@@ -10,8 +10,8 @@ import isort
import autopep8
def export(
cfg: str = typer.Option("cfg.yml", help="config yaml file name"),
output: str = typer.Option("output.py", help="output python file name")
cfg: str = typer.Option("cfg.yaml", help="config yaml file name"),
output: str = typer.Option("script.py", help="output python file name")
):
user_cfg = yaml.safe_load(Path(cfg).open("r"))
pipeline_name = user_cfg["pipeline_name"]

View File

@@ -0,0 +1,54 @@
from pathlib import Path
from typing import Optional
import typer
import os
import shutil
import yaml
def list_recipes():
file_current_dir = Path(__file__).resolve().parent
recipe_dir = file_current_dir.parent.parent / "recipes"
file_list = list(recipe_dir.glob("*.yaml"))
header = "| {:<30} | {:<18} | {:<20} |".format("Filename", "Pipeline", "Dataset")
typer.echo("="*len(header))
typer.echo(header)
typer.echo("="*len(header))
for file in file_list:
cfg = yaml.safe_load(Path(file).open("r"))
typer.echo("| {:<30} | {:<18} | {:<20} |".format(file.name, cfg["pipeline_name"], cfg["data"]["name"]))
typer.echo("="*len(header))
def copy_recipes(dir: str = typer.Option("dglgo_example_recipes", help="directory name for recipes")):
file_current_dir = Path(__file__).resolve().parent
recipe_dir = file_current_dir.parent.parent / "recipes"
current_dir = Path(os.getcwd())
new_dir = current_dir / dir
new_dir.mkdir(parents=True, exist_ok=True)
for file in recipe_dir.glob("*.yaml"):
shutil.copy(file, new_dir)
print("Example recipes are copied to {}".format(new_dir.absolute()))
def get_recipe(recipe_name: Optional[str] = typer.Argument(None, help="The recipe filename to get, e.q. nodepred_citeseer_gcn.yaml")):
if recipe_name is None:
typer.echo("Usage: dgl recipe get [RECIPE_NAME] \n")
typer.echo(" Copy the recipe to current directory \n")
typer.echo(" Arguments:")
typer.echo(" [RECIPE_NAME] The recipe filename to get, e.q. nodepred_citeseer_gcn.yaml\n")
typer.echo("Here are all avaliable recipe filename")
list_recipes()
else:
file_current_dir = Path(__file__).resolve().parent
recipe_dir = file_current_dir.parent.parent / "recipes"
current_dir = Path(os.getcwd())
recipe_path = recipe_dir / recipe_name
shutil.copy(recipe_path, current_dir)
print("Recipe {} is copied to {}".format(recipe_path.absolute(), current_dir.absolute()))
recipe_app = typer.Typer(help="Get example recipes")
recipe_app.command(name="list", help="List all available example recipes")(list_recipes)
recipe_app.command(name="copy", help="Copy all available example recipes to current directory")(copy_recipes)
recipe_app.command(name="get", help="Copy the recipe to current directory")(get_recipe)
if __name__ == "__main__":
recipe_app()

View File

@@ -5,12 +5,11 @@ from enum import Enum
import typing
import yaml
from pathlib import Path
import isort
import autopep8
def train(
cfg: str = typer.Option("cfg.yml", help="config yaml file name"),
cfg: str = typer.Option("cfg.yaml", help="config yaml file name"),
):
user_cfg = yaml.safe_load(Path(cfg).open("r"))
pipeline_name = user_cfg["pipeline_name"]
@@ -18,8 +17,8 @@ def train(
f_code = autopep8.fix_code(output_file_content, options={'aggressive': 1})
f_code = isort.code(f_code)
exec(f_code, {'__name__': '__main__'})
code = compile(f_code, 'dglgo_tmp.py', 'exec')
exec(code, {'__name__': '__main__'})
if __name__ == "__main__":
train_app = typer.Typer()

View File

@@ -49,7 +49,7 @@ class GCN(nn.Module):
in_hidden = hidden_size if i > 0 else in_size
out_hidden = hidden_size if i < num_layers - 1 else data_info["out_size"]
self.layers.append(dgl.nn.GraphConv(in_hidden, out_hidden, norm=norm))
self.layers.append(dgl.nn.GraphConv(in_hidden, out_hidden, norm=norm, allow_zero_in_degree=True))
self.dropout = nn.Dropout(p=dropout)
self.act = getattr(torch, activation)

View File

@@ -12,6 +12,8 @@ class GIN(nn.Module):
aggregator_type='sum'):
"""Graph Isomophism Networks
Edge feature is ignored in this model.
Parameters
----------
data_info : dict

View File

@@ -55,7 +55,7 @@ class GraphSAGE(nn.Module):
h = node_feat
h = self.dropout(h)
for l, layer in enumerate(self.layers):
h = layer(graph, h)
h = layer(graph, h, edge_feat)
if l != len(self.layers) - 1:
h = self.activation(h)
h = self.dropout(h)
@@ -64,7 +64,7 @@ class GraphSAGE(nn.Module):
def forward_block(self, blocks, node_feat, edge_feat = None):
h = node_feat
for l, (layer, block) in enumerate(zip(self.layers, blocks)):
h = layer(block, h)
h = layer(block, h, edge_feat)
if l != len(self.layers) - 1:
h = self.activation(h)
h = self.dropout(h)

View File

@@ -14,6 +14,8 @@ class SGC(nn.Module):
bias=True, k=2):
""" Simplifying Graph Convolutional Networks
Edge feature is ignored in this model.
Parameters
----------
data_info : dict

View File

@@ -20,6 +20,7 @@ class LinkpredPipelineCfg(BaseModel):
eval_period: int = 5
optimizer: dict = {"name": "Adam", "lr": 0.005}
loss: str = "BCELoss"
save_path: str = "model.pth"
num_runs: int = 1
@@ -29,6 +30,7 @@ pipeline_comments = {
"train_batch_size": "Edge batch size when training",
"num_epochs": "Number of training epochs",
"eval_period": "Interval epochs between evaluations",
"save_path": "Path to save the model",
"num_runs": "Number of experiments to run",
}
@@ -67,20 +69,18 @@ class LinkpredPipeline(PipelineBase):
def config(
data: DataFactory.filter("linkpred").get_dataset_enum() = typer.Option(..., help="input data name"),
cfg: str = typer.Option(
"cfg.yml", help="output configuration path"),
"cfg.yaml", help="output configuration path"),
node_model: NodeModelFactory.get_model_enum() = typer.Option(...,
help="Model name"),
edge_model: EdgeModelFactory.get_model_enum() = typer.Option(...,
help="Model name"),
neg_sampler: NegativeSamplerFactory.get_model_enum() = typer.Option(
"uniform", help="Negative sampler name"),
device: DeviceEnum = typer.Option(
"cpu", help="Device, cpu or cuda"),
"persource", help="Negative sampler name"),
):
self.__class__.setup_user_cfg_cls()
generated_cfg = {
"pipeline_name": "linkpred",
"device": device.value,
"device": "cpu",
"data": {"name": data.name},
"neg_sampler": {"name": neg_sampler.value},
"node_model": {"name": node_model.value},
@@ -89,6 +89,7 @@ class LinkpredPipeline(PipelineBase):
output_cfg = self.user_cfg_cls(**generated_cfg).dict()
output_cfg = deep_convert_dict(output_cfg)
comment_dict = {
"device": "Torch device name, e.q. cpu or cuda or cuda:0",
"general_pipeline": pipeline_comments,
"node_model": NodeModelFactory.get_constructor_doc_dict(node_model.value),
"edge_model": EdgeModelFactory.get_constructor_doc_dict(edge_model.value),
@@ -99,6 +100,9 @@ class LinkpredPipeline(PipelineBase):
},
}
comment_dict = merge_comment(output_cfg, comment_dict)
if cfg is None:
cfg = "_".join(["linkpred", data.value, node_model.value, edge_model.value]) + ".yaml"
yaml = ruamel.yaml.YAML()
yaml.dump(comment_dict, Path(cfg).open("w"))
print("Configuration file is generated at {}".format(Path(cfg).absolute()))

View File

@@ -112,6 +112,7 @@ def main():
loss = torch.nn.{{ loss }}()
optimizer = torch.optim.Adam(params, **pipeline_cfg["optimizer"])
test_hits = train(cfg, pipeline_cfg, device, dataset, model, optimizer, loss)
torch.save(model, pipeline_cfg["save_path"])
return test_hits
if __name__ == '__main__':

View File

@@ -18,6 +18,7 @@ pipeline_comments = {
"patience": "Steps before early stop",
"checkpoint_path": "Early stop checkpoint model file path"
},
"save_path": "Path to save the model",
"num_runs": "Number of experiments to run",
}
@@ -27,6 +28,7 @@ class NodepredPipelineCfg(BaseModel):
eval_period: int = 5
optimizer: dict = {"name": "Adam", "lr": 0.01, "weight_decay": 5e-4}
loss: str = "CrossEntropyLoss"
save_path: str = "model.pth"
num_runs: int = 1
@PipelineFactory.register("nodepred")
@@ -54,15 +56,14 @@ class NodepredPipeline(PipelineBase):
def get_cfg_func(self):
def config(
data: DataFactory.filter("nodepred").get_dataset_enum() = typer.Option(..., help="input data name"),
cfg: str = typer.Option(
"cfg.yml", help="output configuration path"),
cfg: Optional[str] = typer.Option(
None, help="output configuration path"),
model: NodeModelFactory.get_model_enum() = typer.Option(..., help="Model name"),
device: DeviceEnum = typer.Option("cpu", help="Device, cpu or cuda"),
):
self.__class__.setup_user_cfg_cls()
generated_cfg = {
"pipeline_name": self.pipeline_name,
"device": device,
"device": "cpu",
"data": {"name": data.name},
"model": {"name": model.value},
"general_pipeline": {}
@@ -70,6 +71,7 @@ class NodepredPipeline(PipelineBase):
output_cfg = self.user_cfg_cls(**generated_cfg).dict()
output_cfg = deep_convert_dict(output_cfg)
comment_dict = {
"device": "Torch device name, e.q. cpu or cuda or cuda:0",
"data": {
"split_ratio": 'Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset'
},
@@ -79,6 +81,8 @@ class NodepredPipeline(PipelineBase):
comment_dict = merge_comment(output_cfg, comment_dict)
yaml = ruamel.yaml.YAML()
if cfg is None:
cfg = "_".join(["nodepred", data.value, model.value]) + ".yaml"
yaml.dump(comment_dict, Path(cfg).open("w"))
print("Configuration file is generated at {}".format(Path(cfg).absolute()))
@@ -88,7 +92,7 @@ class NodepredPipeline(PipelineBase):
def gen_script(cls, user_cfg_dict):
# Check validation
cls.setup_user_cfg_cls()
user_cfg = cls.user_cfg_cls(**user_cfg_dict)
user_cfg = cls.user_cfg_cls(**user_cfg_dict)
file_current_dir = Path(__file__).resolve().parent
with open(file_current_dir / "nodepred.jinja-py", "r") as f:
template = Template(f.read())
@@ -102,6 +106,8 @@ class NodepredPipeline(PipelineBase):
render_cfg.update(DataFactory.get_generated_code_dict(user_cfg_dict["data"]["name"], '**cfg["data"]'))
generated_user_cfg = copy.deepcopy(user_cfg_dict)
if "split_ratio" in generated_user_cfg["data"]:
generated_user_cfg["data"].pop("split_ratio")
if len(generated_user_cfg["data"]) == 1:
generated_user_cfg.pop("data")
else:
@@ -116,9 +122,6 @@ class NodepredPipeline(PipelineBase):
if user_cfg_dict["data"].get("split_ratio", None) is not None:
render_cfg["data_initialize_code"] = "{}, split_ratio={}".format(render_cfg["data_initialize_code"], user_cfg_dict["data"]["split_ratio"])
if "split_ratio" in generated_user_cfg["data"]:
generated_user_cfg["data"].pop("split_ratio")
render_cfg["user_cfg_str"] = f"cfg = {str(generated_user_cfg)}"
render_cfg["user_cfg"] = user_cfg_dict
return template.render(**render_cfg)

View File

@@ -112,6 +112,7 @@ def main():
optimizer = torch.optim.{{ user_cfg.general_pipeline.optimizer.name }}(model.parameters(), **pipeline_cfg["optimizer"])
# train
test_acc = train(cfg, pipeline_cfg, device, data, model, optimizer, loss)
torch.save(model, pipeline_cfg["save_path"])
return test_acc
if __name__ == '__main__':

View File

@@ -36,6 +36,14 @@ pipeline_comments = {
"patience": "Steps before early stop",
"checkpoint_path": "Early stop checkpoint model file path"
},
"sampler": {
"fan_out": "List of neighbors to sample per edge type for each GNN layer, with the i-th element being the fanout for the i-th GNN layer. Length should be the same as num_layers in model setting",
"batch_size": "Batch size of seed nodes in training stage",
"num_workers": "Number of workers to accelerate the graph data processing step",
"eval_batch_size": "Batch size of seed nodes in training stage in evaluation stage",
"eval_num_workers": "Number of workers to accelerate the graph data processing step in evaluation stage"
},
"save_path": "Path to save the model",
"num_runs": "Number of experiments to run",
}
@@ -47,6 +55,7 @@ class NodepredNSPipelineCfg(BaseModel):
optimizer: dict = {"name": "Adam", "lr": 0.005, "weight_decay": 0.0}
loss: str = "CrossEntropyLoss"
num_runs: int = 1
save_path: str = "model.pth"
@PipelineFactory.register("nodepred-ns")
class NodepredNsPipeline(PipelineBase):
@@ -60,7 +69,7 @@ class NodepredNsPipeline(PipelineBase):
class NodePredUserConfig(UserConfig):
eval_device: DeviceEnum = Field("cpu")
data: DataFactory.filter("nodepred-ns").get_pydantic_config() = Field(..., discriminator="name")
model : NodeModelFactory.get_pydantic_model_config() = Field(..., discriminator="name")
model : NodeModelFactory.filter(lambda cls: hasattr(cls, "forward_block")).get_pydantic_model_config() = Field(..., discriminator="name")
general_pipeline: NodepredNSPipelineCfg
cls.user_cfg_cls = NodePredUserConfig
@@ -72,16 +81,14 @@ class NodepredNsPipeline(PipelineBase):
def get_cfg_func(self):
def config(
data: DataFactory.filter("nodepred-ns").get_dataset_enum() = typer.Option(..., help="input data name"),
cfg: str = typer.Option(
"cfg.yml", help="output configuration path"),
model: NodeModelFactory.get_model_enum() = typer.Option(..., help="Model name"),
device: DeviceEnum = typer.Option(
"cpu", help="Device, cpu or cuda"),
cfg: Optional[str] = typer.Option(
None, help="output configuration path"),
model: NodeModelFactory.filter(lambda cls: hasattr(cls, "forward_block")).get_model_enum() = typer.Option(..., help="Model name"),
):
self.__class__.setup_user_cfg_cls()
generated_cfg = {
generated_cfg = {
"pipeline_name": "nodepred-ns",
"device": device,
"device": "cpu",
"data": {"name": data.name},
"model": {"name": model.value},
"general_pipeline": {"sampler":{"name": "neighbor"}}
@@ -89,14 +96,21 @@ class NodepredNsPipeline(PipelineBase):
output_cfg = self.user_cfg_cls(**generated_cfg).dict()
output_cfg = deep_convert_dict(output_cfg)
comment_dict = {
"device": "Torch device name, e.q. cpu or cuda or cuda:0",
"data": {
"split_ratio": 'Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset'
},
"general_pipeline": pipeline_comments,
"model": NodeModelFactory.get_constructor_doc_dict(model.value)
"model": NodeModelFactory.get_constructor_doc_dict(model.value),
}
comment_dict = merge_comment(output_cfg, comment_dict)
# truncate length fan_out to be the same as num_layers in model
if "num_layers" in comment_dict["model"]:
comment_dict['general_pipeline']["sampler"]["fan_out"] = [5,10,15,15,15][:int(comment_dict['model']["num_layers"])]
if cfg is None:
cfg = "_".join(["nodepred-ns", data.value, model.value]) + ".yaml"
yaml = ruamel.yaml.YAML()
yaml.dump(comment_dict, Path(cfg).open("w"))
print("Configuration file is generated at {}".format(
@@ -112,6 +126,10 @@ class NodepredNsPipeline(PipelineBase):
template = Template(f.read())
pipeline_cfg = NodepredNSPipelineCfg(
**user_cfg_dict["general_pipeline"])
if "num_layers" in user_cfg_dict["model"]:
assert user_cfg_dict["model"]["num_layers"] == len(user_cfg_dict["general_pipeline"]["sampler"]["fan_out"]), \
"The num_layers in model config should be the same as the length of fan_out in sampler. For example, if num_layers is 1, the fan_out cannot be [5, 10]"
render_cfg = copy.deepcopy(user_cfg_dict)
model_code = NodeModelFactory.get_source_code(
@@ -123,6 +141,8 @@ class NodepredNsPipeline(PipelineBase):
user_cfg_dict["data"]["name"], '**cfg["data"]'))
generated_user_cfg = copy.deepcopy(user_cfg_dict)
if "split_ratio" in generated_user_cfg["data"]:
generated_user_cfg["data"].pop("split_ratio")
if len(generated_user_cfg["data"]) == 1:
generated_user_cfg.pop("data")
else:
@@ -135,8 +155,6 @@ class NodepredNsPipeline(PipelineBase):
if user_cfg_dict["data"].get("split_ratio", None) is not None:
render_cfg["data_initialize_code"] = "{}, split_ratio={}".format(render_cfg["data_initialize_code"], user_cfg_dict["data"]["split_ratio"])
if "split_ratio" in generated_user_cfg["data"]:
generated_user_cfg["data"].pop("split_ratio")
render_cfg["user_cfg_str"] = f"cfg = {str(generated_user_cfg)}"
render_cfg["user_cfg"] = user_cfg_dict
@@ -145,4 +163,4 @@ class NodepredNsPipeline(PipelineBase):
@staticmethod
def get_description() -> str:
return "Node classification sampling pipeline"
return "Node classification neighbor sampling pipeline"

View File

@@ -157,8 +157,8 @@ def main():
model = model.to(device)
loss = torch.nn.{{ user_cfg.general_pipeline.loss }}()
optimizer = torch.optim.{{ user_cfg.general_pipeline.optimizer.name }}(model.parameters(), **pipeline_cfg["optimizer"])
# train
test_acc = train(cfg, pipeline_cfg, device, data, model, optimizer, loss)
torch.save(model, pipeline_cfg["save_path"])
return test_acc
if __name__ == '__main__':

View File

@@ -334,6 +334,14 @@ class ModelFactory:
type_annotation_dict[k] = param.annotation
return type_annotation_dict
def filter(self, filter_func):
new_fac = ModelFactory()
for name in self.registry:
if filter_func(self.registry[name]):
new_fac.registry[name] = self.registry[name]
new_fac.code_registry[name] = self.code_registry[name]
return new_fac
class SamplerFactory:
""" The factory class for creating executors"""
@@ -411,7 +419,7 @@ class SamplerFactory:
NegativeSamplerFactory = SamplerFactory()
NegativeSamplerFactory.register("uniform")(GlobalUniform)
NegativeSamplerFactory.register("global")(GlobalUniform)
NegativeSamplerFactory.register("persource")(PerSourceUniform)
NodeModelFactory = ModelFactory()

View File

View File

@@ -31,4 +31,5 @@ general_pipeline:
name: Adam
lr: 0.005
loss: BCELoss
save_path: "model.pth"
num_runs: 1 # Number of experiments to run

View File

@@ -31,4 +31,5 @@ general_pipeline:
name: Adam
lr: 0.005
loss: BCELoss
save_path: "model.pth"
num_runs: 1 # Number of experiments to run

View File

@@ -31,4 +31,5 @@ general_pipeline:
name: Adam
lr: 0.005
loss: BCELoss
save_path: "model.pth"
num_runs: 1 # Number of experiments to run

View File

@@ -31,4 +31,5 @@ general_pipeline:
lr: 0.005
weight_decay: 0.0
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 5

View File

@@ -35,4 +35,5 @@ general_pipeline:
lr: 0.005
weight_decay: 0.0
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 5 # Number of experiments to run

View File

@@ -28,4 +28,5 @@ general_pipeline:
lr: 0.005
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -24,4 +24,5 @@ general_pipeline:
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -23,4 +23,5 @@ general_pipeline:
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -28,4 +28,5 @@ general_pipeline:
lr: 0.005
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -24,4 +24,5 @@ general_pipeline:
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -23,4 +23,5 @@ general_pipeline:
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -28,4 +28,5 @@ general_pipeline:
lr: 0.005
weight_decay: 0.001
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -24,4 +24,5 @@ general_pipeline:
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -23,4 +23,5 @@ general_pipeline:
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
save_path: "model.pth"
num_runs: 10 # Number of experiments to run

View File

@@ -3,7 +3,7 @@
from setuptools import find_packages
from distutils.core import setup
setup(name='dglenter',
setup(name='dglgo',
version='0.0.1',
description='DGL',
author='DGL Team',
@@ -15,12 +15,15 @@ setup(name='dglenter',
'autopep8>=1.6.0',
'numpydoc>=1.1.0',
"pydantic>=1.9.0",
"ruamel.yaml>=0.17.20"
"ruamel.yaml>=0.17.20",
"PyYAML>=5.1"
],
license='APACHE',
package_data={"": ["./*"]},
include_package_data=True,
license='APACHE',
entry_points={
'console_scripts': [
"dgl-enter = dglenter.cli.cli:main"
"dgl = dglgo.cli.cli:main"
]
},
url='https://github.com/dmlc/dgl',

26
dglgo/tests/cfg.yml Normal file
View File

@@ -0,0 +1,26 @@
version: 0.0.1
pipeline_name: nodepred
device: cpu
data:
name: cora
split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
model:
name: sage
embed_size: -1 # The dimension of created embedding table. -1 means using original node embedding
hidden_size: 16 # Hidden size.
num_layers: 1 # Number of hidden layers.
activation: relu # Activation function name under torch.nn.functional
dropout: 0.5 # Dropout rate.
aggregator_type: gcn # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
general_pipeline:
early_stop:
patience: 20 # Steps before early stop
checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
num_epochs: 200 # Number of training epochs
eval_period: 5 # Interval epochs between evaluations
optimizer:
name: Adam
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
num_runs: 1 # Number of experiments to run

1
dglgo/tests/run_test.sh Normal file
View File

@@ -0,0 +1 @@
python -m pytest --pdb -vv --capture=tee-sys test_pipeline.py::test_recipe

View File

@@ -0,0 +1,62 @@
import subprocess
from typing import NamedTuple
import pytest
from pathlib import Path
# class DatasetSpec:
dataset_spec = {
"cora": {"timeout": 30}
}
class ExperimentSpec(NamedTuple):
pipeline: str
dataset: str
model: str
timeout: int
extra_cfg: dict = {}
exps = [ExperimentSpec(pipeline="nodepred", dataset="cora", model="sage", timeout=0.5)]
@pytest.mark.parametrize("spec", exps)
def test_train(spec):
cfg_path = "/tmp/test.yaml"
run = subprocess.run(["dgl", "config", spec.pipeline, "--data", spec.dataset, "--model", spec.model, "--cfg", cfg_path], timeout=spec.timeout, capture_output=True)
assert run.stderr is None or len(run.stderr) == 0, "Found error message: {}".format(run.stderr)
output = run.stdout.decode("utf-8")
print(output)
run = subprocess.run(["dgl", "train", "--cfg", cfg_path], timeout=spec.timeout, capture_output=True)
assert run.stderr is None or len(run.stderr) == 0, "Found error message: {}".format(run.stderr)
output = run.stdout.decode("utf-8")
print(output)
TEST_RECIPE_FOLDER = "my_recipes"
@pytest.fixture
def setup_recipe_folder():
run = subprocess.run(["dgl", "recipe", "copy", "--dir", TEST_RECIPE_FOLDER], timeout=15, capture_output=True)
@pytest.mark.parametrize("file", [str(f) for f in Path(TEST_RECIPE_FOLDER).glob("*.yaml")])
def test_recipe(file, setup_recipe_folder):
print("DGL enter train {}".format(file))
try:
run = subprocess.run(["dgl", "train", "--cfg", file], timeout=5, capture_output=True)
sh_stdout, sh_stderr = run.stdout, run.stderr
except subprocess.TimeoutExpired as e:
sh_stdout = e.stdout
sh_stderr = e.stderr
if sh_stderr is not None and len(sh_stderr) != 0:
error_str = sh_stderr.decode("utf-8")
lines = error_str.split("\n")
for line in lines:
line = line.strip()
if line.startswith("WARNING") or line.startswith("Aborted") or line.startswith("0%"):
continue
else:
assert len(line) == 0, error_str
print("{} stdout: {}".format(file, sh_stdout))
print("{} stderr: {}".format(file, sh_stderr))
# test_recipe( , None)

View File

@@ -1,270 +0,0 @@
# DGL-Enter
(What is DGL-Enter? Why design this? What is it for?)
DGL-Enter is a commanline tool for user to quickly bootstrap models with multiple datasets. And provide full capability for user to customize the pipeline into their own takks.
## Installation guide
You can install DGL-enter easily by `pip install dglenter`. Then you should be able to use DGL-Enter in you commandline tool by type in `dgl-enter`
```
Usage: dgl-enter [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
config Generate the config files
export Export the python file from config
train Train the model
```
## Train GraphSAGE on Cora from scratch
Here we'll use one of the most classic model GraphSAGE and Cora citation graph dataset as an example, to show how easy to train a model with DGL-Enter.
### Step 1: Use `dgl-enter config` to generate a yaml configuration file
Run `dgl-enter config nodepred --data cora --model sage --cfg cora_sage.yml`. Then you'll get a configuration file `cora_sage.yml` includes all the configuration to be tuned, with the comments
Optionally, You can change the config as you want to acheive a better performance. Below is a modified sample based on the template generated by the command above.
The early stop part is removed for simplicity
```yaml
version: 0.0.1
pipeline_name: nodepred
device: cpu
data:
name: cora
split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
model:
name: sage
embed_size: -1 # The dimension of created embedding table. -1 means using original node embedding
hidden_size: 16 # Hidden size.
num_layers: 1 # Number of hidden layers.
activation: relu # Activation function name under torch.nn.functional
dropout: 0.5 # Dropout rate.
aggregator_type: gcn # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
general_pipeline:
num_epochs: 200 # Number of training epochs
eval_period: 5 # Interval epochs between evaluations
optimizer:
name: Adam
lr: 0.01
weight_decay: 0.0005
loss: CrossEntropyLoss
num_runs: 1 # Number of experiments to run
```
### Step 2: Use `dgl-enter train` to initiate the training process.
Simply run `dgl-enter train --cfg cora_sage.yml` will start the training process
```log
...
Epoch 00190 | Loss 1.5225 | TrainAcc 0.9500 | ValAcc 0.6840
Epoch 00191 | Loss 1.5416 | TrainAcc 0.9357 | ValAcc 0.6840
Epoch 00192 | Loss 1.5391 | TrainAcc 0.9357 | ValAcc 0.6840
Epoch 00193 | Loss 1.5257 | TrainAcc 0.9643 | ValAcc 0.6840
Epoch 00194 | Loss 1.5196 | TrainAcc 0.9286 | ValAcc 0.6840
EarlyStopping counter: 12 out of 20
Epoch 00195 | Loss 1.4862 | TrainAcc 0.9643 | ValAcc 0.6760
Epoch 00196 | Loss 1.5142 | TrainAcc 0.9714 | ValAcc 0.6760
Epoch 00197 | Loss 1.5145 | TrainAcc 0.9714 | ValAcc 0.6760
Epoch 00198 | Loss 1.5174 | TrainAcc 0.9571 | ValAcc 0.6760
Epoch 00199 | Loss 1.5235 | TrainAcc 0.9714 | ValAcc 0.6760
Test Accuracy 0.7740
Accuracy across 1 runs: 0.774 ± 0.0
```
That's all! Basically you only need two line of command to train a graph neural network.
## Debug your model and advanced customization
That's not everything yet. We belive you may want to change more than the configuration files, to change the training pipeline, calculate new metrics, or look into the code for details.
DGL-Enter can export a self-contained, runnable python script for you to do anything you like.
Try `dgl-enter export --cfg cora_sage.yml --output script.py`, and you'll get the script used to train the model, like a magic!
Below
```python
...
def train(cfg, pipeline_cfg, device, data, model, optimizer, loss_fcn):
g = data[0] # Only train on the first graph
g = dgl.remove_self_loop(g)
g = dgl.add_self_loop(g)
g = g.to(device)
node_feat = g.ndata.get('feat', None)
edge_feat = g.edata.get('feat', None)
label = g.ndata['label']
train_mask, val_mask, test_mask = g.ndata['train_mask'].bool(
), g.ndata['val_mask'].bool(), g.ndata['test_mask'].bool()
val_acc = 0.
for epoch in range(pipeline_cfg['num_epochs']):
model.train()
logits = model(g, node_feat, edge_feat)
loss = loss_fcn(logits[train_mask], label[train_mask])
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_acc = accuracy(logits[train_mask], label[train_mask])
if epoch != 0 and epoch % pipeline_cfg['eval_period'] == 0:
val_acc = accuracy(logits[val_mask], label[val_mask])
print("Epoch {:05d} | Loss {:.4f} | TrainAcc {:.4f} | ValAcc {:.4f}".
format(epoch, loss.item(), train_acc, val_acc))
model.eval()
with torch.no_grad():
logits = model(g, node_feat, edge_feat)
test_acc = accuracy(logits[test_mask], label[test_mask])
return test_acc
def main():
cfg = {
'version': '0.0.1',
'device': 'cpu',
'data': {
'split_ratio': None},
'model': {
'embed_size': -1,
'hidden_size': 16,
'num_layers': 1,
'activation': 'relu',
'dropout': 0.5,
'aggregator_type': 'gcn'},
'general_pipeline': {
'num_epochs': 200,
'eval_period': 5,
'optimizer': {
'lr': 0.01,
'weight_decay': 0.0005},
'loss': 'CrossEntropyLoss',
'num_runs': 1}}
device = cfg['device']
pipeline_cfg = cfg['general_pipeline']
# load data
data = AsNodePredDataset(CoraGraphDataset())
# create model
model_cfg = cfg["model"]
cfg["model"]["data_info"] = {
"in_size": model_cfg['embed_size'] if model_cfg['embed_size'] > 0 else data[0].ndata['feat'].shape[1],
"out_size": data.num_classes,
"num_nodes": data[0].num_nodes()
}
model = GraphSAGE(**cfg["model"])
model = model.to(device)
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
model.parameters(),
**pipeline_cfg["optimizer"])
# train
test_acc = train(cfg, pipeline_cfg, device, data, model, optimizer, loss)
return test_acc
...
```
## Recipes
We've prepared a set of finetuned config under `enter/recipes`, that you can try easily to get a reproducable result.
For example, using GCN with pubmet dataset, you can use `enter/recipes/nodepred_pubmed_gcn.yml`.
To try it, type in `dgl-enter train --cfg recipes/nodepred_pubmed_gcn.yml` to train the model, or `dgl-enter export --cfg recipes/nodepred_pubmed_gcn.yml` to get the full training script.
## Use DGL-Enter on your own dataset
You can modify the generated script in anyway you want. However, we also provided an end2end way to use your own dataset, by using our `CSVDataset`.
Step 1: Prepare your csv and metadata file.
Following the tutorial at [Loading data from CSV files](https://docs.dgl.ai/en/latest/guide/data-loadcsv.html#guide-data-pipeline-loadcsv`), Prepare your own CSV dataset includes three files minimally, node data csv, edge data csv and the meta data file (meta.yml).
```yml
dataset_name: my_csv_dataset
edge_data:
- file_name: edges.csv
node_data:
- file_name: nodes.csv
```
Step 2: Choose to csv dataset in the `dgl-enter config` stage
Try `dgl-enter config nodepred --data csv --model sage --cfg csv_sage.yml`, to use SAGE model for your dataset. You'll see the data part is now the configuration related to CSV dataset. `data_path` is used to specify the data folder, and `./` means the current folder.
If your dataset doesn't have the builtin split on the nodes for train/val/test, you need to manually set the split ratio in the config yml file, DGL will random generate the split for you.
```yml
data:
name: csv
split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
data_path: ./ # metadata.yaml, nodes.csv, edges.csv should in this folder
```
Step 3: `train` the model/`export` the script
Then you can do the same as the tutorial above, either train the model by `dgl-eneter train --cfg csv_sage.yaml` or use `dgl-enter export --cfg csv_sage.yml --output my_dataset.py` to get the training script.
## API Referencce
DGL enter is a new tool for user to bootstrap datasets and common models.
The entry point of enter is `dgl-enter`, and it has three subcommand `config`, `train` and `export`.
### Config
The config stage is to generate a configuration file on the specific pipeline.
`dgl-enter` currently provides 3 pipelines:
- nodepred (Node prediction tasks, suitable for small dataset to prototype)
- nodepred-ns (Node prediction tasks with sampling method, suitable for medium and large dataset)
- linkpred (Link prediction tasks, to predict whether edge exists among node pairs based on node features)
You can get the full list by `dgl-enter config --help`
```
Usage: dgl-enter config [OPTIONS] COMMAND [ARGS]...
Generate the config files
Options:
--help Show this message and exit.
Commands:
linkpred Link prediction pipeline
nodepred Node classification pipeline
nodepred-ns Node classification sampling pipeline
```
For each pipeline it will have diffirent options to specified. For example, for node prediction pipeline, you can do `dgl-enter config nodepred --help`, you'll get:
```
Usage: dgl-enter config nodepred [OPTIONS]
Node classification pipeline
Options:
--data [cora|citeseer|ogbl-collab|csv|reddit|co-buy-computer]
input data name [required]
--cfg TEXT output configuration path [default:
cfg.yml]
--model [gcn|gat|sage|sgc|gin] Model name [required]
--device [cpu|cuda] Device, cpu or cuda [default: cpu]
--help Show this message and exit.
```
You can always get the detailed help information by adding `--help` to the command line
### Train
You can train a model on the dataset based on the configuration file generated by `dgl-enter config`, by `dgl-enter train`.
```
Usage: dgl-enter train [OPTIONS]
Train the model
Options:
--cfg TEXT yaml file name [default: cfg.yml]
--help Show this message and exit.
```
### Export
Get the self-contained, runnable python script derived from the configuration file by `dgl-enter export`.

View File

@@ -1,18 +0,0 @@
import typer
from ..pipeline import *
from ..model import *
from .config_cli import config_app
from .train_cli import train
from .export_cli import export
no_args_is_help = False
app = typer.Typer(no_args_is_help=no_args_is_help, add_completion=False)
app.add_typer(config_app, name="config", no_args_is_help=no_args_is_help)
app.command(help="Train the model", no_args_is_help=no_args_is_help)(train)
app.command(help="Export the python file from config", no_args_is_help=no_args_is_help)(export)
def main():
app()
if __name__ == "__main__":
app()