mirror of https://github.com/dmlc/dgl.git synced 2026-06-04 19:44:23 +08:00

Files

Bowen Yao f9097ee438 [Graphbolt][Dataset] Add igb-hom dataset (#7781 )

Co-authored-by: Muhammed Fatih BALIN <m.f.balin@gmail.com>

2024-09-06 01:04:53 -04:00

node_classification.py

[Graphbolt][Dataset] Add igb-hom dataset (#7781 )

2024-09-06 01:04:53 -04:00

README.md

[GraphBolt] Add DiskBasedFeature example for DGL model (#7624 )

2024-08-06 08:54:45 -04:00

README.md

Overview

This project demonstrates how to use GraphBolt to train and evaluate a GraphSAGE model for node classification task on large graphs, where node features are on-disk and fetched using DiskBasedFeature. GraphBolt utilizes various in-house implemented caching policy algorithms such as SIEVE, S3-FIFO, LRU and CLOCK to cache frequently required features and io_uring to fetch cache-missed features from disk. The SIEVE algorithm is the default option.

Node classification task

This example demonstrates how to run node classification task with GraphBolt.DiskBasedFeature. All results are collected on an AWS EC2 g5.8xlarge instance with 128GB RAM, 32 cores, an 24GB A10G GPU and a instance storage of 250K IOPS.

Run on `ogbn-papers100M` dataset

Dataset	Graph Size	Feature Size	Feature Dim
ogbn-papers100M	13 GB	53 GB	128

Results with various caching policies

This part trains a three-layer GraphSAGE model for 3 epochs on ogbn-papers100M dataset with 10GB CPU cache, using neighbor sampling.

Run default SIEVE policy

Instruction:

python node_classification.py --gpu-cache-size-in-gigabytes=0 --cpu-cache-size-in-gigabytes=10 --dataset=ogbn-papers100M --epochs=3

Result:

Training: 1178it [03:00,  6.53it/s, num_nodes=671260, gpu_cache_miss=1, cpu_cache_miss=0.0578]                                             
Evaluating: 123it [00:16,  7.47it/s, num_nodes=624816, gpu_cache_miss=1, cpu_cache_miss=0.0569]
Epoch 00, Loss: 1.4173, Approx. Train: 0.5787, Approx. Val: 0.6353, Time: 180.33928060531616s                                              
Training: 1178it [01:39, 11.79it/s, num_nodes=648380, gpu_cache_miss=1, cpu_cache_miss=0.0451]                                             
Evaluating: 123it [00:15,  7.90it/s, num_nodes=625373, gpu_cache_miss=1, cpu_cache_miss=0.0451]
Epoch 01, Loss: 1.1446, Approx. Train: 0.6386, Approx. Val: 0.6382, Time: 99.92613315582275s                                               
Training: 1178it [01:36, 12.15it/s, num_nodes=674194, gpu_cache_miss=1, cpu_cache_miss=0.0408]                                             
Evaluating: 123it [00:15,  8.08it/s, num_nodes=628233, gpu_cache_miss=1, cpu_cache_miss=0.0409]
Epoch 02, Loss: 1.0975, Approx. Train: 0.6507, Approx. Val: 0.6535, Time: 96.95083212852478s

Performance Comparison on four caching polices

Below results demonstrate the epoch time with four different caching policies.

Policy	Epoch 1 (s)	Epoch 2 (s)	Epoch 3 (s)
SIEVE	180.339	99.926	96.951
S3-FiFO	181.438	110.054	108.310
LRU	194.583	138.352	138.369
CLOCK	188.915	129.372	129.388

Results with Layer-Neighbor Sampling

This part trains a three-layer GraphSAGE model for 3 epochs on ogbn-papers100M dataset with 10GB CPU cache, using Layer-Neighbor Sampling and default SIEVE policy.

Run default `--batch-dependency=1`

Instruction:

python node_classification.py --gpu-cache-size-in-gigabytes=0 --cpu-cache-size-in-gigabytes=10 --dataset=ogbn-papers100M --sample-mode=sample_layer_neighbor --batch-dependency=1 --epochs=3

Result:

Training: 1178it [02:51,  6.88it/s, num_nodes=463495, gpu_cache_miss=1, cpu_cache_miss=0.0774]                                             
Evaluating: 123it [00:15,  7.94it/s, num_nodes=465592, gpu_cache_miss=1, cpu_cache_miss=0.0762]
Epoch 00, Loss: 1.4173, Approx. Train: 0.5774, Approx. Val: 0.6300, Time: 171.11454963684082s                                              
Training: 1178it [01:34, 12.43it/s, num_nodes=474446, gpu_cache_miss=1, cpu_cache_miss=0.0604]                                             
Evaluating: 123it [00:14,  8.45it/s, num_nodes=462042, gpu_cache_miss=1, cpu_cache_miss=0.0603]
Epoch 01, Loss: 1.1463, Approx. Train: 0.6384, Approx. Val: 0.6395, Time: 94.7821741104126s                                                
Training: 1178it [01:31, 12.82it/s, num_nodes=479331, gpu_cache_miss=1, cpu_cache_miss=0.0545]                                             
Evaluating: 123it [00:14,  8.67it/s, num_nodes=463628, gpu_cache_miss=1, cpu_cache_miss=0.0546]
Epoch 02, Loss: 1.1000, Approx. Train: 0.6501, Approx. Val: 0.6516, Time: 91.8746063709259s

Performance Comparison on different `--batch-dependency`

batch-dependency	Epoch 1 (s)	Epoch 2 (s)	Epoch 3 (s)
1	171.114	94.782	91.875
64	144.241	78.749	75.270
4096	92.494	56.111	57.647

Effect of `--layer-dependency`

Below results demonstrate the effect of enabling --layer-dependency on epoch time when setting --batch-dependency=1.

layer-dependency	Epoch 1 (s)	Epoch 2 (s)	Epoch 3 (s)
False	171.114	94.782	91.875
True	159.625	86.209	83.171

Compared to In-mem Performance

This part trains a three-layer GraphSAGE model for 3 epochs on ogbn-papers100M dataset with 20GB CPU cache and 5GB GPU cache, using neighbor sampling. We compare it to the in-mem performance with 5GB GPU cache. Following result demonstrates that with sufficient cache memory, the performance of DiskBasedFeature is not bottlenecked by the cache itself and comparable with in-memory feature stores. Note that the first epoch of training initiates the cache, thus taking longer time.

Instruction:

python node_classification.py --gpu-cache-size-in-gigabytes=5 --cpu-cache-size-in-gigabytes=20 --dataset=ogbn-papers100M --epochs=3

Result:

Feature Store	Epoch 1 (s)	Epoch 2 (s)	Epoch 3 (s)
DiskBasedFeature	143.761	32.018	31.889
In-memory	28.861	28.330	28.305

README.md

Overview

Node classification task

Run on ogbn-papers100M dataset

Results with various caching policies

Run default SIEVE policy

Performance Comparison on four caching polices

Results with Layer-Neighbor Sampling

Run default --batch-dependency=1

Performance Comparison on different --batch-dependency

Effect of --layer-dependency

Compared to In-mem Performance

Run on `ogbn-papers100M` dataset

Run default `--batch-dependency=1`

Performance Comparison on different `--batch-dependency`

Effect of `--layer-dependency`