Tutorial
Data Loading API
Welcome to the tutorial! Let’s start with the data loading API that is
used to assemble a dataset from the given graph(s) and task. To load a
dataset from the remote data repository, simply use the
gli.dataloading.get_gli_dataset()
function:
>>> import gli
>>> dataset = gli.get_gli_dataset(dataset="cora", task="NodeClassification", device="cpu")
>>> dataset
Dataset("CORA dataset. NodeClassification", num_graphs=1, save_path=/Users/jimmy/.dgl/CORA dataset. NodeClassification)
The above code loads the Cora on gli.task.NodeClassificationTask
that is
predefined in the GLI repository. gli.dataloading.get_gli_dataset()
essentially
does three things:
Load the requested graph(s).
Load the requested task configuration.
Combine them to return a dataset instance.
Alternatively, one can do the same thing step by step, with the help of functions provided by GLI.
In specific, methods started with get
will download data from the remote repository and methods started with read
will read data files from local directories.
GLI adopts the graph classes of DGL. Therefore, gli.dataloading.get_gli_graph()
will return a DGLGraph
instance, or a list of DGLGraph
if the dataset
contains multiple graphs. Besides, GLI provides class implementations for various tasks (e.g., gli.task.NodeClassificationTask
, gli.task.LinkPredictionTask
).
Furthermore, gli.dataloading.get_gli_task()
will return a gli.task.GLITask
object. One can then call gli.dataloading.combine_graph_and_task()
to assemble
a corresponding dataset (e.g., gli.dataset.NodeClassificationDataset
, gli.dataset.LinkPredictionDataset
).
>>> import gli
>>> g = gli.get_gli_graph(dataset="cora", device="cpu", verbose=False)
>>> g
Graph(num_nodes=2708, num_edges=10556,
ndata_schemes={'NodeFeature': Scheme(shape=(1433,), dtype=torch.float32), 'NodeLabel': Scheme(shape=(), dtype=torch.int64)}
edata_schemes={})
>>> task = gli.get_gli_task(dataset="cora", task="NodeClassification", verbose=False)
>>> task
<gli.task.NodeClassificationTask object at 0x100eff640>
>>> dataset = gli.combine_graph_and_task(g, task)
>>> dataset
Dataset("CORA dataset. NodeClassification", num_graphs=1, save_path=/Users/jimmy/.dgl/CORA dataset. NodeClassification)
The returned dataset is inherited from DGLDataset
. Therefore, it can
be incorporated into DGL’s infrastructure seamlessly:
>>> type(dataset)
<class 'gli.dataset.NodeClassificationDataset'>
>>> isinstance(dataset, dgl.data.DGLDataset)
True
Example
Next, let’s see a full example of dataloading and training on GLI datasets.
First, import all required modules.
import gli
import torch
from torch import nn
import torch.nn.functional as F
from dgl.nn.pytorch import GraphConv
from gli.utils import to_dense
Then, load the Cora dataset on node classification task.
data = gli.dataloading.get_gli_dataset("cora", "NodeClassification")
g = data[0]
g = to_dense(g)
features = g.ndata["NodeFeature"]
labels = g.ndata["NodeLabel"]
train_mask = g.ndata["train_mask"]
val_mask = g.ndata["val_mask"]
test_mask = g.ndata["test_mask"]
in_feats = features.shape[1]
n_classes = data.num_labels
Since there are sparse features in Cora dataset, we need to convert it to dense for later computation.
We then define the evaluation function as below.
def accuracy(logits, labels):
"""Calculate accuracy."""
_, indices = torch.max(logits, dim=1)
correct = torch.sum(indices == labels)
return correct.item() * 1.0 / len(labels)
def evaluate(model, features, labels, mask, eval_func):
"""Evaluate model."""
model.eval()
with torch.no_grad():
logits = model(features)
logits = logits[mask]
labels = labels[mask]
return eval_func(logits, labels)
Next, we define a GCN model and start training.
class GCN(nn.Module):
"""GCN network."""
def __init__(self,
g,
in_feats,
n_hidden,
n_classes,
n_layers,
activation,
dropout):
"""Initiate model."""
super().__init__()
self.g = g
self.layers = nn.ModuleList()
# input layer
self.layers.append(GraphConv(in_feats, n_hidden,
activation=activation))
# hidden layers
for _ in range(n_layers - 2):
self.layers.append(GraphConv(n_hidden, n_hidden,
activation=activation))
# output layer
self.layers.append(GraphConv(n_hidden, n_classes))
self.dropout = nn.Dropout(p=dropout)
def forward(self, features):
"""Forward."""
h = features
for i, layer in enumerate(self.layers):
if i != 0:
h = self.dropout(h)
h = layer(self.g, h)
return h
model = GCN(g=g,
in_feats=in_feats,
n_hidden=8,
n_classes=n_classes,
n_layers=2,
activation=F.relu,
dropout=.6)
optimizer = torch.optim.AdamW(model.parameters(), lr=.01, weight_decay=.001)
eval_func = accuracy
loss_fcn = nn.CrossEntropyLoss()
for epoch in range(200):
model.train()
# forward
logits = model(features)
loss = loss_fcn(logits[train_mask], labels[train_mask])
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_acc = eval_func(logits[train_mask], labels[train_mask])
val_acc = evaluate(model, features, labels, val_mask, eval_func)
print(f"Epoch {epoch:05d} | Loss {loss.item():.4f} |"
f"TrainAcc {train_acc:.4f} | ValAcc {val_acc:.4f}")
test_acc = evaluate(model, features, labels, test_mask, eval_func)
print(f"Test Accuracy: {test_acc:.4f}")
Output:
Epoch 00000 | Loss 1.9454 |TrainAcc 0.1429 | ValAcc 0.3180
Epoch 00001 | Loss 1.9375 |TrainAcc 0.2500 | ValAcc 0.3580
Epoch 00002 | Loss 1.9318 |TrainAcc 0.3286 | ValAcc 0.3940
Epoch 00003 | Loss 1.9242 |TrainAcc 0.3357 | ValAcc 0.4100
Epoch 00004 | Loss 1.9138 |TrainAcc 0.4214 | ValAcc 0.4420
Epoch 00005 | Loss 1.9039 |TrainAcc 0.5143 | ValAcc 0.4720
Epoch 00006 | Loss 1.9002 |TrainAcc 0.4143 | ValAcc 0.4740
Epoch 00007 | Loss 1.8891 |TrainAcc 0.4643 | ValAcc 0.4660
Epoch 00008 | Loss 1.8787 |TrainAcc 0.5071 | ValAcc 0.4760
Epoch 00009 | Loss 1.8733 |TrainAcc 0.4286 | ValAcc 0.5020
Epoch 00010 | Loss 1.8581 |TrainAcc 0.5857 | ValAcc 0.5280
...