Graph: Train, valid, and test dataset split for link prediction
Link Prediction
- Link prediction is a common task in knowledgegraph’s link completeion.
- Link prediction is usually an unsupervised or self-supervised task, which means that sometimes we need to split the dataset and create corresponding labels on our own.
How to prepare train, valid, test datasets ?
For link prediction, we will split edges twice
- Step 1: Assign 2 types of edges in the original graph
- Message edges: Used for GNN message passing
- Supervision edges: Use for computing objectives
- After step 1:
- Only message edges will remain in the graph
- Supervision edges are used as supervision for edge predictions made by the model, will not be fed into GNN!
- Step 2: Split edges into train / validation / test
Option 1: Inductive setting
- training / validation / test sets are on different graphs
- The dataset consists of multiple graphs
- Each split can only observe the graph(s) within the split. A successful model should generalize to unseen graphs
- Applicable to node / edge / graph tasks
Option 2: Transductive
- training / validation / test sets are on the same graph
- The dataset consists of one graph
- The entire graph can be observed in all dataset splits, we only split the labels
- Only applicable to node / edge prediction tasks
Code
Option 1: PyG’s RandomLinkSplit
|
|
Option 2: deepsnap
's GraphDataset
- The
GraphDataset
is compatible withPytorch Geometric
!
|
|
Check the all docs here
The content blew is almost the same as in colab notebooks
. It’s just for easy and quick viewing in any devices.
General rules
In general, edges in the graph will be splitted to two types:
message passing
edges: used for GNN message passingsupervision
edges: used in loss function for backpropagation.- Need to include
negative sampling edges
, the edges not existed in the original graph.
- Need to include
DeepSNAP’s GraphDataset
will automatically generate labels for all edges.
- Negative edges: label 0.
- Positive supervision edges: usually label 1.
- If the original edges already have labels (started from 0), all the labels will be added by 1.
In addition to edges split and negative edge sampling, edges in each of the train, validation and test sets usually need to be disjoint.
Transductive Link Prediction Split
DeepSNAP
link prediction contains two main split modes (edge_train_mode: all, disjoin)
Split Mode: All
The figure blew shows the supervision edges in train (blue), validation (red) and test (green) sets. Notice that all original edges in all
mode will be included in the supervision edges.
To be more specific:
- At
training
time: the training supervision edges are same with the training message passing edges.- The $\text{training supervision edges} == \text{training message passing edges}$
- At
validation
time: the message passing edges are the training message passing edges and training supervision edges (still the training message passing edges in this case).- The $\text{validation supervision edges} \notin \text{training supervision edges}$:
disjoint
withtraining supervision edges
: - The $\text{validation message passing edges} = \text{training message passing edges} + \text{training supervision edges}$
- The $\text{validation supervision edges} \notin \text{training supervision edges}$:
- At
test
time: the message passing edges are the union of training message passing edges, training supervision edges, and validation supervision edges.- The $\text{test supervision edges} \notin \lbrace \text{training supervision edges}, \text{valid supervision edges} \rbrace$:
disjoint
withtraining supervision
edges andvalidation supervision
edges. - The $\text{test message passing edges} = \text{validation supervison edges} + \text{training message passing edges} + \text{training supervision edges}$
- The $\text{test supervision edges} \notin \lbrace \text{training supervision edges}, \text{valid supervision edges} \rbrace$:
Split Mode: Disjoint
The figure blow shows the supervision edges in train (blue), validation (red), test (green) sets and the training message passing edges (grey). Notice that not all original edges in disjoint
mode will be included in the supervision edges.
To be more specific:
- At
training
time: the training supervision edges are disjoint with the training message passing edges.- The $\text{training supervision edges} \notin \text{training message passing edges}$
- At
validation
time: the message passing edges are the union of training message passing edges and training supervision edges. Notice that the validation supervision edges are disjoint with training supervision edges.- The $\text{validation message passing edges} = \text{training message passing edges} + \text{training supervision edges}$
- The $\text{validation supervision edges} \notin \text{training supervision edges}$
- At
test
time: the message passing edges are the training message passing edges, training supervision edges, and validation supervision edges. The test supervision edges are disjoint with training supervision edges and validation supervision edges.- The $\text{test message passing edges} = \text{validation supervison edges} + \text{training message passing edges} + \text{training supervision edges}$
- The $\text{validation supervison edges} \notin \lbrace \text{training supervision edges}, \text{valid supervision edges} \rbrace$
Inductive Link Prediction Split
For inductive link prediction in DeepSNAP, graphs will be splitted to different (train, validation and test) sets. Each graph in the same set will have message passing edges and supervision edges (which are same in this case). But supervision and message passing edges in each graph in different sets are disjoint.
Negative Sampling Ratio and Resampling
For link_pred
task, DeepSNAP will automatically and randomly sample negative edges when:
- The dataset is splitted to several datasets, such that one dataset is splitted to train, validation and test.
- The
Batch
of the graph is called or used (this will resample all negative edges).
The number or ratio of negative edges can be controlled by specifying the edge_negative_sampling_ratio
, which has the default value 1. The resampling can be disabled by setting resample_negatives
to be False. The example below shows how to set different number or ratio of negative edges.
Training set negative edges will be resampled when the Batch object is called However, to reduce the computation cost, the validation and test sets negative edges will not be resampled.
|
|
Message Passing Ratio
Here is an example of adjusting the number of message passing edges and supervision edges in disjoint
mode. We can control the number of edges by adjusting the edge_message_ratio
, which defines the ratio between message-passing edges and supervision edges in the training set.
Node Split
See also dataset split for node classification