Graph Convolutional Network and Graph Attention
Limitations of Shallow Encoders (e.g. node2vec)
O ( ∣ V ∣ ) O( | V | ) O ( ∣ V ∣ ) parameters are needed:No sharing of parameters between nodes Every node has its own unique embedding Inherently “transductive”:Can not generate embeddings for nodes that are not seen during training Do not incorporate node featuresMany graphs have features that we can and should leverage Could get embedding for unseen nodes!!!
Aggreate Neighbors : Generate node embeddings based on local network neighborhoods.
Intuition: Nodes aggregate information from their neighors using neural networks.
Computation graph: defined by networkneigborhood
Layers: Model can be of arbitary depth
Nodes have embeddings at each layer Layer-0 embedding: Node u u u 's input feature x u x_u x u Layer-K embedding gets information from nodes that are K hops away
Unsuperised traininguse only the graph structure “similar: nodes have similar embeddings unspuervise loss function could based on:Random walks (node2vec, DeepWalk, struc2vec) Graph factorization Node proximity in the graph Supervised trainingdirectly train the mode for a supervised task (e.g. node classification)
where,
W k W_k W k , B k {B_k} B k is learnable weighted matrices.h v 0 = x v h_v^0 = x_v h v 0 = x v : initial 0-th layer embeddings are equal to node features.h v k − 1 \mathbf{h}_{v}^{k-1} h v k − 1 : previous layer embedding of v v v .z v = h v k \mathbf{z} _v = \mathbf{h} _{v}^{k} z v = h v k : Embedding after k k k layers of neigborhood aggregationσ \sigma σ : Non-linearity, e.g., ReLUAGG:
Mean: take a weighted average of neighbors
AGG = ∑ u ∈ N ( v ) h u k − 1 ∣ N ( v ) ∣ \text{AGG} = \sum _{u \in N(v)} \frac{ \mathbf{h} _{u}^{k-1} } { | N(v) | } AGG = u ∈ N ( v ) ∑ ∣ N ( v ) ∣ h u k − 1
Pool: Transform neighbor vectors and apply symmetric vector function
AGG = γ ( [ Q h u k − 1 , ∀ u ∈ ( N ( v ) ) ] ) \text{AGG} = \gamma ([\mathbf{Q}\mathbf{h}_{u}^{k-1}, \forall u \in (N(v))]) AGG = γ ( [ Q h u k − 1 , ∀ u ∈ ( N ( v ) ) ] )
γ \gamma γ : element-wise mean/maxLSTM: Apply LSTM to reshuffled of neighbors
AGG = LSTM ( [ h u k − 1 , ∀ u ∈ π ( N ( v ) ) ] ) \text{AGG} = \text{LSTM} ([\mathbf{h}_{u}^{k-1}, \forall u \in \pi(N(v))]) AGG = LSTM ( [ h u k − 1 , ∀ u ∈ π ( N ( v ) ) ] )
The formula
h v k = σ ( W k ∑ u ∈ N ( v ) h u k − 1 ∣ N ( v ) ∣ + B k h v k − 1 )
\mathbf{h} _{v}^{k} = \sigma (\mathbf{W} _{k} \sum _{u \in N(v)} \frac{ \mathbf{h} _{u}^{k-1}}{ | N(v) | } + \mathbf{B} _{k} \mathbf{h} _{v}^{k-1} )
h v k = σ ( W k u ∈ N ( v ) ∑ ∣ N ( v ) ∣ h u k − 1 + B k h v k − 1 )
Equivalently rewritten in vector form:
H ( l + 1 ) = σ ( H ( l ) W 0 l + A ~ H ( l ) W 0 l )
\mathbf{H} ^{(l+1)} = \sigma ( \mathbf{H} ^{(l)} \mathbf{W} _{0}^{l} + \tilde{\mathbf{A}} \mathbf{H} ^{(l)} \mathbf{W} _{0}^{l})
H ( l + 1 ) = σ ( H ( l ) W 0 l + A ~ H ( l ) W 0 l )
with A ~ = D − 1 2 A D − 1 2 \tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}} A ~ = D − 2 1 A D − 2 1 .
Aggregates messages across neighborhoods. N ( v ) N(v) N ( v ) α v u = 1 ∣ N ( v ) ∣ \alpha _{vu} = \frac{1}{ | N(v) |} α v u = ∣ N ( v ) ∣ 1 is the weighting factor of node u u u 's message to node v v v α v u \alpha _{vu} α v u id defined explicitly based on the structural properties of the graphAll neighbors u ∈ N ( v u \in N(v u ∈ N ( v are equally important to node v v v ) Allows for (implicitly) specifying different importance values (α v u \alpha_{vu} α v u ) to different neighbors
Compute embedding h v k \mathbf{h}_{v}^{k} h v k of each node in the graph following:
Nodes attend over their neighorhoods’ message Implicitly specifying different weights to different nodes in a neighborhood Attention Mechanism
Compute attention coefficients e v u e_{vu} e v u across pairs of nodes u u u , v v v based on their messages:
e v u = a ( W k h u k − 1 , W k h v k − 1 ) e_{vu} = a (\mathbf{W}_{k} \mathbf{h}_{u}^{k-1}, \mathbf{W}_{k} \mathbf{h}_{v}^{k-1}) e v u = a ( W k h u k − 1 , W k h v k − 1 ) Normalize coefficients using the softmax function in order to comparable across different neighorhoods: α v u = exp ( e v u ) ∑ k ∈ N ( v ) exp ( e v k ) h v k = σ ( ∑ u ∈ N ( v ) α v u W k h u k − 1 )
\begin{aligned}
\alpha_{vu} &=\frac{ \exp (e_{v u} )} {\sum_{k \in N(v)} \exp (e_{v k} )} \cr
\boldsymbol{h}_{v}^{k} &=\sigma (\sum_{u \in N(v)} \alpha_{v u} \boldsymbol{W}_{k} \boldsymbol{h}_{u}^{k-1} )
\end{aligned}
α v u h v k = ∑ k ∈ N ( v ) exp ( e v k ) exp ( e v u ) = σ ( u ∈ N ( v ) ∑ α v u W k h u k − 1 )
Multi-head attention
Attention operations in a given layer are independently replicated R times (each replica with different parameters) Outputs are aggregated (by concatenating or adding) For GCN ,
X ′ = D ~ − 1 2 A ~ D ~ − 1 2 X Θ X^{\prime} = \tilde{D} ^{- \frac{1}{2}} \tilde{A} \tilde{D}^{- \frac{1}{2}} X \Theta X ′ = D ~ − 2 1 A ~ D ~ − 2 1 X Θ
Acutally, it is same as
x i ( k ) = ∑ j ∈ N ( i ) ∪ { i } 1 deg ( i ) ⋅ deg ( j ) ⋅ ( Θ ⋅ x j ( k − 1 ) )
\mathbf{x} _{i}^{(k)}=\sum _{j \in \mathcal{N}(i) \cup\lbrace i \rbrace } \frac{1}{\sqrt{\operatorname{deg}(i)} \cdot \sqrt{\operatorname{deg}(j)}} \cdot (\mathbf{\Theta} \cdot \mathbf{x} _{j}^{(k-1)} )
x i ( k ) = j ∈ N ( i ) ∪ { i } ∑ d e g ( i ) ⋅ d e g ( j ) 1 ⋅ ( Θ ⋅ x j ( k − 1 ) )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class GCNConv ( pyg_nn . MessagePassing ):
def __init__ ( self , in_channels , out_channels ):
super ( GCNConv , self ) . __init__ ( aggr = 'add' ) # aggregation
self . lin = torch . nn . Linear ( in_channels , out_channels )
def forward ( self , x , edge_index ):
# add self loop
edge_index , _ = self . add_self_loops ( edge_index , num_nodes = x . size ( 0 ))
# initial feature transform
x = self . lin ( x )
return self . propagate ( edge_index , size = ( x . size ( 0 ), x . size ( 0 )), x = x )
def message ( self , x_j , edge_index , size ):
# \phi
row , col = edge_index
deg = pyg_utils . degree ( row , size [ 0 ], dtype = x_j . dtype )
deg_inv_sqrt = deg . pow ( - 0.5 )
norm = deg_inv_sqrt [ row ] * deg_inv_sqrt [ col ]
return norm . view ( - 1 , 1 ) * x_j
def update ( self , aggr_out ):
# \gamma
return aggr_out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class GraphSage ( pyg_nn . MessagePassing ):
"""Non-minibatch version of GraphSage."""
def __init__ ( self , in_channels , out_channels , reducer = 'mean' ,
normalize_embedding = True ):
super ( GraphSage , self ) . __init__ ( aggr = 'mean' ) # Aggerate
if normalize_embedding :
self . normalize_emb = True
def forward ( self , x , edge_index ):
num_nodes = x . size ( 0 )
return self . propagate ( edge_index , size = ( num_nodes , num_nodes ), x = x )
def message ( self , x_j , edge_index , size ):
# \phi
return x_j
def update ( self , aggr_out , x ):
# \gamma: concate and transform
concat_out = torch . cat (( x , aggr_out ), 1 )
aggr_out = F . relu ( self . agg_lin ( concat_out ))
if self . normalize_emb :
aggr_out = F . normalize ( aggr_out , p = 2 , dim = 1 )
return aggr_out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class GAT ( pyg_nn . MessagePassing ):
def __init__ ( self , in_channels , out_channels , num_heads = 1 , concat = True ,
dropout = 0 , bias = True , ** kwargs ):
super ( GAT , self ) . __init__ ( aggr = 'add' , ** kwargs )
self . in_channels = in_channels
self . out_channels = int ( out_channels / num_heads )
self . heads = num_heads
self . concat = concat
self . dropout = dropout
self . lin = nn . Linear ( in_channels , self . out_channels * num_heads ) # TODO
self . att = nn . Parameter ( torch . Tensor ( 1 , self . heads , self . out_channels * 2 )) # TODO
if bias and concat :
self . bias = nn . Parameter ( torch . Tensor ( self . heads * self . out_channels ))
elif bias and not concat :
self . bias = nn . Parameter ( torch . Tensor ( self . out_channels ))
else :
self . register_parameter ( 'bias' , None )
nn . init . xavier_uniform_ ( self . att )
nn . init . zeros_ ( self . bias )
def forward ( self , x , edge_index , size = None ):
if size is None and torch . is_tensor ( x ):
edge_index , _ = remove_self_loops ( edge_index )
edge_index , _ = add_self_loops ( edge_index , num_nodes = x . size ( 0 ))
# \theta
x = self . lin ( x )
return self . propagate ( edge_index , size = size , x = x )
def message ( self , edge_index_i , x_i , x_j , size_i ):
# \phi compute attention coefficient
x_i = x_i . view ( - 1 , self . heads , self . out_channels ) # split hidden into multi-heads
x_j = x_j . view ( - 1 , self . heads , self . out_channels )
# concat, then cosine similarity (vector inner product) on last axis.
alpha = ( torch . cat ([ x_i , x_j ], dim =- 1 ) * self . att ) . sum ( dim =- 1 )
alpha = F . leaky_relu ( alpha , 0.2 )
# pyg softmax: called scatter_add internaly
alpha = pyg_utils . softmax ( alpha , edge_index_i , size_i )
alpha = F . dropout ( alpha , p = self . dropout , training = self . training )
return x_j * alpha . view ( - 1 , self . heads , 1 ) # weighted input
def update ( self , aggr_out ):
# \gamma multi-head
if self . concat is True :
aggr_out = aggr_out . view ( - 1 , self . heads * self . out_channels )
else :
aggr_out = aggr_out . mean ( dim = 1 )
if self . bias is not None :
aggr_out = aggr_out + self . bias
return aggr_out
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs