Contents

Graph: GCN and GAT

Graph Convolutional Network and Graph Attention

Why deep graph encoder ?

Limitations of Shallow Encoders (e.g. node2vec)

  • $O( | V | )$ parameters are needed:
    • No sharing of parameters between nodes
    • Every node has its own unique embedding
  • Inherently “transductive”:
    • Can not generate embeddings for nodes that are not seen during training
  • Do not incorporate node features
    • Many graphs have features that we can and should leverage

Graph Convolutional Network

Could get embedding for unseen nodes!!!

Aggreate Neighbors: Generate node embeddings based on local network neighborhoods.

  1. Intuition: Nodes aggregate information from their neighors using neural networks. /images/ml/gcn1.png

  2. Computation graph: defined by networkneigborhood /images/ml/gcn2.png

  3. Layers: Model can be of arbitary depth

    • Nodes have embeddings at each layer
    • Layer-0 embedding: Node $u$'s input feature $x_u$
    • Layer-K embedding gets information from nodes that are K hops away

/images/ml/gcn3.png

Training

  1. Unsuperised training
    • use only the graph structure
    • “similar: nodes have similar embeddings
    • unspuervise loss function could based on:
      • Random walks (node2vec, DeepWalk, struc2vec)
      • Graph factorization
      • Node proximity in the graph
  2. Supervised training
    • directly train the mode for a supervised task (e.g. node classification)

GraphSAGE: Generalized neigborhood aggregation

/images/ml/gcn4.png

where,

  • $W_k$, ${B_k}$ is learnable weighted matrices.
  • $h_v^0 = x_v$: initial 0-th layer embeddings are equal to node features.
  • $\mathbf{h}_{v}^{k-1}$: previous layer embedding of $v$.
  • $\mathbf{z} _v = \mathbf{h} _{v}^{k}$: Embedding after $k$ layers of neigborhood aggregation
  • $\sigma$: Non-linearity, e.g., ReLU

AGG:

  • Mean: take a weighted average of neighbors $$\text{AGG} = \sum _{u \in N(v)} \frac{ \mathbf{h} _{u}^{k-1} } { | N(v) | }$$

  • Pool: Transform neighbor vectors and apply symmetric vector function $$\text{AGG} = \gamma ([\mathbf{Q}\mathbf{h}_{u}^{k-1}, \forall u \in (N(v))])$$

    • $\gamma$: element-wise mean/max
  • LSTM: Apply LSTM to reshuffled of neighbors $$\text{AGG} = \text{LSTM} ([\mathbf{h}_{u}^{k-1}, \forall u \in \pi(N(v))])$$

Graph Attention Network

Simple Neighorhood Aggregation

The formula

$$ \mathbf{h} _{v}^{k} = \sigma (\mathbf{W} _{k} \sum _{u \in N(v)} \frac{ \mathbf{h} _{u}^{k-1}}{ | N(v) | } + \mathbf{B} _{k} \mathbf{h} _{v}^{k-1} ) $$

Equivalently rewritten in vector form:

$$ \mathbf{H} ^{(l+1)} = \sigma ( \mathbf{H} ^{(l)} \mathbf{W} _{0}^{l} + \tilde{\mathbf{A}} \mathbf{H} ^{(l)} \mathbf{W} _{0}^{l}) $$

with $\tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}$.

Graph convolutional operator

  • Aggregates messages across neighborhoods. $N(v)$
  • $\alpha _{vu} = \frac{1}{ | N(v) |}$ is the weighting factor of node $u$'s message to node $v$
  • $\alpha _{vu}$ id defined explicitly based on the structural properties of the graph
  • All neighbors $u \in N(v$ are equally important to node $v$)

Attention strategy

Allows for (implicitly) specifying different importance values ($\alpha_{vu}$) to different neighbors

Compute embedding $\mathbf{h}_{v}^{k}$ of each node in the graph following:

  • Nodes attend over their neighorhoods’ message
  • Implicitly specifying different weights to different nodes in a neighborhood
  1. Attention Mechanism

    • Compute attention coefficients $e_{vu}$ across pairs of nodes $u$, $v$ based on their messages: $$e_{vu} = a (\mathbf{W}_{k} \mathbf{h}_{u}^{k-1}, \mathbf{W}_{k} \mathbf{h}_{v}^{k-1})$$
    • Normalize coefficients using the softmax function in order to comparable across different neighorhoods:

    $$ \begin{aligned} \alpha_{vu} &=\frac{ \exp (e_{v u} )} {\sum_{k \in N(v)} \exp (e_{v k} )} \cr \boldsymbol{h}_{v}^{k} &=\sigma (\sum_{u \in N(v)} \alpha_{v u} \boldsymbol{W}_{k} \boldsymbol{h}_{u}^{k-1} ) \end{aligned} $$

  2. Multi-head attention

    • Attention operations in a given layer are independently replicated R times (each replica with different parameters)
    • Outputs are aggregated (by concatenating or adding)

GCN

For GCN,

$$X^{\prime} = \tilde{D} ^{- \frac{1}{2}} \tilde{A} \tilde{D}^{- \frac{1}{2}} X \Theta$$

Acutally, it is same as

$$ \mathbf{x} _{i}^{(k)}=\sum _{j \in \mathcal{N}(i) \cup\lbrace i \rbrace } \frac{1}{\sqrt{\operatorname{deg}(i)} \cdot \sqrt{\operatorname{deg}(j)}} \cdot (\mathbf{\Theta} \cdot \mathbf{x} _{j}^{(k-1)} ) $$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class GCNConv(pyg_nn.MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(GCNConv, self).__init__(aggr='add') # aggregation 
        self.lin = torch.nn.Linear(in_channels, out_channels)
    def forward(self, x, edge_index):
        # add self loop
        edge_index, _ = self.add_self_loops(edge_index, num_nodes=x.size(0))
        # initial feature transform
        x = self.lin(x)
        return self.propagate(edge_index, size=(x.size(0), x.size(0)), x= x)
    def message(self, x_j, edge_index, size):
        # \phi 
        row, col = edge_index
        deg = pyg_utils.degree(row, size[0], dtype=x_j.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        norm = deg_inv_sqrt[row]*deg_inv_sqrt[col]
        return norm.view(-1, 1) *x_j
    def update(self, aggr_out):
        # \gamma
        return aggr_out

GraphSAGE

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class GraphSage(pyg_nn.MessagePassing):
    """Non-minibatch version of GraphSage."""
    def __init__(self, in_channels, out_channels, reducer='mean', 
                 normalize_embedding=True):
        super(GraphSage, self).__init__(aggr='mean') # Aggerate

        if normalize_embedding:
            self.normalize_emb = True

    def forward(self, x, edge_index):
        num_nodes = x.size(0)
        return self.propagate(edge_index, size=(num_nodes, num_nodes), x=x)

    def message(self, x_j, edge_index, size):
         # \phi
         return x_j

    def update(self, aggr_out, x):
        # \gamma: concate and transform
        concat_out = torch.cat((x, aggr_out), 1)
        aggr_out = F.relu(self.agg_lin(concat_out)) 
        if self.normalize_emb:
            aggr_out = F.normalize(aggr_out, p=2, dim=1) 
        return aggr_out

GAT

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class GAT(pyg_nn.MessagePassing):

    def __init__(self, in_channels, out_channels, num_heads=1, concat=True,
                 dropout=0, bias=True, **kwargs):
        super(GAT, self).__init__(aggr='add', **kwargs)

        self.in_channels = in_channels
        self.out_channels = int(out_channels / num_heads)
        self.heads = num_heads
        self.concat = concat 
        self.dropout = dropout

        self.lin = nn.Linear(in_channels, self.out_channels * num_heads) # TODO
        self.att = nn.Parameter(torch.Tensor(1, self.heads, self.out_channels * 2)) # TODO

        if bias and concat:
            self.bias = nn.Parameter(torch.Tensor(self.heads * self.out_channels))
        elif bias and not concat:
            self.bias = nn.Parameter(torch.Tensor(self.out_channels))
        else:
            self.register_parameter('bias', None)
        nn.init.xavier_uniform_(self.att)
        nn.init.zeros_(self.bias)


    def forward(self, x, edge_index, size=None):
        if size is None and torch.is_tensor(x):
            edge_index, _ = remove_self_loops(edge_index)
            edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        # \theta 
        x = self.lin(x) 
        return self.propagate(edge_index, size=size, x=x)

    def message(self, edge_index_i, x_i, x_j, size_i):
        # \phi compute attention coefficient
        x_i = x_i.view(-1, self.heads, self.out_channels) # split hidden into multi-heads
        x_j = x_j.view(-1, self.heads, self.out_channels) 
        # concat, then cosine similarity (vector inner product) on last axis.
        alpha = (torch.cat([x_i, x_j], dim=-1) * self.att).sum(dim=-1)
        alpha = F.leaky_relu(alpha, 0.2)
        # pyg softmax: called scatter_add internaly
        alpha = pyg_utils.softmax(alpha, edge_index_i, size_i)
        alpha = F.dropout(alpha, p=self.dropout, training=self.training)
        return x_j * alpha.view(-1, self.heads, 1) # weighted input

    def update(self, aggr_out):
        # \gamma multi-head
        if self.concat is True:
            aggr_out = aggr_out.view(-1, self.heads * self.out_channels)
        else:
            aggr_out = aggr_out.mean(dim=1)
        if self.bias is not None:
            aggr_out = aggr_out + self.bias
        return aggr_out

Reference

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs