Graph: GNN review

zqfang included in Machine Learning with Graphs

2020-07-26 1302 words 7 minutes

More about Graph Neural Network

Algebra presentation of Graphs

1. Adjacency matrix

$$ A_{i j}= \begin{cases} 1 & \text { if }\lbrace v_{i}, v_{j}\rbrace \in E \text { and } i \neq j \cr 0 & \text { otherwise } \end{cases} $$

2. Degree matrix: D is a diagonal matrix, where

$$ D_{ii} = d(v_i) $$

3. Laplacian matrix

What and why Laplacian matrix

if we consider all edges in graph $G$ to be undirected, then Laplacian matrix $L$ could be defined as

$$ L = D-A $$

Thus, we have the elements:

$$ L_{i j}=\begin{cases} d\left(v_{i}\right) & \text { if } i=j \cr -1 & \text { if }\lbrace v_{i}, v_{j}\rbrace \in E \text { and } i \neq j \cr 0 & \text { otherwise. } \end{cases}. $$

4. Symmetric normalized Laplacian

the symmetric normalized Laplacian is define as:

$$ \begin{aligned} L^{sym} &=D^{-\frac{1}{2}} L D^{-\frac{1}{2}} \cr &=I-D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \end{aligned} $$

The elements are given by:

$$ L_{i j}^{s y m}=\begin{cases} 1 & \text { if } i=j \text { and } d\left(v_{i}\right) \neq 0 \cr -\frac{1}{\sqrt{d\left(v_{i}\right) d\left(v_{j}\right)}} & \text { if } \lbrace v_{i}, v_{j} \rbrace \in E \text { and } i \neq j \cr 0 & \text { otherwise. } \end{cases} $$

5. Random walk nmormalized Laplacian

$$ L^{rw} = D^{-1}L = I - D^{-1}A $$

The elements can be computed by:

$$ L_{i j}^{r w}= \begin{cases} 1 & \text { if } i=j \text { and } d\left(v_{i}\right) \neq 0 \cr -\frac{1}{d\left(v_{i}\right)} & \text { if }\lbrace v_{i}, v_{j}\rbrace \in E \text { and } i \neq j \cr 0 & \text { otherwise } \end{cases} $$

6. Incidence Matrix

$$ M_{i j}= \begin{cases} 1 & \text { if } \exists k \text { s.t } e_{j}=\lbrace v_{i}, v_{k} \rbrace \cr -1 & \text { if } \exists k \text { s.t } e_{j}=\lbrace v_{k}, v_{i} \rbrace \cr 0 & \text { otherwise. } \end{cases} $$

for undrected graph, the corresponding incidence matrix statisfies that

$$ M_{i j}=\begin{cases} 1 & \text { if } \exists k \text { s.t } e_{j}=\lbrace v_{i}, v_{k}\rbrace \cr 0 & \text { otherwise. } \end{cases}. $$

Vanilla Graph Neural Networks

A node is defined by its features and related nodes in the graph. The aim of GNN is to lean a state embedding $h_v \in R^s$, which encodes the information of the neighborhood, for each node. The state embedding $h_v$ is used to produce an output $O_v$, such as the distribution of the predicted node lable.

Model

$$ h_v = f(X_v, X_{co[v]}, h_{ne[v]}, X_{ne[v]}) $$

$$ o_v = g(h_v, X_v) $$

$f$: local transition function, shared amoing all nodes
$g$: local output function
$X$: the input feature
$h$: hidden state
$co[v]$: the set of edges connected to node v
$ne[v]$: the set of neighbors of node $v$

Now, we have a compact form as

$$ H = F(H,X) \\ O = G(H, X_N) $$

$F$: the global transition function
$G$: the global output funciton

GNN use the classic iterative scheme to compute the state

$$ H^{t+1} = F(H^t, X) $$

Next question is how to learn parameters of $f$ and $g$. The loss can be written as

$$ loss = \sum_{i = 1}^p(t_i - o_i) $$

where

$p$: the number of supervised nodes
the state $h_{v}^{t}$ are iteratively updated until a time step $T$.

Graph Convolutional Networks

Spectral approaches and spatial approaches. Four calssic models (Spectral Network, ChebNet, GCN, and AGCN)

Spectral approaches

Spectral Network

the convolution operation is defined in the Fourier domain by computing the eigendecomposition of the graph Laplacian. The operation can be defined as multiplication of a signal x (a scalar for each node) with a filter $g_{\theta} = \text{diag}(\theta)$:

$$ \mathbf{g_{\theta}} \star \mathbf{x}=\mathbf{U g}_{\theta}(\Lambda) \mathbf{U}^{T} \mathbf{x} $$

$\mathbf{U}$: the matrix of eigenvectors of the normalized graph Laplacian $L = I_N - D^{-\frac{1}{2}}AD^{-\frac{1}{2}} = \mathbf{U}\Lambda\mathbf{U}^T$
This operation results in potentially intense computations and non-spatially localized filters

CHEBNET

the operation is

$$ \mathbf{g_\theta} \star \mathbf{x} \approx \sum_{k=0}^{K} \boldsymbol{\theta}_{k} \mathbf{T}_{k}(\tilde{\mathbf{L}}) \mathbf{x} $$

$\tilde{\mathbf{L}} = \frac{2}{\lambda_{max}}L - I_N$
$\lambda_{max}$: the largest eigenvalue of $L$.
$\theta$: a vector of Chebyshev coefficients
$T_{k}(x)$: the Chebyshev ploynomials $T_k(x) = 2xT_{k-1}(x) -T_{k-2}(x)$. $T_0(x) = 1$, $T_1(x) = x$.

GCN

$$ \mathbf{g_\theta^{\prime}} \star \mathbf{x} \approx \theta_{0}^{\prime} \mathbf{x}+\theta_{1}^{\prime}\left(\mathbf{L}-\mathbf{I}_{N}\right) \mathbf{x}=\theta_{0}^{\prime} \mathbf{x}-\theta_{1}^{\prime} \mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}} \mathbf{x} $$

then, constraining the number of parameters with $\theta = \theta_{0}^{\prime} = - \theta_{1}^{\prime}$, get

$$ \mathbf{g_\theta} \star \mathbf{x} \approx \theta\left(\mathbf{I}_{N}+\mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}}\right) \mathbf{x} $$

this operator could lead to numberical instabilities and exploding/vanishing gradients. Finally, introudce the renormalization trick: $I_N + \mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}} \rightarrow \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}$, with $\tilde{A} = A + I_N$, and $\tilde{D} = \sum_j\tilde{A}_{ij}$. Finally

$$ \mathbf{Z}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X} \Theta $$

$\Theta \in \mathbb{R}^{C \times F}$: a matrix of filter paramters
$\mathbf{Z} \in \mathbb{R}^{N \times F}$: the convolved signal matrix

AGCN

Adaptive Graph Convolution Network (AGCN) is proposed to learn the underlying relations. AGCN learns a “residual” graph Laplacian $\mathbf{L}_{res}$ and add it to the original Lapalcian matrix

$$ \widehat{\mathbf{L}}=\mathbf{L}+\alpha \mathbf{L}_{r e s} $$

$\mathbf{L}_{res}$ is computed by learned graph adjacency matrix $\widehat{\mathbf{A}}$

$$ \begin{aligned} \mathbf{L_{res}} &=\mathbf{I}-\widehat{\mathbf{D}}^{-\frac{1}{2}} \widehat{\mathbf{A}} \widehat{\mathbf{D}}^{-\frac{1}{2}} \cr \widehat{\mathbf{D}} &=\operatorname{degree}(\widehat{\mathbf{A}}) \end{aligned} $$

$\widehat{\mathbf{A}}$ is computed via a learned metric

The idea behind the adaptive metric is that Euclidean distance is not suitatble for graph structured data and the metric should be adaptive the the task and input features. ACGN use the generalized Mahalanobis distance

$$ D(\mathbf x_{i}, \mathbf x_{j})= \sqrt{ (\mathbf x_{i}-\mathbf x_{j})^{T} \mathbf{M}(\mathbf x_{i}-\mathbf x_{j})} $$

where M is a learned prameter taht statisfies $M = W_d W_{d}^T$.

Spatial Methods

spatial approches defined convolutions directly on the graph, operating on spatially close neighbors. The major challenge of spatial approches is defining the convolution operation with differently sized neighborhoods and maintaining the local invariance of CNNs.

Nerual FPS

Use different weight matris for nodes with different degrees

$$ \begin{aligned} \mathbf{x} &=\mathbf{h}{v}^{t-1}+\sum{i=1}^{\left|N_{v}\right|} \mathbf{h}_{i}^{t-1} \
\mathbf{h}_{v}^{t} &=\sigma\left(\mathbf{x} \mathbf{W}_{t}^{\left|N_{v}\right|}\right) \end{aligned} $$

Patchy-SAN

first selects and normalized exactly k neigbors for each node, then normalized neighhorhos servers as the receptive filed and the convolutional operation is applied.

Node sequence selection
Neighorhood assembly
Graph normalization
Convolution architecture.

DCNN

the diffusion-convolution neural network (DCNN)

for node classification

$$ H = \sigma(W^c \odot P^*X) $$

P*: N x K x N tensor, contains the power series $\lbrace P, P^2, \cdots, P^K \rbrace$
X: N x F tensor of input features

for graph classification

$$ H = \sigma(W^c \odot 1^T_N P^* X /N) $$

$1_N$: N x 1 vector of ones.

DGCN

dual grpah convolutional network, which jointly consider the local consistency and global consitency on graphs. It use 2 convolutional networks to capture the local/global consistency and adopts an unsupervised loss to ensemble them.

first network:

$$ \mathbf{Z}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X} \Theta $$

second network replace the adjacency matrix with positive pointwise mutual information (PPMI) matrix

$$ \mathbf H^{\prime}=\sigma(\mathbf D_{P}^{-\frac{1}{2}} \mathbf{X}_{P} \mathbf D_{P}^{-\frac{1}{2}} \mathbf H \Theta) $$

Assemble these two convolutions via the final loss function

$$ L=L_{0}\left(\operatorname{Conv}_{A}\right)+\lambda(t) L_{r e g}\left(\operatorname{Conv}_{A}, \operatorname{Conv}_{P}\right) $$

$\lambda(t)$ is the dynamic weight to blance the importance of these two loss functions.

$$ L_{0}\left(\operatorname{Conv}_{A}\right)=-\frac{1}{\left|y_{L}\right|} \sum_{l \in y_{L}} \sum_{i=1}^{c} Y_{l, i} \ln \left(\widehat{Z}_{l, i}^{A}\right) $$

$$ L_{r e g}\left(\operatorname{Conv}_{A}, \operatorname{Conv}_{P}\right)=\frac{1}{n} \sum_{i=1}^{n}\left|\widehat{Z}_{i,:}^{P}-\widehat{Z}_{i,:}^{A}\right|^{2} $$

LGCN

learnable graph conovlutional networks

MONET

A propose spatial-domain model (MoNet) on non-Euclidean domains which could generalized serveal previous network. GCNN, ACNN, GCN, DCNN

GRAPHSAGE

a general inductive framework.

$$ \begin{aligned} \mathbf h_{N_{v}}^{t} &=\text { AGGREGATE }_{t}\left(\lbrace \mathbf{h}_{u}^{t-1}, \forall u \in N_{v}\rbrace \right) \cr \mathbf{h}_{v}^{t} &=\sigma\left(\mathbf{W}^{t} \cdot\left[\mathbf{h}_{v}^{t-1} | \mathbf{h}_{N_{v}}^{t}\right]\right) \end{aligned} $$

Mean aggregator

$$ \mathbf h_{v}^{t}=\sigma\left(\mathbf{W} \cdot \operatorname{MEAN}\left(\lbrace \mathbf{h}_{v}^{t-1}\rbrace \cup\lbrace \mathbf{h}_{u}^{t-1}, \forall u \in N_{v}\rbrace \right)\right. $$

LSTM aggregator
Pooling aggregator

$$ \mathbf h_{N_{v}}^{t} =\max \left(\lbrace \sigma\left(\mathbf{W}_{\mathrm{pool}} \mathbf{h}_{u}^{t-1}+\mathbf{b}\right), \forall u \in N_{v}\rbrace \right) $$

an unsupervised loss function which encouage nearby nodes to have similar representations while distant nodes have different representations:

$$ J_{G}\left(\mathbf{z}_{u}\right)=-\log \left(\sigma\left(\mathbf{z}_{u}^{T} \mathbf{z}_{v}\right)\right)-Q \cdot E_{v_{n} \sim P_{n}(v)} \log \left(\sigma\left(-\mathbf{z}_{u}^{T} \mathbf{z}_{v_{n}}\right)\right) $$

Contents