Homework 1 GNN Expressiveness
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Homework 1
1 GNN Expressiveness (28 points)
For Q1.1, write down number of layers needed. For Q1.2, write down the transition matrix M and the limiting distribution r. For Q1.3 and 1.4, write down the transition matrix w.r.t A and D. For Q1.5, write down your proof in a few sentences (equations if necessary). For Q1.6, describe the message function, aggregate function, and update rule in a few sentences or equations.
Graph Neural Networks (GNNs) are a class of neural network architectures used for deep learning on graph-structured data. Broadly, GNNs aim to generate high-quality embeddings of nodes by iteratively aggregating feature information from local graph neighborhoods using neural networks; embeddings can then be used for recommendations, classification, link prediction, or other downstream tasks. Two important types of GNNs are GCNs (graph convolutional networks) and GraphSAGE (graph sampling and aggregation).
Let G = (V, E) denote a graph with node feature vectors Xu for u ∈ V. To generate the embedding for a node u, we use the neighborhood of the node as the computation graph. At every layer l , for each pair of nodes u ∈ V and its neighbor v ∈ V , we compute a message function via neural networks, and apply a convolutional operation that aggregates the messages from the node’s local graph neighborhood (Figure 1.1), and updates the node’s representation at the next layer. By repeating this process through K GNN layers, we capture feature and structural information from a node’s local K-hop neighborhood. For each of the message computation, aggregation, and update functions, the learnable parameters are shared across all nodes in the same layer.
We initialize the feature vector for node Xu based on its individual node attributes. If we already have outside information about the nodes, we can embed that as a feature vector. Otherwise, we can use a constant feature (vector of 1) or the degree of the node as the feature vector.
These are the key steps in each layer of a GNN:
. Message computation: We use a neural network to learn a message function between nodes. For each pair of nodes u and its neighbor v, the
neural network message function can be expressed as M(hu(k), hv(k), eu,v ). In
GCN and GraphSAGE, this can simply be σ(Whv + b), where W and b
are the weights and bias of a neural network linear layer. Here hu(k) refers to
the hidden representation of node u at layer k, and eu,v denotes available information about the edge (u,v), like the edge weight or other features. For GCN and GraphSAGE, the neighbors of u are simply defined as nodes that are connected to u. However, many other variants of GNNs have different definitions of neighborhood.
. Aggregation: At each layer, we apply a function to aggregate information from all of the neighbors of each node. The aggregation function is usually permutation invariant, to reflect the fact that nodes’ neighbors have no canonical ordering. In a GCN, the aggregation is done by a weighted sum, where the weight for aggregating from v to u corresponds to the (u,v) entry of the normalized adjacency matrix D−1/2AD −1/2 .
. Update: We update the representation of anode based on the aggregated representation of the neighborhood. For example, in GCNs, a multi-layer perceptron (MLP) is used; Graph-SAGE combines a skip layer with the MLP.
. Pooling: The representation of an entire graph can be obtained by adding a pooling layer at the end. The simplest pooling methods are just taking the mean, max, or sum of all of the individual node representations. This is usually done for the purposes of graph classification.
We can formulate the Message computation, Aggregation, and Update steps
hk+1 = σ(D−1/2AD −1/2hkW k ) (1)
where hk represents the matrix of activations in the k-th layer, D−1/2AD −1/2 is the normalized adjacency of graph G, Wk is a layer-specific learnable matrix, and σ is a non-linearity function. Dropout and other forms of regularization can also be used.
We provide the pseudo-code for GraphSAGE embedding generation below. This will also be relevant to the questions below.
In this question, we investigate the effect of the number of message passing layers on the expressive power of Graph Convolutional Networks. In neural networks, expressiveness refers to the set of functions (usually the loss function for classification or regression tasks) a neural network is able to compute, which depends on the structural properties of a neural network architecture.
.
1.1 Effect of Depth on Expressiveness (4 points)
Consider the following 2 graphs in figure 1.2, where all nodes have 1-dimensional initial feature vector x = [1]. We use a simplified version of GNN, with no non- linearity, no learned linear transformation, and sum aggregation. Specifically, at every layer, the embedding of node v is updated as the sum over the embed-
dings of its neighbors (Nv ) and its current embedding hv(t) to get hv(t)+1 . We run
the GNN to compute node embeddings for the 2 red nodes respectively. Note that the 2 red nodes have different 5-hop neighborhood structure (note this is not the minimum number of hops for which the neighborhood structure of the 2 nodes differs). How many layers of message passing are needed so that these 2 nodes can be distinguished (i.e., have different GNN embeddings)? Explain your answer in a few sentences.
Figure 1.2: Figure for Question 1.1
1.2 Random Walk Matrix (4 points)
Consider the graph shown below (figure 1.3).
1. Assume that the current distribution over nodes is r = [0, 0, 1, 0], and after the random walk, the distribution is M · r. What is the random walk transition matrix M, where each row of M corresponds with the node ID in the graph?
2. What is the limiting distribution r, namely the eigenvector of M that has an eigenvalue of 1 (r = Mr)? Write your answer in fraction form or round it to the nearest thousandth place and in the following form, e.g. [1.200, 0.111, 0.462, 0.000]. Note that before reporting you should normalize r (Hint: r is a probability distribution).
Figure 1.3: Figure for Question 1.2
1.3 Relation to Random Walk (i) (4 points)
Let’s explore the similarity between message passing and random walks. Let
hl) be the embedding of node i at layer l. Suppose that we are using a mean aggregator for message passing, and omit the learned linear transformation and non-linearity: hl+1) = Σj∈Ni hl) . If we start at a node u and take a uniform random walk for 1 step, the expectation over the layer-l embeddings of nodes we can end up with is h+1) , exactly the embedding of u in the next
GNN layer. What is the transition matrix of the random walk? Describe the transition matrix using the adjacency matrix A, and degree matrix D, a diagonal matrix where Di,i is the degree of node i.
1.4 Relation to Random Walk (ii) (4 points)
Suppose that we add a skip connection to the aggregation from Question 1.3:
hl+1) = hl) + ji hl)
What is the new corresponding transition matrix?
1.5 Over-Smoothing Effect (5 points)
In Question 1.1 we see that increasing depth could give more expressive power. On the other hand, however, a very large depth also gives rise to the undesirable effect of over smoothing. Assume we are still using the aggregation function from
Question 1.3: hl+1) = Σj∈Ni hl). Show that the node embedding h(l) will
converge as l → ∞ . Here we assume that the graph is connected and has no bipartite components. We also assume that the graph is undirected.
Over-smoothing thus refers to the problem of node embedding convergence. Namely, if all node embeddings converge to the same value then we can no longer distinguish them and our node embeddings become useless for down- stream tasks. However, in practice, learnable weights, non-linearity, and other architecture choices can alleviate the over-smoothing effect.
Hint: Think about the Markov Convergence Theorem: Is the Markov chain irreducible and aperiodic? You don’t need to be super rigorous with your proof.
1.6 Learning BFS with GNN (7 points)
Next, we investigate the expressive power of GNN for learning simple graph algorithms. Consider breadth-first search (BFS), where at every step, nodes that are connected to already visited nodes become visited. Suppose that we use GNN to learn to execute the BFS algorithm. Suppose that the embeddings are 1-dimensional. Initially, all nodes have input feature 0, except a source node which has input feature 1. At every step, nodes reached by BFS have embedding 1, and nodes not reached by BFS have embedding 0. Describe a message function, an aggregation function, and an update rule for the GNN such that it learns the task perfectly.
2 Node Embedding and its Relation to Matrix Factorization (24 points)
What to submit: For Q2.1, one or more sentences/equations describ- ing the decoder. For Q2.2, write down the objective function. For Q2.3, describe the characteristics of W in one or more sentences. For Q2.4, write down the objective function. For Q2.5, characterize the embeddings, whether you think it will reflect structural similarity, and your justification. For Q2.6, one or more sentences for node2vec and struct2vec respectively. For Q2.7, one or more sentences of expla- nation. For Q2.8, one or more sentences characterizing embeddings from struct2vec.
Recall that matrix factorization and the encoder-decoder view of node em- beddings are closely related. For the embeddings, when properly formulating the encoder-decoder and the objective function, we can find equivalent matrix factorization formulation approaches.
Note that in matrix factorization we are optimizing for L2 distance; in encoder- decoder examples such as DeepWalk and node2vec, we often use log-likelihood as in lecture slides. The goal to approximate A with ZTZ is the same, but for this question, stick with the L2 objective function.
2.1 Simple matrix factorization (3 points)
In the simple matrix factorization, the objective is to approximate adjacency matrix A by the product of embedding matrix with its transpose. The opti- mization objective is minZ ||A − ZTZ|| 2 .
In the encoder-decoder perspective of node embeddings, what is the decoder? (Please provide a mathematical expression for the decoder)
2.2 Alternate matrix factorization (3 points)
In linear algebra, we define bilinear form as zi(T)Wzj , where W is a matrix.
Suppose that we define the decoder as the bilinear form, what would be the objective function for the corresponding matrix factorization? (Assume that the W matrix is fixed)
2.3 BONUS: Relation to eigen-decomposition (3 points)
Recall eigen-decomposition of a matrix (link). What would be the condition of W, such that the matrix factorization in the previous question (2.2) is equivalent to learning the eigen-decomposition of matrix A?
2.4 Multi-hop node similarity (3 points)
Define node similarity with the multi-hop definition: 2 nodes are similar if they are connected by at least one path of length at most k, where k is a parameter (e.g. k = 2). Suppose that we use the same encoder (embedding lookup) and decoder (inner product) as before. What would be the corresponding matrix factorization problem we want to solve?
2.5 node2vec & struct2vec (i) (3 points)
Finally, we’ll explore some limitations of node2vec that are introduced in the lecture, and look at algorithms that try to overcome them.
As mentioned in the lecture, due to the way random walk works, it’s hard for node2vec to learn structural embedding from the graph. Think about how a new algorithm called struct2vec works. For this question, we define a clique to be a fully connected graph, where any two nodes are connected.
Given a graph G(V, E), it defines K functions gk (u,v), k = 1, 2,..,K, which measure the structural similarity between nodes. The parameter k means that only the local structures within distance k of the node are taken into account. With all the nodes in G, regardless of the existing edges, it forms a new clique graph where any two nodes are connected by an edge whose weight is equal to the structural similarity between them. Since struct2vec defines K structural similarity functions, each edge has a set of possible weights corresponding to g1 , g2 ,...,gK .
The random walks are then performed on the clique. During each step, weights are assigned according to different gk ’s selected by some rule (omitted here for simplification). Then, the algorithm chooses the next node with probability proportional to the edge weights.
Characterize the vector representations (i.e. the embedding space) of the 10- node cliques after running the node2vec algorithm on the graph in figure 2.1. Assume through the random walk, nodes that are close to each other have sim- ilar embeddings. Do you think the node embeddings will reflect the structural similarity? Justify your answer.
Figure 2.1: Two 10-node cliques
2.6 node2vec & struct2vec (ii) (3 points)
In the above figure 2.1, suppose you arrive at node w. What are the nodes that you can reach after taking one step further with the node2vec algorithm? What about with the struct2vec algorithm (suppose that for this graph, gk (u,v) > 0 for any u,v,k)?
2.7 node2vec & struct2vec (iii) (3 points)
Why is it necessary to consider different gk ’s during the random walk?
2.8 node2vec & struct2vec (iv) (3 points)
Characterize the vector representations (i.e. the embedding space) of the two 10-node cliques after running the struct2vec algorithm on the graph in the above figure (Figure 2.1).
3 GCN (11 points)
Consider a graph G = (V, E), with node features x(v) for each v ∈ V. For each
node v ∈ V , let h = x(v) be the node’s initial embedding. At each iteration
k, the embeddings are updated as
hv) = AGGREGATE ({ hk−1) , Au ∈ Ⅵ(v)})
hk) = COMBINE(hk−1), hv)),
for some functions AGGREGATE( ·) and COMBINE( ·). Note that the argument
to the AGGREGATE( ·) function, hk−1) , Au ∈ Ⅵ(v), is a multi-set. That is,
since multiple nodes can have the same embedding, the same element can occur
in hk−1) , Au ∈ Ⅵ(v) multiple times. Finally, a graph itself may be embedded
by computing some function applied to the multi-set of all the node embeddings at some final iteration K, which we notate as
READOUT ({hK) , Av e V})
We want to use the graph embeddings above to test whether two graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ) are isomorphic. Recall that this is true if and only if there is some bijection ϕ : V1 一 V2 between nodes of G1 and nodes of G2 such that for any u,v e V1 ,
(u,v) e E1 今 (ϕ(u),ϕ(v)) e E2
The way we use the model above to test isomorphism is the following. For the two graphs, if their readout functions differ, that is
READOUT ({hK) , Av e V1 }) READOUT ({hK) , Av e V2 }) ,
we conclude the graphs are not isomorphic. Otherwise, we conclude the graphs are isomorphic. Note that this algorithm is not perfect: graph isomor- phism is thought to be hard! Below, we will explore the expressiveness of these graph embeddings.
3.1 Isomorphism Check (2 points)
Are the following two graphs isomorphic? If so, demonstrate an isomorphism between the sets of vertices. To demonstrate an isomorphism between two graphs, you need to find a 1-to-1 correspondence between their nodes and edges. If these two graphs are not isomorphic, prove it by finding a structure (node and/or edge) in one graph which is not present in the other.
3.2 Aggregation Choice (3 points)
The choice of the AGGREGATE( ·) is important for model above. Three common choices are:
AGGREGATEmax ({hk−1) , Au e Ⅵ(v)})i = u)
AGGREGATEmean ({hk−1) , Au e Ⅵ(v)}) =
AGGREGATEsum ({hk−1) , Au e Ⅵ(v)}) = Σ
u∈Ⅵ(v)
the expressiveness of the
(hk−1))i (element-wise max)
Σ (hk−1))
u∈Ⅵ(v)
(hk−1))
Give an example of two graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ) and their initial node features, such that for some node v1 ∈ V1 and some node v2 ∈ V2 with the
same initial features h) = h) , the updated features h) and h) are equal if
we use mean and max aggregation, but different if we use sum aggregation.
Hint: Your node features can be scalars rather than vectors, i.e. one dimen- sional node features instead of n-dimensional. Also, You are free to arbitrarily choose the number of nodes (e.g. 3 nodes),their connections (i.e. edges between nodes) in your example.
3.3 Weisfeiler-Lehman Test (6 points)
Our isomorphism-test algorithm is known to be at most as powerful as the well- known Weisfeiler-Lehman test (WL test). At each iteration, this algorithm updates the representation of each node to be the set containing its previous representation and the previous representations of all its neighbors. The full algorithm is below.
Prove that our neural model is at most as powerful as the WL test. More precisely, let G1 and G2 be non-isomorphic, and suppose that their node em- beddings are updated over K iterations with the same AGGREGATE( ·) and COMBINE( ·) functions. Show that if
READOUT ({ hK) , Av ∈ V1 }) READOUT ({ hK) , Av ∈ V2 }) ,
then the WL test also decides the graphs are not isomorphic.
Note: The proof has to be generic to any AGGREGATE, COMBINE, READOUT functions. Namely, it’s not sufficient to show this for a specific instance of the GNN model.
Hint: You can use proof by contradiction by first assuming that Weisfeiler- Lehman test cannot decide whether G1 and G2 are isomorphic at the end of K’th iteration.
2023-10-17
Node Embedding and its Relation to Matrix Factorization