COMPSCI 752 Big Data Management Assignment 3 / Semester 1, 2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMPSCI 752
Big Data Management
Assignment 3 / Semester 1, 2023
Knowledge Graph and Fundamentals Problems
1 Knowledge graph [25 marks]
We will build a knowledge graph based on the text from Ninh’s homepage.
"Prior to joining University of Auckland in December 2018, Ninh worked in
Copenhagen for 7 years at the University of Copenhagen and IT University of Copenhagen . He received his PhD at IT University of Copenhagen under the supervision of Professor Rasmus Pagh in 2014 . After that, he spent 4 years in postdoctoral positions in Copenhagen . He was the recipient of the best paper awards in WWW Conference 2014 and PKDD 2020 . AMiner has recognized him as the 2022 AI 2000 Most Influential Scholar Honorable Mention in Data Mining (Rising Star) for his outstanding and vibrant contributions to this field between 2012 and 2021 . "
1. Unsupervised method: Assume that nouns will be entities, and verbs form relations. Using NLP techniques (e.g. nltk package), write a small Python script to parse the above text into entities and relationships. [7 marks]
Construct a knowledge graph based on the parsing result. [3 marks]
2. Supervised method: Using a pre-trained model (e.g. https://spacy .io/models/ en) to parse the above text for entities and verbs. [7 marks]
Assume that verbs form relations, construct a knowledge graph based on the pars- ing result. [3 marks]
3. If we use some specific nouns as verbs, e.g. supervision, award, contribution, how do the constructed knowledge graphs above change? [5 marks]
Though handwriting graphs are acceptable (without any mark deduction), you are en- couraged to find suitable tools to draw knowledge graphs.
Solutions: See the Python code attached. Note that the goal of this question is to motivate students on using public Python libraries for NLP tasks (nltk or spacy). Student’s solutions might be different de- pending on their parsing results. It is fine given the knowledge graph has most of basic information. |
2 Odd sketch [15 marks]
Given two set A and B, the symmetric difference size between A and B is lAl + lBl - 2lA n Bl. Consider 4 sets below:
S1 = {1, 2, 4, 6, 8} ,
S2 = {1, 2, 4, 6, 9} ,
S3 = {1, 2, 4, 7, 9} ,
S4 = {1, 3, 5, 7, 9} .
1. Finding all pairs of sets such that their symmetric difference size is at most 2. [3 marks]
Solution: Simply compute all pairs of similarity using the set difference size, we have |
||
sym(S1, S2) = 2 sym(S2, S3) = 2 sym(S3, S4) = 4 |
sym(S1, S3) = 4 sym(S2, S4) = 6 |
sym(S1, S4) = 8 |
We have the similar pairs {(S1, S2), (S2, S3)}. |
2. Using the hash function h(x) = x + 1 mod 4, construct odd sketches of 4 bits length for these sets above. [8 marks]
Solution: The odd sketches of the sets are S1 = {0, 0, 1, 0} S2 = {0, 1, 0, 0} S3 = {1, 1, 0, 1} S4 = {0, 0, 1, 0} |
3. How can we use these odd sketches to solve the question 1? Explain your solution. [4 marks]
Solution: We observe that if the set difference size of two sets is at mОst 2, the odd sketches of these two sets will be different at mОst 2 positions. For example, considering two sets A1 = {1, 2, 4, 6}, A2 = {1, 2, 4, 6}, there odd sketches are identical. Now consider S1 = A1 n {8}, S2 = A2 n {9}. Since sym(S1, S2) = 2, there are mauimum two positions that their odd sketches are different (i.e. when 8 and 9 hash into different positions). Hence we can use odd sketches to answer the question 1 by considering all pairs whose odd sketches are different at most 2 positions. Using their odd sketches, we have the result {(S1, S2), (S1, S4), (S2, S3), (S2, S4)}. There might be the false positives introduced by Odd Sketches, e.g. (S1, S4). It should be fine given the efficiency of binary operations. You might filter out the false positives by computing the symmetric difference of pairs returned by Odd Sketches or increasing the size of Odd Sketches. |
3 PageRank [10 marks]
We first use Graph 1 in Figure 1 for executing the PageRank algorithm with the damping factor d = 1.
Figure 1: Graph 1
1. Compute the transition matrix M for Graph 1. [1 mark]
2. Compute the PageRank of each vertex until it converges. [4 marks]
3. We remove the edge from E to A and C to E in Graph 1 to form Graph 2 in Figure 2. Assume the sum of values of the initialized page rank vector uО is 1. We run PageRank on Graph 2 until it converges. How can we initialize uО such that, in the end, node E has
(a) the largest page rank.
(b) the smallest page rank.
Explain the result. [5 marks]
Figure 2: Graph 2
Solution:
z 0 0 0 0 1r
'(')1/3 0 0 1/2 0'(')
1. M = '(')1/3 0 0 1/2 0 '(')
' '
' 0 0 1 0 0'
2. We compute u = MuО until it converges to {0.1875, 0.1875, 0.1875, 0.2500, 0.1875}.
3. We construct new transition matrix.
z 0 0 0 0 0r
'(')1/3 0 0 1/2 0'(')
M\ = '(')1/3 0 0 1/2 0'(') .
' '
' 0 0 0 0 0'
Whatever we initialize the vector uО , there are always lost of contributions due to the first zero-row and last zero-row. In the end, all nodes have the same page rank, which is 0.
2023-06-08