Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMPSCI 752

Big Data Management

Assignment 3 / Semester 1, 2023

Knowledge Graph and Fundamentals Problems

1    Knowledge graph                                       [25 marks]

We will build a knowledge graph based on the text from Ninh’s homepage.

"Prior  to  joining University  of  Auckland  in  December  2018,  Ninh worked  in

Copenhagen  for  7  years  at  the University  of  Copenhagen  and  IT University  of Copenhagen .    He  received  his  PhD  at  IT  University  of  Copenhagen  under  the

supervision  of  Professor  Rasmus  Pagh  in  2014 .    After  that,  he  spent  4  years in postdoctoral positions  in  Copenhagen .    He  was  the  recipient  of  the best   paper  awards  in  WWW  Conference  2014  and  PKDD  2020 .     AMiner  has  recognized

him  as  the  2022  AI  2000 Most  Influential  Scholar  Honorable Mention  in  Data Mining  (Rising  Star)  for his  outstanding  and  vibrant  contributions  to  this field between  2012  and  2021 . "

1. Unsupervised method: Assume that nouns will be entities,  and verbs form relations.  Using NLP techniques (e.g. nltk package), write a small Python script to parse the above text into entities and relationships.     [7 marks]

Construct a knowledge graph based on the parsing result.                      [3 marks]

2. Supervised method: Using a pre-trained model (e.g. https://spacy. io/models/ en) to parse the above text for entities and verbs.                                   [7 marks]

Assume that verbs form relations, construct a knowledge graph based on the pars-    ing result.        [3 marks]

3. If we use some specific nouns as verbs,e.g. supervision,  award,  contribution, how do the constructed knowledge graphs above change?                       [5 marks]

Though handwriting graphs are acceptable (without any mark deduction), you are en- couraged to find suitable tools to draw knowledge graphs.

Solutions:

See the Python code attached.

Note that the goal of this question is to motivate students on using public Python libraries for NLP tasks (nltk or spacy). Student’s solutions might be different de- pending on their parsing results.  It is fine given the knowledge graph has most of basic information.

2 Odd sketch [15 marks]

Given two set A and B, the symmetric difference size between A and B is lAl + lBl - 2lA n Bl .  Consider 4 sets below:

S1 = {1, 2, 4, 6, 8} ,

S2 = {1, 2, 4, 6, 9} ,

S3 = {1, 2, 4, 7, 9} ,

S4 = {1, 3, 5, 7, 9} .

1.  Finding all pairs of sets such that their symmetric difference size is at most 2. [3 marks]

Solution: Simply compute all pairs of similarity using the set difference size, we have

sym(S1, S2) = 2 sym(S2, S3) = 2

sym(S3, S4) = 4

sym(S1, S3) = 4

sym(S2, S4) = 6

sym(S1, S4) = 8

We have the similar pairs {(S1, S2), (S2, S3)}.

2.  Using the hash function h(x)  = x + 1  mod  4, construct odd sketches of 4 bits length for these sets above.       [8 marks]

Solution: The odd sketches of the sets are

S1 = {0, 0, 1, 0}

S2 = {0, 1, 0, 0}

S3 = {1, 1, 0, 1}

S4 = {0, 0, 1, 0}

3.  How can we use these odd sketches to solve the question 1?  Explain your solution. [4 marks]

Solution: We observe that if the set difference size of two sets is at most 2, the odd sketches of these two sets will be different at most 2 positions.

For example, considering two sets A1 = {1, 2, 4, 6}, A2 = {1, 2, 4, 6}, there odd sketches are identical. Now consider S1 = A1 n {8}, S2  = A2 n {9}.   Since sym(S1, S2) = 2, there are maximum two positions that their odd sketches are different (i.e. when 8 and 9 hash into different positions).

Hence we can use odd sketches to answer the question 1 by considering all pairs whose odd sketches are different at most 2 positions.  Using their odd sketches, we have the result {(S1, S2), (S1, S4), (S2, S3), (S2, S4)}.

There might be the false positives introduced by Odd Sketches, e.g.  (S1, S4). It should be ne given the eciency of binary operations. You might lter out the false positives by computing the symmetric difference of pairs returned by Odd Sketches or increasing the size of Odd Sketches.

3 PageRank [10 marks]

We first use Graph 1 in Figure 1 for executing the PageRank algorithm with the damping factor d = 1.

Figure 1: Graph 1

1. Compute the transition matrix M for Graph 1.

[1 mark]

2.  Compute the PageRank of each vertex until it converges.

[4 marks]

3. We remove the edge from E to A and C to E in Graph 1 to form Graph 2 in Figure 2. Assume the sum of values of the initialized page rank vector u0 is 1.  We run PageRank on Graph 2 until it converges. How can we initialize u0 such that, in the end, node E has

(a) the largest page rank.

(b) the smallest page rank.

Explain the result.                                                                                   [5 marks]

Figure 2: Graph 2

Solution:

0     0    0     0     1

' 1/3   0   0   1/2   0 '

1.  M = ' 1/3   0   0   1/2   0 '

' 1/3   1   0     0     0 '

' 0     0   1     0     0'

2. We compute u = Mu0 until it converges to {0.1875, 0.1875, 0.1875, 0.2500, 0.1875}.

3. We construct new transition matrix.

0     0    0     0     0

' 1/3   0   0   1/2   0 '

M/  = ' 1/3   0   0   1/2   0 ' .

' 1/3   1   0     0     0 '

' 0     0   0     0     0'

Whatever we initialize the vector u0 , there are always lost of contributions due to the first zero-row and last zero-row.  In the end, all nodes have the same page rank, which is 0.