Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMPSCI 752

Big Data Management

Assignment 3 / Semester 1, 2023

Knowledge Graph and Fundamentals Problems

1   Knowledge graph                                    [25 marks]

We will build a knowledge graph based on the text from Ninh’s homepage.

"Prior  to  joining University  of  Auckland  in  December  2018,  Ninh worked  in

Copenhagen  for  7  years  at  the University  of  Copenhagen  and  IT University  of Copenhagen .   He  received his  PhD  at  IT University  of  Copenhagen under  the    supervision  of  Professor  Rasmus  Pagh  in  2014 .   After  that, he  spent  4  years in postdoctoral positions  in  Copenhagen .   He was  the  recipient  of  the best  paper  awards  in WWW  Conference  2014  and  PKDD  2020 .   AMiner has  recognized    him  as  the  2022  AI  2000 Most  Influential  Scholar  Honorable Mention  in  Data  Mining  (Rising  Star)  for his  outstanding  and  vibrant  contributions  to  this  field between  2012  and  2021 . "

1. Unsupervised method:  Assume that nouns will be entities, and verbs form relations. Using NLP techniques (e.g. nltk package), write a small Python script to parse the above text into entities and relationships. [7 marks]

Construct a knowledge graph based on the parsing result. [3 marks]

2.  Supervised method: Using a pre-trained model (e.g. https://spacy .io/models/ en) to parse the above text for entities and verbs.        [7 marks]

Assume that verbs form relations, construct a knowledge graph based on the pars-   ing result.              [3 marks]

3. If we use some specific nouns as verbs, e.g. supervision,  award,  contribution, how do the constructed knowledge graphs above change?                      [5 marks]

Though handwriting graphs are acceptable (without any mark deduction), you are en- couraged to nd suitable tools to draw knowledge graphs.

Solutions:

See the Python code attached.

Note that the goal of this question is to motivate students on using public Python libraries for NLP tasks (nltk or spacy). Students solutions might be dierent de- pending on their parsing results. It is ne given the knowledge graph has most of basic information.

2    Odd sketch                                               [15 marks]

Given two set A and B, the symmetric difference size between A and B is lAl + lBl - 2lA n Bl. Consider 4 sets below:

S1 = {1, 2, 4, 6, 8} ,

S2 = {1, 2, 4, 6, 9} ,

S3 = {1, 2, 4, 7, 9} ,

S4 = {1, 3, 5, 7, 9} .

1. Finding all pairs of sets such that their symmetric difference size is at most 2. [3 marks]

Solution: Simply compute all pairs of similarity using the set dierence size, we have

sym(S1, S2) = 2 sym(S2, S3) = 2 sym(S3, S4) = 4

sym(S1, S3) = 4

sym(S2, S4) = 6

sym(S1, S4) = 8

We have the similar pairs {(S1, S2), (S2, S3)}.

2. Using the hash function h(x) = x + 1 mod  4, construct odd sketches of 4 bits length for these sets above.       [8 marks]

Solution: The odd sketches of the sets are

S1 = {0, 0, 1, 0}

S2 = {0, 1, 0, 0}

S3 = {1, 1, 0, 1}

S4 = {0, 0, 1, 0}

3. How can we use these odd sketches to solve the question 1? Explain your solution. [4 marks]

Solution: We observe that if the set dierence size of two sets is at mОst 2, the odd sketches of these two sets will be dierent at mОst 2 positions.

For example, considering two sets A1 = {1, 2, 4, 6}, A2 = {1, 2, 4, 6}, there odd sketches are identical. Now consider S1 = A1 n {8}, S2 = A2 n {9}. Since sym(S1, S2) = 2, there are mauimum two positions that their odd sketches are dierent (i.e. when 8 and 9 hash into dierent positions).

Hence we can use odd sketches to answer the question 1 by considering all pairs whose odd sketches are dierent at most 2 positions. Using their odd sketches, we have the result {(S1, S2), (S1, S4), (S2, S3), (S2, S4)}.

There might be the false positives introduced by Odd Sketches, e.g.  (S1, S4). It should be ne given the eciency of binary operations. You might lter out the false positives by computing the symmetric dierence of pairs returned by Odd Sketches or increasing the size of Odd Sketches.

3   PageRank                                                [10 marks]

We first use Graph 1 in Figure 1 for executing the PageRank algorithm with the damping factor d = 1.

Figure 1: Graph 1

1.  Compute the transition matrix M for Graph 1.                                        [1 mark]

2.  Compute the PageRank of each vertex until it converges.                      [4 marks]

3. We remove the edge from E to A and C to E in Graph 1 to form Graph 2 in Figure 2. Assume the sum of values of the initialized page rank vector uО is 1. We run PageRank on Graph 2 until it converges. How can we initialize uО such that, in the end, node E has

(a) the largest page rank.

(b) the smallest page rank.

Explain the result.                                                                                    [5 marks]

Figure 2: Graph 2

Solution:

z 0     0   0     0     1r

'(')1/3   0   0   1/2   0'(')

1.  M =  '(')1/3   0   0   1/2   0 '(')

'                             '

' 0     0   1     0     0'

2. We compute u = MuО until it converges to {0.1875, 0.1875, 0.1875, 0.2500, 0.1875}.

3. We construct new transition matrix.

z 0     0   0     0     0r

'(')1/3   0   0   1/2   0'(')

M\  =  '(')1/3   0   0   1/2   0'(') .

'                             '

' 0     0   0     0     0'

Whatever we initialize the vector uО , there are always lost of contributions due to the rst zero-row and last zero-row.  In the end, all nodes have the same page rank, which is 0.