闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMPSCI 752

Big Data Management

Assignment 4 / Semester 1, 2023

MapReduce and Data Stream

1 Similarity join [25 marks]

Consider the collection of 4 pre-processed documents. Each document is represented as a list of terms together with its weight as follows:

d1 {(andrew, 6), (teach, 4), (logic, 2), (database, 5)}

d2 {(peter, 6), (study, 2), (python, 6), (logic, 3)}

d3 {(andrew, 5), (teach, 1), (python, 5)}

d4 {(peter, 5), (study, 4), (like, 2), (python, 5)}

We represent each document as a high-dimensional vector and use the inner product as the similarity measure. We will use the preﬁx-ﬁltering algorithm to ﬁnd similar pairs of documents whose inner product values are larger than a threshold t.

1. Compute the similarity of each pair of non-identical documents. [5 marks]

2. Assume that the dimension indexes are as follows:

{(andrew, 1), (peter, 2), (teach, 3), (study, 4), (like, 5),

(logic, 6), (database, 7), (python, 8)}.

Construct the signature of each document given the threshold t = 37. What are the candidate pairs that we need to compute the similarity? [10 marks]

3. There are several way to index the dimensions, for example

{(peter, 1), (andrew, 2), (teach, 3), (study, 4), (like, 5),

(logic, 6), (database, 7), (python, 8)}.

Are there any way to index the dimensions so that we use less space to store our inverted index? Explain your solution. [10 marks]

Solution:

1. We compute all pair similarity as follows.

sim(d1, d2) = 6, sim(d1, d3) = 34, sim(d1, d4) = 0

sim(d2, d3) = 30, sim(d2, d4) = 68

sim(d3, d4) = 25

2. First, we need to construct the maximum vector m and sort the documents based on the dimension indexes. It makes the signature construction easier.

Dim = {(andrew, 1), (peter, 2), (teach, 3), (study, 4), (like, 5),

(logic, 6), (database, 7), (python, 8)}

m = {(andrew, 6), (peter, 6), (teach, 4), (study, 2), (like, 2),

(logic, 3), (database, 5), (python, 6)}

d1 = {(andrew, 6), (teach, 4), (logic, 2), (database, 5)}

d2 = {(peter, 6), (study, 2), (logic, 3), (python, 6)}

d3 = {(andrew, 5), (teach, 1), (python, 5)}

d4 = {(peter, 5), (study, 4), (like, 2), (python, 5)}

We construct the signatures of the documents are as follows:

sig(d1) = {(teach, 4), (logic, 2), (database, 5)} p(d1) = 2

sig(d2) = {(study, 2), (logic, 3), (python, 6)} p(d2) = 3

sig(d3) = {(python, 5)} p(d3) = 7

sig(d4) = {(study, 4), (like, 2), (python, 5)} p(d4) = 3

The p(d1), p(d2), p(d3), p(d4) are not important when the signatures are correct. Constructing the inverted index on these 4 signatures, the candidate pairs are (d1, d2), (d2, d3), (d2, d4), (d3, d4).

3. Since the data set is very sparse, we can index the dimensions based on their sparsity/occurrences, e.g. (database, 1), (logic, 2), (like, 3)...

By doing this way, the inverted index tends to contain frequent terms and hence we can reduce the storage space of our data structure

An alternative solution might be that the way of indexing the dimensions highly depends on the weight of terms. If we select terms with small weights ﬁrst, then we can ﬁlter out many terms and only index terms with large weight.

2 MapReduce [10 marks]

Consider the transaction checkout at Pak’nSave. For each item sold, it generates a record in the form [Item-ID, Supplier-ID, Quantity, Price], where Item-ID is the unique identiﬁer of an item, Supplier-ID is the unique identiﬁer of a supplier, Quantity is the number of items Item-ID purchased, and Price is the sales price. Assume that Pak’nSave has accumulated many terabytes of the transaction data over a year. As a data scientist, you want to know the following information for decision marking.

For each Supplier-ID, compute the average sales price of all items provided by Supplier- ID.

For example, given three records of Supplier-1: [Item-1, Supplier-1, 5, 50], [Item-2, Supplier-1, 2, 2], [Item-1, Supplier-1, 10, 100], the average sales price for Supplier-1 is (50 + 2 + 100) / (5 + 2 + 10) = 152/17.

Write a MapReduce function to solve the question above. For Map() and Reduce(), indicate the key and value of both input and output.

Solution:

● For the map function, we need to output Supplier-ID as the key, value is the combination between the price and quantity.

● Map.Input = (key = null, value = transaction); Map.Output = (key = Supplier- ID, value = pair(Quantity, Price))

● For the reduce function, we need sum of quantities, sum of prices, and then com- pute the ratio.

● Reduce.Input = (key = Supplier-ID, value = (Quantity, Price). Reduce.Output = (key = Supplier-ID, value = sum of prices / sum of quantities)

3 Data stream [15 marks]

In our lecture, we have studied the reservoir sampling which samples an element from a stream of size m with the same probability. If we use the reservoir sampling with the summary size s = 1, each element of a stream will be sampled with probability 1/m. We name this method as RS1 . The generalized version of reservoir sampling with the summary size s > 1 guarantees that each element in a stream will be sampled with the same probability s/m. We name this method as RSs.

In the assignment, we consider a new algorithm, called pRS1 , that simulates RSs for s > 1 by running s independent RS1 instances in parallel. pRS1 also uses a summary of size s, as shown in Figure 1.

1. As a function of m and s, what is the probability an element of a stream is sampled by pRS1 ? [10 marks]

2. Let fi be the number of occurrences of the element ai in a stream. Explain how we can use RSs and pRS1 to estimate fi. [5 marks]

Solution:

Figure 1: Illustration of how pRS1 works.

● Consider a ﬁrst cell of the summary, the probability an element of a stream has been sample is 1/m. Since we have s cells, the probability an element has not been sampled is (1 _ 1/m)s. Hence, an element is sampled by pRS1 with probability 1 _ (1 _ 1/m)s.

● We deﬁne a random variable Xj , j e [s] where Xj = 1 if ai appears at the cell j of our summary; otherwise, Xj = 0. We have E[Xj ] = fi /m. Let X =j Xj be the random variable corresponding to the number of occurrences of ai in our summary. We have E[X] =j E[Xj ] = sfi /m. Hence we estimate fi by Xm/s. The estimator is the same for both approaches even though the sampling probabilities are slightly diﬀerent.