Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP60711 Data Engineering RESIT

Section A

a) Give two arguments to persuade an IT Company’s CEO to invest financial resources in a new Data-Lake-based IT Architecture for his company. Make sure you add a few lines explaining each argument and justifying its relevance and potential impact on the IT Company.

b) Explain why you agree or disagree with the following statement, illustrating your answer with an example:

Some data preparation strategies can negatively impact on data analysis”.

c) Describe how parallelism can be used to scale out the calculation of the following data profiling operations, emphasising any challenges:

i. Second Quartile.

ii. Mean.

d) Give an example of an unclean dataset being submitted to a Data Transformation step (in a Data Cleaning Process) that generates more data discrepancy. Make sure you describe the dataset by providing schema information, instances (i.e., values), the reason why the data is unclean, as well as details about the data transformation being applied on the data and the discrepancy that the transformation generates.

e) Provide two example situations where multi-column data profiling is useful, providing an explanation for each.

Section B

a) In the context of business intelligence:

(i) Characterise the differences between OLTP and OLAP? Why might different database systems be used for each type?

(ii) Compare and contrast the use of row store and column store.

(iii) Discuss the current state of data warehousing and how it has adapted to the emerging ‘big and complex data” revolution- including many and often dynamic data sources and requirements for analytics.

b) In the context of classification:

i) Outline a decision tree classification algorithm; discuss how the attribute used at each node is chosen and what effect different training sets may have.

ii) Tests on a classifier give the following confusion matrix:

Predicted

Disease=yes

Disease=no

Total

Actual

Disease=yes

90

210

300

Disease=no

140

9560

9700

Total

230

9770

10000

Calculate the precision, recall, and specificity for the classifier based on this table. Provide your working as appropriate.

(iii) Explain the potential effects of unbalanced data on the usefulness of a classifier. Give an example to support your answer.

(c) In the context of association rule (itemset) mining:

(i) Outline the working of the Apriori algorithm. Explain the importance of the subset property.

(ii) Using Apriori, suppose that L4 is the list:

{ {p,q,r,s}, {p,q,r,t}, {p,q,r,z}, {p,q,s,z}, {p,r,s,z}, {q,r,s,z},

{r,s,w,x}, {r,s,w,z}, {r,t,v,x}, {r,t,v,z}, {r,t,x,z}, {r,v,x,y},

{r,v,x,z}, {r,v,y,z}, {r,x,y,z}, {t,v,x,z}, {v,x,y,z} }

a. At the join step of the algorithm, which itemsets are placed in C5 (the candidate set)?

b. Which itemsets are discarded by the prune step of the algorithm?

Provide your working as appropriate.