Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

FIT3152 Mock eExam Questions

R Coding (10 Marks)

eExam Q1 (4 Marks)

The DunHumby (DH) data frame records the Date a Customer shops at a store, the number of Days since their last shopping visit, and amount Spent for 20 customers. The first 4 rows are shown below.

Describe the action and output(s) of the R code.

eExam Q2 (6 Marks)

Describe the function performed by each line of code or code fragment. [1 Mark]

(a) DHY = DH[as.Date(DH$visit_date,"%d-%m-%y") < as.Date("01-01-11","%d-%m-%y"),]

(b) CustSpend = as.table(by(DHY$visit_spend, DHY$customer_id, sum)) [1 Mark]

(c) CustSpend = sort(CustSpend, decreasing = TRUE) [1 Mark]

(d) CustSpend = head(CustSpend, 12) [1 Mark]

(e) DHYZ = DHY[(DHY$customer_id %in% CustSpend$customer_id),] [1 Mark]

(f) ... + facet_wrap(~ customer_id, nrow = 3) [1 Mark]

Regression (10 Marks)

A subset of the diamonds’ data set from the R package ggplot2’ was created. The data set reports price, size(carat) and quality (cut, color and clarity) information as well as specific measurements (x, y and z). The first 6 rows are printed below.

The least squares regression of log(price) on log(size) and color is given below. Note that log’ in this context means Loge(X). ’ Based on this output, answer the following questions.

eExam Q3 (4 Marks)

(a)        Write down the regression equation predicting log(price) as a function of size and color. [1

Mark]

(b)

Explain the different data types present in the variables: carat and color. What is the effect of this difference on the regression equation? [2 Marks]

(c)       What is the predicted price for a diamond of 1 carat of color H? [1 Mark]

eExam Q4 (6 Marks)

(a)        Which  colour diamonds  can be reliably  assumed to have the highest value? Explain your

reasoning. How sure can you be? [2 Marks]

(b)

Which colour diamonds have reasoning. [2 Marks]

the lowest value?

How reliable is the

evidence?

Explain your

(c)       Comment on the reliability of the model as a whole giving reasons. [2 Marks]

Networks (10 Marks)

eExam Q5 (5 Marks)

The social network of a group of friends (numbered from 1 – 7) is drawn below.

(a)       Calculate the betweenness centrality for nodes 1 to 7. [2 Marks]

(b)       Calculate the closeness centrality for nodes nodes 1 to 7. [2 Marks]

(c)

Giving reasons based on your results in Parts a and b, which node is most central in the network? [1 Mark]

eExam Q6 (3 Marks)

(a)       Calculate the density of the graph. [1 Mark]

(b)       Calculate the clustering coefficient of the graph. [1 Mark]

(c)       Calculate the diameter of the graph. [1 Mark]

eExam Q7 (2 Marks)

Write down the adjacency matrix for the network. [2 Marks]

Naïve Bayes (4 Marks)

eExam Q8 (3 Marks)

Use the data below and Naïve Bayes classification to predict whether the following test instance will be happy or not.

Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? )

eExam Q9 (1 Mark)

Use the complete Naïve Bayes formula to evaluate the confidence of predicting Happy = yes, based    on the same attributes as the previous question: (Age Range = young, Occupation = professor, Gender = F).

Visualisation (6 Marks)

eExam Q10 (6 Marks)

A World Health study is examining how life expectancy varies between men and women in different countries and at different times in history. The table below shows a sample of the data that has been recorded. There are approximately 15,000 records in all.

Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of types) suggest a suitable graphic to help the researcher display as many variables as clearly as possible.

Explain your decision. Which graph elements correspond to the variables you want to display?

Decision Trees (10 Marks)

eExam Q11 (4 Marks)

Eight university staff completed a questionnaire on happiness. The results are given below.

A decision tree was generated from the data.

(a)        Using the decision tree generated from the data provided, assuming a required confidence level

greater than 60% to classify as Happy’, what is the predicted classification for the following instances: [2 Marks]

Instance 1: (Age Range  = Young, Occupation = Professor, Gender = F, Happy = ? ) Instance 2: (Age Range  = Old, Occupation = Professor, Gender = F, Happy = ? )

(b)

Is it possible to generate a 100% accurate decision tree using this data? Explain your answer. [ 1 Mark]

(c)       Explain how the concept of entropy is used in some decision tree algorithms. [1 Mark]

eExam Q12 (6 Marks)

(a)        Do you think entropy was used to generate the decision tree above? Explain your answer. [2

Marks]

(b)       What is the entropy of “Happy”? [1 Mark]

(c)

What is information gain introduced? [2 Marks]

after

the

first node

of

the

decision

tree

(Age

Range)

has

been

(d)       Explain why some decision tree algorithms are referred to as greedy algorithms. [1 Mark]

ROC and Lift (10 Marks)

eExam Q13 (4 Marks)

The following table shows the outcome of a classification model for customer data. The table lists customers by code and provides the following information: The model confidence of a customer buying/not buying a new product (confidence-buy); whether in fact the customer did or did not buy the product (buy = 1 if the customer purchased the model, buy = 0 if the customer did not buy the model).

(a)        Calculate the True Positive Rate and the False Positive Rate when a confidence level of 20%

is required for a positive classification. [2 Marks]

(b)

Calculate the True Positive Rate and the False Positive Rate when a confidence level of 80% is required for a positive classification. [2 Marks]

eExam Q14 (2 Marks)

The ROC chart for the previous question is shown below. Comment on the quality of the model overall. Give a single measure of classifier performance.

eExam Q15 (4 Marks)

(a)        What is the lift value if you target the top 40% of customers that the classifier is most confident

of? [2 Marks]

(b)       Explain what the value of lift means in the previous question. [2 Marks]

Clustering (10 Marks)

eExam Q16 (4 Marks)

A k-Means clustering algorithm is fitted to the iris data, as shown below.

(a)        Comment on the quality of the clustering giving at least one quantitative measure. [2 Marks]

(b)        What actions could be performed to improve the quality of the clustering? [2 Marks]

eExam Q17 (2 Marks)

For the previous question, if clustering was used to discriminate between the irises, what would be the accuracy of the model? Explain your reasoning. [2 Marks]

eExam Q18 (4 Marks)

15 observations were sampled at random from the Iris data set. The dendrogram resulting from clustering, based on their sepal and petal measurements, is below.

(a)       If you wanted just three clusters, which items would be in each cluster? [1 Mark]

(b)

Based on the dendrogram, comment on the ease or difficulty of distinguishing between the     three species of iris based on their sepal and petal measurements. Explain your reasoning with an example from the graph. [2 Marks]

(c)       What does Height’ mean in this context. [1 Mark]

Text Analytics (8 Marks)

eExam Q19 (2 Marks)

Explain what is meant by the bag of wordsapproach to text mining.

eExam Q20 (2 Marks)

Apply the five main steps required to pre-process text documents for analysis to the corpus below. Write your processed documents in the space provided.

Doc1 = { The church choir sang loudly. }

Doc2 = { The boys were singing in the church choir. }

Doc3 = { The boy asked to sing a song. }

eExam Q21 (2 Marks)

Construct the term document frequency matrix for the processed text documents above. [2 Marks].

eExam Q22 (2 Marks)

Using the term document frequency matrix, calculate the Cosine Distance between each pair of documents. [2 Marks]

Ensemble Methods (7 Marks)

eExam Q23 (2 Marks)

Describe the main similarities of the three ensemble classifiers (bagging, boosting and random forests) studied.

eExam Q24 (2 Marks)

How do boosting and random forests differ from bagging?

eExam Q25 (3 Marks)

An artificial neural network (ANN) is to be used to classify whether or not to Buy a certain product based on Popularity, Sales and Performance. An extract of the data is below.

ID

Popularity

Sales

Performance

Buy

1

low

330000

0.87

Maybe

2

medium

40000

0.22

No

3

low

50000

NA

Yes

4

high

30000

0

Yes

5

low

100000

0.1

No

6

medium

NA

0.06

No

...

...

...

...

...

(a)       How many input nodes does the ANN require for this problem? [1 Mark]

(b)       How many output nodes does the ANN require for this problem? [1 Mark]

(c)

What pre-processing and data transformations are required before applying the ANN? [1 Mark]

Dirty and Tidy Data (7 Marks)


eExam Q26 (5 Marks)

The table below is an extract from the list of books in the British Library. Identify the instances of      dirty data present, stating the way in which the data is dirty. One mark will be given for each correct instance up to a maximum of 6 marks.