闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BUSI3122-E1

A LEVEL 3 MODULE, AUTUMN SEMESTER 2022-2023

INTRODUCTION TO DATA SCIENCE: BIG DATA ANALYSIS IN BUSINESS

Question 1. Assorted True/False Quesetions (20 marks)

Please write on the answer booklet whether each of the following statements is True or False.

a) ‘Whether a customer would like to purchase IPhone 14 or not’ is an example of classification problems in data mining. (2 marks)

b) Predicting the relationship between the customer number and the price of the product is an example of regression tasks in data mining. (2 marks)

c) Jane and Rob share 10 friends, we would like to predict whether they are also friends to each other. This is an example of co-occurrence grouping tasks. (2 marks)

d) In the data mining terminology, one variable is the same as one feature. (2 marks)

e) Logistic regression is one kind of predictive model for regression tasks. (2 marks)

f) Generalization tries to find the real pattern that can be applied to new data, while

Overfitting may find some patterns that only fit the training data. (2 marks)

g) Similarity measures are most essential for Naïve Bayes. (2 marks)

h) SVM chooses the line to minimize the margin between two classes. (2 marks)

i) Profit Curve and Lift Curve share the same X-axis. (2 marks)

j) Each individual tree in the Random Forest is built on all observations. (2 marks)

Question 2. Naïve Bayesian (25 marks)

The following dataset contains loan information and can be used to try to predict whether a borrower will default (the last column is the classification). We are going to build a Naïve Bayes model to determine whether a loan X should be classified as a Defaulted Borrower or not. So, determine which is larger, P(Yes|X) or P(No|X) :

Tid	Home Owner	Marital Status	Annual Income	Defaulted Borrower
1	Yes	Single	High	No
2	No	Married	High	No
3	No	Single	Low	No
4	Yes	Married	High	No
5	No	Divorced	Low	Yes
6	No	Married	Low	No
7	Yes	Divorced	High	No
8	No	Single	Low	Yes
9	No	Married	Low	No
10	No	Single	Low	Yes

a) First, please calculate the prior probabilities for Defaulted Borrower (P(YES)) and Non-

Defaulted Borrower (P(NO)), and all the necessary parameters for a Naïve Bayesian classifier. (10 marks)

b) Given a new customer X (Home Owner = No, Marital Status=Married, Income=High), calculate the probability that this customer is a Defaulted Borrower or Non-Defaulted Borrower respectively. Based on the Naïve Bayesian classifier, what will be the predicted class of this customer? (10 marks)

c) If we set the prior P(YES)=0.1 P(NO)=0.9 and the other parameters remain the same, answer question b) again, and briefly explain how the prior probabilities influence the judgement of classifier. (5 marks)

Question 3. Logistic Regression (30 marks)

The dating web site Jiayuan.com requires its users to create profiles based on a survey in which they rate their interest (on a scale from 0 to 3) in five categories: physical fitness, music, spirituality, education, and alcohol consumption. A new Jiayuan customer, Joseph NoBody, has reviewed the profiles of 20 prospective dates and classified whether he is interested in learning more about them.

Based on Joseph's classification of these 20 profiles, Jiayuan has applied a logistic regression to predict whether Joseph is interested in other profiles that he has not yet viewed. The resulting logistic regression model is as follows:

Log odds of Interested = -0.920 + 0.325 × Fitness - 3.611 × Music

+ 5.535 × Education - 2.927 × Alcohol

For the 20 profiles (observations) that Joseph has viewed and shown his interests, this logistic regression model generates the following probability of Interested.

Observation ID	Interested (Actual)	Probability of Interested (Predicted)	Observation ID	Interested (Actual)	Probability of Interested (Predicted)
20	1	1.000	16	1	0.512
17	1	0.999	9	0	0.485
4	1	0.999	6	0	0.419
12	0	0.877	18	1	0.368
14	1	0.853	3	0	0.365
19	1	0.767	2	0	0.330
11	1	0.754	8	0	0.322
7	0	0.666	5	0	0.200
13	1	0.657	1	0	0.168
10	1	0.602	15	0	0.128

a) Using a cut-off value of 0.5 to classify whether Joseph is interested or not, and construct the confusion matrix for this 20-observation data set. (5 marks)

b) According to Jiayuan, it costs the website 10 cents to recommend one profile to Joseph. If Joseph is interested, he will spend 1 rmb to gain the full access to the profile. Please calculate the expected profit the website can gain from Joseph if it applied the classifier. (10 marks)

c) Based on the logistic regression result table, construct the ROC curve and calculate the AUC. Hint: the ROC curve is a set of line segments parallel or vertical to the x-axis. (10 marks)

d) If we have a new profile that has values of Fitness = 3, Music = 1. Education = 3, and Alcohol = 1, use the estimated logistic regression equation to compute the probability of Joseph's interest in this profile. (5 marks)

Question 4. Decision Analytic Thinking (25 marks)

As the World Cup 2022 is ongoing in Qatar now, you are about to launch a legal betting shop in your neighbourhood. Thanks to your friend Lucky Yu, a veteran in the betting shop business, you have access to an extensive dataset on existing customers, including gender, age, demographic information by zip code, and their betting history. You plan to send invitations to residents in your area (with a targeting cost) and run experiments under the following assumptions:

Customers’ bets may vary.

Customers may place a bet when they pass by your betting shop (even when they did not receive the invitation).

Targeting cost is fixed.

Other than the targeting cost, there are no additional costs.

You have been asked to build several data mining models that would suggest which customers should be targeted to maximize your profit. Use the expected value framework to determine what models should be used to address the problem, and explain the dataset you need for each of them.

Note: It is sufficient to write down the correct expected value equations to identify the models that should be constructed. You need to consider different situations and build individual models for different situations.