闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

First Semester Examination

2020/2021 Academic Session

CDS503 – Machine Learning

1. (a). Question 1(a)(i) to (iii) are based on Table 1.

Table 1 shows the set of training data consisting of bank loan applications approval based on applicant’s average monthly saving and outstanding mortgage loan, which “+” sign means “loan granted”, while “-“ sign means “loan not granted” .

Table 1

Average monthly saving (RM)	Outstanding mortgage loan (in ‘000 RM)	Loan granted
250	150	-
300	60	-
450	175	+
500	135	+
500	230	-
600	150	+
600	90	-
700	185	+
700	230	-
780	140	+
950	55	-
970	153	-

(52/100)

(i). Based on the dataset in Table 1, construct Class C in 2D graph by showing the positive and negative instances. Identify value of S1, S2, M1 and M2 where

S1 <= average monthly saving <= S2) AND (M1 <= outstanding mortgage loan <= M2.

(ii). Figure 1 shows the formalized supervised

instance space X, target function y = f(x) space Y. The hypothesis space is the set search.

Figure 1

Construct a hypothesis space Class H that also shows the potential area of false positive (FP) and false negative (FN). Explain the error E(h|X] of false positive and false negative in this bank loan applications approval.

(iii). Construct a hypothesis space Class S that may create the potential

of overfitting. Describe how overfitting happens in the context of this bank loan applications approval.

(b). Table 2 shows the training data to classify the sport type of an athlete,

which is based on athlete’s height, maximal voluntary contraction (MVC) and maximal oxygen uptake (Max O2).

Table 2

Name	Height	Maximal Voluntary Contraction (MVC)	Maximal Oxygen Uptake (Max O2)	Sport type
Osman	tall	moderate	moderate	Running
Thevan	medium	low	high	Badminton
Seng Huat	tall	high	moderate	Running
Ramu	medium	moderate	low	Running
Ahmad	short	low	high	Badminton
Muhamad	tall	low	high	Badminton
Chin Huat	medium	high	moderate	Running

(48/100)

2. (a).

(i). By using K-Nearest Neighbour (KNN) algorithm, compute the sport type suitable for the person with medium height and medium Maximal Voluntary Contraction (MVC) and high Maximal Oxygen Uptake (Max O2).

The value of k is 3 and proximity metric used is Euclidean Distance.

Show your workings.

(ii). Determine the sport type of the same person with k = 7.

Conclude your result as compared to 1(b)(i).

(iii). By using Naïve Bayes algorithm, compute the sport type suitable

for the person with medium height and high Maximal Voluntary Contraction (MVC) and low Maximal Oxygen Uptake (Max O2).

Show your workings.

Figure 2

(32/100)

(i). Based on the Decision Tree (DT) shown in Figure 2, explain the reason of the feature ‘Age’ becoming the root of the DT. Your justification should include the information gain and degree of the purity.

(ii). List the labels involved in this DT.

(b). Email spam is annoying, filling up our inbox and making it hard to find

genuine emails. In order to protect our email server from being overloaded with non-essential emails, the spam filters are used.

Consider the following information:

The information given as below:

• 2% of email in inbox being filtered based on specific keyword is considered as spam.

• 85% of email that is spam contains keyword “you win” .

• 8.5% of email that is not spam also contains keyword “you win” .

An email in inbox is being classified contains the keyword “you win” when being filtered. Using Bayes’ theorem, calculate a probability that this email is spammed. Show your working.

(32/100)

(c). Suppose you are using a Support Vector Machine (SVM) classifier with 2 class classification problem as shown in Figure 3. Now you have been given the following data in which some points are circled that are representing support vectors.

Figure 3

(36/100)

(i). Determine whether the decision boundary will change if you remove any one of circled points.

(iI). Determine whether the decision boundary will change if you

remove any one of non-circled points.

(iii). Explain the cost parameter in SVM and how it effects the

smoothness of the decision boundary.

3. Answer all questions

(a) Given the following dataset in Table 3.

(36/100)

(i) Compute the parameters (coefficients), w0 and w1 of the linear regression model using least squares method.

Table 3

Y	X
1.5	2.0
2.1	2.4
1.9	2.5
2.8	2.8
2.1	2.9
2.0	3.0
2.6	2.9
2.2	3.2
2.7	3.3
3.1	3.6

(ii) Compute the predicted Y value given X = 4.5 .

(b) Given a two-dimensional dataset as shown in Figure 4. Suppose the

centroids are (3,2) and (5,5), compute the new centroids of the clusters after K-mean method is applied for one iteration.

Figure 4 (32/100)

(c) Consider performing the hierarchical agglomerative clustering algorithm on the following set of data points as shown in Figure 5. Assuming we stop when only two clusters remain. State the linkage method that ensures two balanced clusters will be formed (each have two data points). Explain your

answers.

Figure 5

(32/100)

4. Answer all questions

(a) Given a two-dimensional dataset, we want to represent the data in only one

dimension. There are two methods available: Principal Component Analysis and Linear Discriminant Analysis. State which method to be used to reduce the dataset. Explain your answers.

3.5 5.5

4.5

(24/100)

(b) You are tasked to build a classification model for a prediction problem. The

dataset is large with low number of noisy samples. However, it is found that a single model gets a very low performance. Thus, it is decided that the ensemble learning methods bagging or boosting is to be used to build a better classification model for the prediction problem. Choose a method and explain your reason for choosing the method over another method.

(40/100)

(c) A data scientist builds an ensemble classifier using stacking method. The ensemble classifier consists of three decision trees as the base models and a support vector machine as the meta-model. The ensemble classifier is evaluated on a test set and it is observed that the accuracy is lower than expected. Suggest an improvement that could be made to improve the accuracy of the classifier.

(36/100)