Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Advanced Data Management

Data Mining Assignment 2022/23

Module Leader: Dr Kostas Domdouzis

Module Name: Advanced Data Management

Assignment Title: Data Mining

Weighting: 40%

Level: 6

Module Code: 55-600092

Individual/Group: Individual

Magnitude: Blackboard Phase Test

Submission date/time: 27th April 2023 at 15:00 (submission area will be made available 1.5hrs prior to the deadline and you have 1.5hrs to complete the phase test)

Blackboard Submission: Y/N

Planned feedback date: 18th May 2023

In module retrieval available: Yes/No

Turnitin Submission: Y/N

Mode of Feedback: Blackboard

1.1 Learning Outcomes

This assignment assesses your ability to:

•   Apply appropriate statistical business analytics to data to obtain summary, graphical and interactive presentations that support insights into it.

1.2 Assessment Criteria

This module will be assessed via a case study. This will involve the analysis of the data set outlined in Appendix 1. This assignment, contributes to 40% of the final module mark and each student is expected to carry out their own analysis of the data based on the questions outlined below marks are indicated next to each question.

Note this is an individual assignment and upon completing your analysis you are expected to input the answer to questions set through the “002 Project - Submission  Point” on blackboard by the 27th of April 2023 at 15:00 to the:

“002 Project - Submission Point” - can be found on blackboard in the Assessment” tab.

1.3 Submission Details

Each student is expected to submit their responses to the “002 Project - Submission Point” form on blackboard by the given deadline, this will involve selecting the most appropriate answer from a list of given responses to the questions provided below.

1.4 Problem Outline

For  this  assignment  you  are  required  to  analyse  a  data  set  concerning  financial transactions and details for customers at a Czech bank. The  relationships are shown below:

You  are  required to  analyse one table  resulting from a query from this database  as detailed below. Full details of the fields in this table are given below and in Appendix 1.

1.5 Data Provided

The final  query  is  saved  as  a  SAS  dataset for  use  in  Enterprise  Miner.  It  is  called czechbk15.sas7bdat. It is available on the SHU server in the path:

E:\SHUUsers\!SharedData\Rich\ADM2223

You will need to create a library to access the data.

1.6 Details of the Query and resulting data

In this assignment you will investigate if there are any groups of accounts with similar properties.  Also you will build a model to predict which accounts have a second account holder attached to that account. For this purpose a subset of variables are selected from the final combination of tables for each account. These variables can be seen to represent for each account, credits and different types of withdrawals that take place:

•    Credits (payments in) there is one pair of variables that gives the total paid in to the account (credit) and the number of times money is paid in (creditn).

Withdrawal (taking money out) there are two separate variables for each of the following methods of withdrawing money:

•    Cash

•    Insurance payment

•    Overdraft Penalty

•    Statement Payment

•    Household Payment

•    Other bank withdrawal

•    Loan Payments

For each of these types of payments the number of payments (ending in n) and the value

of payments (ending in t) has been recorded for a period of five years.

Finally additional information is held about each account:

Account id, Age of primary account holder, if they have a credit card or not (with this bank), number of days account open, if they have a loan or not, if there is a second user of the account and the gender of the main account holder (sex).  There is one nominal variable: the frequency of their bank statements which is monthly, weekly or after transaction. This gives the set of variables as shown in the appendix. Make sure you fully understand what these variables represent - for a full list see the Appendix 1.

The bank wishes to see if different customers have similar profiles and have therefore asked that the data be clustered. They are looking for about five clusters.

For this assignment we will be using only the following variables in the data set. Whilst you are working on the assignment set all the other variables to reject and then you will not have to keep changing them.

Variables to use:

You will need to understand what these variables are so make sure you read Appendix 1.

1.7 Analysis Required

Instruction: Take a note (TAN) of your response to the questions below along with    screen shots (SS) of the SAS outputs, as you will need these answers to complete the “ADM Data Mining - Answer Submission Area” form on blackboard.

Since the cluster analysis requires the use of fields that are as symmetrical as possible you should first investigate each of the interval fields in the data.

Question 1

Q1AA. Obtain (SS) suitable plots of the interval variables. (1 mark)

Q1AB. Discuss (TAN) the plots in detail. Hint: when looking at these plots you may wish to consider the following: are there any usual features? what are the shapes of the plots? and what does this mean in relation to how customers behaviour at the bank? (4 marks)

Question 2

Q2AA. Obtain (SS) suitable plots of the nominal and binary variables. (1 mark)

Q2AB. Discuss (TAN) the plots in detail, what do these show in relation to how customers behaviour at the bank? (4 marks)

Instruction: Use the transform node in Enterprise Miner and the "Maximum Normal"      option for interval variables to find suitable transformations of the interval variables.  You should ensure that in your scoring settings, you still retain a copy of the original variables (set both Hide and Reject to "no").

Question 3

Q3AA. Obtain (SS) suitable screen shot of the SAS transformation table. (1 mark)

Q3AB. Explain (TAN) what actual transformations the software has picked. Are there any that haven’t been transformed? (2 marks)

Q3AC. Produce further plots (SS) of the transformed and original variables. (3 marks)

Q3AD. Present evidence (TAN) as to whether or not the transformations have been         successful. Hint: state for each interval variable whether subsequent analysis should use the original (untransformed) variable or the new transformed variable. (3 marks)

Q3AE. List (TAN) which set of interval variables you would use for clustering going            forward. Hint: this can be a combination of both transformed and original variables, but the same variable should be used twice regardless of whether it is transformed or not. (2 marks)

Instruction: You now need to use the transformed data - Interval variables only. Firstly, set those variables you chose in question Q3AE to Yes and those you are not using to No. Once you have done this change the settings of the clustering node so as to fit five clusters.

Question 4

4AA. Obtain (SS) a segment sizeplot of the output. (1 mark)

4AB. Discuss (TAN) the segment size plotobtained in question 4AA. (1 mark)

4AC. Obtain (SS) the cluster mean statisticsof the output. (1 mark)

4AD. Discuss (TAN) the cluster mean statisticsobtained in question 4AC. (2 marks)

Question 5

5AA. Using the original variables, rather than the transformed variables, produce suitable plots (SS) to investigate the nature of each cluster for all the payment related variables (those that end in t). (2 marks)

5AB. Using the plots in Question 5AA, illustrate the validity of your cluster solution by profiling clusters 1 and 3 and interpret the factors that make these unique (TAN). (5 marks)

Instruction: In the data node set: second as the target value. Now add a data partition    node to the data node and set the training level 70%, the validation to 30% and the test to 0%. Add a decision tree node to the data partition node as in figure 1.0 and adjust the tree settings as per figure 1.1.

Figure 1.0 - Enterprise Miner Stream

Figure 1.1 - Decision Tree Settings

Now run the Decision Tree node.

Question 6

6AA. Obtain (SS) a tree diagramof the output. (1 mark)

6AB. Fully interpret (TAN) the derived tree. (2 marks)

6AC. Obtain (SS) the FIT Statistics from the derived tree (1 mark)

6AD. Discuss (TAN) the FIT Statics for the derived tree. (2 marks)

Instruction: Use the decision tree in conjunction with the following attributes given below:

Age = .,   creditn = 0.01,  creditt =200,  stmentn = 0.02, stmentt = 10, card = y, cardwdn = 0,  cardwdt = 0,  insuren = 0,  insuret = 0,  overdtn = 0.42,

overdtt = 600, days = 800,  frequency = monthly,  householdn = 0, householdt = 0, othbwdn = 1000, othbwdt = 500,  loanpayn = 6000,  loanpayt = 98894, sex =M,     cashwdn = 0, cashwdt = 0

6B. Record (TAN) whether or not the customer is likely to have a second account. (2 marks)

6C. Explain (TAN) the important factors that impact on a having a second account, what reservations might you have? (3 marks)

Question 7

7AA. Use all of your results above to discuss (TAN) how all the analysis you have carried out may be utilised by the bank. (2 marks)

7AB. Discuss (TAN) how the results of the analyses above may be utilised by the bank to carry out further supervised data mining. (2 marks)

7AC. Record what other possible targets might be appropriate for future data mining at the bank. (2 marks)

Total Marks available: 50 marks

(Data Mining Assignment contributes 40% to the final module mark)