Advanced Data Management Data Mining Assignment 2022/23
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Advanced Data Management
Data Mining Assignment 2022/23
Module Leader: Dr Kostas Domdouzis
Module Name: Advanced Data Management
Assignment Title: Data Mining
Weighting: 40%
Level: 6
Module Code: 55-600092
Individual/Group: Individual
Magnitude: Blackboard Phase Test
Submission date/time: 27th April 2023 at 15:00 (submission area will be made available 1.5hrs prior to the deadline and you have 1.5hrs to complete the phase test)
Blackboard Submission: Y/N
Planned feedback date: 18th May 2023
In module retrieval available: Yes/No
Turnitin Submission: Y/N
Mode of Feedback: Blackboard
1.1 Learning Outcomes
This assignment assesses your ability to:
• Apply appropriate statistical business analytics to data to obtain summary, graphical and interactive presentations that support insights into it.
1.2 Assessment Criteria
This module will be assessed via a case study. This will involve the analysis of the data set outlined in Appendix 1. This assignment, contributes to 40% of the final module mark and each student is expected to carry out their own analysis of the data based on the questions outlined below – marks are indicated next to each question.
Note this is an individual assignment and upon completing your analysis you are expected to input the answer to questions set through the “002 Project - Submission Point” on blackboard by the 27th of April 2023 at 15:00 to the:
“002 Project - Submission Point” - can be found on blackboard in the “Assessment” tab.
1.3 Submission Details
Each student is expected to submit their responses to the “002 Project - Submission Point” form on blackboard by the given deadline, this will involve selecting the most appropriate answer from a list of given responses to the questions provided below.
1.4 Problem Outline
For this assignment you are required to analyse a data set concerning financial transactions and details for customers at a Czech bank. The relationships are shown below:
You are required to analyse one table resulting from a query from this database as detailed below. Full details of the fields in this table are given below and in Appendix 1.
1.5 Data Provided
The final query is saved as a SAS dataset for use in Enterprise Miner. It is called czechbk15.sas7bdat. It is available on the SHU server in the path:
E:\SHUUsers\!SharedData\Rich\ADM2223
You will need to create a library to access the data.
1.6 Details of the Query and resulting data
In this assignment you will investigate if there are any groups of accounts with similar properties. Also you will build a model to predict which accounts have a second account holder attached to that account. For this purpose a subset of variables are selected from the final combination of tables for each account. These variables can be seen to represent for each account, credits and different types of withdrawals that take place:
• Credits (payments in) there is one pair of variables that gives the total paid in to the account (credit) and the number of times money is paid in (creditn).
Withdrawal (taking money out) there are two separate variables for each of the following methods of withdrawing money:
• Cash
• Insurance payment
• Overdraft Penalty
• Statement Payment
• Household Payment
• Other bank withdrawal
• Loan Payments
For each of these types of payments the number of payments (ending in – n) and the value
of payments (ending in –t) has been recorded for a period of five years.
Finally additional information is held about each account:
Account id, Age of primary account holder, if they have a credit card or not (with this bank), number of days account open, if they have a loan or not, if there is a second user of the account and the gender of the main account holder (sex). There is one nominal variable: the frequency of their bank statements which is monthly, weekly or after transaction. This gives the set of variables as shown in the appendix. Make sure you fully understand what these variables represent - for a full list see the Appendix 1.
The bank wishes to see if different customers have similar profiles and have therefore asked that the data be clustered. They are looking for about five clusters.
For this assignment we will be using only the following variables in the data set. Whilst you are working on the assignment set all the other variables to reject and then you will not have to keep changing them.
Variables to use:
You will need to understand what these variables are so make sure you read Appendix 1.
1.7 Analysis Required
Instruction: Take a note (TAN) of your response to the questions below along with screen shots (SS) of the SAS outputs, as you will need these answers to complete the “ADM Data Mining - Answer Submission Area” form on blackboard.
Since the cluster analysis requires the use of fields that are as symmetrical as possible you should first investigate each of the interval fields in the data.
Question 1
Q1AA. Obtain (SS) suitable plots of the interval variables. (1 mark)
Q1AB. Discuss (TAN) the plots in detail. Hint: when looking at these plots you may wish to consider the following: are there any usual features? what are the shapes of the plots? and what does this mean in relation to how customers behaviour at the bank? (4 marks)
Question 2
Q2AA. Obtain (SS) suitable plots of the nominal and binary variables. (1 mark)
Q2AB. Discuss (TAN) the plots in detail, what do these show in relation to how customers behaviour at the bank? (4 marks)
Instruction: Use the transform node in Enterprise Miner and the "Maximum Normal" option for interval variables to find suitable transformations of the interval variables. You should ensure that in your scoring settings, you still retain a copy of the original variables (set both Hide and Reject to "no").
Question 3
Q3AA. Obtain (SS) suitable screen shot of the SAS transformation table. (1 mark)
Q3AB. Explain (TAN) what actual transformations the software has picked. Are there any that haven’t been transformed? (2 marks)
Q3AC. Produce further plots (SS) of the transformed and original variables. (3 marks)
Q3AD. Present evidence (TAN) as to whether or not the transformations have been successful. Hint: state for each interval variable whether subsequent analysis should use the original (untransformed) variable or the new transformed variable. (3 marks)
Q3AE. List (TAN) which set of interval variables you would use for clustering going forward. Hint: this can be a combination of both transformed and original variables, but the same variable should be used twice regardless of whether it is transformed or not. (2 marks)
Instruction: You now need to use the transformed data - Interval variables only. Firstly, set those variables you chose in question Q3AE to Yes and those you are not using to No. Once you have done this change the settings of the clustering node so as to fit five clusters.
Question 4
4AA. Obtain (SS) a “segment size” plot of the output. (1 mark)
4AB. Discuss (TAN) the “segment size plot” obtained in question 4AA. (1 mark)
4AC. Obtain (SS) the “cluster mean statistics” of the output. (1 mark)
4AD. Discuss (TAN) the “cluster mean statistics” obtained in question 4AC. (2 marks)
Question 5
5AA. Using the original variables, rather than the transformed variables, produce suitable plots (SS) to investigate the nature of each cluster for all the payment related variables (those that end in –t). (2 marks)
5AB. Using the plots in Question 5AA, illustrate the validity of your cluster solution by profiling clusters 1 and 3 and interpret the factors that make these unique (TAN). (5 marks)
Instruction: In the data node set: second as the target value. Now add a data partition node to the data node and set the training level 70%, the validation to 30% and the test to 0%. Add a decision tree node to the data partition node as in figure 1.0 and adjust the tree settings as per figure 1.1.
Figure 1.0 - Enterprise Miner Stream
Figure 1.1 - Decision Tree Settings
Now run the Decision Tree node.
Question 6
6AA. Obtain (SS) a “tree diagram” of the output. (1 mark)
6AB. Fully interpret (TAN) the derived tree. (2 marks)
6AC. Obtain (SS) the FIT Statistics from the derived tree (1 mark)
6AD. Discuss (TAN) the FIT Statics for the derived tree. (2 marks)
Instruction: Use the decision tree in conjunction with the following attributes given below:
Age = ., creditn = 0.01, creditt =200, stmentn = 0.02, stmentt = 10, card = y, cardwdn = 0, cardwdt = 0, insuren = 0, insuret = 0, overdtn = 0.42,
overdtt = 600, days = 800, frequency = monthly, householdn = 0, householdt = 0, othbwdn = 1000, othbwdt = 500, loanpayn = 6000, loanpayt = 98894, sex =M, cashwdn = 0, cashwdt = 0
6B. Record (TAN) whether or not the customer is likely to have a second account. (2 marks)
6C. Explain (TAN) the important factors that impact on a having a second account, what reservations might you have? (3 marks)
Question 7
7AA. Use all of your results above to discuss (TAN) how all the analysis you have carried out may be utilised by the bank. (2 marks)
7AB. Discuss (TAN) how the results of the analyses above may be utilised by the bank to carry out further supervised data mining. (2 marks)
7AC. Record what other possible targets might be appropriate for future data mining at the bank. (2 marks)
Total Marks available: 50 marks
(Data Mining Assignment contributes 40% to the final module mark)
2023-04-27