BANKING DATASET CLASSIFICATION
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BANKING DATASET CLASSIFICATION
SYSTEM REQUIREMENTS AND ALGORITHM
This is a research-based paper where, I have identified and find out a classification problem. This classification problem is solved using the decision tree classifier with python as the main programming language from scratch. This is a business problem based which states that. Banks at some time experience low cash flow of revenue from the customers which leads to the revenue decline in these banks. Now, the bank management would like to find a way of increasing its revenue so that its capital can be high too.
With this kind of problem, the bank top managers went ahead to find out what might be the main root cause behind which is greatly leading to revenue decline in the banks account. During the investigation by the top managers, they find out that there customers at large were not investing enough in the long-term deposits which are the main source of the bank revenue generation (Carapeto, M., et al. 2011).
Following such scenario by the top management in their own bank, the management wanted to now be informed and be aware about these existing customers who have high chances of subscribing to long term deposits. With this kind of information, the top management will be able to focus more on the marketing efforts on such customers so that by the end of the revenue collection, the bank would have achieved its target of revenue increase by that time.
To be clear enough, the functional requirements of this system to be developed is to classify and group the customers from among others who have high chances of making a long-term investment deposits in the bank. This will be necessarily achieved by the other related features like the type of employment the customer is working under, the job type and any other related features with high correlation to the target variable of whether to make longer term investments or not.
For non-functional requirements, since the system is majorly focusing on the transaction of customer deposits to be saved in the bank for such a longer period, the system must be well equipped on how to eliminate and avoid the unnecessary entrance of intruders via vulnerable so that the integrity of the customer remains at the same time no losses is experienced (Grigoroudis, E., et al. 2016).
This algorithm works by grouping customers into two different groups; the first group is for customers with low chances of subscribing to longer term investments deposits while the other classified group is for the customers with high chances of subscribing to longer term investments deposits.
EXPLANATION OF HOW THE CODE WORKS.
Implementation of my codebase was largely achieved by use of functions, as follows is an explanation of how this code works and the dependencies found in these functions . The first function is the train_test_split(). This function collects the datasets used and later does the splitting of the dataset into training and testing splits. However, before splitting the dataset, random values are generated to help in disorienting the order of records in the dataset. This is done during the classification to be assured of getting the accurate response from our chosen dataset. The test dataset is stored in the test_df variable and the train dataset is stored in the train_df variable for later use in the code flow.
There is check_purity() function. The main aim of this function is to check the purity for the classes that are unique in the target column of the dataset. If a class exceeds 1 in the target column this is an indication that its purity has been reduced to some level, thus the accuracy will also be reduced by default. However, there is as increase in accuracy when classes in the output column remains one. There is classify_data() function. This function performs the classification of classes using the target column of dataset. Remember the target column in the dataset is the target variable showing customer with high or low chances of investing in longer term deposits.
get_potential_splits() function is another function. This function splits the dataset into two parts based on the columns . Since I want to predict the target variable, this function supports removal of the target feature so that the train split has got all the features whereas the second splits to contain all the features except the target feature. The split_data() function is used to split data based on values of the column in the dataset given. The resultant values are then returned by the function and identified as the data_below and data_above. This is data splitting into test and train too.
calculate_entropy() is a function with one argument, the data which represents the dataset as a data.frame that is passed in during the data splitting hence performing the calculation of the entropy. This function basically deals with calculating and measuring the impurity in the dataset used. This function helps the decision tree on how to perform dataset splitting. By doing this, this function will at large effect how the decision tree is supposed to be drawing its boundaries based on the dataset in use.
Results of the Classification Tree
Before is a decision tree showing classification of customers who are most likely to make investment in the bank for a long term based on different factors.
Fig. Following the above classification tree, its clear that customers who are working(employed) and takes a loan from the bank are most likely to invest in a long-term period in the bank. Based on these results, I recommend the bank to recruit more employed customers and who are willing to take loan. As a result, the bank revenue will increase as this is the main objective of the company.
Generally, the flow of the whole program is described as follows; at first the I connected my google colab with the google drive where I have uploaded and stored my train and test dataset which is to be used in these program process. After connecting the two, I loaded the dataset in my new python colab notebook file where I will be working from while performing the same dataset manipulation and analysis. From that point after loading the dataset, I had to perform data splitting since the data was clean and ready to be used in the following process. I did the purity testing in my dataset too to check the order of my dataset as well. Later on, I wrote the code to find out the overall entropy in the whole dataset. I also set the example where I set the inputs to the program so that I could test how good the model accuracy was. From here, I could perform the classification of data into the training and testing parts. Lastly, I produced the decision tree that classifies the chances of the customers making a long-term deposits investment based on various features that affects this chance at large (Doumpos, M., & Zopounidis, C. 2012).
In my code, the dataset is represented in a tabular format. The rows/records represent the number of observations the dataset contains in detail. The features which are represented by the columns in the dataset acts as the labels. The dataset has got 14 features including the target variable which is the outcome variable used in classification. The dataset is stored in a data.frame from csv(comma separated value) so that data manipulation can be done.
Several python features have been implemented in the code program. Lists have been used to access the specific row and column as a single data.frame just the way it is represented. On the other hand, the python dictionaries have also been used in the representation and accessing of rows in the dataset. Tuples have been used in creating the python functions and methods so that the final program can be created.
CODE COMPLEXITY ANALYSIS
The key data inputs used in this program vary as several features for different users of the bank have got different data values. An increase in size of the input will increase the running time. Thus, there is a linear relationship between running time and the input values.
Below is the running time of the code using range of different input size values
For constant running time O(n) where n=input values Therefore when n=2, O(2) will be 2 and when n=4, O(4) will result into 4 since running time is linear.
Below is a Big(0) plot showing how running time responds when different input values are used to test the complexity of running time in the code. Its obvious that the input values are linear to the running time
According to the graph above, the complexity of the code used becomes O(n). This means that running time is linear to the size of input value . In addition, this code is limited to variables like duration as its input value. There is no nesting of loops in the whole code, thus, we calculate the loop count of this code as follows; 1 loop(not nested) = O(n). .(Sueyoshi, T., & Kirihara, Y 2018).
There is minimal usage of loops in the code and instead many functions were used to achieve the results. This is because, many loops counter lowers the running speed of execution thus, minimization of loops was after increasing the running speed of my code. (Carapeto, M., et al. 2011).
TEST DATA DESCRIPTION
I have used the dataset of banks located in Portuguese. This is an open source data which explains more about how the customer investments affects the banks revenue rise and decline in Portuguese. I made my choice on using this open source dataset since, after identifying the business problem that might affect, I reach out to this dataset in the process and find out that it is also a problem in Portuguese so I had to analyze this dataset and find out., Also, using open source data saved my time plus that if I could generate data randomly I could be solving problem that is not there. This dataset I chooses has got several features which are relevant in the classification to help in answering some of the business-related problems.
Using this dataset and its rich features, the classification tree algorithm has been able to use them in making the classification of knowing the type of customers that are expected to make longer term deposits investment in the bank. Actually, this is the benefit of the bank.
Just after training my code using the train dataset, I tested the code using the testing data. Since I had converted the whole of my categorical code into numerical, it was easy to use the test data. Basically, the code was to be correct if the expected output would be generated that is by doing the classification. However, I could identify a fault in the code if the code would generate unexpected output which is not classification as it is intended to do.
CONCLUSIONS
In summary, the code has both limitations and benefits. Some of the benefits of this code is that, the classification of the test data is done by the code itself thus saving time. Am only supposed to provide the dataset to be classified accordingly and the rest code does for me. This saves time. This code can be applied in performing classification of other datasets that have the same format and structure as a whole. And if not, then some modules/functions can be used in other code to perform the same function has they have done in this program.
However, some of the limitations of my code, includes; this code procedural, therefore, if one step is omitted then the next step will fail since it is dependent on the previous step. Since it is function based, too much code is needed for the implementation of a functionality to be achieved.
I do believe that my code has followed the good design principle since, I have decided not to combine two different paradigms instead I have used only one; the function based to achieve my functionalities. This business-related problem can also be solved by the k-means algorithm; however, I have preferred decision tree since to choose the value of K that fits so that results can be accurate becomes a problem when using the K Nearest Neighbors.
Despite using the functional based paradigms in implementation of this code, we can still improve the working and design of this code by using the Object-oriented paradigm since its much precise and clear when compared to functional.
APPENDIX
from google.colab import drive
drive.mount('/content/drive')
# Import Statements
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from pprint import pprint
%matplotlib inline
# For the train data
# Load and Prepare Data
df = pd.read_csv("/content/drive/MyDrive/data/new_train.csv")
df = df.rename(columns={"y": "label"})
# for the train data
df.head()
# Load and Prepare Data
df = pd.read_csv("/content/drive/MyDrive/data/new_test.csv")
df = df.rename(columns={"y": "label"})
# for the test data
df.head()
# Train-Test-Split
def train_test_split(df, test_size):
if isinstance(test_size, float):
test_size = round(test_size * len(df))
indices = df.index.tolist()
test_indices = random.sample(population=indices, k=test_size)
test_df = df.loc[test_indices]
train_df = df.drop(test_indices)
return train_df, test_df
random.seed(0)
train_df, test_df = train_test_split(df, test_size=20)
# Helper Functions
# The helper functions operate on a NumPy 2d-array. Therefore,
#let’s create a variable called “data” to see what we will be working with.
data = train_df.values
data[:14]

# Data pure? Enables us to check the purity of dataset used
# based on the continuos test dataset
#However, in this case data is impure since not all its data belong to one class, as we have two classses
def check_purity(data):
label_column = data[:, -1]
unique_classes = np.unique(label_column)
if len(unique_classes) == 1:
return True
else:
return False
# Classify of the dataset
def classify_data(data):
label_column = data[:, -1]
unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)
index = counts_unique_classes.argmax()
classification = unique_classes[index]
return classification
# Potential splits to split nodes into sub-nodes
def get_potential_splits(data):
potential_splits = {}
_, n_columns = data.shape
for column_index in range(n_columns - 1): # excluding the last column which is the label
potential_splits[column_index] = []
values = data[:, column_index]
unique_values = np.unique(values)
for index in range(len(unique_values)):
if index != 0:
current_value = unique_values[index]
previous_value = unique_values[index - 1]
potential_split = (current_value + previous_value) / 2
potential_splits[column_index].append(potential_split)
return potential_splits
# Data Splitting into Test and Train Data
def split_data(data, split_column, split_value):
split_column_values = data[:, split_column]
data_below = data[split_column_values <= split_value]
data_above = data[split_column_values > split_value]
return data_below, data_above
# Finding out the Lowest Overall Entropy?
def calculate_entropy(data):
label_column = data[:, -1]
_, counts = np.unique(label_column, return_counts=True)
probabilities = counts / counts.sum()
entropy = sum(probabilities * -np.log2(probabilities))
return entropy
def calculate_overall_entropy(data_below, data_above):
n = len(data_below) + len(data_above)
p_data_below = len(data_below) / n
p_data_above = len(data_above) / n
overall_entropy = (p_data_below * calculate_entropy(data_below)
+ p_data_above * calculate_entropy(data_above))
return overall_entropy
def determine_best_split(data, potential_splits):
overall_entropy = 9999
for column_index in potential_splits:
for value in potential_splits[column_index]:
data_below, data_above = split_data(data, split_column=column_index, split_value=value)
current_overall_entropy = calculate_overall_entropy(data_below, data_above)
if current_overall_entropy <= overall_entropy:
overall_entropy = current_overall_entropy
best_split_column = column_index
best_split_value = value
return best_split_column, best_split_value
# Decision Tree Algorithm
# Representation of the Decision Tree
sub_tree = {"question": ["yes_answer",
"no_answer"]}
example_tree = {"job <= 5": ["No", {"job <= 8": [{"education <= 4.9": ["Yes", "No"]}] }]}
# The Algorithm
def decision_tree_algorithm(df, counter=0, min_samples=3, max_depth=14):
# data preparations
if counter == 0:
global COLUMN_HEADERS
COLUMN_HEADERS = df.columns
data = df.values
else:
data = df
# base cases
if (check_purity(data)) or (len(data) < min_samples) or (counter == max_depth):
classification = classify_data(data)
return classification
# recursive part
else:
counter += 1
# helper functions
potential_splits = get_potential_splits(data)
split_column, split_value = determine_best_split(data, potential_splits)
data_below, data_above = split_data(data, split_column, split_value)
# instantiate sub-tree
feature_name = COLUMN_HEADERS[split_column]
question = "{} <= {}".format(feature_name, split_value)
sub_tree = {question: []}
# find answers (recursion)
yes_answer = decision_tree_algorithm(data_below, counter, min_samples, max_depth)
no_answer = decision_tree_algorithm(data_above, counter, min_samples, max_depth)
# If the answers are the same, then there is no point in asking the qestion.
# This could happen when the data is classified even though it is not pure
# yet (min_samples or max_depth base cases).
if yes_answer == no_answer:
sub_tree = yes_answer
else:
sub_tree[question].append(yes_answer)
sub_tree[question].append(no_answer)
return sub_tree
tree = decision_tree_algorithm(train_df, max_depth=5)
# print the tree
pprint(tree)
#Classification
sub_tree
example = test_df.iloc[0]
example
def classify_example(example, tree):
question = list(tree.keys())[0]
feature_name, comparison_operator, value = question.split()
# ask question
if example[feature_name] <= float(value):
answer = tree[question][0]
else:
answer = tree[question][1]
# base case
if not isinstance(answer, dict):
return answer
# recursive part
else:
residual_tree = answer
return classify_example(example, residual_tree)
# Calculate Accuracy
def calculate_accuracy(df, tree):
df["classification"] = df.apply(classify_example, axis=1, args=(tree,))
df["classification_correct"] = df["classification"] == df["y"]
accuracy = df["classification_correct"].mean()
return accuracy
REFERENCES
Huang, J., Chai, J., & Cho, S. (2020). Deep learning in finance and banking: A literature review and classification. Frontiers of Business Research in China, 14(1), 1-24.
Doumpos, M., & Zopounidis, C. (2012). Multi–criteria classification methods in financial and banking decisions. International Transactions in Operational Research, 9(5), 567-581.
Grigoroudis, E., Politis, Y., & Siskos, Y. (2016). Satisfaction benchmarking and customer classification: An application to the branches of a banking organization. International transactions in operational research, 9(5), 599-618.
Carapeto, M., Moeller, S., Faelten, A., Vitkova, V., & Bortolotto, L. (2011). Distress classification measures in the banking sector. Risk governance and control: financial markets & institutions, 1(4), 19-30.
Sueyoshi, T., & Kirihara, Y. (2018). Efficiency measurement and strategic classification of Japanese banking institutions. International Journal of Systems Science, 29(11), 1249-1263.
2022-02-18
SYSTEM REQUIREMENTS AND ALGORITHM