Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AFM 346 - Final Project

Spring 2022 - Week 11

Introduction

In this final project, you will have a chance to pull together the skills and knowledge that you have gained this term. The task is this: predicting credit card defaults. Data comes from Taiwanese consumers in 2005, courtesy of I.C. Yeh and C.H. Lien. You’ll be using an adapted form of this data. (References are at the end of this document, including an optional research paper based on the original data.)

Getting the Data

The project data is in the default of credit card clients - adapted file in the Week 11 module on Learn.

Overview of the Data

Note that all monetary amounts are in New Taiwan dollars (or TWD).

· Outcome variable: is_default: default on payment next month (yes, no)

· Predictor variables:

General 

§ id: ID of each client

§ limit_balance: amount of given credit (includes individual and family/supplementary credit)

§ gender: gender (male, female)

§ educ_level: (graduate school, university, high school, others, unknown 1, unknown 2)

§ marital: marital status (married, single, others)

§ age: age in years

Repayment status (-2 = ?, -1 = pay duly, 0 = ?, 1 = payment delay for one month, 2 = payment delay for two months, … 8 = payment delay for eight months, 9 = payment delay for nine months and above)

§ pay_0: repayment status in September 2005

§ pay_2: repayment status in August 2005

§ pay_3: repayment status in July 2005

§ pay_4: repayment status in June 2005

§ pay_5: repayment status in May 2005

§ pay_6: repayment status in April 2005

Bill amount 

§ bill_amount1: amount of bill statement in September 2005

§ bill_amount2: amount of bill statement in August 2005

§ bill_amount3: amount of bill statement in July 2005

§ bill_amount4: amount of bill statement in June 2005

§ bill_amount5: amount of bill statement in May 2005

§ bill_amount6: amount of bill statement in April 2005

Payment amount 

§ pay_amount1: amount of previous payment in September 2005

§ pay_amount2: amount of previous payment in August 2005

§ pay_amount3: amount of previous payment in July 2005

§ pay_amount4: amount of previous payment in June 2005

§ pay_amount5: amount of previous payment in May 2005

§ pay_amount6: amount of previous payment in April 2005

Guidelines

Your submission should include the items listed below. For each item, please submit both an Rmd file and an MHTML file.

· Summary report

· Working papers (3)

The report objective is to recommend two models to predict defaults on credit card payments. The report will also include a third model as a baseline.

Audience for this Report

The audience for this report will be the two people listed below. Approval for the continuation of your project will depend on their evaluations.

1. An executive who will read only the summary report

2. A data scientist who will read the summary report and the working papers, and then provide advisory input to the executive

Problem Statement

The problem statement should be your assumption about the most important business problem to solve with this project:

· Operating within a limited department budget?

· Reducing the bank’s losses from credit-card loans?

· Something else?

Summary Report

Please organize your summary report according to the items below.

· Introduction 

o Problem statement

o Project objective

o Overview of report sections

o Executive summary of model recommendations

· Methodology 

Overview of modeling process: the main steps used for selecting the models in your summary report. (It may also be useful to mention any models that were tested but omitted from this report.)

Model selection: description of the tuning process for hyperparameters. This may include regular grids, Latin hypercubes, Bayesian optimization, or a combination of these methods. Be sure to include rationale for your choices. Discuss your process for selecting among different model types, such as KNN or boosting. (Computating requirements are a reasonable consideration for certain model types.)

§ Note: For each model in the working papers that you submit, complete at least two rounds of tuning. Examples of this process are in the applied lectures for Week 9, in sections titled “Tune the First Model” and “Tune the Second Model”. Your second round of tuning should build on the results from your first round of tuning.

Performance metrics: the metric to optimize, along with rationale for your choice and a definition of the metric. If other metrics were used, please define them and explain their use in your process.

Data splits and resampling: A description of your proportions for splitting the data into training and test sets. Also a description of your cross-validation approach, the number of folds, and the number of repeats.

§ Please set each of your two random seeds using Google’s random-number generator. You can access this feature by entering “random number” into a Google search box. Each of your random seeds should be an integer between one thousand and one billion.

§ When splitting the data, you may use a training percentage as low as 50%. Please explain the cost-benefit trade-offs for whatever train/test split you choose.

§ When creating resamples, you may use a number of folds as low as 5, and a number of repeats as low as 3. Please explain the cost-benefit trade-offs for whatever folds and repeats you choose.

Feature engineering: A brief discussion of your approach to selecting standard preprocessing steps. Also a description of any non-standard preprocessing steps.

· Data 

Data description: number of rows, number of columns, and column names and descriptions

Exploratory data analysis: 

§ Data quality: total number of missing values (if any) in each data column

§ Data visualizations 

§ Outcome: proportion of each class

§ Each numeric predictor: overall distribution, and relationship with outcome

§ Each categorical predictor: overall frequency, and relationship with outcome

§ Summary table: summary statistics for all numeric variables. There should be one variable in each table row and one statistic in each table column. Statistics should include minimum, maximum, mean (or median), and standard deviation (or inter-quartile range). Additional statistics may also be included.

· Baseline Model 

o Choose a model that:

§ Can be explained easily

§ Can be trained quickly

§ Has low performance, but better than random

o Items to report (also in the next two sections below)

§ Description of the model with rationale for choosing it

§ Brief description of the hyperparameters

§ Brief description of the preprocessing steps

§ Results from cross-validation

§ Best combination of hyperparameters

· Low-complexity Model 

o A moderately well-performing model. Relative to the best-performing model, this moderate model should be chosen for one of two reasons:

0. The moderate model is simpler, having fewer hyperparameters to learn (The select_by_one_std_err() or select_by_pct_loss() function can be useful for this.)

1. The moderate model is easier to explain (e.g., ridge/lasso logistic regression).

o Items to report (same items as with baseline model)

§ …

· Best-performing Model 

o The top-performing model in your experiments

§ By how much did this model outperform other models (significantly or marginally)?

§ Training time for this model, with some discussion of the tradeoff between training time and performance benefits here

o Items to report (same items as with baseline model)

§ …

· Test Results 

o Fit your moderate and best models on the full training data

o Evaluate these models with the test data

o Report the results. Along with your narrative, include a confusion matrix, a ROC curve, and a precision-recall curve.

· Conclusion 

o Recommended model with rationale

o Explanation of either one of the following two items:

§ Why your recommended model is always superior

§ Circumstances under which an alternative model should be preferred

o Technical suggestions for improving your recommended model in advance of deployment, to improve modeling performance, robustness, generality, etc.

o Business suggestions for utilizing your recommended model in practice, to address the underlying problem with credit-card defaults

Working Papers

· Working papers on your three chosen models:

1. Baseline

2. Low-complexity

3. Best-performing

· The working papers will document your experiments. These papers may be used as a basis for awarding partial credit (if needed).

· The working papers should be adequate for an informed predictive analyst to reproduce your experiments. So sufficient detail and organization should be included for this hypothetical reproducability. But no specific format is required for these documents.

· It’s recommended to create one R Markdown document per model type. These documents should share a common template of your own design. It’s useful to use the save() function to store results after cross-validation has been completed, as shown in lectures. When creating the summary report, you can use the load() function to utilize these results in your discussion.

Preparing for Modeling

Converting Categoricals to Factors

For modeling in R, it’s useful to convert any categorical variables from numeric to factor type. The sample code below will do this.

credit <- credit %>%

    mutate_if(is.character, as.factor)

Feature Engineering

Here are some suggestions for feature engineering.

First, there are some negative values in the monthly bill amounts. Presumably these values reflect credit balances. It’s useful to do two things, as described below. Some sample code is provided.

1. For each month, create a new, Boolean column to indicate whether a positive credit balance exists.

2. Change the type of this new column to numeric for modeling.

3. For each month, ‘clip’ any negative balance to zero for easier processing.

my_recipe %>%

    step_mutate_at(matches('_amount[1-6]'), fn = as.numeric) %>%

    

    step_mutate(

        bill_amount_credit1 = bill_amount1 < 0,

        bill_amount_credit2 = bill_amount2 < 0,

        bill_amount_credit3 = bill_amount3 < 0,

        bill_amount_credit4 = bill_amount4 < 0,

        bill_amount_credit5 = bill_amount5 < 0,

        bill_amount_credit6 = bill_amount6 < 0

        ) %>%

 

    step_mutate_at(matches('credit'), fn = as.numeric) %>%

    step_mutate_at(matches('bill_amount[1-6]'), fn = ~ if_else(. < 0, 0, .))

Second, for each month, you may want to handle the repayment-status columns the same way that you handled the bill-amount columns.

Third, for each month, you may want to create a new feature, such as the difference between the bill amount and the payment amount. Or the ratio of the payment amount to the bill amount.

If you engineer any of these features, it’s important to test their utility for your models. Cross-validation will give you feedback on this utility. It may be useful to design one or more of your experiments to use two recipes, so that you can evaluate different preprocessing options with different workflows.

References

· Default of Credit Card Clients Data Set at the UCI Machine Learning Repository.

· Yeh, I. C., & Lien, C. H. (2009). “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients”. Expert Systems with Applications, 36(2), 2473-2480.