Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BUS-888 ML for Finance | Summer 2025 | Assignment 1

09 July 2025

Notes:

A. All deliverables need to have the following lines typed or written in the beginning:

I pledge on my honor that I have neither received nor given unauthorized assistance on this deliverable.

B.     You are free to consult the internet or other resources. If you do use any, please cite/provide a reference.

C.     LLM / AI-tool Policy: Limited use of generative AI to produce code or for debugging is allowed, but you must (i) disclose the prompt in an appendix, (ii) check for errors, and (iii) cite the tool. Use of LLMs/AI for narratives is NOT permitted.

D.     Please list any assumptions you make.

E.     Upload a pdf document for your answers and attach your code (py or ipynb files) separately. Welcome to use any python interface (Spyder or Google Colab or PyCharm…..)

1.  Exploratory Data Analysis (6 points)

Problem Description:

As quantitative financial analysts, it is critical to be able to visualize datasets, find patterns and gain insights. In this problem, we want to analyze stock data and calculate returns. We have some stock data that we want to analyze.  The dataset contains daily closing stock prices for the following stocks, shown below, over a period. Data is available (click to download) here: stock_data.csv.

# AAPL = Apple Stock

# BA = Boeing

# T = AT&T

# MGM = MGM Resorts International (Hotel Industry)

# AMZN = Amazon

# IBM = IBM

# TSLA = Tesla Motors

# GOOG = Google

# sp500 = US Stock Market (S&P 500 index)

Please answer the following questions (below) based on this data. Paste your answers/outputs into a pdf document and attach your code (py or ipynb files) separately.

1.1.    What is the average daily return and annualized return of the S&P500 in this period? (1pt)

1.2.    Which instrument had the minimum standard deviation from the mean in dollar value? (1pt)

1.3.    Plot the daily price data for all stocks (incl the index) with and without normalization (1pt)

1.4.    Plot the correlation heat-map of daily returns for all instruments in the data (1pt)

1.5.    List the three most correlated stocks with SP500. (1pt)

1.6.    Compare T and TSLA in this time -- which stock is riskier and why? (Use a quantitative metric to qualify) (1pt)

2.  Supervised Learning (9 points)

Case Description:

Consider that you are a financial analyst intern at a retail bank. Given the scale of daily credit-card transactions, detecting credit-card fraud is challenging and is an ideal use case for ML applications. In one of your internship projects, you are tasked to prototype ML models to flag credit-card fraud. You are given some initial sample training data (click to download: training_data.csv). The data has the following features (feature/variable name - data type - description) and target labels as shown below.

Feature

Type

Description

distance_from_home

numeric

Distance from cardholder s home for this transaction

distance_from_last_transaction

numeric

Distance from previous transaction

ratio_to_median_purchase_price

numeric

Ratio of transaction price to median historical price

repeat_retailer

binary

1 if same retailer has been used before, else 0

used_chip

binary

1 if chip used, else 0

used_pin_number

binary

1 if PIN entered, else 0

online_order

binary

1 if online transaction, else 0

Fraud (Target Label)

binary

1 if fraudulent, else 0

Your task is to build and test ML prediction models to accurately predict fraudulent transactions. Test data has also been provided (click to download: test_data.csv) to help you build and identify the best model. You are expected to use the following algorithms with 5-fold cross-validation:

a) LOGIT (baseline)

b) LOGIT with LASSO

c) LOGIT with RIDGE

d) CART

e) Random Forest (n_estimators = 25)

Please answer the following question (below) in preparation for a meeting with the VP and his team to present your initial analyses. Paste your answers/outputs into a pdf document and attach your code separately.

2.1.  Fit each model and evaluate prediction performance on the test data. Please report in a neat table the following metrics for the provided test data: accuracy, recall, precision, f-1 scores, Positive Predictive Value and AUC. (3 pts)

2.2.  Of these reported metrics, which primary metric would you recommend be used for business deployment? Please justify (<150 words) (2 pts)

2.3.  Of these preliminary models, select the best model for further training, testing, tuning/optimizing, and eventual deployment? Briefly provide your justification in 2-3 bullet points. (Hint: think in terms of performance, interpretability, potential cost). (2 pts)

2.4.  Out of the seven features described above, identify the two most important predictive features.

Quantify importance for these two features and briefly explain the intuition behind your reasoning. (2 pts)