Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 1

KNN, Classification, Metrics

Homework Due: Monday. Sep 29 at 11:59pm

This HW consists of 5 problems, all using the same College.csv dataset.

Colleges in the United States are often compared and evaluated using a variety of institutional statistics, such as admissions rate, graduation rate, tuition, SAT scores, and faculty credentials. By analyzing these data, we can explore what factors are most predictive of whether a college can be considered “elite” or not.

The College.csv dataset contains the following features (among others):

. Private: whether the school is private (Yes/No)

. Apps: number of applications received

. Accept: number of acceptances

. Enroll: number of students who actually enrolled

. Top10perc: percentage of new students from the top 10% of their high school class

. Top25perc: percentage of new students from the top 25% of their high school class

. F.Undergrad: number of full-time undergraduates

. P.Undergrad: number of part-time undergraduates

. Outstate: out-of-state tuition

. Room.Board: room and board costs

. Books: estimated cost of books

. Personal: estimated personal expenses

. PhD: percentage of faculty with PhDs

. Terminal: percentage of faculty with terminal degrees

. S.F.Ratio: student-faculty ratio

. perc.alumni: percentage of alumni who donate

. Expend: instructional expenditure per student

. Grad.Rate: graduation rate

1. Data Preparation [5 pts]

Load College.csv into a DataFrame. Add a new column called Elite that

classifies schools based on the percentage of new students from the top 10% of their high school class:

If Top10perc ≥ 50, label the school as Elite; Otherwise, label the school as Non-Elite

2. Data Visualization [5 pts]

Choose four features from the dataset (not including the Elite classification) that you think are most indicative of whether a school is Elite or Non-Elite.

Use Seaborn’s pairplot function to create pairwise scatter plots for these four features. Color the scatter plot points according to the Elite classification.

3. Data Analysis [8 pts]

Based on your plots from Q2, write a paragraph analyzing the results:

.    Are there any clear patterns or trends?

.    Which features or feature pairs appear to be most correlated with being Elite vs Non-Elite?

.     If a school wanted to improve its chances of being classified as Elite, what strategies or improvements would you recommend (e.g., recruiting higher-performing students, improving faculty-student ratio, increasing expenditures)?

4. Model Evaluation [12 pts]

Build and evaluate two classifier models to predict whether a school is Elite.  Choose one from each of the categories below.

. Category 1 (choose one):

o  KNN with a small k of your choice

o  KNN with a large k of your choice

. Category 2 (choose one):

o  A classifier that predicts all schools are Elite

o  A classifier that predicts all schools are Non-Elite

Requirements:

A.   Normalize the input data

B.   Split the dataset into training and testing sets (no overlap, reasonable split)

C.   Report the following metrics for both training and testing data: Precision, Recall, Accuracy, and F1-score

Be sure to include answers to all of the following in paragraph form:

A. For both training and testing data, what are the precision, recall, accuracy, and F1-scores of each classifier?

B.   Which classifier performed best and worst on the training data, and why?

C.   Which classifier performed best and worst on the testing data, and why?

D. Which classifier would you recommend for real-world use, and why?

5. Applying the Model [10 pts]

Consider a new, hypothetical college with the following stats:

.     Private = Yes

.    Apps = 5000

.    Accept = 2000

.     Enroll = 800

.    Top10perc = 55

.    Top25perc = 80

.     F.Undergrad = 3500

.     P.Undergrad = 500

.     Outstate = 20000

.     Room.Board = 9000

.     Books = 500

.     Personal = 2000

.     PhD = 80

.    Terminal = 85

.     S.F.Ratio = 12

.     perc.alumni = 25

.     Expend = 15000

.     Grad.Rate = 90

Answer the following:

A.   Do you expect this university to be public or private? Why?

B.   Would you predict this school to be Elite or Non-Elite?

C.   Using KNN, what value of k did you choose and why?

D.   Which features did you use, and how did you preprocess the data?

Extra Credit [2 pts]: What is the smallest change to the school’s stats that would flip your prediction? (Use any reasonable definition of “smallest.”)

Deliverables:

.    A Jupyter notebook .ipynb file that you used to answer the questions

o  Make sure your notebook file, when run, generates at least a pandas dataframe (Q1) and scatter plots (Q2)

o [it can also generate other analyses]

.    A report in the form of a PDF file

o  Questions 3, 4, and 5 should be answered in this document

o  This report should clearly delineate which parts respond to which

question. For full points, the report should make use of plots/figures to justify or support your answers

.    Alternatively, you can write your report in your Jupyter notebook.

o However, if you decide to do that you must be sure that each question is answered in full English text, that the questions are clearly marked, and that there is no extraneous code not relating to the questions. If you submit a PDF file, then we will not grade the code directly (feel free to have messy code!), but be sure to include all figures and results needed for the PDF file to be a stand alone document.

Intention of the assignment:

.     Q1 is designed to make sure you understand the basics of loading and manipulating datasets from raw CSV files.

.     Q2 asks you to extend this skill to graphing data, and in the process also

requires you to thoughtful consider what features are the most meaningful to graph.

.     Q3 exercises your ability to thoughtfully analyze data, and tie in concepts from class like bias, variance, flexibility, and how to select good models.

.     Q4 focuses on your ability to create and tune classification models, and checks to make sure you can correctly compute classification metrics.

.     Q5 asks you to actually make predictions on a new data point, and justify this prediction – this is one of the key skills we want you to take from this class!

Grading Criteria:

A.   Does the Python code correctly load the college.csv data into a single

DataFrame? Does this one DataFrame contain both the original data and the derived Elite classification?

B.   Do the four chosen features make sense and show an interesting relationship to Elite vs Non-Elite status? Do the pairplot results look correct? Is the plot easy to read?

C.   Are the written arguments backed up with figures and plots? Do the arguments make sense? Is there a clear transition from numbers to meaningful real-world advice?

D.   Are the two classifiers implemented correctly? Are the four metrics (precision, recall, accuracy, F1-score) computed correctly? Is there a clear understanding of the difference between training and testing data? Is the train/test split correct (e.g., no overlap) and reasonable (neither too small nor too large)? Does the analysis show an ability to weigh the relative importance of precision vs recall vs accuracy vs F1-score for real-world tasks?

E.   Does the example school data (in Q5) get loaded correctly into Python for classification? Is the choice of k and the preprocessing steps well justified with both charts and clear explanation? Do I believe the student’s reasoning for whether this new school should be classified as Elite or Non-Elite?

F.   [Extra Credit]: How convincing is the analysis that a small change in the school’s statistics would flip its predicted classification?