BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BU MET CS-677: Data Science With PytChSo-n6,7v7.2A.0ssignment: kNN & Log. Regression(banknotes)
Assignment
In this assignment, we will implement k-nn and logistic re- gression classiiers to detect ”fake” banknotes and analyze the comparative importance of features in predicting accuracy.
For the dataset, we use ”banknote authentication dataset” from the machine Learning depository at UCI: https://archive. ics.uci.edu/ml/datasets/banknote+authentication
Dataset Description: From the website: ”This dataset contains 1,372 examples of both fake and real banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The inalim- ages have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.”
There are 4 continuous attributes (features) and a class:
1. f1 - variance of wavelet transformed image
2. f2 - skewness of wavelet transformed image
3. f3 - curtosis of wavelet transformed image
4. f4 - entropy of image
5. class (integer)
In other words, assume that you have a machine that examines a banknote and computes 4 attributes (step 1). Then each ban- knote is examined by a much more expensive machines and/or by human expert(s) and classiied as fake or real (step 2). The second step is very time-consuming and expensive. You want to build a classiier that would give you results after step 1 only.
We assume that class 0 are good banknotes. We will use color ”green” or ”+” for legitimate banknotes. Class 1 are assumed to be fake banknotes and we will use color ”red” or ”-” for counterfeit banknotes. These are ”true” labels.
Question 1:
1. load the data into dataframe and add column ”color”. For each class 0, this should contain ”green” and for each class 1 it should contain ”red”
2. for each class and for each feature f1 ; f2 ; f3 ; f4 , compute its mean μ() and standard deviation σ(). Round the results to 2 decimal places and summarize them in a table as shown below:
3. examine your table. Are there any obvious patterns in the distribution of banknotes in each class
class |
μ(f1) |
σ(f1) |
μ(f2) |
σ(f2) |
μ(f3) |
σ(f3) |
μ(f4) |
σ(f4) |
0 |
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
all |
|
|
|
|
|
|
|
|
Question 2:
1. split your dataset X into training Xtrain and Xtesting parts (50/50 split). Using ”pairplot”from seaborn package, plot pairwise relationships in Xtrain separately for class 0 and class 1. Save your results into 2 pdf iles ”good bills.pdf” and ”fake bills.pdf”
2. visually examine your results. Come up with three simple comparisons that you think may be su伍cient to detect a fake bill. For example, your classiier may look like this:
# assume you are examining a bill
# with features f_ 1 , f_ 2 , f_ 3 and f_4
# your rule may look like this :
if ( f_ 1 > 4) and ( f_ 2 > 8) and ( f_4 < 25):
x = " good "
else :
x = " fake "
3. apply your simple classiier to Xtest and compute predicted class labels
4. comparing your predicted class labels with true labels, com- pute the following:
(a) TP - true positives (your predicted label is + and true label is +)
(b) FP - false positives (your predicted label is + but true label is -
(c) TN - true negativess (your predicted label is - and true label is -
(d) FN - false negatives (your predicted label is - but true label is +
(e) TPR = TP/(TP + FN) - true positive rate. This is the fraction of positive labels that your predicted correctly. This is also called sensitivity, recall or hit rate.
(f) TNR = TN/(TN + FP) - true negative rate. This is the fraction of negative labels that your predicted correctly. This is also called speciicity or selectivity.
5. summarize your indings in the table as shown below:
6. does you simple classiier gives you higher accuracy oniden- tifying ”fake” bills or ”real” bills” Is your accuracy better than 50% (”coin” lipping)?
TP |
FP |
TN |
FN |
accuracy |
TPR |
TNR |
|
|
|
|
|
|
|
Question 3 (use k-NN classiier using sklearn library)
1. take k = 3, 5, 7, 9, 11. Use the same Xtrain and Xtest as before. For each k, train your k-NN classiieron Xtrain and compute its accuracy for Xtest
2. plot a graph showing the accuracy. On x axis you plot k and ony-axis you plot accuracy. What is the optimal value k* of k?
3. use the optimal value k* to compute performance measures and summarize them in the table
TP |
FP |
TN |
FN |
accuracy |
TPR |
TNR |
|
|
|
|
|
|
|
4. is your k-NN classiier better than your simple classiier for any of the measures from the previous table?
5. consider a bill x that contains the last 4 digits of your BUID as feature values. What is the class label predicted for this bill by your simple classiier? What is the label for this bill predicted by k-NN using the best k* ?
Question 4: One of the fundamental questions in machine learning is ”feature selection”. We try to come up with a least number of features and still retain good accuracy. The natural question is whether some of the features are important or can be dropped.
1. take your best value k* . For each of the four features f1 , . . . , f4 , drop that feature from both Xtrain and Xtest. Train your classiier on the ”truncated” Xtrain and pre- dict labels on Xtest using just 3 remaining features. You will repeat this for 4 cases: (1) just f1 is missing, (2) just f2 missing, (3) just f3 missing and (4) just f4 is missing. Compute the accuracy for each of these scenarious.
2. did accuracy increase in any of the 4 cases compared with accuracy when all 4 features are used?
3. which feature, when removed, contributed the most to loss of accuracy?
4. which feature, when removed, contributed the least to loss of accuracy?
Question 5 (use logistic (regression classiier using sklearn library)
1. Use the same Xtrain and Xtest as before. Train your logistic regression classiier on Xtrain and compute its accuracy for Xtest
2. summarize your performance measures in the table
TP |
FP |
TN |
FN |
accuracy |
TPR |
TNR |
|
|
|
|
|
|
|
3. is your logistic regression better than your simple classiier for any of the measures from the previous table?
4. is your logistic regression better than your k-NN classiier (using the best k* ) for any of the measures from the previous table?
5. consider a bill x that contains the last 4 digits of your BUID as feature values. What is the class label predicted for this bill x by logistic regression? Is it the same label as predicted by k-NN?
Question 6: We will investigate change in accuracy when removing one feature. This is similar to question 4 but now we use logistic regression.
1. For each of the four features f1 , . . . , f4 , drop that feature from both Xtrain and Xtest. Train your logistic regression classiier on the ”truncated” Xtrain and predict labels on Xtest using just 3 remaining features. You will repeat this for 4 cases: (1) just f1 is missing, (2) just f2 missing, (3) just f3 missing and (4) just f4 is missing. Compute the accuracy for each of these scenarious.
2. did accuracy increase in any of the 4 cases compared with accuracy when all 4 features are used?
3. which feature, when removed, contributed the most to loss of accuracy?
4. which feature, when removed, contributed the least to loss of accuracy?
5. is relative signiicance of features the same as you obtained using k-NN?
2023-10-24