ETF5952 Questions for Final Exam B
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ETF5952
Questions for Final Exam
Question 1 (25 points=5+5+5+10)
We consider a data set on cars. The data set includes the following variables:
mpg: miles per gallon
cylinders: Number of cylinders between 4 and 8
displacement: Engine displacement (cu. inches)
horsepower: Engine horsepower
weight: Vehicle weight (lbs.)
acceleration: Time to accelerate from 0 to 60 mph (sec.)
year: Model year (modulo 100)
origin: Origin of car (1. American, 2. European, 3. Japanese)
name: Vehicle name
1. We estimate a regression as below. Explain the effect of weight on mpg (no more than 20 words, 4 decimal places).
glm(formula = mpg ~ weight + cylinders + displacement + horsepower +
acceleration, data = DATA)
Deviance Residuals:
Min -11 .5816
1Q
-2 .8618
Median
-0 .3404
3Q
2 .2438
Max
16 .3416
Coefficients:
Estimate Std . Error t value Pr(>|t |)
(Intercept) 4 .626e+01 2 .669e+00 17 .331 <2e-16 ***
weight -5 .187e-03 8 .167e-04 -6 .351 6e-10 ***
cylinders -3 .979e-01 4 .105e-01 -0 .969 0 .3330
displacement -8 .313e-05 9 .072e-03 -0 .009 0 .9927
horsepower -4 .526e-02 1 .666e-02 -2 .716 0 .0069 **
acceleration -2 .910e-02 1 .258e-01 -0 .231 0 .8171
---
Signif . codes: 0 *** 0 .001 ** 0 .01 * 0 .05 . 0 .1 1
2. We estimate a regression as below. Explain the effect of origin on mpg (no more than 30 words, 4 decimal places).
glm(formula = mpg ~ weight + cylinders + displacement + horsepower +
acceleration + origin, data = DATA)
Deviance Residuals:
Min -12 .6303
1Q
-2 .8009
Median
-0 .2871
3Q
2 .0945
Max
14 .8931
Coefficients:
Estimate Std . Error t value Pr(>|t |)
(Intercept) weight cylinders |
44 .7687911 -0 .0048119 -0 .5661876 |
displacement 0 .0114270
2 .6398457 0 .0008089 0 .4042069 0 .0095737
16 .959 < 2e-16 *** -5 .948 6 .1e-09 *** -1 .401 0 .162100
1 .194 0 .233376
horsepower -0 .0613339 0 .0168679 -3 .636 0 .000314 ***
acceleration -0 .0319841 0 .1232529 -0 .259 0 .795389
origin2 1 .1255451 0 .7015566 1 .604 0 .109458
origin3 2 .9325397 0 .6955675 4 .216 3 .1e-05 ***
---
Signif . codes: 0 *** 0 .001 ** 0 .01 * 0 .05 . 0 .1 1
3. We estimate a regression as below. Explain the effect of weight on mpg (no more than 30 words, 4 decimal places).
glm(formula = mpg ~ weight * origin + cylinders + displacement +
horsepower + acceleration + origin, data = DATA)
Deviance Residuals:
Min -13 .3865
1Q
-2 .6755
Median
-0 .4621
3Q
2 .0962
Max
14 .5518
Coefficients:
Estimate Std . Error t value Pr(>|t |)
(Intercept) 43 .9946277 2 .7178880 16 .187 < 2e-16 ***
weight -0 .0038512 0 .0009205 -4 .184 3 .56e-05 ***
origin2 4 .1783349 2 .9695460 1 .407 0 .160222
origin3 11 .9740394 3 .6294558 3 .299 0 .001061 **
cylinders -0 .6249505 0 .4029298 -1 .551 0 .121726
displacement 0 .0041280 0 .0100292 0 .412 0 .680867
horsepower -0 .0580462 0 .0168325 -3 .448 0 .000627 ***
acceleration -0 .0774625 0 .1243493 -0 .623 0 .533694
weight:origin2 -0 .0012627 0 .0011885 -1 .062 0 .288710
weight:origin3 -0 .0040237 0 .0015922 -2 .527 0 .011900 *
---
Signif . codes: 0 *** 0 .001 ** 0 .01 * 0 .05 . 0 .1 1
4. We estimate a regression as below. Explain the effect of weight on mpg (no more than 30 words, 4 decimal places).
glm(formula = log(mpg) ~ log(weight) * origin + cylinders + displacement +
horsepower + acceleration + origin, data = DATA)
Deviance Residuals:
Min 1Q Median 3Q Max
-0 .51675 -0 .10216 -0 .00386 0 .10077 0 .51237
Coefficients:
Estimate Std . Error t value Pr(>|t |)
(Intercept) 9 .0859329 0 .7847542 11 .578 < 2e-16 ***
log(weight) -0 .6930137 0 .1096725 -6 .319 7 .34e-10 ***
origin2 -1 .7792391 0 .9726568 -1 .829 0 .0681 .
origin3 -1 .2022192 1 .1471960 -1 .048 0 .2953
cylinders -0 .0268622 0 .0149654 -1 .795 0 .0735 .
displacement 0 .0004781 0 .0003746 1 .276 0 .2026
horsepower -0 .0033636 0 .0006076 -5 .536 5 .76e-08 ***
acceleration -0 .0051126 0 .0045646 -1 .120 0 .2634
log(weight):origin2 0 .2302492 0 .1248918 1 .844 0 .0660 .
log(weight):origin3 0 .1633573 0 .1485229 1 .100 0 .2721
---
Signif . codes: 0 *** 0 .001 ** 0 .01 * 0 .05 . 0 .1 1
Question 2 (25 points=5+5+5+10)
1. We obtain a correlation matrix as below. Answer the variables that have the highest positive and negative correlation with lprice (no more than 20 words).
cor(DATA)
crime nox rooms dist radial proptax stratio lowstat lprice crime 1 .0000000 0 .4211523 -0 .2188157 -0 .3799093 0 .6254423 0 .5828192 0 .2886909 0 .4470330 -0 .5274947 nox 0 .4211523 1 .0000000 -0 .3028280 -0 .7702225 0 .6103279 0 .6669806 0 .1868634 0 .5856131 -0 .5087672
rooms -0 .2188157 -0 .3028280 1 .0000000 0 .2054095 -0 .2097727 -0 .2921202 -0 .3540075 -0 .6096048 0 .6329095
dist -0 .3799093 -0 .7702225 0 .2054095 1 .0000000 -0 .4950646 -0 .5343788 -0 .2292694 -0 .4956025 0 .3420084
radial 0 .6254423 0 .6103279 -0 .2097727 -0 .4950646 1 .0000000 0 .9102282 0 .4642446 0 .4760376 -0 .4809716 proptax 0 .5828192 0 .6669806 -0 .2921202 -0 .5343788 0 .9102282 1 .0000000 0 .4542378 0 .5276241 -0 .5596710 stratio 0 .2886909 0 .1868634 -0 .3540075 -0 .2292694 0 .4642446 0 .4542378 1 .0000000 0 .3654023 -0 .4976345 lowstat 0 .4470330 0 .5856131 -0 .6096048 -0 .4956025 0 .4760376 0 .5276241 0 .3654023 1 .0000000 -0 .7914387 lprice -0 .5274947 -0 .5087672 0 .6329095 0 .3420084 -0 .4809716 -0 .5596710 -0 .4976345 -0 .7914387 1 .0000000
2. We run K-mean clustering (K=3) and obtain the result as below. Explain the result related to lprice (no more than 20 words).
> group5$centers
crime nox rooms dist radial proptax stratio lowstat lprice
1 55.6069989 6.747500 5.758750 1.585000 24.000000 66.60000 20.20000 22.54875 9.022747 2 0.3914715 5.122358 6.387453 4.442195 4.455285 31.19268 17.81436 10.48515 10.070670 3 9.5978914 6.698140 6.020853 2.083721 23.224806 66.77442 20.19612 18.43054 9.627254
3. From the K-mean clustering result, explain the relation between lprice and crime (no more than 20 words,
2 decimal places).
4. We apply the principal component analysis for the data and obtain the result, as below. According the estimation result of the first principal component, explain the relation among crime, nox and lprice through the factor (no more than 40 words).
> pc = prcomp(DATA, scale=TRUE)
> round(pc$rotation, 2)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
crime 0 .31 0 .15 -0 .26 -0 .67 0 .47 -0 .31 0 .04 0 .16 -0 .12
nox 0 .35 0 .27 0 .41 0 .13 -0 .10 0 .06 0 .56 0 .53 -0 .04
rooms -0 .25 0 .60 -0 .21 -0 .01 0 .29 0 .67 0 .03 -0 .03 -0 .03
dist -0 .31 -0 .33 -0 .43 -0 .34 -0 .43 0 .24 0 .23 0 .43 -0 .09
radial 0 .37 0 .30 -0 .33 0 .00 -0 .37 -0 .02 -0 .23 0 .07 0 .68
proptax 0 .39 0 .25 -0 .23 0 .04 -0 .45 0 .01 -0 .04 -0 .31 -0 .66
stratio 0 .25 -0 .26 -0 .55 0 .61 0 .37 0 .00 0 .08 0 .22 -0 .08
lowstat 0 .36 -0 .28 0 .27 -0 .11 0 .06 0 .47 -0 .60 0 .31 -0 .13
lprice -0 .37 0 .37 -0 .01 0 .19 -0 .08 -0 .41 -0 .45 0 .51 -0 .23
Question 3 (25 points=5+5+5+10)
● Income: Income in $10,000s
● Limit: Credit limit
● Rating: Credit rating taking 1 for good customers and 0 for bad customers
● Cards: Number of credit cards
● Age: Age in years
● Education: Number of years of education
● Gender: A factor with levels Male and Female
● Student: A factor with levels No and Yes indicating whether the individual was a student
● Married: A factor with levels No and Yes indicating whether the individual was married
● Ethnicity: A factor with levels African American, Asian, and Caucasian indicating the individuals ethnicity
● Balance: Average credit card balance in $.
1. We obtain an estimation result as below. Explain the effect of Education on Rating.
glm(formula = Rating ~ . , family = "binomial", data = DATA)
Deviance Residuals:
Min 1Q Median 3Q Max
-2 .50873 0 .00000 0 .00000 0 .00001 2 .07528
Coefficients:
Estimate Std . Error z value Pr(>|z|)
(Intercept) -116 .63752 74 .03912 -1 .575 0 .1152
Income -0 .54971 0 .90624 -0 .607 0 .5441
Limit 0 .03335 0 .03126 1 .067 0 .2862
Cards 1 .89375 2 .33613 0 .811 0 .4176
Age -0 .05553 0 .11428 -0 .486 0 .6270
Education -0 .24079 0 .25734 -0 .936 0 .3494
GenderFemale 1 .21706 1 .49002 0 .817 0 .4140
StudentYes 27 .48922 45 .32207 0 .607 0 .5442
MarriedYes 4 .72963 2 .19800 2 .152 0 .0314 *
EthnicityAsian -0 .76642 1 .73813 -0 .441 0 .6593
EthnicityCaucasian 1 .24550 1 .73212 0 .719 0 .4721
Balance -0 .04779 0 .09097 -0 .525 0 .5993
---
Signif . codes: 0 *** 0 .001 ** 0 .01 * 0 .05 . 0 .1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 554 .358 Residual deviance: 22 .972 AIC: 46 .972
on 399 on 388
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 13
2. Using the above estimation result, explain the effect of Student on Rating.
3. We obtain the estimation result as below. Explain the effect of Limit on Rating.
glm(formula = Rating ~ . * Student, family = "binomial", data = DATA)
Deviance Residuals:
Min 1Q Median 3Q Max
-2 .349 0 .000 0 .000 0 .000 2 .071
Coefficients:
Estimate Std . Error z value Pr(>|z|)
(Intercept) -1 .276e+02 7 .934e+01 -1 .609 0 .1077
Income -6 .945e-01 9 .247e-01 -0 .751 0 .4526
Limit 3 .817e-02 3 .269e-02 1 .168 0 .2430
Cards 2 .283e+00 2 .480e+00 0 .921 0 .3572
Age -7 .293e-02 1 .180e-01 -0 .618 0 .5367
Education -2 .659e-01 2 .477e-01 -1 .073 0 .2832
GenderFemale 1 .320e+00 1 .462e+00 0 .903 0 .3667
StudentYes -3 .495e+02 1 .941e+05 -0 .002 0 .9986
MarriedYes 5 .325e+00 2 .623e+00 2 .030 0 .0423 *
EthnicityAsian -5 .208e-01 1 .874e+00 -0 .278 0 .7810
EthnicityCaucasian 2 .008e+00 1 .957e+00 1 .026 0 .3049
Balance -6 .242e-02 9 .258e-02 -0 .674 0 .5002
Income:StudentYes -1 .875e+01 8 .781e+03 -0 .002 0 .9983
Limit:StudentYes 6 .206e-01 2 .854e+02 0 .002 0 .9983
Cards:StudentYes 4 .215e+01 2 .031e+04 0 .002 0 .9983
Age:StudentYes -2 .199e+00 8 .374e+02 -0 .003 0 .9979
Education:StudentYes -2 .578e+00 7 .695e+03 0 .000 0 .9997
GenderFemale:StudentYes 1 .975e+01 1 .062e+04 0 .002 0 .9985
StudentYes:MarriedYes -3 .319e+00 9 .854e+03 0 .000 0 .9997
StudentYes:EthnicityAsian -2 .865e+01 2 .266e+04 -0 .001 0 .9990
StudentYes:EthnicityCaucasian -2 .950e+01 9 .090e+03 -0 .003 0 .9974
StudentYes:Balance -1 .822e+00 8 .712e+02 -0 .002 0 .9983
---
Signif . codes: 0 *** 0 .001 ** 0 .01 * 0 .05 . 0 .1 1
4. We obtain an estimation result as below. Explain the effect of Limit on Rating and the effect of Gender on Rating.
> library(gamlr)
> x = sparse .model .matrix(Rating ~ .*Student, data=DATA)[,-1]
> y = DATA$Rating
> sclasso = gamlr(x, y, family = "binomial", nfold=10)
> coef(sclasso)
22 x 1 sparse Matrix of class "dgCMatrix" seg100
intercept
Income
Limit
Cards
Age
Education
GenderFemale
StudentYes
MarriedYes
EthnicityAsian
EthnicityCaucasian
Balance
Income:StudentYes
-24 .479492961
.
0 .004813959
0 .119308039
.
-0 .024437478
0 .458552143
.
1 .253561287
-0 .117659315
.
0 .002295561
.
Limit:StudentYes .
Cards:StudentYes .
Age:StudentYes .
Education:StudentYes .
GenderFemale:StudentYes .
StudentYes:MarriedYes .
StudentYes:EthnicityAsian .
StudentYes:EthnicityCaucasian .
StudentYes:Balance .
Question 4 (25 points=5+5+5+10)
We consider a data set on credit card default. The variables in the data set are
● default: A factor with levels No and Yes indicating whether the customer defaulted on their debt
● student: A factor with levels No and Yes indicating whether the customer is a student
● balance: The average balance that the customer has remaining on their credit card after making their monthly payment
● income: Income of customer
1. We estimate a classification tree as below. Explain “Yes”, 0.76 and 1% at the far right node (no more than
30 words).
fit1 = rpart(default~balance+income+student, data = DATA, method="class")
2. From the estimation result, explain the probability that balance is strictly less than 1800 in the sample (no more than 30 words).
3. Explain characters of individuals in the second node from the left.
4. We compare actual and predicted outcomes as below. Obtain false positive and negative rates.
> fit1 = rpart(default~balance+income+student, data = DATA, method="class") > pred .tree = predict(fit1, type="class")
> table(pred .tree, DATA$default)
pred .tree No Yes
No 9611 203
Yes 56 130
2022-06-16