Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ETF5952

Questions for Final Exam

Question 1 (25 points=5+5+5+10)

We consider a data set on cars. The data set includes the following variables:

mpg: miles per gallon

cylinders: Number of cylinders between 4 and 8

displacement: Engine displacement (cu. inches)

horsepower: Engine horsepower

weight: Vehicle weight (lbs.)

acceleration: Time to accelerate from 0 to 60 mph (sec.)

year: Model year (modulo 100)

origin: Origin of car (1. American, 2. European, 3. Japanese)

name: Vehicle name

1. We estimate a regression as below. Explain the effect of weight on mpg (no more than 20 words, 4 decimal places).

glm(formula  = mpg  ~  weight  +  cylinders  +  displacement  +  horsepower  +

acceleration,  data  =  DATA)

Deviance  Residuals:

Min -11 .5816

1Q

-2 .8618

Median

-0 .3404

3Q

2 .2438

Max

16 .3416

Coefficients:

Estimate  Std .  Error  t  value  Pr(>|t |)

(Intercept)      4 .626e+01    2 .669e+00    17 .331      <2e-16  ***

weight              -5 .187e-03    8 .167e-04    -6 .351        6e-10  ***

cylinders        -3 .979e-01    4 .105e-01    -0 .969      0 .3330

displacement  -8 .313e-05    9 .072e-03    -0 .009      0 .9927

horsepower      -4 .526e-02    1 .666e-02    -2 .716      0 .0069  **

acceleration  -2 .910e-02    1 .258e-01    -0 .231      0 .8171

---

Signif .  codes:    0  ***  0 .001  **  0 .01  *  0 .05  .  0 .1      1

2. We estimate a regression as below. Explain the effect of origin on mpg (no more than 30 words, 4 decimal places).

glm(formula  = mpg  ~  weight  +  cylinders  +  displacement  +  horsepower  +

acceleration  +  origin,  data  =  DATA)

Deviance  Residuals:

Min -12 .6303

1Q

-2 .8009

Median

-0 .2871

3Q

2 .0945

Max

14 .8931

Coefficients:

Estimate  Std .  Error  t  value  Pr(>|t |)

(Intercept) weight          cylinders

44 .7687911 -0 .0048119 -0 .5661876

displacement    0 .0114270

2 .6398457 0 .0008089 0 .4042069 0 .0095737

16 .959    <  2e-16  *** -5 .948    6 .1e-09  *** -1 .401  0 .162100

1 .194  0 .233376

horsepower      -0 .0613339    0 .0168679    -3 .636  0 .000314  ***

acceleration  -0 .0319841    0 .1232529    -0 .259  0 .795389

origin2              1 .1255451    0 .7015566      1 .604  0 .109458

origin3              2 .9325397    0 .6955675      4 .216    3 .1e-05  ***

---

Signif .  codes:    0  ***  0 .001  **  0 .01  *  0 .05  .  0 .1      1

3. We estimate a regression as below. Explain the effect of weight on mpg (no more than 30 words, 4 decimal places).

glm(formula  = mpg  ~  weight  *  origin  +  cylinders  +  displacement  +

horsepower  +  acceleration  +  origin,  data  =  DATA)

Deviance  Residuals:

Min -13 .3865

1Q

-2 .6755

Median

-0 .4621

3Q

2 .0962

Max

14 .5518

Coefficients:

Estimate  Std .  Error  t  value  Pr(>|t |)

(Intercept)        43 .9946277    2 .7178880    16 .187    <  2e-16  ***

weight                  -0 .0038512    0 .0009205    -4 .184  3 .56e-05  ***

origin2                 4 .1783349    2 .9695460      1 .407  0 .160222

origin3                11 .9740394    3 .6294558      3 .299  0 .001061  **

cylinders            -0 .6249505    0 .4029298    -1 .551  0 .121726

displacement        0 .0041280    0 .0100292      0 .412  0 .680867

horsepower          -0 .0580462    0 .0168325    -3 .448  0 .000627  ***

acceleration      -0 .0774625    0 .1243493    -0 .623  0 .533694

weight:origin2  -0 .0012627    0 .0011885    -1 .062  0 .288710

weight:origin3  -0 .0040237    0 .0015922    -2 .527  0 .011900  *

---

Signif .  codes:    0  ***  0 .001  **  0 .01  *  0 .05  .  0 .1      1

4. We estimate a regression as below. Explain the effect of weight on mpg (no more than 30 words, 4 decimal places).

glm(formula  =  log(mpg)  ~  log(weight)  *  origin  +  cylinders  +  displacement  +

horsepower  +  acceleration  +  origin,  data  =  DATA)

Deviance  Residuals:

Min                1Q       Median                3Q             Max

-0 .51675    -0 .10216    -0 .00386      0 .10077      0 .51237

Coefficients:

Estimate  Std .  Error  t  value  Pr(>|t |)

(Intercept)                    9 .0859329    0 .7847542    11 .578    <  2e-16  ***

log(weight)                  -0 .6930137    0 .1096725    -6 .319  7 .34e-10  ***

origin2                         -1 .7792391    0 .9726568    -1 .829      0 .0681  .

origin3                         -1 .2022192    1 .1471960    -1 .048      0 .2953

cylinders                      -0 .0268622    0 .0149654    -1 .795      0 .0735  .

displacement                  0 .0004781    0 .0003746      1 .276      0 .2026

horsepower                    -0 .0033636    0 .0006076    -5 .536  5 .76e-08  ***

acceleration                -0 .0051126    0 .0045646    -1 .120      0 .2634

log(weight):origin2    0 .2302492    0 .1248918      1 .844      0 .0660  .

log(weight):origin3    0 .1633573    0 .1485229      1 .100      0 .2721

---

Signif .  codes:    0  ***  0 .001  **  0 .01  *  0 .05  .  0 .1      1

Question 2 (25 points=5+5+5+10)

1. We obtain a correlation matrix as below. Answer the variables that have the highest positive and negative correlation with lprice (no more than 20 words).

cor(DATA)

crime                nox            rooms              dist          radial        proptax        stratio        lowstat          lprice crime        1 .0000000    0 .4211523  -0 .2188157  -0 .3799093    0 .6254423    0 .5828192    0 .2886909    0 .4470330  -0 .5274947 nox            0 .4211523    1 .0000000  -0 .3028280  -0 .7702225    0 .6103279    0 .6669806    0 .1868634    0 .5856131  -0 .5087672

rooms      -0 .2188157  -0 .3028280    1 .0000000    0 .2054095  -0 .2097727  -0 .2921202  -0 .3540075  -0 .6096048    0 .6329095

dist        -0 .3799093  -0 .7702225    0 .2054095    1 .0000000  -0 .4950646  -0 .5343788  -0 .2292694  -0 .4956025    0 .3420084

radial      0 .6254423    0 .6103279  -0 .2097727  -0 .4950646    1 .0000000    0 .9102282    0 .4642446    0 .4760376  -0 .4809716 proptax    0 .5828192    0 .6669806  -0 .2921202  -0 .5343788    0 .9102282    1 .0000000    0 .4542378    0 .5276241  -0 .5596710 stratio    0 .2886909    0 .1868634  -0 .3540075  -0 .2292694    0 .4642446    0 .4542378    1 .0000000    0 .3654023  -0 .4976345 lowstat    0 .4470330    0 .5856131  -0 .6096048  -0 .4956025    0 .4760376    0 .5276241    0 .3654023    1 .0000000  -0 .7914387 lprice    -0 .5274947  -0 .5087672    0 .6329095    0 .3420084  -0 .4809716  -0 .5596710  -0 .4976345  -0 .7914387    1 .0000000

2. We run K-mean clustering (K=3) and obtain the result as below.  Explain the result related to lprice (no more than 20 words).

>  group5$centers

crime           nox        rooms          dist        radial    proptax    stratio    lowstat        lprice

1  55.6069989  6.747500  5.758750  1.585000  24.000000  66.60000  20.20000  22.54875    9.022747 2    0.3914715  5.122358  6.387453  4.442195    4.455285  31.19268  17.81436  10.48515  10.070670 3    9.5978914  6.698140  6.020853  2.083721  23.224806  66.77442  20.19612  18.43054    9.627254

3.  From the K-mean clustering result, explain the relation between lprice and crime (no more than 20 words,

2 decimal places).

4. We apply the principal component analysis for the data and obtain the result, as below.  According the estimation result of the rst principal component, explain the relation among crime, nox and lprice through the factor (no more than 40 words).

>  pc  =  prcomp(DATA,  scale=TRUE)

>  round(pc$rotation,  2)

PC1      PC2      PC3      PC4      PC5      PC6      PC7      PC8      PC9

crime        0 .31    0 .15  -0 .26  -0 .67    0 .47  -0 .31    0 .04    0 .16  -0 .12

nox            0 .35    0 .27    0 .41    0 .13  -0 .10    0 .06    0 .56    0 .53  -0 .04

rooms      -0 .25    0 .60  -0 .21  -0 .01    0 .29    0 .67    0 .03  -0 .03  -0 .03

dist        -0 .31  -0 .33  -0 .43  -0 .34  -0 .43    0 .24    0 .23    0 .43  -0 .09

radial      0 .37    0 .30  -0 .33    0 .00  -0 .37  -0 .02  -0 .23    0 .07    0 .68

proptax    0 .39    0 .25  -0 .23    0 .04  -0 .45    0 .01  -0 .04  -0 .31  -0 .66

stratio    0 .25  -0 .26  -0 .55    0 .61    0 .37    0 .00    0 .08    0 .22  -0 .08

lowstat    0 .36  -0 .28    0 .27  -0 .11    0 .06    0 .47  -0 .60    0 .31  -0 .13

lprice    -0 .37    0 .37  -0 .01    0 .19  -0 .08  -0 .41  -0 .45    0 .51  -0 .23

Question 3 (25 points=5+5+5+10)

● Income: Income in $10,000s

● Limit: Credit limit

● Rating: Credit rating taking 1 for good customers and 0 for bad customers

● Cards: Number of credit cards

● Age: Age in years

● Education: Number of years of education

● Gender: A factor with levels Male and Female

● Student: A factor with levels No and Yes indicating whether the individual was a student

● Married: A factor with levels No and Yes indicating whether the individual was married

● Ethnicity: A factor with levels African American, Asian, and Caucasian indicating the individuals ethnicity

● Balance: Average credit card balance in $.

1. We obtain an estimation result as below. Explain the effect of Education on Rating.

glm(formula  =  Rating  ~  . ,  family  =  "binomial",  data  =  DATA)

Deviance  Residuals:

Min                1Q       Median                3Q             Max

-2 .50873      0 .00000      0 .00000      0 .00001      2 .07528

Coefficients:

Estimate  Std .  Error  z  value  Pr(>|z|)

(Intercept)                -116 .63752      74 .03912    -1 .575      0 .1152

Income                             -0 .54971        0 .90624    -0 .607      0 .5441

Limit                                 0 .03335        0 .03126      1 .067      0 .2862

Cards                                  1 .89375        2 .33613      0 .811      0 .4176

Age                                   -0 .05553        0 .11428    -0 .486      0 .6270

Education                       -0 .24079        0 .25734    -0 .936      0 .3494

GenderFemale                    1 .21706        1 .49002      0 .817      0 .4140

StudentYes                     27 .48922      45 .32207      0 .607      0 .5442

MarriedYes                       4 .72963        2 .19800      2 .152      0 .0314  *

EthnicityAsian              -0 .76642        1 .73813    -0 .441      0 .6593

EthnicityCaucasian        1 .24550        1 .73212      0 .719      0 .4721

Balance                           -0 .04779        0 .09097    -0 .525      0 .5993

---

Signif .  codes:    0  ***  0 .001  **  0 .01  *  0 .05  .  0 .1      1

(Dispersion  parameter  for  binomial  family  taken  to  be  1)

Null  deviance:  554 .358 Residual  deviance:    22 .972 AIC:  46 .972

on  399 on  388

degrees  of  freedom

degrees  of  freedom

Number  of  Fisher  Scoring  iterations:  13

2.  Using the above estimation result, explain the effect of Student on Rating.

3. We obtain the estimation result as below. Explain the effect of Limit on Rating.

glm(formula  =  Rating  ~  .  *  Student,  family  =  "binomial",  data  =  DATA)

Deviance  Residuals:

Min            1Q    Median            3Q         Max

-2 .349      0 .000      0 .000      0 .000      2 .071

Coefficients:

Estimate  Std .  Error  z  value  Pr(>|z|)

(Intercept)                                     -1 .276e+02    7 .934e+01    -1 .609      0 .1077

Income                                               -6 .945e-01    9 .247e-01    -0 .751      0 .4526

Limit                                                   3 .817e-02    3 .269e-02      1 .168      0 .2430

Cards                                                   2 .283e+00    2 .480e+00      0 .921      0 .3572

Age                                                     -7 .293e-02    1 .180e-01    -0 .618      0 .5367

Education                                         -2 .659e-01    2 .477e-01    -1 .073      0 .2832

GenderFemale                                      1 .320e+00    1 .462e+00      0 .903      0 .3667

StudentYes                                       -3 .495e+02    1 .941e+05    -0 .002      0 .9986

MarriedYes                                         5 .325e+00    2 .623e+00      2 .030      0 .0423  *

EthnicityAsian                               -5 .208e-01    1 .874e+00    -0 .278      0 .7810

EthnicityCaucasian                         2 .008e+00    1 .957e+00      1 .026      0 .3049

Balance                                             -6 .242e-02    9 .258e-02    -0 .674      0 .5002

Income:StudentYes                         -1 .875e+01    8 .781e+03    -0 .002      0 .9983

Limit:StudentYes                             6 .206e-01    2 .854e+02      0 .002      0 .9983

Cards:StudentYes                             4 .215e+01    2 .031e+04      0 .002      0 .9983

Age:StudentYes                               -2 .199e+00    8 .374e+02    -0 .003      0 .9979

Education:StudentYes                    -2 .578e+00    7 .695e+03      0 .000      0 .9997

GenderFemale:StudentYes                1 .975e+01    1 .062e+04      0 .002      0 .9985

StudentYes:MarriedYes                  -3 .319e+00    9 .854e+03      0 .000      0 .9997

StudentYes:EthnicityAsian          -2 .865e+01    2 .266e+04    -0 .001      0 .9990

StudentYes:EthnicityCaucasian  -2 .950e+01    9 .090e+03    -0 .003      0 .9974

StudentYes:Balance                       -1 .822e+00    8 .712e+02    -0 .002      0 .9983

---

Signif .  codes:    0  ***  0 .001  **  0 .01  *  0 .05  .  0 .1      1

4. We obtain an estimation result as below.  Explain the effect of Limit on Rating and the effect of Gender on Rating.

>  library(gamlr)

>  x  =  sparse .model .matrix(Rating  ~  .*Student,  data=DATA)[,-1]

>  y  =  DATA$Rating

>  sclasso  =  gamlr(x,  y,  family  =  "binomial",  nfold=10)

>  coef(sclasso)

22  x  1  sparse  Matrix  of  class  "dgCMatrix"    seg100

intercept

Income

Limit

Cards

Age

Education

GenderFemale

StudentYes

MarriedYes

EthnicityAsian

EthnicityCaucasian

Balance

Income:StudentYes

-24 .479492961

.

0 .004813959

0 .119308039

.

-0 .024437478

0 .458552143

.

1 .253561287

-0 .117659315

.

0 .002295561

.

Limit:StudentYes                                .

Cards:StudentYes                                .

Age:StudentYes                                    .

Education:StudentYes                        .

GenderFemale:StudentYes                  .

StudentYes:MarriedYes                      .

StudentYes:EthnicityAsian              .

StudentYes:EthnicityCaucasian      .

StudentYes:Balance                            .

Question 4 (25 points=5+5+5+10)

We consider a data set on credit card default. The variables in the data set are

● default: A factor with levels No and Yes indicating whether the customer defaulted on their debt

● student: A factor with levels No and Yes indicating whether the customer is a student

● balance:  The average balance that the customer has remaining on their credit card after making their monthly payment

● income: Income of customer

1. We estimate a classification tree as below. Explain Yes”, 0.76 and 1% at the far right node (no more than

30 words).

fit1  =  rpart(default~balance+income+student,  data  =  DATA, method="class")

 

2.  From the estimation result, explain the probability that balance is strictly less than 1800 in the sample (no more than 30 words).

3.  Explain characters of individuals in the second node from the left.

4. We compare actual and predicted outcomes as below. Obtain false positive and negative rates.

>  fit1  =  rpart(default~balance+income+student,  data  =  DATA, method="class") >  pred .tree  =  predict(fit1,  type="class")

>  table(pred .tree,  DATA$default)

pred .tree      No    Yes

No    9611    203

Yes      56    130