BA222 - Lecture Notes 09: Introduction to Regression Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BA222 - Lecture Notes 09: Introduction to Regression Analysis
Table of Contents
• Introduction
• Statistical Models
• Linear Equations
• Linear Regressions
• Linear Regression Examples
• Demand Model
• Wage Model
• Summary of Linear Regression Models
• Estimation of Beta Coefficient
• Fitted Values
• Residuals
• Mean Square Error
• Minimization and Estimation on Python
• Limitations of Linear Regression Models
• Evaluating Regression Models
• Visualizing Regression Results
• Goodness of Fit (R-Squared)
• Statistical Significance of Beta Coefficients
• Hypothesis Testing in Regression Analysis
• Statistical Significance
• T-Statistics and P-Value
Introduction
Up to this point we have review the basics of statistical analysis: how to describe the distribution of a variable and how to identify statistical association. Now we are going to start the topic of Regression Analysis, which is going to take most of the time for the rest of the semester and is going to be central for the final project of the course.
The goal of Regression Analysis is to quantify the statistical association between variables. In other words, we are going to develop a methodology for answering questions of the form "if x changes, by how much will y change on average"?. These kind of questions are essential for making the right business decisions. They are especially useful for:
1. Policy Making: Understanding the causal relations between variables (and not just simple correlation) is fundamental for making unbiased policy recommendations.
2. Forecasting: Knowing what is going to happen in the future given information about the present by identifying time trends and the relation among variables.
You will learn that regression models are powerful tools for statistical analysis and, at least in my experience, can change the way you think about statistical problems in general.
Statistical Models
A statistical model is a formalization of the statistical relation between two variables. A model expresses, generally with an equation, the idea that the change in one variable is associated with a change in the average value of another variable (and not each specific value necessarily). The purpose of having a model is to be explicit about what we know (and what we don't know) and the type of relation that we expect to find. It is almost certain that whatever model we propose is going to be wrong in some way. The goal is not to come up with a model that is a perfect representation of reality, but one that is useful.
Linear Equations
When representing the statistical relation between two variables with a scatterplot is natural to fill-in-the gaps and we are may be inclined to draw a line across the points in the graphs to summarize how the two variables are related. Generally we do this with a straight line.
Recall from basic algebra that a straight line can be specified using two parameters: the intercept and the slope.
y = β0 + β1 x
• The intercept or β0 represent the values of the y variable when x is equal to
zero.
• the slope or β1 represent by how much y changes when x increases by one unit.
A linear equation is an example of an exact linear relation, because knowing the value of x allow us to immediately calculate the values of y.
Linear Regressions
To acknowledge the fact that the relation between x and y is in statistical and not exact, we are going to include a new term to the linear equation that is going to represent all other factors besides x that are related to y but not explicitly included in the model.
y = β0 + β1 x + error
The new term is called the error term. When we add an error term to a linear equation, we call it a linear regression model.
Linear Regression Examples
It is a good idea to think about what each term in the model represents using some concrete examples.
Demand Model
From economic theory we know that there is an inverse relation between the demand for a product and its price. We can express this idea with a linear equation:
Demand = β0 + β1 Price
Here the demand is taking the role of the y variable because we are trying to explain how the quantity demanded changes with changes in prices. The demand in this model is an example of a dependent variable. It is called that way because the quantity demanded depends on the price level.
Price, which is taking the role of the x variable, is an example of an independent variable, because we are not assuming that changes in the quantity demand lead to changes in prices, but the other way around.
Now let's think about the beta coefficients. β0 or the intercept, represents the quantity demanded if the price is equal to zero. β1 or the slope represents by how much the demand for pizza will change if the price is changed. In this case, it is reasonable to assume that β1 is probably going to be a negative number.
As of now this is a simple linear equation that establishes an exact relation between price and quantity demanded. In reality we see that for the same price level, the quantity demanded for a product may go up or down dependent on other factors. For instance, the demand for gasoline goes up or down at different days of the week even if the price remains the same. The demand for ice cream has a strong cyclical component, meaning that at the same price there is a greater demand in the summer than in the winter. As you can see, if we don't think in statistical terms, a linear equation model may be too simplistic to be useful. We can be explicit about the fact that other factors affect the demand by adding an error term to the regression model:
Demand = β0 + β1 Price + error
By adding the error term the model changes in a fundamental way. The dependent variable, demand, is not only going to be determined by the value of Price, but also whatever value the error term takes. When the error term is positive, the demand will be greater and when the error term is negative the demand will be lower. The error term may be positive because of favorable conditions for the demand of the product that are independent of the price, and negative when the conditions are adverse.
In that way, our model now allows for the demand to have multiple values for the same price level and is now simply expressing that there is a statistical relation between the demand and price, i.e. when the price changes, the average value of the demand changes.
Wage Model
Imagine that we want to express the relation between having a college degree and wages. Let's start with a simple linear equation:
Wage = β0 + β1 College
College is represented with a dummy variable (or boolean). Where zero means that the person has no college degree, and one that it does. According to this representation we should observe that everyone without a college degree should
earn:
Wage = β0 + β1 College = β0 + β1 (0) = β0
and everyone with a college degree earns:
Wage = β0 + β1 College = β0 + β1 (1) = β0 + β1
Therefore, this model assumes that there are only two possible wages; with no education (β0) and with education (β0 + β1). But in reality we observe multiple wages, and sometimes (not the norm) individuals without a college education earn more than those with college degrees. We can fix this by adding an error term:
Wage = β0 + β1 College + error
Now the model works like this. This is the wage of individuals without education: Wage = β0 + β1 College + error = β0 + β1 (0) + error = β0 + error and with education:
Wage = β0 + β1 College + error = β0 + β1 (1) + error = β0 + β1 + error
In this way, depending on the error term, some individuals may have greater or lower earnings for the same education level.
Summary of Linear Regression Models
Previously we defined the linear regression model as:
y = β0 + β1 + error
It is important that you remember the following concepts going forward:
1. Variables:
1. Dependent Variable: That's the y variable, it is the variable that we are trying to explain with the model.
2. Independent Variable: That's the x variable, it is the variable that we are using to explain y.
2. Beta Coefficients (Parameters):
1. Intercept: That's β0 coefficient, it represents the average value of y when x is equal to zero. To understand why this result is true simply replace the value of x in the regression model and note that you are left with y = β0 + error. In the following sections we'll argue that the average value of the error term is equal to zero, making the average of y (when x = 0) equal to β0 . In some models this coefficient will be non-sensical because the possibility that x = 0 makes no sense.
2. Slope: That's the β1 coefficient, it represents by how much the average
value of y changes when x increases by one unit. The sign of the slope coefficient coincides with the sign of the correlation coefficient, but they are not equal to each other. The correlation coefficient is bounded by − 1 and 1 and the slope is unbounded. The slope allow us to quantify the statistical relation between the variables x and y.
3. Error Term: The error term represents all other factors that are associated with y that are different from x.
Practice:
1. Write a linear regression model for the sales of pizza, where the independent variable is the price level.
1. Start by identifying the dependent and independent variables
2. Then write the linear regression model equation using the name of the variables instead of x and y
3. Interpret the value of the beta coefficients
4. What other factors may affect the sales of pizza besides the price. Where in the model are those factors represented?
2. Say we want to use a regression model to predict the number of goals scored by a soccer player using the number of minutes played.
1. Which variable should be the independent variable? The dependent variable?
2. Write down the equation for the regression model (don't forget the error term!)
3. What would be the interpretation of the intercept coefficient? Does it make sense?
4. What sign do you expect the slope to have? What would be the interpretation of the slope?
5. What other variables are related to the number of goals that are different from the minutes played? Where in the regression model are these factors represented?
Estimation of Beta Coefficient
We can think of many different methods to come up with values for β0 and β1 . We are going to use a methodology called Ordinary Least Squares (OLS). This method consist in minimizing the Mean Square Error (MSE).
Fitted Values
The fitted values (predicted values) of a regression model y^ are the values of y that are expected by the model assuming an average value of zero for the error term and some specific values for the beta coefficients and the x variable:
y^ = β0(*) + β1(*)x*
For instance, assume for now that for the Pizza Sales model we have a β0(*) = 500 and β1(*) = −25. Then for a price of $10 the expected value for sales is:
y^ = β0(*) + β1(*)x*
y^ = 500 − 25x*
y^ = 500 − 25(10)
y^ = 250
Practice:
1. On Python, write down a function that takes the following inputs: intercept, slope and x. The function should compute the fitted values according to the formula presented above.
2. Using the pizzaSales .csv data. Estimate the fitted values of Sales using the information for prices. Assume a value of 500 for the intercept and -25 for the slope.
3. Are the predictions any good? How can you tell?
Residuals
The difference between the observed (actual) value of y and a fitted value (y^) is called the residual. Is a measure of how different is the model's prediction with respect to the actual value of y.
residual = y − y^
A positive residual indicates that the observed value is greater than the prediction of the model (underestimation of y). A negative residual is one where the prediction is greater than the observed value (overestimation of y). A residual that is close to zero it means that the model was able to perfectly predict the value of y. If a model is good at predicting the value of y, it will produce fitted values (y^) that are close to the actual values and, thus, generate residuals that are close to zero.
If a model is good at predicting the value of y, it will produce fitted values (y^) that
are close to the actual values and, thus, generate residuals that are close to zero. By construction, the residual provides an estimation of the values of the error term:
y = β0 + β1 x + error
y = y^ + error
y − y^ = error
residual = error
Practice:
1. Now write a function that given y and y_hat computes the residuals.
2. Use the predicted values from the last part as yh at and the actual sales as y .
3. Produce a histogram of the residuals. Are the model's predictions any good?
Mean Square Error
The OLS methodology consist on minimizing the residuals, that is, making each residual as close to zero as possible. In order to achieve that we are going to define the Mean Square Error (MSE):
MSE = (residual)2 = (yi − y^i )2
The residuals are squared because we want negative and positive values to have the same effect on MSE. The MSE is simply the average of the squared values of the residuals, it represents how close to the actual values are the predictions of the model.
1. Write a function to compute the MSE using the residuals.
2. Use the residuals from the previous part and compute the MSE.
3. Test other values for the intercept and the slope, are you improving or worsening the model's predictions?
Minimization and Estimation on Python
We can, theoretically, test out all possible values of the coefficients and find the pair that minimizes the error term. But there is a built-in package that does allow us to find the coefficients without us having to fish for them manually.
Let's star by loading a new package:
import statsmodels .formula .api as smf
The statsmodels package ( smf for short) is a package that allows Python to estimate (run) regressions using only a few lines of code. The pizza data is saved in a data frame called pz :
model = smf .ols('pie_sales ~ price', data = pz)
This line specifies the model that we want to estimate. It starts with the function .ols() from the smf package. Inside the function we have two arguments: the first one is a string and the second a data frame. The string, also called the formula is used to identify the independent and the dependent variable. To the left of the ~ symbol we specify the dependent variable and to the right the independent variable. The second argument is simply the data frame that contains those variables.
This function is simply an specification, is like writing out the equation. In order to obtain the coefficients that minimize the MSE we need to use the function .fit() on the model .
estimation = model .fit()
From here we can call three types of results: the coefficients, the fitted values and the residuals:
estimation .params # This will produce the coefficients estimation .fittedvalues # this will produce the fitted values
estimation .resid # This will produce the residuals
Finally, you can get a summary of the estimation procedure using the function on the estimation results:
.summary()
estimation .summary()
This table produces many results, more than what we need for now. But you'll be able to find the coefficients in the second section of the results.
You can get to the results in a single line by concatenating the commands:
results = smf .ols('pie_sales ~ price', data = pz) .fit() .summary()
Practice:
1. Using the statsmodel package. Estimate a regression model with pie_sales as the dependent variable and adv as the independent variable.
2. Using the regression results. What are the expected average sales if there is no expenditure in advertising?
3. Using the regression results. Predict the level of sales if the advertising expenditure is equal to $5,500.
4. Interpret the value of the beta coefficients.
5. What is the mean of the residuals?
Limitations of Linear Regression Models
Linear regression models are a great technique to quickly quantify the statistical relation between two variables. We can use them to make specific predictions of a variable y for a given value of x. Yet, there are some issues that you should be aware of (this is not an exhaustive list, more issues will be discussed in following lecture notes):
• Linearity: The model, as presented here, it is limited to represent linear relations. In this class we have seen many scatterplots suggesting a non-linear relation between two variables. Using a linear model leads to what we call functional form bias. Which means that you are incorrectly estimating a statistical relation by choosing to estimate a linear model when the model is not linear (or viceversa).
• No Diminishing or Increasing returns: This most common example of functional form bias. A linear model assumes a constant statistical relation between x and y. But it is too common in business and economics to observe diminishing/increasing returns that the assumption of linearity is an issue. We cannot expect the same productivity from workers that are overworked than workers that are fresh. We should not expect the same academic performance of individuals that come from wealthy families than those from less advantaged backgrounds.
• Outliers: This methodology is particularly sensitive to outliers. The slope coefficient estimates how the average value of y changes when the value of x increases by one unit. Because averages are sensitive to the presence of outliers, the slope coefficient will also be influenced by outliers. We call this issue bias due to influential observations. An influential observation is a single observation that by its inclusion in the model the estimated coefficients changes in a significant manner. In practice you should not use regression models when influential observations are present in the data, if you suspect that an observation is causing bias: show the results with the influential observations and without the influential observations.
• Limited Range: The regression model is estimated using a specific range of values for the x variable. For instance, in the case of the Pizza database, the price variable was always in the range of about $4 to $8 dollars. Any prediction outside that range is considered an extrapolation of the data. That is, fitted values for which no actual data for the x variable was used. As a result, extrapolations have to be interpreted making the often unrealistic assumption that the statistical relation within the range used for the estimation is going to remain constant outside that range. This can lead to very unrealistic predictions. For instance, imagine that we estimate the price of a house using a regression model with number of bathrooms as the independent variable and obtained a positive slope. In the sample the maximum number of bathrooms is 4. Any price prediction using the model and a number of bathrooms greater than 4 is an extrapolation and we should be critical about the predicted values. There is a reason why we don’t have houses with more than 4 bathrooms.
Most of these limitations have a fix. Depending on how much time we have left in the semester, we may come back to this topic and address them in more detail.
Evaluating Regression Models
Now that we understand the basics of the linear regression model, and how to implement it on Python , we are going to work on evaluating the regression results. That is, after we go through the procedure of estimating the regression
coefficients, we need to interpret and judge the regression results to check if we can apply the results to the problem at hand or if there are issues with the model that need to be fixed.
First, we are going to work on how to judge if the regression fits the data well or not using the R2 coefficient and plotting a scatterplot of the data with the regression line.
Then we'll use the concepts of standard errors, confidence intervals, t-statistics and p-value to test if the estimated beta coefficients are statistically equal or different than a hypothesis value, generally zero. This is important because the beta coefficients, just like the sample average and standard deviation, are statistics derived from a sample. As a result they will be slightly different for each sample, even if the samples comes from the same population.
Practice:
Let's start with a quick re-cap of what we have learned about regression. Start by loading some data, the relevant packages and estimating a regression:
1. Load the packages pandas , matplotlib .pyplot ,
statsmodels .formula .api
2. Load the CASchools .csv data
3. Estimate a regression of math on income ( math is the dependent variable and income is the independent variable). The math variable measures the average SAT scores of Schools in California. The income variable is the median annual household income in thousands of dollars. (Hint: Use smf .ols() to specify the model and .fit() to estimate it. You can extract the parameters using .params or produce a summary table using .summary() . See previous notes if you need more guidance.)
4. Estimate the intercept and slope coefficients (beta coefficients). Is the interpretation of the intercept non-sensical? Explain.
5. How much extra-income is necessary to increase the average SAT scores of a school by 25 points? (Keeping everything else constant)
6. Make a histogram of the residuals. Describe the distribution.
You should conclude that:
1. Intercept: When the median household income is equal to zero, the average math score is 625.54. But this value makes no sense as having a median income of zero is not realistic.
2. Slope: For each additional thousand dollars in median household income, the
average of the math SAT scores increases by 1.81 points.
The answer to part 5 can be solved with basic algebra:
y = β0 + β1 x
#y = y2 − y1 = β0 + β1 x2 − β0 − β1 x1 = β1 (x2 − x1 )
#y = β1 #x
If the change in math scores is 25 then #y = 25. Recall that we estimated β1 =
1.81. Thus:
25 = 1.81#x
#x =
= 13.81
Therefore, the median household income needs to increase by about $13,800 for the average math SAT score to increase by 25 points.
2023-04-25