Linear regression is a method we can use to understand the relationship between one or more explanatory variables and a response variable.
When we perform linear regression on a dataset, we end up with a regression equation which can be used to predict the values of a response variable, given the values for the explanatory variables.
We can then measure the difference between the predicted values and the actual values to come up with the residuals for each prediction. This helps us get an idea of how well our regression model is able to predict the response values.
This tutorial explains how to obtain both the predicted values and the residuals for a regression model in Stata.
Example: How to Obtain Predicted Values and Residuals
For this example we will use the built-in Stata dataset called auto. We’ll use mpg and displacement as the explanatory variables and price as the response variable.
Use the following steps to perform linear regression and subsequently obtain the predicted values and residuals for the regression model.
Step 1: Load and view the data.
First, we’ll load the data using the following command:
sysuse auto
Next, we’ll get a quick summary of the data using the following command:
summarize
Step 2: Fit the regression model.
Next, we’ll use the following command to fit the regression model:
regress price mpg displacement
The estimated regression equation is as follows:
estimated price = 6672.766 -121.1833*(mpg) + 10.50885*(displacement)
Step 3: Obtain the predicted values.
We can obtain the predicted values by using the predict command and storing these values in a variable named whatever we’d like. In this case, we’ll use the name pred_price:
predict pred_price
We can view the actual prices and the predicted prices side-by-side using the list command. There are 74 total predicted values, but we’ll view just the first 10 by using the in 1/10 command:
list price pred_price in 1/10
Step 4: Obtain the residuals.
We can obtain the residuals of each prediction by using the residuals command and storing these values in a variable named whatever we’d like. In this case, we’ll use the name resid_price:
predict resid_price, residuals
We can view the actual price, the predicted price, and the residuals all side-by-side using the list command again:
list price pred_price resid_price in 1/10
Step 5: Create a predicted values vs. residuals plot.
Lastly, we can created a scatterplot to visualize the relationship between the predicted values and the residuals:
scatter resid_price pred_price
We can see that, on average, the residuals tend to grow larger as the fitted values grow larger. This could be a sign of heteroscedasticity – when the spread of the residuals is not constant at every response level.
We could formally test for heteroscedasticity using the Breusch-Pagan Test and we could address this problem using robust standard errors.