When we fit linear regression models we often calculate the R-squared value of the model.
The R-squared value is the proportion of the variance in the response variable that can be explained by the predictor variables in the model.
The value for R-squared can range from 0 to 1 where:
- A value of 0 indicates that the response variable cannot be explained by the predictor variables at all.
- A value of 1 indicates that the response variable can be perfectly explained by the predictor variables.
Although this metric is commonly used to assess how well a regression model fits a dataset, it has one serious drawback:
The drawback of R-squared:
R-squared will always increase when a new predictor variable is added to the regression model.
Even if a new predictor variable is almost completely unrelated to the response variable, the R-squared value of the model will increase, if only by a small amount.
For this reason, it’s possible that a regression model with a large number of predictor variables has a high R-squared value, even if the model doesn’t fit the data well.
Fortunately there is an alternative to R-squared known as adjusted R-squared.
The adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a regression model.
It is calculated as:
Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]
where:
- R2: The R2 of the model
- n: The number of observations
- k: The number of predictor variables
Because R-squared always increases as you add more predictors to a model, the adjusted R-squared can tell you how useful a model is, adjusted for the number of predictors in a model.
The advantage of Adjusted R-squared:
Adjusted R-squared tells us how well a set of predictor variables is able to explain the variation in the response variable, adjusted for the number of predictors in a model.
Because of the way it’s calculated, adjusted R-squared can be used to compare the fit of regression models with different numbers of predictor variables.
To gain a better understanding of adjusted R-squared, check out the following example.
Example: Understanding Adjusted R-Squared in Regression Models
Suppose a professor collects data on students in his class and fits the following regression model to understand how hours spent studying and current grade in the class affect the score a student receives on the final exam.
Exam Score = β0 + β1(hours spent studying) + β2(current grade)
Suppose this regression model has the following metrics:
- R-squared: 0.955
- Adjusted R-squared: 0.946
Now suppose the professor decides to collect data on another variable for each student: shoe size.
Although this variable should be completely unrelated to the final exam score, he decides to fit the following regression model:
Exam Score = β0 + β1(hours spent studying) + β2(current grade) + β3(shoe size)
Suppose this regression model has the following metrics:
- R-squared: 0.965
- Adjusted R-squared: 0.902
If we only looked at the R-squared values for each of these two regression models, we would conclude that the second model is better to use because it has a higher R-squared value!
However, if we look at the adjusted R-squared values then we come to a different conclusion: The first model is better to use because it has a higher adjusted R-squared value.
The second model only has a higher R-squared value because it has more predictor variables than the first model.
However, the predictor variable that we added (shoe size) was a poor predictor of final exam score, so the adjusted R-squared value penalized the model for adding this predictor variable.
This example illustrates why adjusted R-squared is a better metric to use when comparing the fit of regression models with different numbers of predictor variables.
Additional Resources
The following tutorials explain how to calculated adjusted R-squared values using different statistical software:
How to Calculate Adjusted R-Squared in R
How to Calculate Adjusted R-Squared in Excel
How to Calculate Adjusted R-Squared in Python