R-squared is a measure of how well a linear regression model “fits” a dataset. Also commonly called the coefficient of determination, R-squared is the proportion of the variance in the response variable that can be explained by the predictor variable.
The value for R-squared can range from 0 to 1. A value of 0 indicates that the response variable cannot be explained by the predictor variable at all. A value of 1 indicates that the response variable can be perfectly explained without error by the predictor variable.
In practice, you will likely never see a value of 0 or 1 for R-squared. Instead, you’ll likely encounter some value between 0 and 1.
For example, suppose you have a dataset that contains the population size and number of flower shops in 30 different cities. You fit a simple linear regression model to the dataset, using population size as the predictor variable and flower shops as the response variable. In the output of the regression results, you see that R2 = 0.2. This indicates that 20% of the variance in the number of flower shops can be explained by the population size.
This leads to an important question: is this a “good” value for R-squared?
The answer to this question depends on your objective for the regression model. Namely:
1. Are you interested in explaining the relationship between the predictor(s) and the response variable?
OR
2. Are you interested in predicting the response variable?
Depending on the objective, the answer to “What is a good value for R-squared?” will be different.
Explaining the Relationship Between the Predictor(s) and the Response Variable
If your main objective for your regression model is to explain the relationship between the predictor(s) and the response variable, the R-squared is mostly irrelevant.
For example, suppose in the regression example from above, you see that the coefficient for the predictor population size is 0.005 and that it’s statistically significant. This means that an increase of one in population size is associated with an average increase of 0.005 in the number of flower shops in a particular city. Also, population size is a statistically significant predictor of the number of flower shops in a city.
Whether the R-squared value for this regression model is 0.2 or 0.9 doesn’t change this interpretation. Since you are simply interested in the relationship between population size and the number of flower shops, you don’t have to be overly concerned with the R-square value of the model.
Predicting the Response Variable
If your main objective is to predict the value of the response variable accurately using the predictor variable, then R-squared is important.
In general, the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable.
How high an R-squared value needs to be depends on how precise you need to be. For example, in scientific studies, the R-squared may need to be above 0.95 for a regression model to be considered reliable. In other domains, an R-squared of just 0.3 may be sufficient if there is extreme variability in the dataset.
To find out what is considered a “good” R-squared value, you will need to explore what R-squared values are generally accepted in your particular field of study. If you’re performing a regression analysis for a client or a company, you may be able to ask them what is considered an acceptable R-squared value.
Prediction Intervals
A prediction interval specifies a range where a new observation could fall, based on the values of the predictor variables. Narrower prediction intervals indicate that the predictor variables can predict the response variable with more precision.
Often a prediction interval can be more useful than an R-squared value because it gives you an exact range of values in which a new observation could fall. This is particularly useful if your primary objective of regression is to predict new values of the response variable.
For example, suppose a population size of 40,000 produces a prediction interval of 30 to 35 flower shops in a particular city. This may or may not be considered an acceptable range of values, depending on what the regression model is being used for.
Conclusion
In general, the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable.
How high an R-squared value needs to be to be considered “good” varies based on the field. Some fields require higher precision than others.
To find out what is considered a “good” R-squared value, consider what is generally accepted in the field you’re working in, ask someone with specific subject area knowledge, or ask the client/company you’re performing the regression analysis for what they consider to be acceptable.
If you’re interested in explaining the relationship between the predictor and response variable, the R-squared is largely irrelevant since it doesn’t impact the interpretation of the regression model.
If you’re interested in predicting the response variable, prediction intervals are generally more useful than R-squared values.
Further Reading:
Pearson Correlation Coefficient
Introduction to Simple Linear Regression