Multicollinearity in regression analysis occurs when two or more predictor variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.
One way to detect multicollinearity is by using a metric known as the variance inflation factor (VIF), which measures the correlation and strength of correlation between the predictor variables in a regression model.
This tutorial explains how to use VIF to detect multicollinearity in a regression analysis in SPSS.
Example: Multicollinearity in SPSS
Suppose we have the following dataset that shows the exam score of 10 students along with the number of hours they spent studying, the number of prep exams they took, and their current grade in the course:
We would like to perform a linear regression using score as the response variable and hours, prep_exams, and current_grade as the predictor variables, but we want to make sure that the three predictor variables aren’t highly correlated.
To determine if multicollinearity is a problem, we can produce VIF values for each of the predictor variables.
To do so, click on the Analyze tab, then Regression, then Linear:
In the new window that pops up, drag score into the box labelled Dependent and drag the three predictor variables into the box labelled Independent(s). Then click Statistics and make sure the box is checked next to Collinearity diagnostics. Then click Continue. Then click OK.
Once you click OK, the following table will be displayed that shows the VIF value for each predictor variable:
The VIF values for each of the predictor variables are as follows:
- hours: 1.169
- prep_exams: 1.403
- current_grade: 1.522
The value for VIF starts at 1 and has no upper limit. A general rule of thumb for interpreting VIFs is as follows:
- A value of 1 indicates there is no correlation between a given predictor variable and any other predictor variables in the model.
- A value between 1 and 5 indicates moderate correlation between a given predictor variable and other predictor variables in the model, but this is often not severe enough to require attention.
- A value greater than 5 indicates potentially severe correlation between a given predictor variable and other predictor variables in the model. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.
We can see that none of the VIF values for the predictor variables in this example are greater than 5, which indicates that multicollinearity will not be a problem in the regression model.