The Mahalanobis distance is the distance between two points in a multivariate space. It’s often used to find outliers in statistical analyses that involve several variables.
This tutorial explains how to calculate the Mahalanobis distance in SPSS.
Example: Mahalanobis Distance in SPSS
Suppose we have the following dataset that displays the exam score of 20 students along with the number of hours they spent studying, the number of prep exams they took, and their current grade in the course:
We can use the following steps to calculate the Mahalanobis distance for each observation in the dataset to determine if there are any multivariate outliers.
Step 1: Select the linear regression option.
Click the Analyze tab, then Regression, then Linear:
Step 2: Select the Mahalanobis option.
Drag the response variable score into the box labelled Dependent. Drag the other three predictor variables into the box labelled Independent(s). Then click the Save button. In the new window that pops up, make sure the box next to Mahalanobis is checked. Then click Continue. Then click OK.
Once you click OK, the Mahalanobis distance for each observation in the dataset will appear in a new column titled MAH_1:
We can see that some of the distances are much larger than others. To determine if any of the distances are statistically significant, we need to calculate their p-values.
Step 3: Calculate the p-values of each Mahalanobis distance.
Click the Transform tab, then Compute Variable.
In the Target Variable box, choose a new name for the variable you’re creating. We chose “pvalue.” In the Numeric Expression box, type the following:
1 – CDF.CHISQ(MAH_1, 3)
Then click OK.
This will produce a p-value that corresponds to the Chi-Square value with 3 degrees of freedom. We use 3 degrees of freedom because there are 3 predictor variables in our regression model.
Step 4: Interpret the p-values.
Once you click OK, the p-value for each Mahalanobis distance will be displayed in a new column:
By default, SPSS only displays the p-values to two decimal places. You can increase the number of decimal places by clicking Variable View at the bottom of SPSS and increasing the number in the Decimals column:
Once you return to the Data View, you can see each p-value shown to five decimal places. Any p-value that is less than .001 is considered to be an outlier.
We can see that the first observation is the only outlier in the dataset because it has a p-value less than .001:
How to Handle Outliers
If an outlier is present in your data, you have a couple options:
1. Make sure the outlier is not the result of a data entry error.
Sometimes an individual simply enters the wrong data value when recording data. If an outlier is present, first verify that the data value was entered correctly and that it wasn’t an error.
2. Remove the outlier.
If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis. Just make sure to mention in your final report or analysis that you removed an outlier.