A box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.
The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:
- y(λ) = (yλ – 1) / λ if y ≠ 0
- y(λ) = log(y) if y = 0
We can perform a box-cox transformation in R by using the boxcox() function from the MASS() library. The following example shows how to use this function in practice.
Refer to this paper from the University of Connecticut for a nice summary of the development of the Box-Cox transformation.
Example: Box-Cox Transformation in R
The following code shows how to fit a linear regression model to a dataset, then use the boxcox() function to find an optimal lambda to transform the response variable and fit a new model.
library(MASS) #create data y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8) x=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8) #fit linear regression model model #find optimal lambda for Box-Cox transformation bc #fit new linear regression model using the Box-Cox transformation new_model
The optimal lambda was found to be -0.4242424. Thus, the new regression model replaced the original response variable y with the variable y = (y-0.4242424 – 1) / -0.4242424.
The following code shows how to create two Q-Q plots in R to visualize the differences in residuals between the two regression models:
#define plotting area op #Q-Q plot for original model qqnorm(model$residuals) qqline(model$residuals) #Q-Q plot for Box-Cox transformed model qqnorm(new_model$residuals) qqline(new_model$residuals) #display both Q-Q plots par(op)
As a rule of thumb, if the data points fall along a straight diagonal line in a Q-Q plot then the dataset likely follows a normal distribution.
Notice how the box-cox transformed model produces a Q-Q plot with a much straighter line than the original regression model.
This is an indication that the residuals of the box-cox transformed model are much more normally distributed, which satisfies one of the assumptions of linear regression.
Additional Resources
How to Transform Data in R (Log, Square Root, Cube Root)
How to Create & Interpret a Q-Q Plot in R
How to Perform a Shapiro-Wilk Test for Normality in R