One of the most common problems that you’ll encounter in machine learning is multicollinearity. This occurs when two or more predictor variables in a dataset are highly correlated.
When this occurs, a model may be able to fit a training dataset well but it may perform poorly on a new dataset it has never seen because it overfits the training set.
One way to get around this problem is to use a method known as partial least squares, which works as follows:
- Standardize both the predictor and response variables.
- Calculate M linear combinations (called “PLS components”) of the original p predictor variables that explain a significant amount of variation in both the response variable and the predictor variables.
- Use the method of least squares to fit a linear regression model using the PLS components as predictors.
- Use k-fold cross-validation to find the optimal number of PLS components to keep in the model.
This tutorial provides a step-by-step example of how to perform partial least squares in R.
Step 1: Load Necessary Packages
The easiest way to perform partial least squares in R is by using functions from the pls package.
#install pls package (if not already installed) install.packages("pls") load pls package library(pls)
Step 2: Fit Partial Least Squares Model
For this example, we’ll use the built-in R dataset called mtcars which contains data about various types of cars:
#view first six rows of mtcars dataset
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
For this example we’ll fit a partial least squares (PLS) model using hp as the response variable and the following variables as the predictor variables:
- mpg
- disp
- drat
- wt
- qsec
The following code shows how to fit the PLS model to this data. Note the following arguments:
- scale=TRUE: This tells R that each of the variables in the dataset should be scaled to have a mean of 0 and a standard deviation of 1. This ensures that no predictor variable is overly influential in the model if it happens to be measured in different units.
- validation=”CV”: This tells R to use k-fold cross-validation to evaluate the performance of the model. Note that this uses k=10 folds by default. Also note that you can specify “LOOCV” instead to perform leave-one-out cross-validation.
#make this example reproducible set.seed(1) #fit PCR model model TRUE, validation="CV")
Step 3: Choose the Number of PLS Components
Once we’ve fit the model, we need to determine the number of PLS components worth keeping.
The way to do so is by looking at the test root mean squared error (test RMSE) calculated by the k-fold cross-validation:
#view summary of model fitting
summary(model)
Data: X dimension: 32 5
Y dimension: 32 1
Fit method: kernelpls
Number of components considered: 5
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
CV 69.66 40.57 35.48 36.22 36.74 36.67
adjCV 69.66 40.41 35.12 35.80 36.27 36.20
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 68.66 89.27 95.82 97.94 100.00
hp 71.84 81.74 82.00 82.02 82.03
There are two tables of interest in the output:
1. VALIDATION: RMSEP
This table tells us the test RMSE calculated by the k-fold cross validation. We can see the following:
- If we only use the intercept term in the model, the test RMSE is 69.66.
- If we add in the first PLS component, the test RMSE drops to 40.57.
- If we add in the second PLS component, the test RMSE drops to 35.48.
We can see that adding additional PLS components actually leads to an increase in test RMSE. Thus, it appears that it would be optimal to only use two PLS components in the final model.
2. TRAINING: % variance explained
This table tells us the percentage of the variance in the response variable explained by the PLS components. We can see the following:
- By using just the first PLS component, we can explain 68.66% of the variation in the response variable.
- By adding in the second PLS component, we can explain 89.27% of the variation in the response variable.
Note that we’ll always be able to explain more variance by using more PLS components, but we can see that adding in more than two PLS components doesn’t actually increase the percentage of explained variance by much.
We can also visualize the test RMSE (along with the test MSE and R-squared) based on the number of PLS components by using the validationplot() function.
#visualize cross-validation plots validationplot(model) validationplot(model, val.type="MSEP") validationplot(model, val.type="R2")
In each plot we can see that the model fit improves by adding in two PLS components, yet it tends to get worse when we add more PLS components.
Thus, the optimal model includes just the first two PLS components.
Step 4: Use the Final Model to Make Predictions
We can use the final model with two PLS components to make predictions on new observations.
The following code shows how to split the original dataset into a training and testing set and use the final model with two PLS components to make predictions on the testing set.
#define training and testing sets train nrow(mtcars), c("hp")] test nrow(mtcars), c("mpg", "disp", "drat", "wt", "qsec")] #use model to make predictions on a test set model TRUE, validation="CV") pcr_pred 2) #calculate RMSE sqrt(mean((pcr_pred - y_test)^2)) [1] 54.89609
We can see that the test RMSE turns out to be 54.89609. This is the average deviation between the predicted value for hp and the observed value for hp for the observations in the testing set.
Note that an equivalent principal components regression model with two principal components produced a test RMSE of 56.86549. Thus, the PLS model slightly outperformed the PCR model for this dataset.
The complete R code use in this example can be found here.