Home » Ridge Regression in Python (Step-by-Step)

Ridge Regression in Python (Step-by-Step)

by Erma Khan

Ridge regression is a method we can use to fit a regression model when multicollinearity is present in the data.

In a nutshell, least squares regression tries to find coefficient estimates that minimize the sum of squared residuals (RSS):

RSS = Σ(yi – ŷi)2

where:

  • Σ: A greek symbol that means sum
  • yi: The actual response value for the ith observation
  • ŷi: The predicted response value based on the multiple linear regression model

Conversely, ridge regression seeks to minimize the following:

RSS + λΣβj2

where j ranges from 1 to p predictor variables and λ ≥ 0.

This second term in the equation is known as a shrinkage penalty. In ridge regression, we select a value for λ that produces the lowest possible test MSE (mean squared error).

This tutorial provides a step-by-step example of how to perform ridge regression in Python.

Step 1: Import Necessary Packages

First, we’ll import the necessary packages to perform ridge regression in Python:

import pandas as pd
from numpy import arange
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import RepeatedKFold

Step 2: Load the Data

For this example, we’ll use a dataset called mtcars, which contains information about 33 different cars. We’ll use hp as the response variable and the following variables as the predictors:

  • mpg
  • wt
  • drat
  • qsec

The following code shows how to load and view this dataset:

#define URL where data is located
url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data
data_full = pd.read_csv(url)

#select subset of data
data = data_full[["mpg", "wt", "drat", "qsec", "hp"]]

#view first six rows of data
data[0:6]

	mpg	wt	drat	qsec	hp
0	21.0	2.620	3.90	16.46	110
1	21.0	2.875	3.90	17.02	110
2	22.8	2.320	3.85	18.61	93
3	21.4	3.215	3.08	19.44	110
4	18.7	3.440	3.15	17.02	175
5	18.1	3.460	2.76	20.22	105

Step 3: Fit the Ridge Regression Model

Next, we’ll use the RidgeCV() function from sklearn to fit the ridge regression model and we’ll use the RepeatedKFold() function to perform k-fold cross-validation to find the optimal alpha value to use for the penalty term.

Note: The term “alpha” is used instead of “lambda” in Python.

For this example we’ll choose k = 10 folds and repeat the cross-validation process 3 times.

Also note that RidgeCV() only tests alpha values .1, 1, and 10 by default. However, we can define our own alpha range from 0 to 1 by increments of 0.01:

#define predictor and response variables
X = data[["mpg", "wt", "drat", "qsec"]]
y = data["hp"]

#define cross-validation method to evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

#define model
model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error')

#fit model
model.fit(X, y)

#display lambda that produced the lowest test MSE
print(model.alpha_)

0.99

The lambda value that minimizes the test MSE turns out to be 0.99.

Step 4: Use the Model to Make Predictions

Lastly, we can use the final ridge regression model to make predictions on new observations. For example, the following code shows how to define a new car with the following attributes:

  • mpg: 24
  • wt: 2.5
  • drat: 3.5
  • qsec: 18.5

The following code shows how to use the fitted ridge regression model to predict the value for hp of this new observation:

#define new observation
new = [24, 2.5, 3.5, 18.5]

#predict hp value using ridge regression model
model.predict([new])

array([104.16398018])

Based on the input values, the model predicts this car to have an hp value of 104.16398018.

You can find the complete Python code used in this example here.

Related Posts