How to Perform Logistic Regression in Python (Step-by-Step) | Online Statistics library

Logistic regression uses a method known as maximum likelihood estimation to find an equation of the following form:

The formula on the right side of the equation predicts the log odds of the response variable taking on a value of 1.

Thus, when we fit a logistic regression model we can use the following equation to calculate the probability that a given observation takes on a value of 1:

p(X) = e^{β₀ + β₁X₁ + β₂X₂ + … + β_pX_p} / (1 + e^{β₀ + β₁X₁ + β₂X₂ + … + β_pX_p})

We then use some probability threshold to classify the observation as either 1 or 0.

For example, we might say that observations with a probability greater than or equal to 0.5 will be classified as “1” and all other observations will be classified as “0.”

This tutorial provides a step-by-step example of how to perform logistic regression in R.

Step 1: Import Necessary Packages

First, we’ll import the necessary packages to perform logistic regression in Python:

Step 2: Load the Data

For this example, we’ll use the Default dataset from the Introduction to Statistical Learning book. We can use the following code to load and view a summary of the dataset:

We will use student status, bank balance, and income to build a logistic regression model that predicts the probability that a given individual defaults.

Step 3: Create Training and Test Samples

Next, we’ll split the dataset into a training set to train the model on and a testing set to test the model on.

Step 4: Fit the Logistic Regression Model

Next, we’ll use the LogisticRegression() function to fit a logistic regression model to the dataset:

Step 5: Model Diagnostics

Once we fit the regression model, we can then analyze how well our model performs on the test dataset.

We can also obtain the accuracy of the model, which tells us the percentage of correction predictions the model made:

This tells us that the model made the correct prediction for whether or not an individual would default 96.2% of the time.

Lastly, we can plot the ROC (Receiver Operating Characteristic) Curve which displays the percentage of true positives predicted by the model as the prediction probability cutoff is lowered from 1 to 0.

The higher the AUC (area under the curve), the more accurately our model is able to predict outcomes:

The complete Python code used in this tutorial can be found here.