How to Perform Logistic Regression Using Statsmodels | Online Statistics library

The statsmodels module in Python offers a variety of functions and classes that allow you to fit various statistical models.

The following step-by-step example shows how to perform logistic regression using functions from statsmodels.

Step 1: Create the Data

First, let’s create a pandas DataFrame that contains three variables:

Hours Studied (Integer value)
Study Method (Method A or B)
Exam Result (Pass or Fail)

We’ll fit a logistic regression model using hours studied and study method to predict whether or not a student passes a given exam.

The following code shows how to create the pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'result': [0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
                              0, 1, 1, 1, 0, 1, 1, 1, 1, 1],
                   'hours': [1, 2, 2, 2, 3, 2, 5, 4, 3, 6,
                            5, 8, 8, 7, 6, 7, 5, 4, 8, 9],
                   'method': ['A', 'A', 'A', 'B', 'B', 'B', 'B',
                             'B', 'B', 'A', 'B', 'A', 'B', 'B',
                             'A', 'A', 'B', 'A', 'B', 'A']})

#view first five rows of DataFrame
df.head()

	result	hours	method
0	0	1	A
1	1	2	A
2	0	2	A
3	0	2	B
4	0	3	B

Step 2: Fit the Logistic Regression Model

Next, we’ll fit the logistic regression model using the logit() function:

import statsmodels.formula.api as smf

#fit logistic regression model
model = smf.logit('result ~ hours + method', data=df).fit()

#view model summary
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.557786
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 result   No. Observations:                   20
Model:                          Logit   Df Residuals:                       17
Method:                           MLE   Df Model:                            2
Date:                Mon, 22 Aug 2022   Pseudo R-squ.:                  0.1894
Time:                        09:53:35   Log-Likelihood:                -11.156
converged:                       True   LL-Null:                       -13.763
Covariance Type:            nonrobust   LLR p-value:                   0.07375
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      -2.1569      1.416     -1.523      0.128      -4.932       0.618
method[T.B]     0.0875      1.051      0.083      0.934      -1.973       2.148
hours           0.4909      0.245      2.002      0.045       0.010       0.972
===============================================================================

The values in the coef column of the output tell us the average change in the log odds of passing the exam.

For example:

Using study method B is associated with an average increase of .0875 in the log odds of passing the exam compared to using study method A.
Each additional hour studied is associated with an average increase of .4909 in the log odds of passing the exam.

The values in the P>|z| column represent the p-values for each coefficient.

For example:

Studying method has a p-value of .934. Since this value is not less than .05, it means there is not a statistically significant relationship between hours studied and whether or not a student passes the exam.
Hours studied has a p-value of .045. Since this value is less than .05, it means there is a statistically significant relationship between hours studied and whether or not a student passes the exam.

Step 3: Evaluate Model Performance

To assess the quality of the logistic regression model, we can look at two metrics in the output:

1. Pseudo R-Squared

This value can be thought of as the substitute to the R-squared value for a linear regression model.

It is calculated as the ratio of the maximized log-likelihood function of the null model to the full model.

This value can range from 0 to 1, with higher values indicating a better model fit.

In this example, the pseudo R-squared value is .1894, which is quite low. This tells us that the predictor variables in the model don’t do a very good job of predicting the value of the response variable.

2. LLR p-value

This value can be thought of as the substitute to the p-value for the overall F-value of a linear regression model.

If this value is below a certain threshold (e.g. α = .05) then we can conclude that the model overall is “useful” and is better at predicting the values of the response variable compared to a model with no predictor variables.

In this example, the LLR p-value is .07375. Depending on the significance level we choose (e.g. .01, .05, .1) we may or may not conclude that the model as a whole is useful.

Additional Resources

The following tutorials explain how to perform other common tasks in Python:

How to Perform Linear Regression in Python
How to Perform Logarithmic Regression in Python
How to Perform Quantile Regression in Python