The statsmodels module in Python offers a variety of functions and classes that allow you to fit various statistical models.
The following step-by-step example shows how to perform logistic regression using functions from statsmodels.
Step 1: Create the Data
First, let’s create a pandas DataFrame that contains three variables:
- Hours Studied (Integer value)
- Study Method (Method A or B)
- Exam Result (Pass or Fail)
We’ll fit a logistic regression model using hours studied and study method to predict whether or not a student passes a given exam.
The following code shows how to create the pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'result': [0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1], 'hours': [1, 2, 2, 2, 3, 2, 5, 4, 3, 6, 5, 8, 8, 7, 6, 7, 5, 4, 8, 9], 'method': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A']}) #view first five rows of DataFrame df.head() result hours method 0 0 1 A 1 1 2 A 2 0 2 A 3 0 2 B 4 0 3 B
Step 2: Fit the Logistic Regression Model
Next, we’ll fit the logistic regression model using the logit() function:
import statsmodels.formula.api as smf
#fit logistic regression model
model = smf.logit('result ~ hours + method', data=df).fit()
#view model summary
print(model.summary())
Optimization terminated successfully.
Current function value: 0.557786
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: result No. Observations: 20
Model: Logit Df Residuals: 17
Method: MLE Df Model: 2
Date: Mon, 22 Aug 2022 Pseudo R-squ.: 0.1894
Time: 09:53:35 Log-Likelihood: -11.156
converged: True LL-Null: -13.763
Covariance Type: nonrobust LLR p-value: 0.07375
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -2.1569 1.416 -1.523 0.128 -4.932 0.618
method[T.B] 0.0875 1.051 0.083 0.934 -1.973 2.148
hours 0.4909 0.245 2.002 0.045 0.010 0.972
===============================================================================
The values in the coef column of the output tell us the average change in the log odds of passing the exam.
For example:
- Using study method B is associated with an average increase of .0875 in the log odds of passing the exam compared to using study method A.
- Each additional hour studied is associated with an average increase of .4909 in the log odds of passing the exam.
The values in the P>|z| column represent the p-values for each coefficient.
For example:
- Studying method has a p-value of .934. Since this value is not less than .05, it means there is not a statistically significant relationship between hours studied and whether or not a student passes the exam.
- Hours studied has a p-value of .045. Since this value is less than .05, it means there is a statistically significant relationship between hours studied and whether or not a student passes the exam.
Step 3: Evaluate Model Performance
To assess the quality of the logistic regression model, we can look at two metrics in the output:
1. Pseudo R-Squared
This value can be thought of as the substitute to the R-squared value for a linear regression model.
It is calculated as the ratio of the maximized log-likelihood function of the null model to the full model.
This value can range from 0 to 1, with higher values indicating a better model fit.
In this example, the pseudo R-squared value is .1894, which is quite low. This tells us that the predictor variables in the model don’t do a very good job of predicting the value of the response variable.
2. LLR p-value
This value can be thought of as the substitute to the p-value for the overall F-value of a linear regression model.
If this value is below a certain threshold (e.g. α = .05) then we can conclude that the model overall is “useful” and is better at predicting the values of the response variable compared to a model with no predictor variables.
In this example, the LLR p-value is .07375. Depending on the significance level we choose (e.g. .01, .05, .1) we may or may not conclude that the model as a whole is useful.
Additional Resources
The following tutorials explain how to perform other common tasks in Python:
How to Perform Linear Regression in Python
How to Perform Logarithmic Regression in Python
How to Perform Quantile Regression in Python