A five number summary is a way to summarize a dataset using the following five values:
- The minimum
- The first quartile
- The median
- The third quartile
- The maximum
The five number summary is useful because it provides a concise summary of the distribution of the data in the following ways:
- It tells us where the middle value is located, using the median.
- It tells us how spread out the data is, using the first and third quartiles.
- It tells us the range of the data, using the minimum and the maximum.
The easiest way to calculate a five number summary for variables in a pandas DataFrame is to use the describe() function as follows:
df.describe().loc[['min', '25%', '50%', '75%', 'max']]
The following example shows how to use this syntax in practice.
Example: Calculate Five Number Summary in Pandas DataFrame
Suppose we have the following pandas DataFrame that contains information about various basketball players:
import pandas as pd
#create DataFrame
df = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'points': [18, 22, 19, 14, 14, 11, 20, 28],
'assists': [5, 7, 7, 9, 12, 9, 9, 4],
'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]})
#view DataFrame
print(df)
team points assists rebounds
0 A 18 5 11
1 B 22 7 8
2 C 19 7 10
3 D 14 9 6
4 E 14 12 6
5 F 11 9 5
6 G 20 9 9
7 H 28 4 12
We can use the following syntax to calculate the five number summary for each numeric variable in the DataFrame:
#calculate five number summary for each numeric variable df.describe().loc[['min', '25%', '50%', '75%', 'max']] points assists rebounds min 11.0 4.0 5.00 25% 14.0 6.5 6.00 50% 18.5 8.0 8.50 75% 20.5 9.0 10.25 max 28.0 12.0 12.00
Here’s how to interpret the output for the points variable:
- The minimum value is 11.
- The value at the 25th percentile is 14.
- The value at the 50th percentile is 18.5.
- The value at the 75th percentile is 20.5.
- The maximum value is 28.
We can interpret the values for the assists and rebounds variables in a similar manner.
If you’d only like to calculate the five number summary for one specific variable in the DataFrame, you can use the following syntax:
#calculate five number summary for the points variable df['points'].describe().loc[['min', '25%', '50%', '75%', 'max']] min 11.0 25% 14.0 50% 18.5 75% 20.5 max 28.0 Name: points, dtype: float64
The output now displays the five number summary only for the points variable.
Additional Resources
The following tutorials explain how to perform other common tasks in pandas:
Pandas: How to Get Frequency Counts of Values in Column
Pandas: How to Perform Exploratory Data Analysis
Pandas: How to Calculate the Mean by Group