You can use the following basic syntax to calculate the correlation between two variables by group in pandas:
df.groupby('group_var')[['values1','values2']].corr().unstack().iloc[:,1]
The following example shows how to use this syntax in practice.
Example: Calculate Correlation By Group in Pandas
Suppose we have the following pandas DataFrame:
import pandas as pd
#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'points': [18, 22, 19, 14, 14, 11, 20, 28],
'assists': [2, 7, 9, 3, 12, 10, 14, 21]})
#view DataFrame
print(df)
We can use the following code to calculate the correlation between points and assists, grouped by team:
#calculate correlation between points and assists, grouped by team
df.groupby('team')[['points','assists']].corr().unstack().iloc[:,1]
team
A 0.603053
B 0.981798
Name: (points, assists), dtype: float64
From the output we can see:
- The correlation coefficient between points and assists for team A is .603053.
- The correlation coefficient between points and assists for team B is .981798.
Since both correlation coefficients are positive, this tells us that the relationship between points and assists for both teams is positive.
That is, players who tend to score more points also tend to record more assists.
Related: What is Considered to Be a “Strong” Correlation?
Note that we could shorten the syntax by not using the unstack and iloc functions, but the results are uglier:
df.groupby('team')[['points','assists']].corr()
points assists
team
A points 1.000000 0.603053
assists 0.603053 1.000000
B points 1.000000 0.981798
assists 0.981798 1.000000
This syntax produces a correlation matrix for both teams, which provides us with excessive information.
Additional Resources
The following tutorials explain how to perform other common operations in pandas:
How to Perform a GroupBy Sum in Pandas
How to Use Groupby and Plot in Pandas
How to Count Unique Values Using GroupBy in Pandas