Home » How to Create a Correlation Matrix in Python

How to Create a Correlation Matrix in Python

by Erma Khan

One way to quantify the relationship between two variables is to use the Pearson correlation coefficient, which is a measure of the linear association between two variables.

It takes on a value between -1 and 1 where:

  • -1 indicates a perfectly negative linear correlation.
  • 0 indicates no linear correlation.
  • 1 indicates a perfectly positive linear correlation.

The further away the correlation coefficient is from zero, the stronger the relationship between the two variables.

But in some cases we want to understand the correlation between more than just one pair of variables. In these cases, we can create a correlation matrix, which is a square table that shows the the correlation coefficients between several pairwise combination of variables. 

This tutorial explains how to create and interpret a correlation matrix in Python.

How to Create a Correlation Matrix in Python

Use the following steps to create a correlation matrix in Python.

Step 1: Create the dataset.

import pandas as pd

data = {'assists': [4, 5, 5, 6, 7, 8, 8, 10],
        'rebounds': [12, 14, 13, 7, 8, 8, 9, 13],
        'points': [22, 24, 26, 26, 29, 32, 20, 14]
        }

df = pd.DataFrame(data, columns=['assists','rebounds','points'])
df

   assist  rebounds  points
0	4	12	22
1	5	14	24
2	5	13	26
3	6	7	26
4	7	8	29
5	8	8	32
6	8	9	20
7	10	13	14

Step 2: Create the correlation matrix.

#create correlation matrix
df.corr()

                assists   rebounds     points
assists        1.000000  -0.244861  -0.329573
rebounds      -0.244861   1.000000  -0.522092
points        -0.329573  -0.522092   1.000000

#create same correlation matrix with coefficients rounded to 3 decimals 
df.corr().round(3)
	       assists	rebounds  points
assists	         1.000	  -0.245  -0.330
rebounds	-0.245	   1.000  -0.522
points	        -0.330	  -0.522   1.000

Step 3: Interpret the correlation matrix.

The correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself.

All of the other correlation coefficients indicate the correlation between different pairwise combinations of variables. For example:

  • The correlation coefficient between assists and rebounds is -0.245.
  • The correlation coefficient between assists and points  is -0.330.
  • The correlation coefficient between rebounds and points  is -0.522.

Step 4: Visualize the correlation matrix (optional).

You can visualize the correlation matrix by using the styling options available in pandas:

corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Correlation matrix in Python

You can also change the argument of cmap to produce a correlation matrix with different colors.

corr = df.corr()
corr.style.background_gradient(cmap='RdYlGn')

Correlation matrix with matplotlib in Python

corr = df.corr()
corr.style.background_gradient(cmap='bwr')

Correlation matrix using Pandas

corr = df.corr()
corr.style.background_gradient(cmap='PuOr')

Correlation matrix example in Python

Note: For a complete list of cmap arguments, refer to the matplotlib documentation.

Related Posts