Home » How to Perform Exploratory Data Analysis in Python

How to Perform Exploratory Data Analysis in Python

by Erma Khan

One of the first steps of any data analysis project is exploratory data analysis.

This involves exploring a dataset in three ways:

1. Summarizing a dataset using descriptive statistics.

2. Visualizing a dataset using charts.

3. Identifying missing values.

By performing these three actions, you can gain an understanding of how the values in a dataset are distributed and detect any problematic values before proceeding to perform a hypothesis test or perform statistical modeling.

The following step-by-step example shows how to perform exploratory data analysis for a dataset in Python.

Step 1: Create the Data

First, let’s create the following pandas DataFrame:

import pandas as pd
import numpy as np

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                   'points': [18, 22, 19, 14, 14, 11, 20, 28],
                   'assists': [5, 7, 7, 9, 12, 9, 9, 4],
                   'rebounds': [11, 8, 10, 6, 6, np.nan, 9, 12]})

We can take a look at the first five rows of the DataFrame by using the head() function:

#view first five rows of dataset
df.head()

	team	points	assists	rebounds
0	A	18	5	11.0
1	A	22	7	8.0
2	A	19	7	10.0
3	A	14	9	6.0
4	B	14	12	6.0

Step 2: Summarize the Data

We can use the describe() function to quickly summarize each numerical variable in the dataset:

#summarize numerical variables
df.describe()

           points	assists 	rebounds
count	8.0000000	8.00000 	7.000000
mean	18.250000	7.75000 	8.857143
std	5.3652320	2.54951 	2.340126
min	11.000000	4.00000 	6.000000
25%	14.000000	6.50000 	7.000000
50%	18.500000	8.00000 	9.000000
75%	20.500000	9.00000 	10.50000
max	28.000000	12.0000         12.00000

For each of the numeric variables we can see the following information:

  • count: Total number of non-missing values
  • std: The mean value
  • min: The minimum value
  • 25%: The value of the first quartile (25th percentile)
  • 50%: The median value (50th percentile)
  • 75%: The value of the third quartile (75th percentile)
  • max: The maximum value

For the categorical variables in the dataset, we can use value_counts to get a frequency count of each value:

#display frequency counts for team variable
df['team'].value_counts()

A    4
B    4
Name: team, dtype: int64

 From the output we can see:

  • A: This value occurs 4 times.
  • B: This value occurs 4 times.

We can use the shape function to get the dimensions of the DataFrame in terms of number of rows and number of columns:

#display rows and columns
df.shape

(8, 4)

We can see that the DataFrame has 8 rows and 4 columns.

Step 3: Visualize the Data

We can also create charts to visualize the values in the dataset.

For example, we can use the pandas hist() function to create a histogram of the values for each numerical variable:

#create histogram for each numerical variable
df.hist(grid=False, edgecolor='black')

The x-axis of each histogram shows the values for each variable and the y-axis shows the frequency of each value.

We can also use the pandas boxplot() function to create a boxplot for each numerical variable:

#create boxplot for each numerical variable
df.boxplot(grid=False)

We can also use the geom_boxplot() function to create a boxplot of one variable grouped by another variable:

We can also use the pandas corr() function to create a correlation matrix to view the correlation coefficient between each pairwise combination of numeric variables in the DataFrame:

#create correlation matrix
df.corr()

          points	  assists	 rebounds
points	 1.000000	-0.725841	 0.767007
assists	-0.725841	 1.000000	-0.882046
rebounds 0.767007	-0.882046	 1.000000

Related: What is Considered to Be a “Strong” Correlation?

Step 4: Identify Missing Values

We can use the following code to count the total number of missing values in each column of the DataFrame:

#count total missing values in each column
df.isnull().sum()

team        0
points      0
assists     0
rebounds    1
dtype: int64

From the output we can see that there is only one missing value in the rebounds column.

All other columns have no missing values.

We have now completed a basic exploratory data analysis on this dataset and have a good understanding of how the values are distributed for each variable in this dataset.

Related: How to Impute Missing Values in Pandas

Additional Resources

The following tutorials explain how to perform other common tasks in Python:

How to Create Frequency Tables in Python
How to Create Boxplot from Pandas DataFrame
How to Create a Histogram from Pandas DataFrame

Related Posts