You can use the following syntax to calculate a difference between two dates in a pandas DataFrame:
df['diff_days'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'D')
This particular example calculates the difference between the dates in the end_date and start_date columns in terms of days.
Note that we can replace the ‘D’ in the timedelta64() function with the following values to calculate the date difference in different units:
- W: Weeks
- M: Months
- Y: Years
The following examples show how to calculate a date difference in a pandas DataFrame in practice.
Example 1: Calculate Difference Between Two Dates with Datetime Columns
Suppose we have the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'start_date': pd.date_range(start='1/5/2020', periods=6, freq='W'), 'end_date': pd.date_range(start='6/1/2020', periods=6, freq='M')}) #view DataFrame print(df) start_date end_date 0 2020-01-05 2020-06-30 1 2020-01-12 2020-07-31 2 2020-01-19 2020-08-31 3 2020-01-26 2020-09-30 4 2020-02-02 2020-10-31 5 2020-02-09 2020-11-30 #view dtype of each column in DataFrame df.dtypes start_date datetime64[ns] end_date datetime64[ns] dtype: object
Since both columns in the DataFrame already have a dtype of datetime64, we can use the following syntax to calculate the difference between the start and end dates:
import numpy as np
#create new columns that contains date differences
df['diff_days'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'D')
df['diff_weeks'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'W')
df['diff_months'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'M')
df['diff_years'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'Y')
#view updated DataFrame
print(df)
start_date end_date diff_days diff_weeks diff_months diff_years
0 2020-01-05 2020-06-30 177.0 25.285714 5.815314 0.484610
1 2020-01-12 2020-07-31 201.0 28.714286 6.603832 0.550319
2 2020-01-19 2020-08-31 225.0 32.142857 7.392349 0.616029
3 2020-01-26 2020-09-30 248.0 35.428571 8.148011 0.679001
4 2020-02-02 2020-10-31 272.0 38.857143 8.936528 0.744711
5 2020-02-09 2020-11-30 295.0 42.142857 9.692191 0.807683
The new columns contain the date differences between the start and end dates in terms of days, weeks, months, and years.
Example 2: Calculate Difference Between Two Dates with String Columns
Suppose we have the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'start_date': ['2020-01-05', '2020-01-12', '2020-01-19'], 'end_date': ['2020-06-30', '2020-07-31', '2020-08-31']}) #view dtype of each column print(df.dtypes) start_date object end_date object dtype: object
Since neither column in the DataFrame has a dtype of datetime64, we will receive an error if we attempt to calculate the difference between the dates:
import numpy as np
#attempt to calculate date difference
df['diff_days'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'D')
TypeError: unsupported operand type(s) for -: 'str' and 'str'
We must first use pd.to_datetime to convert each column to a datetime format before calculating the difference between the dates:
import numpy as np
#convert columns to datetime
df[['start_date','end_date']] = df[['start_date','end_date']].apply(pd.to_datetime)
#calculate difference between dates
df['diff_days'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 'D')
#view updated DataFrame
print(df)
start_date end_date diff_days
0 2020-01-05 2020-06-30 177.0
1 2020-01-12 2020-07-31 201.0
2 2020-01-19 2020-08-31 225.0
Since we first converted each column to a datetime format, we were able to successfully calculate the difference between the dates without any errors.
Additional Resources
The following tutorials explain how to perform other common operations in pandas:
How to Create a Date Range in Pandas
How to Extract Month from Date in Pandas
How to Convert Timestamp to Datetime in Pandas