From an existing Pandas DataFrame, how can I create a summary DataFrame based on the union of overlapping date ranges (given a start and an end date) and an additional column?
Aaron
aaron.christensen at gmail.com
Tue Jun 2 13:47:34 EDT 2020
Hello,
Given a dateframe with trips made by employees of different companies, I am
trying to generate a new dataframe with only the company names. I am
looking to combine the overlapping travel times from employees of the SAME
company into a single row. If there are no overlapping travel times, then
that row just transfers over as-is. When there are overlapping travel
times, then the following will happen:
--The name field is removed b/c that is no longer relevant (company name
stays), the Depart date will be the earliest date of any of the trip dates
regardless of the employee, the Return date will be the latest date of any
of the trip dates regardless of the employee, the charges for the trip will
be summed
For example, if trips had dates 01/01/20 - 01/31/20, 01/15/20 - 02/15/20,
02/01-20 - 02/28/20, then all three would be combined. The starting date
will be 1/1/20 and ending as of 2/28/20. Basically, the company was on
that trip from start to finish… kinda like a relay run handing off the
baton. Also, the charges will be summed for each of those trips and
transferred over to the single row.
Here is the starting dataframe code/output (note: the row order is
typically not already sorted by company name as in this example):
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank',
'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020',
'01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020',
'03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart',
'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
Name Company Depart Return Charges0 Bob ABC
2020-01-01 2020-01-31 10.101 Joe ABC 2020-01-01 2020-02-15
20.252 Sue ABC 2020-01-06 2020-02-20 30.323 Jack HIJ
2020-01-01 2020-03-01 40.004 Henry HIJ 2020-05-01 2020-05-05
50.015 Frank DEF 2020-01-13 2020-01-15 60.326 Lee DEF
2020-01-12 2020-01-30 70.997 Jack DEF 2020-01-14 2020-02-02
80.87
And, here is the desired/generated dataframe:
Company Depart Return Charges0 ABC 01/01/2020
02/20/2020 60.671 HIJ 01/01/2020 03/01/2020 40.002 HIJ
05/01/2020 05/05/2020 50.013 DEF 01/12/2020 02/02/2020
212.18
I have been trying to use a combination of sorting and grouping but
the best I've achieved is reordering the dataframe. Even though I am
able to sort/group based on values, I still run into the issues of
finding overlapping date ranges and pulling out all trips based on a
single company per aggregate/overlapping date range.
Thank you in advance for any help!
Aaron
More information about the Python-list
mailing list