From an existing Pandas DataFrame, how can I create a summary DataFrame based on the union of overlapping date ranges (given a start and an end date) and an additional column?
joseph pareti
joepareti54 at gmail.com
Tue Jun 9 05:44:46 EDT 2020
i gave it a shot, see attached
Am Mi., 3. Juni 2020 um 23:38 Uhr schrieb Aaron <aaron.christensen at gmail.com
>:
> Hello,
>
> Given a dateframe with trips made by employees of different companies, I am
> trying to generate a new dataframe with only the company names. I am
> looking to combine the overlapping travel times from employees of the SAME
> company into a single row. If there are no overlapping travel times, then
> that row just transfers over as-is. When there are overlapping travel
> times, then the following will happen:
>
> --The name field is removed b/c that is no longer relevant (company name
> stays), the Depart date will be the earliest date of any of the trip dates
> regardless of the employee, the Return date will be the latest date of any
> of the trip dates regardless of the employee, the charges for the trip will
> be summed
>
> For example, if trips had dates 01/01/20 - 01/31/20, 01/15/20 - 02/15/20,
> 02/01-20 - 02/28/20, then all three would be combined. The starting date
> will be 1/1/20 and ending as of 2/28/20. Basically, the company was on
> that trip from start to finish… kinda like a relay run handing off the
> baton. Also, the charges will be summed for each of those trips and
> transferred over to the single row.
>
> Here is the starting dataframe code/output (note: the row order is
> typically not already sorted by company name as in this example):
>
> import pandas as pd
>
>
> emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank',
> 'Lee', 'Jack'],
> 'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF',
> 'DEF'],
> 'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020',
> '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
> 'Return' : ['01/31/2020', '02/15/2020', '02/20/2020',
> '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
> 'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
> }
>
> df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart',
> 'Return', 'Charges'])
> # Convert to date format
> df['Return']= pd.to_datetime(df['Return'])
> df['Depart']= pd.to_datetime(df['Depart'])
>
> Name Company Depart Return Charges0 Bob ABC
> 2020-01-01 2020-01-31 10.101 Joe ABC 2020-01-01 2020-02-15
> 20.252 Sue ABC 2020-01-06 2020-02-20 30.323 Jack HIJ
> 2020-01-01 2020-03-01 40.004 Henry HIJ 2020-05-01 2020-05-05
> 50.015 Frank DEF 2020-01-13 2020-01-15 60.326 Lee DEF
> 2020-01-12 2020-01-30 70.997 Jack DEF 2020-01-14 2020-02-02
> 80.87
>
> And, here is the desired/generated dataframe:
>
> Company Depart Return Charges0 ABC 01/01/2020
> 02/20/2020 60.671 HIJ 01/01/2020 03/01/2020 40.002 HIJ
> 05/01/2020 05/05/2020 50.013 DEF 01/12/2020 02/02/2020
> 212.18
>
> I have been trying to use a combination of sorting and grouping but
> the best I've achieved is reordering the dataframe. Even though I am
> able to sort/group based on values, I still run into the issues of
> finding overlapping date ranges and pulling out all trips based on a
> single company per aggregate/overlapping date range.
>
> Thank you in advance for any help!
>
> Aaron
> --
> https://mail.python.org/mailman/listinfo/python-list
>
--
Regards,
Joseph Pareti - Artificial Intelligence consultant
Joseph Pareti's AI Consulting Services
https://www.joepareti54-ai.com/
cell +49 1520 1600 209
cell +39 339 797 0644
More information about the Python-list
mailing list