Efficient counting of results

Thu Oct 19 15:36:28 EDT 2017

On 19/10/17 20:04, Israel Brewster wrote:
>> then loop through the records, find the schedule for that record (if any, if not move on as mentioned earlier), compare t1 and t2 against the schedule, and increment the appropriate bin counts using a bunch of if statements. Functional, if ugly. But then I got to thinking: I keep hearing about all these fancy numerical analysis tools for python like pandas and numpy - could something like that help? Might there be a way to simply set up a table with "rules" for the columns and rows, and drop my records into the table, having them automatically counted into the proper bins or something? Or am I over thinking this, and the "simple", if ugly approach is best?

The numerical packages can do a lot of things; in this kind of case,
mostly hiding loops in C code. You'll have to be the judge of whether
it's helpful. I'll just try to give you an idea of what you might do.

If you had two pandas/numpy series/arrays of numpy datetimes called
"scheduled" and "actual", you could get all the delays

import pandas as pd
delays = pd.Series(actual - scheduled)

then index those by the actual time

delays.index = actual

and select bits of it as you please, e.g.

from pandas.tseries.offsets import Week
today = pd.to_datetime('today')
wtd_delays = delays[today-week:]

You can construct boolean mask arrays for certain conditions

minute_delta = pd.DateOffset(minutes=1).delta
wtd_is_between_1_and_5_min = ((wtd_delays >= 1*minute_delta) &
			      (wtd_delays < 5*minute_delta))

and either get all the affected datapoints

wtd_delays[wtd_is_between_1_and_5_min]

or count them

np.count_nonzero(wtd_is_between_1_and_5_min)

If you have a larger table in a pandas DataFrame with more information,
and you want to get those rows which fit a particular requirement, you
can do that too. Something like

some_table = dataframe_conjured_out_of_thin_air()
delay = some_table['actual_time'] - some_table['proper_civilised_time']
mtd = today - pd.DateOffset(months=1)

data_mtd_5min_or_later = some_table[mtd:][delay[mtd:]
                                          >= 5*minute_delta]

Or something like that.

If you do a lot of this kind of stuff (sifting through largish
datasets), learning how to use pandas might be an excellent idea, but it
will of course involve a fair amount of scrolling through documentation,
googling, and reading stack overflow posts.

Some pointers:
http://pandas.pydata.org/pandas-docs/stable/
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html#boolean-or-mask-index-arrays

> 
> I suppose I should mention: my data source is the results of a psycopg2 query, so a "record" is a tuple or dictionary (depending on how I want to set up the cursor)

In this case, if you want speed, you're far better off doing most of the
work in SQL rather than Python! If you want clarity, maybe what you're
doing now is already good enough. Or maybe using more complex SQL would
actually be clearer.

Pandas sits somewhere in between, and IMHO only gives you significant
added value (if your data is already in a database) if you want to do
some form of statistical analysis, some other kind of more complex
computation, or if you want to create plots.

-- Thomas