speed up pandas calculation

Wed Jul 30 20:57:59 EDT 2014

On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote:

> I know this is a general python list and I am asking about pandas but
> this question is probably not great for asking on stackoverflow. I have
> a list of files (~80 files, ~30,000 rows) I need to process with my
> current code it is take minutes for each file. 

Hmmm, is that 30,000 rows per file, or 30,000 files in total?

Not that it really matters, I shouldn't expect that it makes that much 
difference either way.

> Any suggestions of a fast
> way. I am try to stick with pandas for educational purposes. Any
> suggestions would be great. If you are curious the can find the data
> file I am using below here.
> http://www.nber.org/nhamcs/data/nhamcsopd2010.csv

For brevity, I've trimmed back the dictionaries to something smaller. 
That's just for ease of reading.

> drugs_current = {'CITALOPRAM': 4332,
>          'ESCITALOPRAM': 4812,
           [...]
>          'CLONIDINE': 44,
>          'GUANFACINE': 717}
> 
> drugs_98_05 = { 'SERTRALINE': 56635,
>                 'CITALOPRAM': 59829,
                  [...]
>                 'CLONIDINE2': 70357,
>                 'GUANFACINE': 52498
>                }
> 
> df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
>                  low_memory=False)
> col_init = list(df.columns.values)
> keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
>             'MED2', 'MED3', 'MED4', 'MED5']
> for col in col_init:
>     if col not in keep_col:
>         del df[col]

I expect that this could be your problem here. Deleting from the start or 
middle of lists is slow, and df may be a list or at least have list-like 
performance. Suppose you have a list like this:

['a', 'b', 'c', 'd', 'e', ..., 'zzz']

that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you 
delete item 0, 'a':

=> ['b', 'c', 'd', 'e', ..., 'zzz']

Python has to move the remaining 18278 items across one space. Then you 
delete 'b':

=> ['c', 'd', 'e', ..., 'zzz']

Python has to move the remaining 18276 items across one space, making a 
total of 36559 moves. And that's just to delete two items. Roughly 
speaking, if you end up deleting N items from a list starting from the 
front, Python may have to move as many as N**2 items into their final 
positions. If you have 5 or 10 columns, that's not too bad, but if you 
have (say) 80 columns, and delete 70 of them, that could be *horribly* 
expensive.

If you must *repeatedly* use del on lists, it's best to ensure you're 
deleting from the end, not the start. But even better, and this applies 
to anything not just lists, is not to delete at all, but to create a new 
list, copying the columns you want, rather than deleting the columns you 
don't want.

I'm not familiar with pandas and am not sure about the exact syntax 
needed, but something like:

new_df = []  # Assuming df is a list.
for col in df:
    if col.value in keep_col:
        new_df.append(col)

> if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
> '99', '00', '91', '02', '03', '04', '05'):

Where does f come from? You haven't shown the definition of that.

-- 
Steven