speed up pandas calculation
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Jul 30 20:57:59 EDT 2014
On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote:
> I know this is a general python list and I am asking about pandas but
> this question is probably not great for asking on stackoverflow. I have
> a list of files (~80 files, ~30,000 rows) I need to process with my
> current code it is take minutes for each file.
Hmmm, is that 30,000 rows per file, or 30,000 files in total?
Not that it really matters, I shouldn't expect that it makes that much
difference either way.
> Any suggestions of a fast
> way. I am try to stick with pandas for educational purposes. Any
> suggestions would be great. If you are curious the can find the data
> file I am using below here.
> http://www.nber.org/nhamcs/data/nhamcsopd2010.csv
For brevity, I've trimmed back the dictionaries to something smaller.
That's just for ease of reading.
> drugs_current = {'CITALOPRAM': 4332,
> 'ESCITALOPRAM': 4812,
[...]
> 'CLONIDINE': 44,
> 'GUANFACINE': 717}
>
> drugs_98_05 = { 'SERTRALINE': 56635,
> 'CITALOPRAM': 59829,
[...]
> 'CLONIDINE2': 70357,
> 'GUANFACINE': 52498
> }
>
> df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
> low_memory=False)
> col_init = list(df.columns.values)
> keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
> 'MED2', 'MED3', 'MED4', 'MED5']
> for col in col_init:
> if col not in keep_col:
> del df[col]
I expect that this could be your problem here. Deleting from the start or
middle of lists is slow, and df may be a list or at least have list-like
performance. Suppose you have a list like this:
['a', 'b', 'c', 'd', 'e', ..., 'zzz']
that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you
delete item 0, 'a':
=> ['b', 'c', 'd', 'e', ..., 'zzz']
Python has to move the remaining 18278 items across one space. Then you
delete 'b':
=> ['c', 'd', 'e', ..., 'zzz']
Python has to move the remaining 18276 items across one space, making a
total of 36559 moves. And that's just to delete two items. Roughly
speaking, if you end up deleting N items from a list starting from the
front, Python may have to move as many as N**2 items into their final
positions. If you have 5 or 10 columns, that's not too bad, but if you
have (say) 80 columns, and delete 70 of them, that could be *horribly*
expensive.
If you must *repeatedly* use del on lists, it's best to ensure you're
deleting from the end, not the start. But even better, and this applies
to anything not just lists, is not to delete at all, but to create a new
list, copying the columns you want, rather than deleting the columns you
don't want.
I'm not familiar with pandas and am not sure about the exact syntax
needed, but something like:
new_df = [] # Assuming df is a list.
for col in df:
if col.value in keep_col:
new_df.append(col)
> if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
> '99', '00', '91', '02', '03', '04', '05'):
Where does f come from? You haven't shown the definition of that.
--
Steven
More information about the Python-list
mailing list