<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro <span dir="ltr"><<a href="mailto:skip.montanaro@gmail.com" target="_blank">skip.montanaro@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class=""><p dir="ltr">> df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False)<br>
> col_init = list(df.columns.values)<br>
> keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5']<br>
> for col in col_init:<br>
> if col not in keep_col:<br>
> del df[col]</p>
</div><p dir="ltr">I'm no pandas expert, but a couple things come to mind. First, where is your code slow (profile it, even with a few well-placed prints)? If it's in read_csv there might be little you can do unless you load those data repeatedly, and can save a pickled data frame as a caching measure. Second, you loop over columns deciding one by one whether to keep or toss a column. Instead try<br>
<br>
df = df[keep_col]</p>
<p dir="ltr"> Third, if deleting those other columns is costly, can you perhaps just ignore them? </p>
<p dir="ltr">Can't be more investigative right now. I don't have pandas on Android. :-)</p></blockquote></div><br><div class="gmail_default"><font face="verdana, sans-serif">So the </font><font face="arial, sans-serif">df = df[keep_col] is not fast but it is not that slow. You made me think of a solution to that part. just slice and copy. The only gotya is that the keep_col have to actually exist</font></div>
<div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:arial,sans-serif;font-size:13.333333969116211px"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small">
keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5']</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small">
df = df[keep_col]</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small">The real slow part seems to be</div>
<div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><div class="gmail_default">for n in drugs:</div><div class="gmail_default"> df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)</div>
</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small"><br></div><div><br></div></span></div><br clear="all"><div><div>Vincent Davis</div><div>720-301-3003<span></span><span></span></div>
</div>
</div></div>