<div class="gmail_quote">On Fri, Feb 19, 2010 at 10:22 AM, Vincent Davis <span dir="ltr"><<a href="mailto:vincent@vincentdavis.net">vincent@vincentdavis.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div><div><div>I have some some (~50) text files that have about 250,000 rows each. I am reading them in using the following which gets me what I want. But it is not fast. Is there something I am missing that should help. This is mostly an question to help me learn more about python. It takes about 4 min right now.</div>


<div><br></div><div>def read_data_file(filename):</div><div>    reader = csv.reader(open(filename, "U"),delimiter='\t')</div><div>    read = list(reader)</div></div></div></blockquote><div><br>You're slurping the entire file here when it's not necessary.<br>

 </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div><div>    data_rows = takewhile(lambda trow: '[MASKS]' not in trow, [x for x in read])</div>

</div></div></blockquote><div><br>[x for x in read] is basically a copy of the entire list. This isn't necessary.<br> <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div><div>

<div>    data = [x for x in data_rows][1:]</div><div>    </div></div></div></blockquote><div><br>Again, copying here is unnecessary.<br><br>[x for x in y] isn't a paradigm in Python. If you really need a copy of an array, x = y[:] is the paradigm.<br>

 </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div><div>    mask_rows = takewhile(lambda trow: '[OUTLIERS]' not in trow, list(dropwhile(lambda drow: '[MASKS]' not in drow, read)))</div>

</div></div></blockquote><div><br><br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div><div>    mask = [row for row in mask_rows if row][3:]</div>

</div></div></blockquote><div><br>Here's another unnecessary array copy.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>

<div>

<div>    </div><div>    outlier_rows = dropwhile(lambda drows: '[OUTLIERS]' not in drows, read)</div><div>    outlier = [row for row in outlier_rows if row][3:]</div></div></div></blockquote><div><br><br>And another.<br>

</div></div><br>Just because you're using Python doesn't mean you get to be silly in how you move data around. Avoid copies as much as possible, and try to avoid slurping in large files all at once. Line-by-line processing is best.<br>

<br>I think you should invert this operation into a for loop. Most people tend to think of things better that way than chained iterators. It also helps you to not duplicate data when it's unnecessary.<br><br>-- <br>Jonathan Gardner<br>

<a href="mailto:jgardner@jonathangardner.net">jgardner@jonathangardner.net</a><br>