Subsetting a dataset

Dan Stromberg drsalists at gmail.com
Mon Jun 13 02:47:23 EDT 2011


On Sun, Jun 12, 2011 at 9:53 PM, Kumar Mainali <kpmainali at gmail.com> wrote:

> I have a huge dataset containing millions of rows and several dozen columns
> in a tab delimited text file.  I need to extract a small subset of rows and
> only three columns. One of the three columns has two word string with header
> “Scientific Name”. The other two columns carry numbers for Longitude and
> Latitude, as below.
>
> Sci Name Longitude Latitude Column4
> Gen sp1 82.5 28.4 …
> Gen sp2 45.9 29.7 …
> Gen sp1 57.9 32.9 …
> … … … …
>
> Of the many species listed under the column “Sci Name”, I am interested in
> only one species which will have multiple records interspersed in the
> millions of rows, and I will probably have to use filename.readline() to
> read the rows one at a time. How would I search for a particular species in
> the dataset and create a new dataset for the species with only the three
> columns?
>
> Next, I have to create such datasets for hundreds of species. All these
> species are listed in another text file. There must be a way to define an
> iterative function that looks at one species at a time in the list of
> species and creates separate dataset for each species. The huge dataset
> contains more species than those listed in the list of my interest.
>
> I very much appreciate any help. I am a beginner in Python. So, complete
> code would be more helpful
>

You could use the csv module, in CPython since 2.3.  Don't be fooled by the
name - it allows you to redefine various aspects making it appropriate for
tab-separated values as well:
http://docs.python.org/release/3.2/library/csv.html
http://docs.python.org/release/2.7.2/library/csv.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110612/452fe552/attachment.html>


More information about the Python-list mailing list