Subsetting a dataset

Kumar Mainali kpmainali at utexas.edu
Mon Jun 13 00:57:50 EDT 2011


I have a huge dataset containing millions of rows and several dozen columns
in a tab delimited text file.  I need to extract a small subset of rows and
only three columns. One of the three columns has two word string with header
“Scientific Name”. The other two columns carry numbers for Longitude and
Latitude, as below.

Sci Name Longitude Latitude Column4
Gen sp1 82.5 28.4 …
Gen sp2 45.9 29.7 …
Gen sp1 57.9 32.9 …
… … … …

Of the many species listed under the column “Sci Name”, I am interested in
only one species which will have multiple records interspersed in the
millions of rows, and I will probably have to use filename.readline() to
read the rows one at a time. How would I search for a particular species in
the dataset and create a new dataset for the species with only the three
columns?

Next, I have to create such datasets for hundreds of species. All these
species are listed in another text file. There must be a way to define an
iterative function that looks at one species at a time in the list of
species and creates separate dataset for each species. The huge dataset
contains more species than those listed in the list of my interest.

I very much appreciate any help. I am a beginner in Python. So, complete
code would be more helpful.

- Kumar


-- 
Section of Integrative Biology
University of Texas at Austin
Austin, Texas 78712, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110612/147692c4/attachment-0001.html>


More information about the Python-list mailing list