[Tutor] quick data structures question
steve at pearwood.info
Thu Sep 22 02:49:47 CEST 2011
Fred G wrote:
> Hey guys,
> I want to write a short script that takes from an input excel file w/ a
> bunch of rows and columns. The two columns I'm interested in are "high
> gains" and "genes." I want the user to write:
>>>> Which genes are associated with gains over 20%?
Depending on your requirements, the simplest data structure would be
just a sorted list of (high gain, genes):
data = [ (1, "GAT"),
Then you can answer questions like:
"Which genes are associated with gains between 17% and 48%?"
with a simple helper function:
def extract_genes(data, low=0, high=100):
"""Get the genes between low% and high% high gain."""
for gain, genes in data:
if gain < low: continue
if gain > high: break
genes = list(extract_genes(data, 17, 48)
> and then I want the script to search through the excel file, find in the
> "high gains" column when the number is greater than 20%, and then store the
> corresponding genes in that row (but in the "genes" column).
Unless you have hundreds of megabytes of data, you certainly don't want
to be reading from the Excel file every single time you do a query. That
will be very slow.
Does it have to be an .xls file? If so, you will need to install and use
a third-part package to read data from the file. Google on "python
excel" for more:
Otherwise, a simple CSV file is 100% Excel compatible: use Excel to do a
Save As and choose Comma Separated Values. You can use the csv module to
read the data into a list, and then do all your processing in memory
instead of on disk, which is like a thousand times faster.
To get the data out of the CSV file into a list, something like this
(again, untested) will work:
f = open("my data file.csv", "r")
reader = csv.reader(f)
# Skip the first line, if it is a header.
header = next(reader) # or f.next() if using Python 2.5
data = 
for row in reader:
# I don't know the format of your data file, so I'll make it up.
# Ignore everything except high_gain and genes.
colour, truth, beauty, strangeness, low_gain, high_gain, genes = row
pair = convert(high_gain), genes
You have to write your own "convert" function to turn the gains into
numbers. Something like this might do it:
"""Convert strings like '0.42' or '42%' to floats."""
num = float(s[:-1])
More information about the Tutor