[Tutor] quick data structures question
Steven D'Aprano
steve at pearwood.info
Thu Sep 22 02:49:47 CEST 2011
Fred G wrote:
> Hey guys,
>
> I want to write a short script that takes from an input excel file w/ a
> bunch of rows and columns. The two columns I'm interested in are "high
> gains" and "genes." I want the user to write:
>
>>>> Which genes are associated with gains over 20%?
Depending on your requirements, the simplest data structure would be
just a sorted list of (high gain, genes):
data = [ (1, "GAT"),
(3, "CCT"),
...
(99, "TTG")]
Then you can answer questions like:
"Which genes are associated with gains between 17% and 48%?"
with a simple helper function:
# untested
def extract_genes(data, low=0, high=100):
"""Get the genes between low% and high% high gain."""
for gain, genes in data:
if gain < low: continue
if gain > high: break
yield genes
genes = list(extract_genes(data, 17, 48)
> and then I want the script to search through the excel file, find in the
> "high gains" column when the number is greater than 20%, and then store the
> corresponding genes in that row (but in the "genes" column).
Unless you have hundreds of megabytes of data, you certainly don't want
to be reading from the Excel file every single time you do a query. That
will be very slow.
Does it have to be an .xls file? If so, you will need to install and use
a third-part package to read data from the file. Google on "python
excel" for more:
https://duckduckgo.com/html/?q=python+excel
Otherwise, a simple CSV file is 100% Excel compatible: use Excel to do a
Save As and choose Comma Separated Values. You can use the csv module to
read the data into a list, and then do all your processing in memory
instead of on disk, which is like a thousand times faster.
To get the data out of the CSV file into a list, something like this
(again, untested) will work:
import csv
f = open("my data file.csv", "r")
reader = csv.reader(f)
# Skip the first line, if it is a header.
header = next(reader) # or f.next() if using Python 2.5
data = []
for row in reader:
# I don't know the format of your data file, so I'll make it up.
# Ignore everything except high_gain and genes.
colour, truth, beauty, strangeness, low_gain, high_gain, genes = row
pair = convert(high_gain), genes
data.append(pair)
f.close()
data.sort()
You have to write your own "convert" function to turn the gains into
numbers. Something like this might do it:
def convert(s):
"""Convert strings like '0.42' or '42%' to floats."""
if s.endswith("%"):
num = float(s[:-1])
return num/100
return float(s)
--
Steven
More information about the Tutor
mailing list