[Tutor] quick data structures question

Steven D'Aprano steve at pearwood.info
Thu Sep 22 02:49:47 CEST 2011


Fred G wrote:
> Hey guys,
> 
> I want to write a short script that takes from an input excel file w/ a
> bunch of rows and columns.  The two columns I'm interested in are "high
> gains" and "genes."  I want the user to write:
> 
>>>> Which genes are associated with gains over 20%?


Depending on your requirements, the simplest data structure would be 
just a sorted list of (high gain, genes):

data = [ (1, "GAT"),
          (3, "CCT"),
          ...
          (99, "TTG")]

Then you can answer questions like:

"Which genes are associated with gains between 17% and 48%?"

with a simple helper function:

# untested
def extract_genes(data, low=0, high=100):
     """Get the genes between low% and high% high gain."""
     for gain, genes in data:
         if gain < low: continue
         if gain > high: break
         yield genes


genes = list(extract_genes(data, 17, 48)


> and then I want the script to search through the excel file, find in the
> "high gains" column when the number is greater than 20%, and then store the
> corresponding genes in that row (but in the "genes" column).  

Unless you have hundreds of megabytes of data, you certainly don't want 
to be reading from the Excel file every single time you do a query. That 
will be very slow.

Does it have to be an .xls file? If so, you will need to install and use 
a third-part package to read data from the file. Google on "python 
excel" for more:

https://duckduckgo.com/html/?q=python+excel

Otherwise, a simple CSV file is 100% Excel compatible: use Excel to do a 
Save As and choose Comma Separated Values. You can use the csv module to 
read the data into a list, and then do all your processing in memory 
instead of on disk, which is like a thousand times faster.

To get the data out of the CSV file into a list, something like this 
(again, untested) will work:

import csv

f = open("my data file.csv", "r")
reader = csv.reader(f)
# Skip the first line, if it is a header.
header = next(reader)  # or f.next() if using Python 2.5
data = []
for row in reader:
     # I don't know the format of your data file, so I'll make it up.
     # Ignore everything except high_gain and genes.
     colour, truth, beauty, strangeness, low_gain, high_gain, genes = row
     pair = convert(high_gain), genes
     data.append(pair)

f.close()
data.sort()


You have to write your own "convert" function to turn the gains into 
numbers. Something like this might do it:

def convert(s):
     """Convert strings like '0.42' or '42%' to floats."""
     if s.endswith("%"):
         num = float(s[:-1])
         return num/100
     return float(s)




-- 
Steven




More information about the Tutor mailing list