A little advice please? (Convert my boss to Python)
Duncan Smith
buzzard at urubu.freeserve.co.uk
Mon Apr 15 17:48:05 EDT 2002
My boss is considering moving to Python (from Poplog), so I've coded up
something to read in records from a text file (as below) and return a
statistic based on the numbers of unique and paired records (as he
requested). Easy enough (code below), but now he wants to compare it for
speed against some existing code (Poplog?). He also wants to be able to
calculate the statistic for only a subset of variables.
So what I'm looking for is speed, and some advice so that I don't end up
trying too many alternatives. Maybe I should be using Numeric arrays (slice
out the superfluous columns)? Maybe an n-dimensional array (for n
variables) and just count the cells with 1 and 2? (Then slice / marginalise
and recount for queries on subsets of variables. I like the sound of this,
but maybe there are limitations on array size / number of dimensions?)
Maybe I could avoid reading the data for superfluous variables and compare
records without the need to 'line.split()'? Maybe my approach of using the
record as a dictionary key and incrementing the values is not the best way
of counting uniques and pairs?
Anyone done anything similar? Any advice? TIA.
Duncan
Identification Var1 Var2 Var3 Var4 Var5 ...
0000000000001 N 0 2 3 0
0000000000002 N 0 2 2 0
0000000000003 N 1 3 3 1
0000000000004 Y 0 2 2 2
0000000000005 N 0 2 1 0
def do_stuff(filename, S): #(S is floating point)
f = open(filename, 'r')
lines = f.readlines()
f.close()
lines = [line.split()[1:] for line in lines]
lines = [line for line in lines if line != []]
lines = lines[1:]
dict = {}
for line in lines:
my_key = tuple(line)
if dict.has_key(my_key):
dict[my_key] += 1
else: dict[my_key] = 1
U = P = 0
for value in dict.values():
if value == 1:
U += 1
elif value == 2:
P += 1
return U*S/(U*S+P*(1-S))
More information about the Python-list
mailing list