A little advice please? (Convert my boss to Python)
Alex Martelli
aleax at aleax.it
Mon Apr 15 18:27:56 EDT 2002
Duncan Smith wrote:
...
> Identification Var1 Var2 Var3 Var4 Var5 ...
>
> 0000000000001 N 0 2 3 0
> 0000000000002 N 0 2 2 0
> 0000000000003 N 1 3 3 1
> 0000000000004 Y 0 2 2 2
> 0000000000005 N 0 2 1 0
>
> def do_stuff(filename, S): #(S is floating point)
> f = open(filename, 'r')
> lines = f.readlines()
> f.close()
> lines = [line.split()[1:] for line in lines]
> lines = [line for line in lines if line != []]
> lines = lines[1:]
You can collapse this as much as you wish -- hard to say what
will have best performance, but perhaps something like (untested
but I hope the intention is clear)
lines = [ tuple(x) for line in lines for x in (line.split()[1:],) if x]
If the file is very large, in Python 2.2, you might be better off by
iterating on the file rather than slurping it in in one go:
lines = [ tuple(x) for line in open(filename)
for x in (line.split()[1:],) if x]
If you need to select from each line's tuple of variables only some,
as you indicate as one possibility, then maybe:
lines = [ tuple([x[i] for i in takethese])
for line in open(filename)
for x in (line.split()[1:],) if x]
where takethese is the sequence of indices you need.
Each of these ideas ends up with lines as a list of tuples in
memory. Alternatively, you could produce each tuple on the
go as you iterate below, rather than 'for line in lines:'; again,
either approach could be faster, depending on filesize vs
memory issues.
> dict = {}
> for line in lines:
> my_key = tuple(line)
> if dict.has_key(my_key):
> dict[my_key] += 1
> else: dict[my_key] = 1
Having made of lines a sequence of tuples, I'd code this as:
adict = {}
for tup in lines:
adict[tup] = 1 + adict.get(tup,0)
particularly avoiding shadowing the name of builtin type dict
(in 2.2) with a local variable (one of my pet little hates:-), but
for speed purposes the win would be to avoid the if/else test.
> U = P = 0
> for value in dict.values():
> if value == 1:
> U += 1
> elif value == 2:
> P += 1
Again I think it might be faster to avoid the if/else with yet
another dictionary:
counts = {}
for count in adict.values():
counts[count] = 1 + counts.get(count, 0)
U = counts.get(1, 0)
P = counts.get(2, 0)
> return U*S/(U*S+P*(1-S))
This one can stay:-).
Alex
More information about the Python-list
mailing list