A little advice please? (Convert my boss to Python)

Alex Martelli aleax at aleax.it
Mon Apr 15 18:27:56 EDT 2002


Duncan Smith wrote:
        ...
> Identification        Var1    Var2    Var3    Var4    Var5 ...
> 
> 0000000000001    N        0           2           3        0
> 0000000000002    N        0           2           2        0
> 0000000000003    N        1           3           3        1
> 0000000000004    Y        0           2           2        2
> 0000000000005    N        0           2           1        0
> 
> def do_stuff(filename, S):  #(S is floating point)
>     f = open(filename, 'r')
>     lines = f.readlines()
>     f.close()
>     lines = [line.split()[1:] for line in lines]
>     lines = [line for line in lines if line != []]
>     lines = lines[1:]

You can collapse this as much as you wish -- hard to say what
will have best performance, but perhaps something like (untested
but I hope the intention is clear)

    lines = [ tuple(x) for line in lines for x in (line.split()[1:],) if x]

If the file is very large, in Python 2.2, you might be better off by
iterating on the file rather than slurping it in in one go:

    lines = [ tuple(x) for line in open(filename)
        for x in (line.split()[1:],) if x]

If you need to select from each line's tuple of variables only some,
as you indicate as one possibility, then maybe:

    lines = [ tuple([x[i] for i in takethese])
        for line in open(filename)
            for x in (line.split()[1:],) if x]

where takethese is the sequence of indices you need.

Each of these ideas ends up with lines as a list of tuples in
memory.  Alternatively, you could produce each tuple on the
go as you iterate below, rather than 'for line in lines:'; again,
either approach could be faster, depending on filesize vs
memory issues.

>     dict = {}
>     for line in lines:
>         my_key = tuple(line)
>         if dict.has_key(my_key):
>             dict[my_key] += 1
>         else: dict[my_key] = 1

Having made of lines a sequence of tuples, I'd code this as:

adict = {}
for tup in lines:
    adict[tup] = 1 + adict.get(tup,0)

particularly avoiding shadowing the name of builtin type dict
(in 2.2) with a local variable (one of my pet little hates:-), but
for speed purposes the win would be to avoid the if/else test.

>     U = P = 0
>     for value in dict.values():
>         if value == 1:
>             U += 1
>         elif value == 2:
>             P += 1

Again I think it might be faster to avoid the if/else with yet
another dictionary:

counts = {}
for count in adict.values():
    counts[count] = 1 + counts.get(count, 0)

U = counts.get(1, 0)
P = counts.get(2, 0)

>     return U*S/(U*S+P*(1-S))

This one can stay:-).


Alex




More information about the Python-list mailing list