optimization

Nathan Clegg nathan at islanddata.com
Mon Jul 5 21:12:59 EDT 1999


I'm writing an application in python that receives data in a specified
format, but can't decide which method would be the most efficient.  The
data format is simple.  The data is a set of (key,value) pairs.  One-line
values are of the format:

Key: one line value

Multiline values appear like this:

.KEY
multiline value
multiline value
multiline value
multiline value
.KEY

where each value line is escaped if a period actually falls into the first
column.  I want to read these pairs into a dict.  The process for the
first case seems simple:

(key, value) = string.split(line, ':', 1)
dict[key] = value

But I'm not sure which approach I should take because of possible
repercussions.  Some of my initial thoughts, minus the nitpicky stuff
(like escaping periods and chomping lines):

1)
temp = ''
for line in file.readlines():
        if line[0] == '.':
                if temp:
                        temp = ''
                else:
                        temp = line
        else:
                if temp:
                        dict[temp] = dict[temp] + '\n' + line
                else:
                        (key, value) = string.split(line, ':', 1)
                        dict[key] = va lue

This approach is unappealing for two reasons.  First, I don't like the
idea of keeping track of too much state (temp).  I know it's not much, but
it just seems unnecessary.  Second, and more importantly, the multiline
case is likely to be much larger than the singleline case is numerous.  I
would rather optimize for it, and I know a = a + b is not the way to do
that.

2)
lines = file.readlines()
while lines:
        line = lines[0]; del lines[0]
        if line[0] == '.':
                index = lines.index(line)
                dict[line[1:]] = string.join(lines[:index], '\n')
                del lines[:index+1]
        else:
                (key, value) = string.split(line, ':', 1)
                dict[key] = va lue

This approach seems to handle the worst case better, by joining rather
than handling lines one by one.  However, I would much rather use a for
loop in thise case than a while.  I'm also worried about all the slicing
going on, though individual lines are likely to be 80 characters or less.

3)
As yet uncoded...reading the file into a single string rather than based
on lines, pulling out the multiline stuff based on a regular expression
search (something like \.(\w+)\n.*\.\1\n), but that sounds like it would
suffer too much from regex and string buffer copies.

Any advice or new methods?

Please forgive typos and syntax issues...the above really is pseudocode.


----------------------------------
Nathan Clegg
 nathan at islanddata.com






More information about the Python-list mailing list