
Nathan Clegg nathan at islanddata.com
Mon Jul 5 21:12:59 EDT 1999

I'm writing an application in python that receives data in a specified
format, but can't decide which method would be the most efficient.  The
data format is simple.  The data is a set of (key,value) pairs.  One-line
values are of the format:

Key: one line value

Multiline values appear like this:

multiline value
multiline value
multiline value
multiline value

where each value line is escaped if a period actually falls into the first
column.  I want to read these pairs into a dict.  The process for the
first case seems simple:

(key, value) = string.split(line, ':', 1)
dict[key] = value

But I'm not sure which approach I should take because of possible
repercussions.  Some of my initial thoughts, minus the nitpicky stuff
(like escaping periods and chomping lines):

temp = ''
for line in file.readlines():
        if line[0] == '.':
                if temp:
                        temp = ''
                        temp = line
                if temp:
                        dict[temp] = dict[temp] + '\n' + line
                        (key, value) = string.split(line, ':', 1)
                        dict[key] = va lue

This approach is unappealing for two reasons.  First, I don't like the
idea of keeping track of too much state (temp).  I know it's not much, but
it just seems unnecessary.  Second, and more importantly, the multiline
case is likely to be much larger than the singleline case is numerous.  I
would rather optimize for it, and I know a = a + b is not the way to do

lines = file.readlines()
while lines:
        line = lines[0]; del lines[0]
        if line[0] == '.':
                index = lines.index(line)
                dict[line[1:]] = string.join(lines[:index], '\n')
                del lines[:index+1]
                (key, value) = string.split(line, ':', 1)
                dict[key] = va lue

This approach seems to handle the worst case better, by joining rather
than handling lines one by one.  However, I would much rather use a for
loop in thise case than a while.  I'm also worried about all the slicing
going on, though individual lines are likely to be 80 characters or less.

As yet uncoded...reading the file into a single string rather than based
on lines, pulling out the multiline stuff based on a regular expression
search (something like \.(\w+)\n.*\.\1\n), but that sounds like it would
suffer too much from regex and string buffer copies.

Any advice or new methods?

Please forgive typos and syntax issues...the above really is pseudocode.

Nathan Clegg
 nathan at islanddata.com

More information about the Python-list mailing list