[Tutor] Perl to Python code migration

Fri, 06 Sep 2002 09:56:58 -0700

Erik Price wrote:

> On Thursday, September 5, 2002, at 10:24  PM, Jeff Shannon wrote:
>
> > I didn't see any explicit specification along those lines...
>
> Surrounded by the question itself and the code, it was pretty easy to
> miss, but Levy did say that:

Whoops, guess I need to read a bit more closely before spouting off.  ;)
(What makes this even more embarrassing is that I went back and re-read the
original before sending that reply, so I missed it *twice*.)

> > Given that restriction, then the best solution probably would be to
> > read in small
> > chunks of the file (possibly using xreadlines() if it's broken into
> > lines) until
> > the delimiter is found, and accumulating those chunks manually.
>
> I was thinking about this earlier today, but it's a hard one.
> xreadlines() seems like the best solution, except that the data is not
> separated into separate lines (it's multi-line).

Well, multiline data can still be handled by xreadlines(), though it's a
bit more complicated.  If we can assume that a record always starts at the
beginning of a line (the delimiter will always be the first group of
characters on any line that it appears on), then we can use code that looks
something like this:

block = []
for line in infile.xreadlines():
    if line.startswith( "(ID   1)" ) and block != []:
        process_data(block)
        block = []
    block.append(line)
process_data(block)

This could doubtless be made a bit cleaner, but it should work.  It starts
by reading a line at a time, and adding that line to a block.  When we find
the line that starts the next record, we process the block that's already
accumulated, and then start a new block with that new line.  Since the
final record won't have anything following it to trigger processing, we
have to add another call to process it after the for-loop finishes.  The
process_data() function could perhaps join() the list of lines into a
single string for easier processing.

This is, in principle, somewhat similar to the event-based sax parsing that
you mentioned, but in a very rough quick-and-dirty form.

Of course, as noted, this assumes that the delimiter always starts a line.
If that assumption isn't valid, or if there are no line breaks in the file,
then you'd need to use find() to determine if the delimiter is present, add
any data before the delimiter to the block before you send it, and start
the new block with only data from after the delimiter.  (If there aren't
line breaks, you'd need to use, say, file.read(1024) instead of
file.xreadlines(), but the principle is the same.)  You might also need to
check each line (or newly read chunk) to ensure that it doesn't contain
more than one delimiter.

Jeff Shannon
Technician/Programmer
Credit International