[Tutor] Suggestions as to how to read a file in paragraphs

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Sep 2 08:31:56 CEST 2004



> > First, do you need the efficiency? Is it actually causing a problem
> > just now?
>
> Before yes, but not now. The efficiency is needed when very large files
> are needed and not enough RAM is available to process huge files (eg. >
> 3 million lines)
>
> > Lots of modules might help (re, parser, string, fileinput etc) but
> > nothing specifically for splitting text files by paragraph. (And how
> > do you define a paragraph separator? Is it a blank line or is it an
> > indented first line? Both are valid...)
>
> Depending on the biological sequence files, the paragraph separator
> could be a "//" or newlines or any string really.


Hi Tzu-Ming,


An iterator approach might work here.  It's possible to do something like
this:

###
def breakIntoRecords(read, delimiter):
    """A generated that, given a read() function (like the one provided
    by files), will yield records, separated by the given delimiter."""
    buffer = []
    while True:
        nextChar = read(1)
        if not nextChar: break
        buffer.append(nextChar)
        if buffer[-len(delimiter):] == list(delimiter):
            yield ''.join(buffer)
            buffer = []
    if buffer: yield ''.join(buffer)
###

Forgive me for the ugly code; it's late.  *grin*


This uses Python's "generator" support, which allows us to easily write
things that can yield chunks of data at a time.  The function above will
chunk up a file, based on the delimiter we give it.


Let's test this function, by using the StringIO module to mimic a string
as a file-like object:

###
>>> from StringIO import StringIO
>>> textFile = StringIO('hello world\nthis is a test\ncan you see this?')
>>> print list(breakIntoRecords(textFile.read, '\n'))
['hello world\n', 'this is a test\n', 'can you see this?']
>>> textFile.seek(0)
>>> print list(breakIntoRecords(textFile.read, ' '))
['hello ', 'world\nthis ', 'is ', 'a ', 'test\ncan ', 'you ', 'see ',
 'this?']
###

Here, we call 'list()' to make it easy to see how the system is breaking
things into records: in reality, we would NOT call list() on the result of
breakIntoRecords(), since that would suck everything into memory at once.
And that would be bad.  *grin*


Instead, we'd read it piecemeal, probably in a for loop.  Python will read
the minimal characters necessary to see the next record, and requests the
next chunk by calling next():

###
>>> iterator = breakIntoRecords(textText.read, 'i')
>>> iterator.next()
'hello world\nthi'
>>> iterator.next()
's i'
>>> iterator.next()
's a test\ncan you see thi'
>>> iterator.next()
's?'
>>> iterator.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
StopIteration
###

So just as long as your records aren't larger than system memory, this
should be efficient.  This solution assumes that your delimiters are
static strings.

(The function above can be improved: we could probably do better by
reading the file as blocks of characters, rather than check for the
delimiter on each character read.)


> I know in perl there is an input record separator which by default is
> the newline and one can specify this to a specific delimiter. Is there
> one in python?

Not certain.  Python has adopted 'universal newlines' support,

    http://www.python.org/doc/2.3.4/whatsnew/node7.html

This allows us to tell Python to guess between the three main standard
ways to define a line.  So there may be something in Python that we might
be able to reuse, to redefine a "line" as something else.  But I'm not
sure how easily accessible this might be.


Good luck to you!



More information about the Tutor mailing list