[Tutor] Suggestions as to how to read a file in paragraphs
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Thu Sep 2 08:31:56 CEST 2004
> > First, do you need the efficiency? Is it actually causing a problem
> > just now?
>
> Before yes, but not now. The efficiency is needed when very large files
> are needed and not enough RAM is available to process huge files (eg. >
> 3 million lines)
>
> > Lots of modules might help (re, parser, string, fileinput etc) but
> > nothing specifically for splitting text files by paragraph. (And how
> > do you define a paragraph separator? Is it a blank line or is it an
> > indented first line? Both are valid...)
>
> Depending on the biological sequence files, the paragraph separator
> could be a "//" or newlines or any string really.
Hi Tzu-Ming,
An iterator approach might work here. It's possible to do something like
this:
###
def breakIntoRecords(read, delimiter):
"""A generated that, given a read() function (like the one provided
by files), will yield records, separated by the given delimiter."""
buffer = []
while True:
nextChar = read(1)
if not nextChar: break
buffer.append(nextChar)
if buffer[-len(delimiter):] == list(delimiter):
yield ''.join(buffer)
buffer = []
if buffer: yield ''.join(buffer)
###
Forgive me for the ugly code; it's late. *grin*
This uses Python's "generator" support, which allows us to easily write
things that can yield chunks of data at a time. The function above will
chunk up a file, based on the delimiter we give it.
Let's test this function, by using the StringIO module to mimic a string
as a file-like object:
###
>>> from StringIO import StringIO
>>> textFile = StringIO('hello world\nthis is a test\ncan you see this?')
>>> print list(breakIntoRecords(textFile.read, '\n'))
['hello world\n', 'this is a test\n', 'can you see this?']
>>> textFile.seek(0)
>>> print list(breakIntoRecords(textFile.read, ' '))
['hello ', 'world\nthis ', 'is ', 'a ', 'test\ncan ', 'you ', 'see ',
'this?']
###
Here, we call 'list()' to make it easy to see how the system is breaking
things into records: in reality, we would NOT call list() on the result of
breakIntoRecords(), since that would suck everything into memory at once.
And that would be bad. *grin*
Instead, we'd read it piecemeal, probably in a for loop. Python will read
the minimal characters necessary to see the next record, and requests the
next chunk by calling next():
###
>>> iterator = breakIntoRecords(textText.read, 'i')
>>> iterator.next()
'hello world\nthi'
>>> iterator.next()
's i'
>>> iterator.next()
's a test\ncan you see thi'
>>> iterator.next()
's?'
>>> iterator.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
###
So just as long as your records aren't larger than system memory, this
should be efficient. This solution assumes that your delimiters are
static strings.
(The function above can be improved: we could probably do better by
reading the file as blocks of characters, rather than check for the
delimiter on each character read.)
> I know in perl there is an input record separator which by default is
> the newline and one can specify this to a specific delimiter. Is there
> one in python?
Not certain. Python has adopted 'universal newlines' support,
http://www.python.org/doc/2.3.4/whatsnew/node7.html
This allows us to tell Python to guess between the three main standard
ways to define a line. So there may be something in Python that we might
be able to reuse, to redefine a "line" as something else. But I'm not
sure how easily accessible this might be.
Good luck to you!
More information about the Tutor
mailing list