Can python read up to where a certain pattern is matched?

Andrew Bennetts andrew-pythonlist at puzzling.org
Sat Mar 6 03:56:48 CET 2004


On Fri, Mar 05, 2004 at 06:16:34PM -0800, Anthony Liu wrote:
[...]
> read(), readline(), readlines() is what I want. I want
> to read a text file sentence by sentence. 
> 
> A sentence by definition is roughly the part between a
> full stop and another full stop or !, ?
> 
> So, for example, for the following text:
> 
> "Some words here, and some other words. Then another
> segment follows, and more. This is a question, a junk
> question, followed by a question mark?"
> 
> It has 3 sentences (2 full stops and 1 question mark),
> and therefore I want to read it in 3 lumps and each
> lump gives me one complete sentence as follows:
> 
> lump 1: Some words here, and some other words.
> 
> lump 2: Then another segment follows, and more.
> 
> lump 3: This is a question, a junk question, followed
> by a question mark?
> 
> How can I achieve this?  Do we have a readsentence()
> function?

You can easily write iterators yourself using generators:

----
from StringIO import StringIO

def bytes(f):
    for byte in iter(lambda: f.read(1), ''):
        yield byte

def sentences(iterable):
    sentence = ''
    for char in iterable:
        sentence += char
        if char in ('.', '!', '?'):
            yield sentence.strip()
            sentence = ''
    sentence = sentence.strip()
    if sentence:
        yield sentence

testFile = StringIO(
"""Some words here, and some other words. Then another
segment follows, and more. This is a question, a junk
question, followed by a question mark?""")

for count, s in enumerate(sentences(bytes(testFile))):
    print 'lump %d: %r' % (count+1, s)
----

This gives the following output:

lump 1: 'Some words here, and some other words.'
lump 2: 'Then another\nsegment follows, and more.'
lump 3: 'This is a question, a junk\nquestion, followed by a question mark?'

Unfortunately (but not surprisingly), the re module only accepts strings,
not iterables, otherwise you could use re.finditer, rather than writing your
own less general function.

Note that I've only lightly tested that code, and it's probably very
inefficient.

-Andrew.





More information about the Python-list mailing list