matching a sentence, greedy up!
Helmut Jarausch
jarausch at skynet.be
Mon Aug 11 05:03:55 EDT 2003
Christian Buck wrote:
> Hi,
>
> i'm writing a regexp that matches complete sentences in a german text,
> and correctly ignores abbrevations. Here is a very simplified version of
> it, as soon as it works i could post the complete regexp if anyone is
> interested (acually 11 kb):
>
> [A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
> \.)){3,}[\.\?\!]+(?!\s[a-z])
>
> As you see i use [] for charsets because i don't want to depend on
> locales an speed does'nt matter. (i removed german chars in the above
> example) I do also allow - and _ within a sentence.
>
> Ok, this is what i think i should do:
> [A-Z] - start with an uppercase char.
> (?: - don't make a group
> [^\.\?\!]+ - eat everything that does not look like an end
> | - OR
> [^a-zA-Z0-9\-_] - accept a non character
> (?: - followed by ...
> [a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
> | - OR
> \d*\. - a number and a dot
> | - OR
> z\.[\s\-]?B\. - some common abbrevations (one one here)
> )){3,} - some times, at least 3
> [\.\?\!]+ - this is the end, and should also match '...'
> (?!\s[a-z]) - not followed by lowercase chars
>
> here i a sample script:
>
> - snip -
> import string, re, pre
> s = 'My text may i. E. look like this: This is the end.'
> re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
> r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
> r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
> r'?:(?!\s[a-z]))')
> mo = re_satz.search(s)
> if mo:
> print "found:"
> sentences = re_satz.findall(s)
> for s in sentences:
> print "Sentence: ", s
> else:
> print "not found :-("
>
> - snip -
>
> Output:
> found!
> Sentence: My text may i.
> Sentence: This is the end.
>
> Why isnt the above regexp greedier and matches the whole sentence?
>
First, you don't need to escape any characters within a character group [].
The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what
I expected. How do you think you can differentiate between the end of a
sentence and (the first part of) an abbreviation?
--
Helmut Jarausch
Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
More information about the Python-list
mailing list