matching a sentence, greedy up!

Mon Aug 11 05:03:55 EDT 2003

Christian Buck wrote:
> Hi,
> 
> i'm writing a regexp that matches complete sentences in a german text, 
> and correctly ignores abbrevations. Here is a very simplified version of 
> it, as soon as it works i could post the complete regexp if anyone is 
> interested (acually 11 kb):
> 
> [A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
> \.)){3,}[\.\?\!]+(?!\s[a-z])
> 
> As you see i use [] for charsets because i don't want to depend on 
> locales an speed does'nt matter. (i removed german chars in the above 
> example) I do also allow - and _ within a sentence.
> 
> Ok, this is what i think i should do:
> [A-Z]     	    	    	- start with an uppercase char.
> (?:    	    	    	- don't make a group
> [^\.\?\!]+    	    	- eat everything that does not look like an end
> |    	    	    	    	- OR
> [^a-zA-Z0-9\-_]    	- accept a non character 
> (?:    	    	    	- followed by ...
> [a-zA-Z0-9\-_]\.    	- a char and a dot like 'i.', '1.' (doesnt work!!!)
> |    	    	    	    	- OR
> \d*\.    	       	- a number and a dot
> |    	    	    	    	- OR
> z\.[\s\-]?B\.     	- some common abbrevations (one one here)   	
> )){3,}    	    	    	- some times, at least 3
> [\.\?\!]+    	    	- this is the end, and should also match '...'
> (?!\s[a-z])    	    	- not followed by lowercase chars
> 
> here i a sample script:
> 
> - snip -
> import string, re, pre
> s = 'My text may i. E. look like this: This is the end.'
> re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
>     	r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
>     	r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
>     	r'?:(?!\s[a-z]))')
> mo = re_satz.search(s)
> if mo:
> 	print "found:"
> 	sentences = re_satz.findall(s)
> 	for s in sentences:
> 		print "Sentence: ", s
> else:
> 	print "not found :-("
> 
> - snip -
> 
> Output:
>     	found!
>     	Sentence:  My text may i.
>     	Sentence:  This is the end.
> 
> Why isnt the above regexp greedier and matches the whole sentence?
> 

First, you don't need to escape any characters within a character group [].

The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what 
I expected. How do you think you can differentiate between the end of a 
sentence and (the first part of) an abbreviation?

-- 
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany