[Tutor] regex eats even when not hungry

Kent Johnson kent37 at tds.net
Fri Feb 16 18:27:54 CET 2007


Thomas wrote:
> I have the following mostly working function to strip the first 4
> digit year out of some text. But a leading space confounds it for
> years starting 20..:
> 
> import re
> def getyear(text):
>     s = """(?:.*?(19\d\d)|(20\d\d).*?)"""
>     p = re.compile(s,re.IGNORECASE|re.DOTALL) #|re.VERBOSE
>     y = p.match(text)
>     try:
>         return y.group(1) or y.group(2)
>     except:
>         return ''
> 
> 
> 
>>>> getyear('2002')
> '2002'
>>>> getyear(' 2002')
> ''
>>>> getyear(' 1902')
> '1902'
> 
> A regex of ".*?" means any number of any characters, with a non-greedy
> hunger (so to speak) right?
> 
> Any ideas on what is causing this to fail?

The | character has very low precedence in a regex. You are matching either
- any number of characters followed by 19xx
or,
- 20xx followed by any number of characters

You could use this instead:
.*?(?:(19\d\d)|(20\d\d)).*?

But why not use p.search(), which will find the string anywhere without 
needing the wildcards? Then your regex could be just
19\d\d|20\d\d

and you return just y.group()

Kent



More information about the Tutor mailing list