[Tutor] regex eats even when not hungry
Kent Johnson
kent37 at tds.net
Fri Feb 16 18:27:54 CET 2007
Thomas wrote:
> I have the following mostly working function to strip the first 4
> digit year out of some text. But a leading space confounds it for
> years starting 20..:
>
> import re
> def getyear(text):
> s = """(?:.*?(19\d\d)|(20\d\d).*?)"""
> p = re.compile(s,re.IGNORECASE|re.DOTALL) #|re.VERBOSE
> y = p.match(text)
> try:
> return y.group(1) or y.group(2)
> except:
> return ''
>
>
>
>>>> getyear('2002')
> '2002'
>>>> getyear(' 2002')
> ''
>>>> getyear(' 1902')
> '1902'
>
> A regex of ".*?" means any number of any characters, with a non-greedy
> hunger (so to speak) right?
>
> Any ideas on what is causing this to fail?
The | character has very low precedence in a regex. You are matching either
- any number of characters followed by 19xx
or,
- 20xx followed by any number of characters
You could use this instead:
.*?(?:(19\d\d)|(20\d\d)).*?
But why not use p.search(), which will find the string anywhere without
needing the wildcards? Then your regex could be just
19\d\d|20\d\d
and you return just y.group()
Kent
More information about the Tutor
mailing list