No explanation for weird behavior in re module!

John Machin sjmachin at lexicon.net
Sun Feb 10 21:14:20 EST 2002


synthespian <synthespian at uol.com.br> wrote in message news:<a470re$1cl30l$1 at ID-78052.news.dfncis.de>...
> Hi-
> 
> 	I'm really intrigued by this behavior:
> 
> >>> import re
> >>> p = re.compile('^(der|die|das(\s\w+))')
> >>> m = p.match('die Tür, Türen')
> >>> n = p.match('das Auto, Autos')
> >>> m.group(0)
>  'die'
> >>> m.group(1)
>  'die'
> >>> m.group(2)
>  [nothing!!!!]
> >>> n.group(0)
>  'das Auto'
> >>> n.group(1)
>  'das Auto'
> >>> n.group(2)
> 'Auto'
> 
> 	I'm using Python2.0 on a Debian potato system. 
> 	Why didn't m.group(2) produce 'Tür' as the output???
> 	Python2.0 is supposed to have Unicode support buil-in the re module right?
> 	Other than the fact that 'Tür' has the 'ü' unicode charcater, I fail to see any difference!

Your pattern is asking to match one of the following
(a) 'der'
(b) 'die'
(c) 'das' followed by a space then one or more alphabetic characters.
In other words, nothing to do with Unicode, and all to do with
operator precedence. You need to toss some parentheses in there.
Forget the grammar and try your pattern on 'die Auto, Autos'; you'll
see that this matches just 'die' also.

Anticipating the next raft of problems: (1) You will need to use the
re.UNICODE flag when you call re.compile(), otherwise \w will not
recognise the Unicode alphabetics (this *is* documented) (2) You may
need to give it an input whose Python type is 'unicode' -- being able
to see the umlaut on your screen is not sufficient evidence of this
:-) (3) You should get into the habit of using the raw string notation
with your regexes whether it is necessary or not, else you will be
bitten in the future.

Anyhow, the following works for me:

Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
>>> p = re.compile(r'^((der|die|das)(\s\w+))',re.UNICODE)
>>> p.match('das Auto').groups()
('das Auto', 'das', ' Auto')
>>> z = u'die T\u00FCr, T\u00FCren'
>>> p.match(z).groups()
(u'die T\xfcr', u'die', u' T\xfcr')



More information about the Python-list mailing list