[ python-Bugs-1611131 ] \b in unicode regex gives strange results
SourceForge.net
noreply at sourceforge.net
Fri Dec 8 21:51:12 CET 2006
Bugs item #1611131, was opened at 2006-12-07 21:44
Message generated for change (Comment added) made by gbrandl
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1611131&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: akaihola (akaihola)
Assigned to: Gustavo Niemeyer (niemeyer)
Summary: \b in unicode regex gives strange results
Initial Comment:
The problem: This doesn't give a match:
>>> re.match(r'ä\b', 'ä ', re.UNICODE)
This works ok and gives a match:
>>> re.match(r'.\b', 'ä ', re.UNICODE)
Both of these work as well:
>>> re.match(r'a\b', 'a ', re.UNICODE)
>>> re.match(r'.\b', 'a ', re.UNICODE)
Docs say \b is defined as an empty string between \w and \W. These do match accordingly:
>>> re.match(r'\w', 'ä', re.UNICODE)
>>> re.match(r'\w', 'a', re.UNICODE)
>>> re.match(r'\W', ' ', re.UNICODE)
So something strange happens in my first example, and I can't help but assume it's a bug.
----------------------------------------------------------------------
>Comment By: Georg Brandl (gbrandl)
Date: 2006-12-08 20:51
Message:
Logged In: YES
user_id=849994
Originator: NO
FWIW, the first example works fine for me with and without Unicode
strings.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2006-12-08 17:18
Message:
Logged In: YES
user_id=21627
Originator: NO
Notice that the re.UNICODE flag is only meaningful if you are using
Unicode strings; in the examples you give, you are using byte strings.
Please re-test with Unicode strings both as the expression and as the
string to match.
----------------------------------------------------------------------
Comment By: akaihola (akaihola)
Date: 2006-12-07 22:18
Message:
Logged In: YES
user_id=1432932
Originator: YES
As a work-around I currently use a regex like r'ä(?=\W)'. Seems to work
ok.
Also, the \b problem doesn't seem to exist in the \W\w case, i.e. at the
beginning of words.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1611131&group_id=5470
More information about the Python-bugs-list
mailing list