[pypy-issue] [issue786] Module re, \d does with re.U option does not work the same way as with CPython

Fri Jul 8 05:38:10 CEST 2011

New submission from Simon <simon.corston at nuance.com>:

\d is interpreted as [0-9] but with the re.U option in CPython it gets
interpreted as anything having the Unicode character attribute of digit.

This means that in CPython, \d will match the superscript 3 when used with re.U.
In Pypy, it doesn't leading to diffs in output for the same regex.

This behavior is analogous to the intepretation of \w according to the re.U
switch, which _does_ work correctly in pypy.

Repro code attached.

----------
files: repro.py
messages: 2747
nosy: linguist, pypy-issue
priority: bug
release: 1.5
status: unread
title: Module re, \d does with re.U option does not work the same way as with CPython

________________________________________
PyPy bug tracker <tracker at bugs.pypy.org>
<https://bugs.pypy.org/issue786>
________________________________________
-------------- next part --------------
import re

foo=u'\N{SUPERSCRIPT THREE}'

# With the re.U switch, \d matches more than just 0-9
unicodeDigitMatcher=re.compile(r'\d', re.U)

# Returns True in CPython, False in Pypy
print unicodeDigitMatcher.match(foo) is not None

# Without the re.U switch, \d matches only 0-9
arabicDigitMatcher=re.compile(r'\d')

# Returns False in Cpython and False in Pypy
print arabicDigitMatcher.match(foo) is not None