[issue2636] Regexp 2.7 (modifications to current re 2.2.2)
report at bugs.python.org
Tue Sep 21 01:51:37 CEST 2010
Vlastimil Brom <vlastimil.brom at gmail.com> added the comment:
I like the idea of the general "new" flag introducing the reasonable, backwards incompatible behaviour; one doesn't have to remember a list of non-standard flags to get this features.
While I recognise, that the module probably can't work correctly with wide unicode characters on a narrow python build (py 2.7, win XP in this case), i noticed a difference to re in this regard (it might be based on the absence of the wide unicode literal in the latter).
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python27\lib\regex.py", line 203, in findall
return _compile(pattern, flags).findall(string, pos, endpos,
File "C:\Python27\lib\regex.py", line 310, in _compile
parsed = parsed.optimise(info)
File "C:\Python27\lib\_regex_core.py", line 1735, in optimise
File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive
return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
I.e. re fails to match this pattern (as it actually looks for "U00010337" ), regex doesn't recognise the wide unicode as surrogate pair either, but it also raises an error from narrow unichr. Not sure, whether/how it should be fixed, but the difference based on the i-flag seems unusual.
Of course it would be nice, if surrogate pairs were interpreted, but I can imagine, that it would open a whole can of worms, as this is not thoroughly supported in the builtin unicode either (len, indices, slicing).
I am trying to make wide unicode characters somehow usable in my app, mainly with hacks like extended unichr
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000
Actually, using regex, one can work around some of these limitations of len, index or slice using a list form of the string containing surrogates
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']
but apparently things like wide unicode literals or character sets (even extending of the shorthands like \w etc.) are much more complicated.
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list