[Python-Dev] [issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Thu Feb 18 01:50:26 CET 2010

Vlastimil Brom wrote:
> Vlastimil Brom <vlastimil.brom at gmail.com> added the comment:
> 
> I just tested the fix for unicode tracebacks and found some possibly weird results (not sure how/whether it should be fixed, as these inputs are indeed rather artificial...).
> (win XPp SP3 Czech, Python 2.6.4)
> 
> Using the cmd console, the output is fine (for the characters it can accept and display)
> 
>>>> regex.findall(ur"\p{InBasicLatinĚ}", u"aé")
> Traceback (most recent call last):
> ...
>   File "C:\Python26\lib\regex.py", line 1244, in _parse_property
>     raise error("undefined property name '%s'" % name)
> regex.error: undefined property name 'InBasicLatinĚ'
> 
> (same result for other distorted "proprety names" containing e.g. ěščřžýáíéúůßäëiöüîô ...
> 
> However, in Idle the output differs depending on the characters present
> 
>>>> regex.findall(ur"\p{InBasicLatinÉ}", u"ab c")
> yields the expected
> ...
>   File "C:\Python26\lib\regex.py", line 1244, in _parse_property
>     raise error("undefined property name '%s'" % name)
> error: undefined property name 'InBasicLatinÉ'
> 
> but
> 
>>>> regex.findall(ur"\p{InBasicLatinĚ}", u"ab c")
> 
> Traceback (most recent call last):
> ...
>   File "C:\Python26\lib\regex.py", line 1244, in _parse_property
>     raise error("undefined property name '%s'" % name)
>   File "C:\Python26\lib\regex.py", line 167, in __init__
>     message = message.encode(sys.stdout.encoding)
>   File "C:\Python26\lib\encodings\cp1250.py", line 12, in encode
>     return codecs.charmap_encode(input,errors,encoding_table)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\xcc' in position 37: character maps to <undefined>
> 
> which might be surprising, as cp1250 should be able to encode "Ě", maybe there is some intermediate ascii step?
> 
> using the wxpython pyShell I get its specific encoding error:
> 
> regex.findall(ur"\p{InBasicLatinÉ}", u"ab c")
> Traceback (most recent call last):
> ...
>   File "C:\Python26\lib\regex.py", line 1102, in _parse_escape
>     return _parse_property(source, info, in_set, ch)
>   File "C:\Python26\lib\regex.py", line 1244, in _parse_property
>     raise error("undefined property name '%s'" % name)
>   File "C:\Python26\lib\regex.py", line 167, in __init__
>     message = message.encode(sys.stdout.encoding)
> AttributeError: PseudoFileOut instance has no attribute 'encoding'
> 
> (the same for \p{InBasicLatinĚ} etc.)
> 
Maybe it shouldn't show the property name at all. That would avoid the
problem.
> 
> In python 3.1 in Idle, all of these exceptions are displayed correctly, also in other scripts or with special characters.
> 
> Maybe in python 2.x e.g. repr(...) of the unicode error messages could be used in order to avoid these problems, but I don't know, what the conventions are in these cases.
> 
> 
> Another issue I found here (unrelated to tracebacks) are backslashes or punctuation (except the handled -_) in the property names, which just lead to failed mathces and no exceptions about unknown property names
> 
> regex.findall(u"\p{InBasic.Latin}", u"ab c")
> []
> 
In the re module a malformed pattern is sometimes treated as a literal:

 >>> re.match(r"a{1,2", r"a{1,2").group()
'a{1,2'

which is what I'm trying to replicate, as far as possible.

Which characters should it accept when parsing the property name, even
if it subsequently rejects the name? I don't want it to accept every
character until it sees the closing '}'. I currently include
alphanumeric, whitespace, '&', '_' and '-'. '.' might be a reasonable
addition.
> 
> I was also surprised by the added pos/endpos parameters, as I used flags as a non-keyword third parameter for the re functions in my code (probably my fault ...)
> 
> re.findall(pattern, string, flags=0)
> 
> regex.findall(pattern, string, pos=None, endpos=None, flags=0, overlapped=False)
> 
> (is there a specific reason for this order, or could it be changed to maintain compatibility with the current re module?)
> 
Oops! I'll fix that.

> I hope, at least some of these remarks make some sense;
> thanks for the continued work on this module!
> 
All constructive remarks are welcome! :-)