[issue10703] Regex 0.1.20101210

Steve Moran report at bugs.python.org
Tue Dec 14 18:41:57 CET 2010


New submission from Steve Moran <stiv at uw.edu>:

The regex package doesn't seem to correctly implement the single grapheme match "\X" (\P{M}\p{M}*) for pre-Python 3. I'm using the string "íi-te" (i, U+0301, i, -, t, e -- where U+0301 is Unicode COMBINING ACUTE ACCENT), reading it in from a file to bypass Unicode c&p issues in the older IDLEs). 


stiv at x$ python3.1
Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> file = open("test_data", "rt", encoding="utf-8")
>>> s = file.readline()
>>> print (s)
íi-te
>>> print (g.findall(s))
['í', 'i', '-', 't', 'e']

* Correct in 3.1 - i+U+0301 considered one grapheme.

stiv at x$ python2.7
Python 2.7 (r27:82500, Oct  4 2010, 14:49:53) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs                                
>>> import regex
>>> file = codecs.open("test_data", "r", "utf-8")
>>> g = regex.compile("\X")
>>> s = file.readline()
>>> s
u'i\u0301i-te'
>>> print s.encode("utf-8")
íi-te
>>> print g.findall(s)
[u'i', u'\u0301', u'i', u'-', u't', u'e']

*Not correct -- accent is treated as a separate character.

Thanks.

----------
components: Regular Expressions
messages: 123961
nosy: stiv
priority: normal
severity: normal
status: open
title: Regex 0.1.20101210
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10703>
_______________________________________


More information about the Python-bugs-list mailing list