[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Tue Nov 23 16:58:03 CET 2010

Steve Moran <stiv at uw.edu> added the comment:

Forgive me if this is just a stupid oversight. 

I'm a linguist and use UTF-8 for "special" characters for linguistics data. This often includes multi-byte Unicode character sequences that are composed as one grapheme. For example the í̵ (if it's displaying correctly for you) is a LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT \u0301. E.g. a word I'm parsing:

jí̵-e-gɨ

I was pretty excited to find out that this regex library implements the grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed to evaluate which sequences of characters can occur across syllable boundaries (here the hyphen "-"), so I'm aiming for:

í̵-e
e-g

When regex couldn't get any better, you awesome developers implemented an overlapped=True flag with findall and finditer. 

Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
>>> import regex
>>> s = "jí̵-e-gɨ"
>>> s
'jí̵-e-gɨ'
>>> m = regex.compile("(\X)(-)(\X)")
>>> m.findall(s, overlapped=False)
[('í̵', '-', 'e')]

But these results are weird to me:

>>> m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', '-', 'g')]

Why the extra matches? At first I figured this had something to do with the overlapping match of the grapheme, since it's multiple characters. So I tried it with with out the grapheme match:

>>> m = regex.compile("(.)(-)(.)")
>>> s2 = "a-b-cd-e-f"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]

That's right. But with overlap...

>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]

Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more simply:

>>> s2 = "a-b-c"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b')]
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]

Thanks!

----------
nosy: +stiv
type: feature request -> behavior

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________