<div class="gmail_quote">On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <span dir="ltr"><<a href="mailto:me@alanplum.com">me@alanplum.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On 2011-09-15 15:02, MRAB wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The regex module at <a href="http://pypi.python.org/pypi/regex" target="_blank">http://pypi.python.org/pypi/<u></u>regex</a> currently uses a<br>
compromise, where it matches 'I' with 'i' and also 'I' with 'ı' and 'İ'<br>
with 'i'.<br>
<br>
I was wondering if it would be preferable to have a TURKIC flag instead<br>
("(?T)" or "(?T:...)" in the pattern).<br>
</blockquote>
<br></div>
I think the problem many people ignore when coming up with solutions like this is that while this behaviour is pretty much unique for Turkish script, there is no guarantee that Turkish substrings won't appear in other language strings (or vice versa).<br>
<br>
For example, foreign names in Turkish are often given as spelled in their native (non-Turkish) script variants. Likewise, Turkish names in other languages are often given as spelled in Turkish.<br>
<br>
The Turkish 'I' is a peculiarity that will probably haunt us programmers until hell freezes over. Unless Turkey abandons its traditional orthography or people start speaking only a single language at a time (including names), there's no easy way to deal with this.<br>
<br>
In other words: the only way to make use of your proposed flag is if you have a fully language-tagged input (e.g. an XML document making extensive use of xml:lang) and only ever apply regular expressions to substrings containing one culture at a time.<div>
<div></div><div class="h5"><br>
-- <br>
<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/<u></u>mailman/listinfo/python-list</a><br>
</div></div></blockquote></div><br>Python does not appear to support special cases mapping, in effect, it is not 100% compliant with the unicode standard.<br><br>The locale specific 'i' casing in Turkic is mentioned in 5.18 (<a href="http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180">Case Mappings</a>) of the unicode standard.<br>
<a href="http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180">http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180</a><br><br>AFAIK, the case methods of python strings seems to be built around the assumption that len("string") == len("string".upper()), but some of these casing rules require that the string grow. Like uppercasing of the german sharp s "ß" which should be translated to the expanded string "SS".<br>
These special cases should be triggered on specific locales, but I have not been able to verify that the Turkic uppercasing of "i" works on either python 2.6, 2.7 or 3.1:<br><br> locale.setlocale(locale.LC_ALL, "tr_TR.utf8") # warning, requires turkish locale on your system.<br>
ord("i".upper()) == 0x130 # is False for me, but should be True<br><br>I wouldn't be surprised if these issues are translated into the 're' module.<br><br>The only support appears to be 'L' switch, but it only makes "<tt class="docutils literal"><span class="pre">\w</span></tt>, <tt class="docutils literal"><span class="pre">\W</span></tt>, <tt class="docutils literal"><span class="pre">\b</span></tt>, <tt class="docutils literal"><span class="pre">\B</span></tt>, <tt class="docutils literal"><span class="pre">\s</span></tt> and <tt class="docutils literal"><span class="pre">\S</span></tt> dependent on the
current locale".<br>Which probably does not yield to the special rules mentioned above, but I could be wrong. Make sure that your locale is correct and test again.<br><br>If you are unsuccessful, I don't see a 'Turkic flag' being introduced into re module any time soon, given the following from PEP 20<br>
"Special cases aren't special enough to break the rules"<br><br>Cheers,<br>-- John-John Tedro<br>