Hi,<div><br></div><div><span class="Apple-style-span">I am a Turkish self-taught python user. Personally, I don't think I am in a position to discuss a issue in this scale. But in my opinion, I think pardus* developers</span><span class="Apple-style-span"> should be invited to join to this discussion. As they are using python heavily on most of their projects** I think they would have something valueable to say about this subject. Here is the pardus-developers mailing list : <a href="http://liste.pardus.org.tr/mailman/listinfo/pardus-devel">http://liste.pardus.org.tr/mailman/listinfo/pardus-devel</a></span></div>

<div><br></div><div>And as for me, I always expect Turkish locale might cause problems, and use some workarounds if neccessary. For example, If I needed to match lower-case or upper-case Turkish "i", I would probably go with [iİ] with unicode flag.</div>

<div><br></div><div><br></div><div>*) <span class="Apple-style-span">a linux distro developed by </span><span class="Apple-style-span" style="font-family: 'Liberation Sans', Arial, Helvetica, sans-serif, generic; font-size: 13px; background-color: rgb(253, 255, 227); "> Scientific & Technological Research Council of Turkey</span></div>

<div><span class="Apple-style-span" style="font-family: 'Liberation Sans', Arial, Helvetica, sans-serif, generic; font-size: 13px; background-color: rgb(253, 255, 227); ">**) </span><a href="http://developer.pardus.org.tr/projects/index.html">http://developer.pardus.org.tr/projects/index.html</a></div>

<div><br></div><div> </div><div><br><div class="gmail_quote">2011/9/15 MRAB <span dir="ltr"><<a href="mailto:python@mrabarnett.plus.com">python@mrabarnett.plus.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">On 15/09/2011 14:44, John-John Tedro wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <<a href="mailto:me@alanplum.com" target="_blank">me@alanplum.com</a><br></div><div><div></div><div class="h5">

<mailto:<a href="mailto:me@alanplum.com" target="_blank">me@alanplum.com</a>>> wrote:<br>

<br>

    On 2011-09-15 15:02, MRAB wrote:<br>

<br>

        The regex module at <a href="http://pypi.python.org/pypi/__regex" target="_blank">http://pypi.python.org/pypi/__<u></u>regex</a><br>

        <<a href="http://pypi.python.org/pypi/regex" target="_blank">http://pypi.python.org/pypi/<u></u>regex</a>> currently uses a<br>

        compromise, where it matches 'I' with 'i' and also 'I' with 'ı'<br>

        and 'İ'<br>

        with 'i'.<br>

<br>

        I was wondering if it would be preferable to have a TURKIC flag<br>

        instead<br>

        ("(?T)" or "(?T:...)" in the pattern).<br>

<br>

<br>

    I think the problem many people ignore when coming up with solutions<br>

    like this is that while this behaviour is pretty much unique for<br>

    Turkish script, there is no guarantee that Turkish substrings won't<br>

    appear in other language strings (or vice versa).<br>

<br>

    For example, foreign names in Turkish are often given as spelled in<br>

    their native (non-Turkish) script variants. Likewise, Turkish names<br>

    in other languages are often given as spelled in Turkish.<br>

<br>

    The Turkish 'I' is a peculiarity that will probably haunt us<br>

    programmers until hell freezes over. Unless Turkey abandons its<br>

    traditional orthography or people start speaking only a single<br>

    language at a time (including names), there's no easy way to deal<br>

    with this.<br>

<br>

    In other words: the only way to make use of your proposed flag is if<br>

    you have a fully language-tagged input (e.g. an XML document making<br>

    extensive use of xml:lang) and only ever apply regular expressions<br>

    to substrings containing one culture at a time.<br>

<br>

    --<br>

    <a href="http://mail.python.org/__mailman/listinfo/python-list" target="_blank">http://mail.python.org/__<u></u>mailman/listinfo/python-list</a><br>

    <<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/<u></u>mailman/listinfo/python-list</a>><br>

<br>

<br>

Python does not appear to support special cases mapping, in effect, it<br>

is not 100% compliant with the unicode standard.<br>

<br>

The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case<br></div></div>

Mappings <<a href="http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180" target="_blank">http://www.unicode.org/<u></u>versions/Unicode6.0.0/ch05.<u></u>pdf#G21180</a>>)<div class="im"><br>

of the unicode standard.<br>

<a href="http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180" target="_blank">http://www.unicode.org/<u></u>versions/Unicode6.0.0/ch05.<u></u>pdf#G21180</a><br>

<br>

AFAIK, the case methods of python strings seems to be built around the<br>

assumption that len("string") == len("string".upper()), but some of<br>

these casing rules require that the string grow. Like uppercasing of the<br>

german sharp s "ß" which should be translated to the expanded string "SS".<br>

These special cases should be triggered on specific locales, but I have<br>

not been able to verify that the Turkic uppercasing of "i" works on<br>

either python 2.6, 2.7 or 3.1:<br>

<br>

   locale.setlocale(locale.LC_<u></u>ALL, "tr_TR.utf8") # warning, requires<br>

turkish locale on your system.<br>

   ord("i".upper()) == 0x130 # is False for me, but should be True<br>

<br>

I wouldn't be surprised if these issues are translated into the 're' module.<br>

<br>

</div></blockquote>

There has been some discussion on the Python-dev list about improving<br>

Unicode support in Python 3.<br>

<br>

It's somewhat unlikely that Unicode will become locale-dependent in<br>

Python because it would cause problems; you don't want:<br>

<br>

    "i".upper() == "I"<br>

<br>

to be maybe true, maybe false.<br>

<br>

An option would be to specify whether it should be locale-dependent.<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The only support appears to be 'L' switch, but it only makes "\w, \W,<br>

\b, \B, \s and \S dependent on the current locale".<br>

</blockquote>

<br></div>

That flag is for locale-dependent 8-bit encodings. The ASCII (Python<br>

3), LOCALE and UNICODE flags are mutually exclusive.<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Which probably does not yield to the special rules mentioned above, but<br>

I could be wrong. Make sure that your locale is correct and test again.<br>

<br>

If you are unsuccessful, I don't see a 'Turkic flag' being introduced<br>

into re module any time soon, given the following from PEP 20<br>

"Special cases aren't special enough to break the rules"<br>

<br>

</blockquote></div>

That's why I'm interested in the view of Turkish users. The rest of us<br>

will probably never have to worry about it! :-)<br>

<br>

(There's a report in the Python bug tracker about this issue, which is<br>

why the regex module has the compromise.)<div><div></div><div class="h5"><br>

-- <br>

<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/<u></u>mailman/listinfo/python-list</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><a href="http://yasar.serveblog.net/" target="_blank">http://yasar.serveblog.net/</a><br><br>

</div>