Turkic I and re
python at mrabarnett.plus.com
Thu Sep 15 16:06:08 CEST 2011
On 15/09/2011 14:44, John-John Tedro wrote:
> On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <me at alanplum.com
> <mailto:me at alanplum.com>> wrote:
> On 2011-09-15 15:02, MRAB wrote:
> The regex module at http://pypi.python.org/pypi/__regex
> <http://pypi.python.org/pypi/regex> currently uses a
> compromise, where it matches 'I' with 'i' and also 'I' with 'ı'
> and 'İ'
> with 'i'.
> I was wondering if it would be preferable to have a TURKIC flag
> ("(?T)" or "(?T:...)" in the pattern).
> I think the problem many people ignore when coming up with solutions
> like this is that while this behaviour is pretty much unique for
> Turkish script, there is no guarantee that Turkish substrings won't
> appear in other language strings (or vice versa).
> For example, foreign names in Turkish are often given as spelled in
> their native (non-Turkish) script variants. Likewise, Turkish names
> in other languages are often given as spelled in Turkish.
> The Turkish 'I' is a peculiarity that will probably haunt us
> programmers until hell freezes over. Unless Turkey abandons its
> traditional orthography or people start speaking only a single
> language at a time (including names), there's no easy way to deal
> with this.
> In other words: the only way to make use of your proposed flag is if
> you have a fully language-tagged input (e.g. an XML document making
> extensive use of xml:lang) and only ever apply regular expressions
> to substrings containing one culture at a time.
> Python does not appear to support special cases mapping, in effect, it
> is not 100% compliant with the unicode standard.
> The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case
> Mappings <http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180>)
> of the unicode standard.
> AFAIK, the case methods of python strings seems to be built around the
> assumption that len("string") == len("string".upper()), but some of
> these casing rules require that the string grow. Like uppercasing of the
> german sharp s "ß" which should be translated to the expanded string "SS".
> These special cases should be triggered on specific locales, but I have
> not been able to verify that the Turkic uppercasing of "i" works on
> either python 2.6, 2.7 or 3.1:
> locale.setlocale(locale.LC_ALL, "tr_TR.utf8") # warning, requires
> turkish locale on your system.
> ord("i".upper()) == 0x130 # is False for me, but should be True
> I wouldn't be surprised if these issues are translated into the 're' module.
There has been some discussion on the Python-dev list about improving
Unicode support in Python 3.
It's somewhat unlikely that Unicode will become locale-dependent in
Python because it would cause problems; you don't want:
"i".upper() == "I"
to be maybe true, maybe false.
An option would be to specify whether it should be locale-dependent.
> The only support appears to be 'L' switch, but it only makes "\w, \W,
> \b, \B, \s and \S dependent on the current locale".
That flag is for locale-dependent 8-bit encodings. The ASCII (Python
3), LOCALE and UNICODE flags are mutually exclusive.
> Which probably does not yield to the special rules mentioned above, but
> I could be wrong. Make sure that your locale is correct and test again.
> If you are unsuccessful, I don't see a 'Turkic flag' being introduced
> into re module any time soon, given the following from PEP 20
> "Special cases aren't special enough to break the rules"
That's why I'm interested in the view of Turkish users. The rest of us
will probably never have to worry about it! :-)
(There's a report in the Python bug tracker about this issue, which is
why the regex module has the compromise.)
More information about the Python-list