Turkic I and re

Thu Sep 15 09:16:15 EDT 2011

On 2011-09-15 15:02, MRAB wrote:
> The regex module at http://pypi.python.org/pypi/regex currently uses a
> compromise, where it matches 'I' with 'i' and also 'I' with 'ı' and 'İ'
> with 'i'.
>
> I was wondering if it would be preferable to have a TURKIC flag instead
> ("(?T)" or "(?T:...)" in the pattern).

I think the problem many people ignore when coming up with solutions 
like this is that while this behaviour is pretty much unique for Turkish 
script, there is no guarantee that Turkish substrings won't appear in 
other language strings (or vice versa).

For example, foreign names in Turkish are often given as spelled in 
their native (non-Turkish) script variants. Likewise, Turkish names in 
other languages are often given as spelled in Turkish.

The Turkish 'I' is a peculiarity that will probably haunt us programmers 
until hell freezes over. Unless Turkey abandons its traditional 
orthography or people start speaking only a single language at a time 
(including names), there's no easy way to deal with this.

In other words: the only way to make use of your proposed flag is if you 
have a fully language-tagged input (e.g. an XML document making 
extensive use of xml:lang) and only ever apply regular expressions to 
substrings containing one culture at a time.