[Tutor] lc_ctype and re.LOCALE

Thu Jan 28 15:23:37 EST 2016

Hi,

Out of curiosity, I wrote the throw-away script below to find a character that is classified (--> LC_CTYPE) as digit in one locale, but not in another.
I ran it with 5000 locale combinations in Python 2 but did not find any (somebody shut down my computer!). I just modified the code so it also 
runs in Python 3. Is this the correct way to find such locale-dependent regex matches?

albertjan at debian:~/Downloads$ uname -a && python --version && python3 --version
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux
Python 2.7.9
Python 3.3.4
albertjan at debian:~/Downloads$ cat lc_ctype.py 
# -*- coding: utf-8 -*-
"""
Find two locales where a different character classification causes a regex 
to match a given character in one locale, but fail in another.
This is to demonstrate the effect that re.LOCALE (in particular the LC_CTYPE
locale category) might have on locale-aware regexes like \w or \d. 
E.g., a character might be classified as digit in one locale but not in another.
"""

from __future__ import print_function, division
import subprocess
import locale
import itertools
import sys
import re

try:
    xrange
except NameError:
    xrange = range
    unichr = chr
if sys.version_info.major> 2:
    unicode = str

proc = subprocess.Popen("locale -a", stdout=subprocess.PIPE, shell=True)
locales = proc.communicate()
locales = sorted(locales[0].split(b"\n"))  # this is the list: http://pastebin.com/FVxUnrWK
if sys.version_info.major> 2:
    locales = [loc.decode("utf-8") for loc in locales]
regex = re.compile(r"\d+", re.LOCALE)  # is this the correct place?

total = len(list(itertools.combinations(locales, 2)))

for n, (locale1, locale2) in enumerate(itertools.combinations(locales, 2), 1):

    if not locale1 or not locale2:
        continue

    if n % 10 == 0 or n == 1:
        sys.stdout.write(" %d (%3.2f%%) ... "  % (n, (n / total * 100) ))
        sys.stdout.flush()  # python 2 print *function* does not have flush param

    for i in xrange(sys.maxunicode + 1):   # 1114111
        s = unichr(i)  #.encode("utf8")
        try:
            locale.setlocale(locale.LC_CTYPE, locale1)
            m1 = bool(regex.match(s))
            locale.setlocale(locale.LC_CTYPE, locale2)
            m2 = bool(regex.match(s))
            if m1 ^ m2:  # m1 != m2
                msg = ("@@ ordinal: %s | character: %s (%r) | " 
                       " digit in locale '%s': %s | digit in locale '%s': %s ")
                print(msg % (i, unichr(i), unichr(i), locale1, m1, locale2, m2))
                break
        except locale.Error as e:
            #print("Error: %s with %s and/or %s" % (e, locale1, locale2))
            continue

print("---Done---")

Thank you!

Albert-Jan