[Tutor] lc_ctype and re.LOCALE
Albert-Jan Roskam
sjeik_appie at hotmail.com
Thu Jan 28 15:23:37 EST 2016
Hi,
Out of curiosity, I wrote the throw-away script below to find a character that is classified (--> LC_CTYPE) as digit in one locale, but not in another.
I ran it with 5000 locale combinations in Python 2 but did not find any (somebody shut down my computer!). I just modified the code so it also
runs in Python 3. Is this the correct way to find such locale-dependent regex matches?
albertjan at debian:~/Downloads$ uname -a && python --version && python3 --version
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux
Python 2.7.9
Python 3.3.4
albertjan at debian:~/Downloads$ cat lc_ctype.py
# -*- coding: utf-8 -*-
"""
Find two locales where a different character classification causes a regex
to match a given character in one locale, but fail in another.
This is to demonstrate the effect that re.LOCALE (in particular the LC_CTYPE
locale category) might have on locale-aware regexes like \w or \d.
E.g., a character might be classified as digit in one locale but not in another.
"""
from __future__ import print_function, division
import subprocess
import locale
import itertools
import sys
import re
try:
xrange
except NameError:
xrange = range
unichr = chr
if sys.version_info.major> 2:
unicode = str
proc = subprocess.Popen("locale -a", stdout=subprocess.PIPE, shell=True)
locales = proc.communicate()
locales = sorted(locales[0].split(b"\n")) # this is the list: http://pastebin.com/FVxUnrWK
if sys.version_info.major> 2:
locales = [loc.decode("utf-8") for loc in locales]
regex = re.compile(r"\d+", re.LOCALE) # is this the correct place?
total = len(list(itertools.combinations(locales, 2)))
for n, (locale1, locale2) in enumerate(itertools.combinations(locales, 2), 1):
if not locale1 or not locale2:
continue
if n % 10 == 0 or n == 1:
sys.stdout.write(" %d (%3.2f%%) ... " % (n, (n / total * 100) ))
sys.stdout.flush() # python 2 print *function* does not have flush param
for i in xrange(sys.maxunicode + 1): # 1114111
s = unichr(i) #.encode("utf8")
try:
locale.setlocale(locale.LC_CTYPE, locale1)
m1 = bool(regex.match(s))
locale.setlocale(locale.LC_CTYPE, locale2)
m2 = bool(regex.match(s))
if m1 ^ m2: # m1 != m2
msg = ("@@ ordinal: %s | character: %s (%r) | "
" digit in locale '%s': %s | digit in locale '%s': %s ")
print(msg % (i, unichr(i), unichr(i), locale1, m1, locale2, m2))
break
except locale.Error as e:
#print("Error: %s with %s and/or %s" % (e, locale1, locale2))
continue
print("---Done---")
Thank you!
Albert-Jan
More information about the Tutor
mailing list