[New-bugs-announce] [issue22410] Locale dependent regexps on different locales

Serhiy Storchaka report at bugs.python.org
Sun Sep 14 18:23:23 CEST 2014


New submission from Serhiy Storchaka:

Locale-specific case-insensitive regular expression matching works only when the pattern was compiled on the same locale as used for matching. Due to caching this can cause unexpected result.

Attached script demonstrates this (it requires two locales: ru_RU.koi8-r and ru_RU.cp1251). The output is:

locale ru_RU.koi8-r
  b'1\xa3' ('1ё') matches b'1\xb3' ('1Ё')
  b'1\xa3' ('1ё') doesn't match b'1\xbc' ('1╪')
locale ru_RU.cp1251
  b'1\xa3' ('1Ј') doesn't match b'1\xb3' ('1і')
  b'1\xa3' ('1Ј') matches b'1\xbc' ('1ј')
locale ru_RU.cp1251
  b'2\xa3' ('2Ј') doesn't match b'2\xb3' ('2і')
  b'2\xa3' ('2Ј') matches b'2\xbc' ('2ј')
locale ru_RU.koi8-r
  b'2\xa3' ('2ё') doesn't match b'2\xb3' ('2Ё')
  b'2\xa3' ('2ё') matches b'2\xbc' ('2╪')

b'\xa3' matches b'\xb3' on KOI8-R locale if the pattern was compiled on KOI8-R locale and matches b'\xb3' if the pattern was compiled on CP1251 locale.

I see three possible ways to solve this issue:

1. Avoid caching of locale-depending case-insensitive patterns. This definitely will decrease performance of the use of locale-depending case-insensitive regexps (if user don't use own caching) and may be slightly decrease performance of the use of other regexps.

2. Clear precompiled regexps cache on every locale change. This can look simpler, but is vulnerable to locale changes from extensions.

3. Do not lowercase characters at compile time (in locale-depending case-insensitive patterns). This needs to introduce new opcode for case-insensitivity matching or at least rewriting implementation of current opcodes (less efficient). On other way, this is more correct implementation than current one. The problem is that this is incompatible with those distributions which updates only Python library but not statically linked binary (e.g. Vim with Python support). May be there are some workarounds.

----------
components: Extension Modules, Library (Lib), Regular Expressions
files: re_locale_caching.py
messages: 226874
nosy: ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Locale dependent regexps on different locales
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5
Added file: http://bugs.python.org/file36616/re_locale_caching.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue22410>
_______________________________________


More information about the New-bugs-announce mailing list