[Python-bugs-list] [Bug #126706] many std modules assume string.letters is [a-zA-Z]

Tue, 09 Jan 2001 06:46:07 -0800

Bug #126706, was updated on 2000-Dec-23 06:19
Here is a current snapshot of the bug.

Project: Python
Category: Python Library
Status: Open
Resolution: None
Bug Group: None
Priority: 5
Submitted by: nobody
Assigned to : nobody
Summary: many std modules assume string.letters is [a-zA-Z]

Details: there are many modules in the standard library that
use string.letters to mean A-Za-z, but that assumption
is incorrect when locales are in use.

also the readline library seems to cause the locale to be set according to
the current environment variables,
even if i don't call locale.*:

% python2.0 -c 'import string; print string.letters'
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
% python2.0
Python 2.0 (#3, Oct 19 2000, 01:42:41) 
[GCC 2.95.2 20000220 (Debian GNU/Linux)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> print string.letters
abcdefghijklmnopqrstuvwxyzµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ
>>> 

here's what grep says on the standard library. most
of these uses seem incorrect to me:

% grep string.letters **/*.py
Cookie.py:_LegalChars       = string.letters + string.digits +
"!#$%&'*+-.^_`|~"cmd.py:IDENTCHARS = string.letters + string.digits + '_'
dospath.py:    varchars = string.letters + string.digits + '_-'
lib-old/codehack.py:identchars = string.letters + string.digits + '_' #
Identifier characters
ntpath.py:    varchars = string.letters + string.digits + '_-'
nturl2path.py:  if len(comp) != 2 or comp[0][-1] not in string.letters:
pipes.py:_safechars = string.letters + string.digits + '!@%_-+=:,./'    #
Safe unquoted
pre.py:    alphanum=string.letters+'_'+string.digits
tokenize.py:    namechars, numchars = string.letters + '_', string.digits
urlparse.py:scheme_chars = string.letters + string.digits + '+-.'

Follow-Ups:

Date: 2001-Jan-09 06:46
By: gvanrossum

Comment:
I agree that the string module should be extended with additional variables
ascii_letters (and ascii_lowercase and ascii_uppercase and
ascii_whitespace).

-------------------------------------------------------

Date: 2001-Jan-01 10:08
By: lemburg

Comment:
The comment about readline calling setlocale() is unfortunately
true (and causes some very subtle bugs in user code...).

About the addition of more constants: I would rather like
to see a database for these things which uses function calls
much like the Unicode database (unicodedata).

Since locales sometime matter, I think there should be an option
to the functions which enables locale support (much like as
for REs) on request. Default should be no locale support, since
this is what most code expects anyway.

-------------------------------------------------------

Date: 2000-Dec-30 19:36
By: akuchling

Comment:
The set of all letters, though, will be commonly used, though maybe we need
an alphanumeric constant for A-Za-z0-9 + underscore.  I like the
.ascii_letters suggestion.

-------------------------------------------------------

Date: 2000-Dec-30 18:26
By: fdrake

Comment:
Andrew, does it make sense to introduce new constants in string for this? 
It seems that each instance is referring to slightly different
specifications or standards (documented or not), so perhaps the constants
should be defined locally within each of the modules.  This also avoids
unnecessary dependencies.
-------------------------------------------------------

Date: 2000-Dec-26 12:18
By: nobody

Comment:
string.ascii_letters etc is more precise
than alphabet, imho.

  -- erno@iki.fi
-------------------------------------------------------

Date: 2000-Dec-26 08:15
By: akuchling

Comment:
The docs for the string module say that, for example, string.lowercase is "
A string containing all the characters that are considered lowercase
letters."  This implies that the strings are locale-aware; code that uses
string.lowercase to mean only a-z 
is therefore in error.  (.digits is not locale-aware.)

Solution: I'd suggest adding new, not locale-aware, constants.
string.alphabet, string.lower_alphabet, string.upper_alphabet, maybe?  Code
should then be changed to use these new constants.

-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=126706&group_id=5470