[Python-bugs-list] [Bug #126706] many std modules assume string.letters is [a-zA-Z]

noreply@sourceforge.net noreply@sourceforge.net
Tue, 26 Dec 2000 12:18:57 -0800


Bug #126706, was updated on 2000-Dec-23 06:19
Here is a current snapshot of the bug.

Project: Python
Category: Python Library
Status: Open
Resolution: None
Bug Group: None
Priority: 5
Submitted by: nobody
Assigned to : nobody
Summary: many std modules assume string.letters is [a-zA-Z]

Details: there are many modules in the standard library that
use string.letters to mean A-Za-z, but that assumption
is incorrect when locales are in use.

also the readline library seems to cause the locale to be set according to
the current environment variables,
even if i don't call locale.*:

% python2.0 -c 'import string; print string.letters'
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
% python2.0
Python 2.0 (#3, Oct 19 2000, 01:42:41) 
[GCC 2.95.2 20000220 (Debian GNU/Linux)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> print string.letters
abcdefghijklmnopqrstuvwxyzµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ
>>> 

here's what grep says on the standard library. most
of these uses seem incorrect to me:

% grep string.letters **/*.py
Cookie.py:_LegalChars       = string.letters + string.digits +
"!#$%&'*+-.^_`|~"cmd.py:IDENTCHARS = string.letters + string.digits + '_'
dospath.py:    varchars = string.letters + string.digits + '_-'
lib-old/codehack.py:identchars = string.letters + string.digits + '_' #
Identifier characters
ntpath.py:    varchars = string.letters + string.digits + '_-'
nturl2path.py:  if len(comp) != 2 or comp[0][-1] not in string.letters:
pipes.py:_safechars = string.letters + string.digits + '!@%_-+=:,./'    #
Safe unquoted
pre.py:    alphanum=string.letters+'_'+string.digits
tokenize.py:    namechars, numchars = string.letters + '_', string.digits
urlparse.py:scheme_chars = string.letters + string.digits + '+-.'




Follow-Ups:

Date: 2000-Dec-26 12:18
By: nobody

Comment:
string.ascii_letters etc is more precise
than alphabet, imho.

  -- erno@iki.fi
-------------------------------------------------------

Date: 2000-Dec-26 08:15
By: akuchling

Comment:
The docs for the string module say that, for example, string.lowercase is "
A string containing all the characters that are considered lowercase
letters."  This implies that the strings are locale-aware; code that uses
string.lowercase to mean only a-z 
is therefore in error.  (.digits is not locale-aware.)

Solution: I'd suggest adding new, not locale-aware, constants.
string.alphabet, string.lower_alphabet, string.upper_alphabet, maybe?  Code
should then be changed to use these new constants.

-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=126706&group_id=5470