[Python-3000] Four new failing tests

Mon Aug 13 20:57:28 CEST 2007

On 8/11/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > ======================================================================
> > ERROR: test_char_write (__main__.TestArrayWrites)
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >   File "Lib/test/test_csv.py", line 648, in test_char_write
> >     a = array.array('u', string.letters)
> > ValueError: string length not a multiple of item size

I fixed this by removing the code from _locale.c that changes string.letters.

> I think some decision should be made wrt. string.letters.
>
> Clearly, string.letters cannot reasonably contain *all* letters
> (i.e. all characters of categories Ll, Lu, Lt, Lo). Or can it?
>
> Traditionally, string.letters contained everything that is a letter
> in the current locale. Still, computing this string might be expensive
> assuming you have to go through all Unicode code points and determine
> whether they are letters in the current locale.
>
> So I see the following options:
> 1. remove it entirely. Keep string.ascii_letters instead
> 2. remove string.ascii_letters, and make string.letters to be
>    ASCII only.
> 3. Make string.letters contain all letters in the current locale.
> 4. Make string.letters truly contain everything that is classified
>    as a letter in the Unicode database.
>
> Which one should happen?

First I'd like to rule out 3 and 4. I don't like 3 because in our new
all-unicode world, using the locale for deciding what letters are
makes no sense -- one should use isalpha() etc. I think 4 is not at
all what people who use string.letters expect, and it's too large.

I think 2 is unnecsesarily punishing people who use
string.ascii_letters -- they have already declared they don't care
about Unicode and we shouldn't break their code.

So that leaves 1.

There are (I think) two categories of users who use string.letters:

(a) People who have never encountered a non-English locale and for
whom there is no difference between string.ascii_letters and
string.letters. Their code may or may not work in other locales. We're
doing them a favor by flagging this in their code by removing
string.letters.

(b) People who want locale-specific behavior. Their code will probably
break anyway, since they are apparently processing text using 8-bit
characters encoded in a fixed-width encoding (e.g. the various Latin-N
encodings). They ought to convert their code to Unicode. Once they are
processing Unicode strings, they can just use isalpha() etc. If they
really want to know the set of letters that can be encoded in their
locale's encoding, they can use locale.getpreferredencoding() and
deduce it from there, e.g.:

enc = locale.getpreferredencoding()
letters = [c for c in bytes(range(256)).decode(enc) if c.isalpha()]

This won't work for multi-byte encodings of course -- but there code
never worked in that case anyway.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)