[Python-3000] string module trimming

Thu Apr 19 17:14:15 CEST 2007

On 4/18/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> "Jeffrey Yasskin" <jyasskin at gmail.com> wrote:
> > I missed the beginning of this discussion, so sorry if you've already
> > covered this. Are you saying that in your app, just because I've set
> > the en_US locale, I won't be able to type "????"?  Or that those
> > characters won't be recognized as letters?
>
> If I understand the conversation correctly, the discussion is what will
> be in string.letters, and what will be handled in str.upper(), etc.,
> when a locale is set.

string.letters should go away because I don't know of any correct uses
of it, and as you say 40K letters is too long. Searching a list is the
wrong way to decide whether a character is a letter, and case
transformations don't work a character at a time (consider what
happens with "ß".upper() (That is, U+00DF, German Small Sharp S)).
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt defines the
mappings that aren't 1-1. There are some that are locale-specific, but
you can do a pretty good job ignoring the language, as long as you
allow strings to change length.

> > The Unicode character database (http://www.unicode.org/ucd/) seems
> > like the obvious way to handle character properties if you want to get
> > the right answers.
>
> Certainly, but having 40k characters in string.letters seems like a bit
> of overkill, for *any* locale.  It seems as though it only makes sense
> to include the letters for the current locale as string.letters, and to
> handle str.upper(), etc., as determined by the locale.

As far as I understand, "letters for the current locale" is the same
as "letters" in Unicode. Can you point me to a character that is a
letter in one locale but not in another? (The third column of
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt defines the
character's category, and
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
says what it means.)

> In terms of sorting, since all (unicode) strings should be comparable to
> one another, using the unicode-specified ordering would seem to make
> sense, unless it is something other than code point values.  If it isn't
> code point values (which seems to be the implication), then we need to
> decide if we want to check a 128kbyte table (for UCS-2 builds) in order
> to sort strings (though cache lookup locality may make this a moot point
> for most comparisons).

If you just need to store strings in an order-based data structure
(which I guess is moot for python with its hashes), then codepoint
order is fine. If you intend to show users a sorted list, then you
have to use the real collation algorithm or you'll produce the wrong
answer. I don't understand the algorithm's details, but ICU has an
implementation, and http://icu-project.org/charts/icu4c_footprint.html
claims that the data for all languages fits in 354K.

UCS-2 is an old and broken fixed-width encoding that cannot represent
characters above U+FFFF. Nobody should ever use it. You probably meant
UTF-16.

-- 
Namasté,
Jeffrey Yasskin