[Python-3000] string module trimming

Thu Apr 19 19:22:13 CEST 2007

"Jeffrey Yasskin" <jyasskin at gmail.com> wrote:
> On 4/18/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > "Jeffrey Yasskin" <jyasskin at gmail.com> wrote:
> > > I missed the beginning of this discussion, so sorry if you've already
> > > covered this. Are you saying that in your app, just because I've set
> > > the en_US locale, I won't be able to type "????"?  Or that those
> > > characters won't be recognized as letters?
> >
> > If I understand the conversation correctly, the discussion is what will
> > be in string.letters, and what will be handled in str.upper(), etc.,
> > when a locale is set.
> 
> string.letters should go away because I don't know of any correct uses
> of it, and as you say 40K letters is too long. Searching a list is the
> wrong way to decide whether a character is a letter, and case
> transformations don't work a character at a time (consider what
> happens with "ÃŸ".upper() (That is, U+00DF, German Small Sharp S)).
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt defines the
> mappings that aren't 1-1. There are some that are locale-specific, but
> you can do a pretty good job ignoring the language, as long as you
> allow strings to change length.

Because we aren't mutating unicode strings, this isn't an issue.  I
respond below regarding string.letters .

> > > The Unicode character database (http://www.unicode.org/ucd/) seems
> > > like the obvious way to handle character properties if you want to get
> > > the right answers.
> >
> > Certainly, but having 40k characters in string.letters seems like a bit
> > of overkill, for *any* locale.  It seems as though it only makes sense
> > to include the letters for the current locale as string.letters, and to
> > handle str.upper(), etc., as determined by the locale.
> 
> As far as I understand, "letters for the current locale" is the same
> as "letters" in Unicode. Can you point me to a character that is a
> letter in one locale but not in another? (The third column of
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt defines the
> character's category, and
> http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
> says what it means.)

Neither I, nor I believe Python mean 'letters' in the general sense, but
the 'alphabet' of a particular locale.  For example, en_US compared to
sv_SE .

> > In terms of sorting, since all (unicode) strings should be comparable to
> > one another, using the unicode-specified ordering would seem to make
> > sense, unless it is something other than code point values.  If it isn't
> > code point values (which seems to be the implication), then we need to
> > decide if we want to check a 128kbyte table (for UCS-2 builds) in order
> > to sort strings (though cache lookup locality may make this a moot point
> > for most comparisons).
> 
> If you just need to store strings in an order-based data structure
> (which I guess is moot for python with its hashes), then codepoint
> order is fine. If you intend to show users a sorted list, then you
> have to use the real collation algorithm or you'll produce the wrong
> answer. I don't understand the algorithm's details, but ICU has an
> implementation, and http://icu-project.org/charts/icu4c_footprint.html
> claims that the data for all languages fits in 354K.

It could probably even be reduced lower than 354K with two tables and a
comparison function that knows how to handle surrogates.

> UCS-2 is an old and broken fixed-width encoding that cannot represent
> characters above U+FFFF. Nobody should ever use it. You probably meant
> UTF-16.

You are more or less right.  Earlier versions of Windows were limited to
UCS-2, and I believe earlier versions of Python on Windows were also
limited to UCS-2.  For narrow builds we use UTF-16, with surrogate pairs
and everything (though a unicode string consisting of a single surrogate
pair will have length 2, not 1 as would be expected).

 - Josiah