[Python-3000] string module trimming

Josiah Carlson jcarlson at uci.edu
Thu Apr 19 08:50:17 CEST 2007


"Jeffrey Yasskin" <jyasskin at gmail.com> wrote:
> On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > On 4/18/07, Guido van Rossum <guido at python.org> wrote:
> > > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > > But we already went over this. There are over 40K letters in Unicode.
> > > It simply makes no sense to have a string.letters approaching that
> > > size.
> >
> > Agreed.  But there aren't 40K (alphabetic) letters in any particular
> > locale.  Most individual languages will have less than 100.
> 
> I missed the beginning of this discussion, so sorry if you've already
> covered this. Are you saying that in your app, just because I've set
> the en_US locale, I won't be able to type "????"?  Or that those
> characters won't be recognized as letters?

If I understand the conversation correctly, the discussion is what will
be in string.letters, and what will be handled in str.upper(), etc.,
when a locale is set.


> The Unicode character database (http://www.unicode.org/ucd/) seems
> like the obvious way to handle character properties if you want to get
> the right answers.

Certainly, but having 40k characters in string.letters seems like a bit
of overkill, for *any* locale.  It seems as though it only makes sense
to include the letters for the current locale as string.letters, and to
handle str.upper(), etc., as determined by the locale.

In terms of sorting, since all (unicode) strings should be comparable to
one another, using the unicode-specified ordering would seem to make
sense, unless it is something other than code point values.  If it isn't
code point values (which seems to be the implication), then we need to
decide if we want to check a 128kbyte table (for UCS-2 builds) in order
to sort strings (though cache lookup locality may make this a moot point
for most comparisons).

 - Josiah



More information about the Python-3000 mailing list