[Python-3000] string module trimming

Thu Apr 19 01:16:43 CEST 2007

On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 4/18/07, Guido van Rossum <guido at python.org> wrote:
> > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
>
> > > Today, string.letters works most easily with ASCII supersets, and is
> > > effectively limited to 8-bit encodings.  Once everything is unicode, I
> > > don't think that 8-bit restriction should apply any more.
>
> > But we already went over this. There are over 40K letters in Unicode.
> > It simply makes no sense to have a string.letters approaching that
> > size.
>
> Agreed.  But there aren't 40K (alphabetic) letters in any particular
> locale.  Most individual languages will have less than 100.

Isn't that excluding the written language of half the world population
(at least China, Korea and Japan)?

> As a proxy for measuring "local" characters, I'll note that during
> some optimization drives for Pango (e.g.,
> http://primates.ximian.com/~federico/news-2005-11.html#04 ) it turned
> out that there were only two non C-J-K languages that needed more than
> 256 cache positions in their character glyph tables.

But here we're talking features, not optimizations. I really don't
think it's a good idea to propose a feature that can't be used
reasonably for CJK languages.

> > > Unless I missed it (and I may have), unicode itself sort of ducks the
> > > question about how to sort strings.  Python really needs to provide
> > > *an* answer, but I'm not sure it is possible to provide the (single)
> > > correct answer.
>
> > The Unicode standard certainly has a solution, but it is complicated
> > and I don't believe it is currently implemented in core Python.
>
> I guess you're right; I saw too many alternatives the last time I
> looked, and must have stopped reading http://unicode.org/reports/tr10/
> after section 1, where it becomes obvious that there is no
> context-free right answer.
>
> > > string.letters is one workaround, and I don't think we should remove
> > > it until a better solution (or workaround) is available.
>
> > I disagree. The correct solution is to implement the Unicode support
> > for locale-specific sorting.
>
> And set-inclusion.

For set-inclusion we already have isalpha() etc. That should be
enough. I really don't see much of a use case for inquiries of the
type "is this a letter in my locale" -- by the time you are doing
that, you probably are only thinking of one specific locale, and then
you should just reject non-locale charaters altogether rather than
treating them as punctuation.

> I'm not convinced that waiting for such a heavyweight solution is
> really the best choice, particularly since the spec itself warns
> against using the strictest forms (too inefficient).
>
> > Remember that the locale module supports only a single, global locale
> > at a time. This renders it totally useless in many apps requiring
> > locale support (such as web servers).
>
> Fair enough.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)