[Python-3000] string module trimming
jimjjewett at gmail.com
Thu Apr 19 01:08:59 CEST 2007
On 4/18/07, Guido van Rossum <guido at python.org> wrote:
> On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > Today, string.letters works most easily with ASCII supersets, and is
> > effectively limited to 8-bit encodings. Once everything is unicode, I
> > don't think that 8-bit restriction should apply any more.
> But we already went over this. There are over 40K letters in Unicode.
> It simply makes no sense to have a string.letters approaching that
Agreed. But there aren't 40K (alphabetic) letters in any particular
locale. Most individual languages will have less than 100.
As a proxy for measuring "local" characters, I'll note that during
some optimization drives for Pango (e.g.,
http://primates.ximian.com/~federico/news-2005-11.html#04 ) it turned
out that there were only two non C-J-K languages that needed more than
256 cache positions in their character glyph tables.
> > Unless I missed it (and I may have), unicode itself sort of ducks the
> > question about how to sort strings. Python really needs to provide
> > *an* answer, but I'm not sure it is possible to provide the (single)
> > correct answer.
> The Unicode standard certainly has a solution, but it is complicated
> and I don't believe it is currently implemented in core Python.
I guess you're right; I saw too many alternatives the last time I
looked, and must have stopped reading http://unicode.org/reports/tr10/
after section 1, where it becomes obvious that there is no
context-free right answer.
> > string.letters is one workaround, and I don't think we should remove
> > it until a better solution (or workaround) is available.
> I disagree. The correct solution is to implement the Unicode support
> for locale-specific sorting.
I'm not convinced that waiting for such a heavyweight solution is
really the best choice, particularly since the spec itself warns
against using the strictest forms (too inefficient).
> Remember that the locale module supports only a single, global locale
> at a time. This renders it totally useless in many apps requiring
> locale support (such as web servers).
More information about the Python-3000