[Python-3000] string module trimming

Thu Apr 19 20:52:00 CEST 2007

On 4/19/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Jeffrey Yasskin" <jyasskin at gmail.com> wrote:
> > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > > On 4/18/07, Guido van Rossum <guido at python.org> wrote:
> > > > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

> > > Agreed.  But there aren't 40K (alphabetic) letters in any particular
> > > locale.  Most individual languages will have less than 100.

> > ... Are you saying that in your app, just because I've set
> > the en_US locale, I won't be able to type "????"?  Or that those
> > characters won't be recognized as letters?

The latter.  Some applications may reject them for that reason; for
example some domain registrars have policies to prevent domain name
spoofing with similar-looking characters.  One way to do that is to
say that a character used in a domain name (under that registrar) is
limited to those letters used by the appropriate national language.

> In terms of sorting, since all (unicode) strings should be comparable to
> one another, using the unicode-specified ordering would seem to make
> sense, unless it is something other than code point values.

It is definately something other than code-point values.

In particular, see section 1.8 (common misconceptions) of
http://unicode.org/reports/tr10/

The sorting isn't fully defined without locale-specific tailoring and
a Unicode Element Collation Table (default 4 bytes/char, though
compressible).  There is a default tailoring and default Unicode
Element Collation Table; it looks (but I haven't proven to myself) as
if  these defaults are sufficient for most use, but certainly not all
usage.

Unicode sorting (even with your own collation table) definately
requires normalization, which is something Python has been careful not
to promise.  (There were some arguments over whether normalization was
even possible to do in a strictly correct fashion.  I didn't
understand them well enough to remember the summary.)  Unless the
"repetoire of supported character sequences" is (unnaturally)
restricted, normalization is only an intermediate step; a third
representation is constructed for the actual comparison.  This third
form can be done a few characters at a time, but then you have to redo
it for the next comparison.

As best I can easily tell about the default settings, there are
distinct strings which are equal, unequal strings which are not
ordered, and strings for which you must compare multiple characters at
once ("x"<"y", but "xz">"yz")

-jJ