convert Unicode to lower/uppercase?

jallan jallan at smrtytrek.com
Wed Sep 24 03:17:00 CEST 2003


Peter Otten <__peter__ at web.de> wrote in message news:<bkpvml$m67$06$1 at news.t-online.com>...
> jallan wrote:
> 
> > I don't see any particular reason why Python "cannot handle case
> > mappings that increase string lengths".
> 
> Now that's a long post. I think it essentially boils down to the above
> statement.
> 
> Looking into stringobject.c (judging from a first impression,
> unicodeobject.c has essentially the same algorithm, but with a few
> indirections):
> 
> static PyObject *
> string_upper(PyStringObject *self)
> {
>         char *s = PyString_AS_STRING(self), *s_new;
>         int i, n = PyString_GET_SIZE(self);
>         PyObject *new;
> 
>         new = PyString_FromStringAndSize(NULL, n);
>         if (new == NULL)
>                 return NULL;
>         s_new = PyString_AsString(new);
>         for (i = 0; i < n; i++) {
>                 int c = Py_CHARMASK(*s++);
>                 if (islower(c)) {
>                         *s_new = toupper(c);
>                 } else
>                         *s_new = c;
>                 s_new++;
>         }
>         return new;
> }
> 
> The whole routine builds on the assumption that len(s) == len(s.upper()) and
> nothing short of a complete rewrite will fix that. But if you volunteer...

I would love to if I had the time. Sigh!  Maybe in some months.

> Personally, I think it's a long way to go for a little s, sharp as it may be
> :-)

If it were just ß one could thrown in a quick conversion of any ß to
ss at the beginning.

But there are over a hundred other characters that expand when
uppercased in http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt,
most of them Greek. Greek is a horror. See
http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html for the
sad tale.

Unfortunately language and orthography are messy and inconsistant and
illogical and sometimes just silly. But handling orthography properly
involves dealing with these complex rules and subrules and exceptions
to rules rather than ignoring them.

Unicode gives us great power, but with great power comes great
responsibility and lots of niggling code. :-(

Fortunately only the Latin, Greek, Coptic, Cyrillic and Armenian
scripts have such a thing as casing and the Unicode people have
provided data files and algorithms that supposedly handle casing for
these languages acceptably.

>From the Conformance requirements for Unicode at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G29484 (C20):

<< An implementation that purports to support the default casing
operations of case conversion, case detection, and caseless mapping
shall do so in accordance with the definitions and specifications in
Section 3.13, Default Case Operations. >>

This involves even more messy fussing about with context specification
for casing and with what values should be returned from a case
querying function, e.g. "A2" is true as either uppercase and titlecase
but not as lowercase. "3" is true as lowercase, uppercase and title
case.

Python or any applicaton or language either does or doesn't conform.

I doubt that there is currently any application that can yet honestly
purport to support Unicode default casing operations of case
conversion, case detection and caseless mapping.

Jim Allan




More information about the Python-list mailing list