convert Unicode to lower/uppercase?

jallan jallan at
Wed Sep 24 03:17:00 CEST 2003

Peter Otten <__peter__ at> wrote in message news:<bkpvml$m67$06$1 at>...
> jallan wrote:
> > I don't see any particular reason why Python "cannot handle case
> > mappings that increase string lengths".
> Now that's a long post. I think it essentially boils down to the above
> statement.
> Looking into stringobject.c (judging from a first impression,
> unicodeobject.c has essentially the same algorithm, but with a few
> indirections):
> static PyObject *
> string_upper(PyStringObject *self)
> {
>         char *s = PyString_AS_STRING(self), *s_new;
>         int i, n = PyString_GET_SIZE(self);
>         PyObject *new;
>         new = PyString_FromStringAndSize(NULL, n);
>         if (new == NULL)
>                 return NULL;
>         s_new = PyString_AsString(new);
>         for (i = 0; i < n; i++) {
>                 int c = Py_CHARMASK(*s++);
>                 if (islower(c)) {
>                         *s_new = toupper(c);
>                 } else
>                         *s_new = c;
>                 s_new++;
>         }
>         return new;
> }
> The whole routine builds on the assumption that len(s) == len(s.upper()) and
> nothing short of a complete rewrite will fix that. But if you volunteer...

I would love to if I had the time. Sigh!  Maybe in some months.

> Personally, I think it's a long way to go for a little s, sharp as it may be
> :-)

If it were just ß one could thrown in a quick conversion of any ß to
ss at the beginning.

But there are over a hundred other characters that expand when
uppercased in,
most of them Greek. Greek is a horror. See for the
sad tale.

Unfortunately language and orthography are messy and inconsistant and
illogical and sometimes just silly. But handling orthography properly
involves dealing with these complex rules and subrules and exceptions
to rules rather than ignoring them.

Unicode gives us great power, but with great power comes great
responsibility and lots of niggling code. :-(

Fortunately only the Latin, Greek, Coptic, Cyrillic and Armenian
scripts have such a thing as casing and the Unicode people have
provided data files and algorithms that supposedly handle casing for
these languages acceptably.

>From the Conformance requirements for Unicode at (C20):

<< An implementation that purports to support the default casing
operations of case conversion, case detection, and caseless mapping
shall do so in accordance with the definitions and specifications in
Section 3.13, Default Case Operations. >>

This involves even more messy fussing about with context specification
for casing and with what values should be returned from a case
querying function, e.g. "A2" is true as either uppercase and titlecase
but not as lowercase. "3" is true as lowercase, uppercase and title

Python or any applicaton or language either does or doesn't conform.

I doubt that there is currently any application that can yet honestly
purport to support Unicode default casing operations of case
conversion, case detection and caseless mapping.

Jim Allan

More information about the Python-list mailing list