[Python-Dev] Unicode mapping tables

Tim Peters tim_one@email.msn.com
Wed, 1 Mar 2000 01:50:44 -0500


[M.-A. Lemburg]
> ...
> Currently, mapping tables map characters to Unicode characters
> and vice-versa. Now the .translate method will use a different
> kind of table: mapping integer ordinals to integer ordinals.

You mean that if I want to map u"a" to u"A", I have to set up some sort of
dict mapping ord(u"a") to ord(u"A")?  I simply couldn't follow this.

> Question: What is more of efficient: having lots of integers
> in a dictionary or lots of characters ?

My bet is "lots of integers", to reduce both space use and comparison time.

> ...
> Something else that changed is the way .capitalize() works. The
> Unicode version uses the Unicode algorithm for it (see TechRep. 13
> on the www.unicode.org site).

#13 is "Unicode Newline Guidelines".  I assume you meant #21 ("Case
Mappings").

> Here's the new doc string:
>
> S.capitalize() -> unicode
>
> Return a capitalized version of S, i.e. words start with title case
> characters, all remaining cased characters have lower case.
>
> Note that *all* characters are touched, not just the first one.
> The change was needed to get it in sync with the .iscapitalized()
> method which is based on the Unicode algorithm too.
>
> Should this change be propogated to the string implementation ?

Unicode makes distinctions among "upper case", "lower case" and "title
case", and you're trying to get away with a single "capitalize" function.
Java has separate toLowerCase, toUpperCase and toTitleCase methods, and
that's the way to do it.  Whatever you do, leave .capitalize alone for 8-bit
strings -- there's no reason to break code that currently works.
"capitalize" seems a terrible choice of name for a titlecase method anyway,
because of its baggage connotations from 8-bit strings.  Since this stuff is
complicated, I say it would be much better to use the same names for these
things as the Unicode and Java folk do:  there's excellent documentation
elsewhere for all this stuff, and it's Bad to make users mentally translate
unique Python terminology to make sense of the official docs.

So my vote is:  leave capitalize the hell alone <wink>.  Do not implement
capitialize for Unicode strings.  Introduce a new titlecase method for
Unicode strings.  Add a new titlecase method to 8-bit strings too.  Unicode
strings should also have methods to get at uppercase and lowercase (as
Unicode defines those).