[Python-Dev] Unicode mapping tables
Wed, 1 Mar 2000 01:50:44 -0500
> Currently, mapping tables map characters to Unicode characters
> and vice-versa. Now the .translate method will use a different
> kind of table: mapping integer ordinals to integer ordinals.
You mean that if I want to map u"a" to u"A", I have to set up some sort of
dict mapping ord(u"a") to ord(u"A")? I simply couldn't follow this.
> Question: What is more of efficient: having lots of integers
> in a dictionary or lots of characters ?
My bet is "lots of integers", to reduce both space use and comparison time.
> Something else that changed is the way .capitalize() works. The
> Unicode version uses the Unicode algorithm for it (see TechRep. 13
> on the www.unicode.org site).
#13 is "Unicode Newline Guidelines". I assume you meant #21 ("Case
> Here's the new doc string:
> S.capitalize() -> unicode
> Return a capitalized version of S, i.e. words start with title case
> characters, all remaining cased characters have lower case.
> Note that *all* characters are touched, not just the first one.
> The change was needed to get it in sync with the .iscapitalized()
> method which is based on the Unicode algorithm too.
> Should this change be propogated to the string implementation ?
Unicode makes distinctions among "upper case", "lower case" and "title
case", and you're trying to get away with a single "capitalize" function.
Java has separate toLowerCase, toUpperCase and toTitleCase methods, and
that's the way to do it. Whatever you do, leave .capitalize alone for 8-bit
strings -- there's no reason to break code that currently works.
"capitalize" seems a terrible choice of name for a titlecase method anyway,
because of its baggage connotations from 8-bit strings. Since this stuff is
complicated, I say it would be much better to use the same names for these
things as the Unicode and Java folk do: there's excellent documentation
elsewhere for all this stuff, and it's Bad to make users mentally translate
unique Python terminology to make sense of the official docs.
So my vote is: leave capitalize the hell alone <wink>. Do not implement
capitialize for Unicode strings. Introduce a new titlecase method for
Unicode strings. Add a new titlecase method to 8-bit strings too. Unicode
strings should also have methods to get at uppercase and lowercase (as
Unicode defines those).