[Python-Dev] Unicode mapping tables

M.-A. Lemburg mal@lemburg.com
Wed, 01 Mar 2000 09:38:52 +0100

Tim Peters wrote:
> [M.-A. Lemburg]
> > ...
> > Currently, mapping tables map characters to Unicode characters
> > and vice-versa. Now the .translate method will use a different
> > kind of table: mapping integer ordinals to integer ordinals.
> You mean that if I want to map u"a" to u"A", I have to set up some sort of
> dict mapping ord(u"a") to ord(u"A")?  I simply couldn't follow this.

I meant:

  'a': u'A' vs. ord('a'): ord(u'A')

The latter wins ;-) Reasoning for the first was that it allows
character sequences to be handled by the same mapping algorithm.
I decided to leave those techniques to some future implementation,
since mapping integers has the nice side-effect of also allowing
sequences to be used as mapping tables... resulting in some
speedup at the cost of memory consumption.

BTW, there are now three different ways to do char translations:

1. char -> unicode  (char mapping codec's decode)
2. unicode -> char  (char mapping codec's encode)
3. unicode -> unicode (unicode's .translate() method)
> > Question: What is more of efficient: having lots of integers
> > in a dictionary or lots of characters ?
> My bet is "lots of integers", to reduce both space use and comparison time.

Right. That's what I found too... it's "lots of integers" now :-)
> > ...
> > Something else that changed is the way .capitalize() works. The
> > Unicode version uses the Unicode algorithm for it (see TechRep. 13
> > on the www.unicode.org site).
> #13 is "Unicode Newline Guidelines".  I assume you meant #21 ("Case
> Mappings").

Dang. You're right. Here's the URL in case someone
wants to join in:


> > Here's the new doc string:
> >
> > S.capitalize() -> unicode
> >
> > Return a capitalized version of S, i.e. words start with title case
> > characters, all remaining cased characters have lower case.
> >
> > Note that *all* characters are touched, not just the first one.
> > The change was needed to get it in sync with the .iscapitalized()
> > method which is based on the Unicode algorithm too.
> >
> > Should this change be propogated to the string implementation ?
> Unicode makes distinctions among "upper case", "lower case" and "title
> case", and you're trying to get away with a single "capitalize" function.
> Java has separate toLowerCase, toUpperCase and toTitleCase methods, and
> that's the way to do it.

The Unicode implementation has the corresponding:

.upper(), .lower() and .capitalize()

They work just like .toUpperCase, .toLowerCase, .toTitleCase
resp. (well at least they should ;).

> Whatever you do, leave .capitalize alone for 8-bit
> strings -- there's no reason to break code that currently works.
> "capitalize" seems a terrible choice of name for a titlecase method anyway,
> because of its baggage connotations from 8-bit strings.  Since this stuff is
> complicated, I say it would be much better to use the same names for these
> things as the Unicode and Java folk do:  there's excellent documentation
> elsewhere for all this stuff, and it's Bad to make users mentally translate
> unique Python terminology to make sense of the official docs.

Hmm, that's an argument but it breaks the current method
naming scheme of all lowercase letter. Perhaps I should simply
provide a new method for .toTitleCase(), e.g. .title(), and
leave the previous definition of .capitalize() intact...

> So my vote is:  leave capitalize the hell alone <wink>.  Do not implement
> capitialize for Unicode strings.  Introduce a new titlecase method for
> Unicode strings.  Add a new titlecase method to 8-bit strings too.  Unicode
> strings should also have methods to get at uppercase and lowercase (as
> Unicode defines those).

...looks like you're more or less on the same wave length here ;-)

Here's what I'll do:

* implement .capitalize() in the traditional way for Unicode
  objects (simply convert the first char to uppercase)
* implement u.title() to mean the same as Java's toTitleCase()
* don't implement s.title(): the reasoning here is that it would
  confuse the user when she get's different return values for
  the same string (titlecase chars usually live in higher Unicode
  code ranges not reachable in Latin-1)

Thanks for the feedback,
Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/