Unicode mapping tables

I am just coding the translate method for Unicode objects and have come along a design question that may have some importance with resp. to speed and memory allocation size. Currently, mapping tables map characters to Unicode characters and vice-versa. Now the .translate method will use a different kind of table: mapping integer ordinals to integer ordinals. Question: What is more of efficient: having lots of integers in a dictionary or lots of characters ? Another aspect of this question is: the translate method will be able to handle sequences *and* mappings because it looks up integers which can be interpreted as indexes as well as dictionary keys. The character mapping codec uses characters as key and thus only allows dictionaries to be used (the reason is that in some future version it should be possible to map single characters to multiple characters or even combinations to bnew combinations). BTW, I dropped the deletions argument from the translate method: it is not needed, since a mapping to None will have the same effect. Note that not specifying a mapping causes the characters to be copied as-is. This has the nice side-effect of grealty reducing the mapping table's size. Note that there will be no .maketrans() method. The same functionality can easily be coded in Python if needed and doesn't fit into the OO-style nature of string and Unicode objects anymore. -- Something else that changed is the way .capitalize() works. The Unicode version uses the Unicode algorithm for it (see TechRep. 13 on the www.unicode.org site). Here's the new doc string: S.capitalize() -> unicode Return a capitalized version of S, i.e. words start with title case characters, all remaining cased characters have lower case. Note that *all* characters are touched, not just the first one. The change was needed to get it in sync with the .iscapitalized() method which is based on the Unicode algorithm too. Should this change be propogated to the string implementation ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

"M.-A. Lemburg" wrote:
Turns out that integers are more flexible after some tests... I'll stick with them :-) Perhaps we could bump the small int optimization limit to 256 (it is currently set to 100) ?! This would be ideal for these tables, since then at least most of the keys would be shared between tables. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[M.-A. Lemburg]
You mean that if I want to map u"a" to u"A", I have to set up some sort of dict mapping ord(u"a") to ord(u"A")? I simply couldn't follow this.
Question: What is more of efficient: having lots of integers in a dictionary or lots of characters ?
My bet is "lots of integers", to reduce both space use and comparison time.
#13 is "Unicode Newline Guidelines". I assume you meant #21 ("Case Mappings").
Unicode makes distinctions among "upper case", "lower case" and "title case", and you're trying to get away with a single "capitalize" function. Java has separate toLowerCase, toUpperCase and toTitleCase methods, and that's the way to do it. Whatever you do, leave .capitalize alone for 8-bit strings -- there's no reason to break code that currently works. "capitalize" seems a terrible choice of name for a titlecase method anyway, because of its baggage connotations from 8-bit strings. Since this stuff is complicated, I say it would be much better to use the same names for these things as the Unicode and Java folk do: there's excellent documentation elsewhere for all this stuff, and it's Bad to make users mentally translate unique Python terminology to make sense of the official docs. So my vote is: leave capitalize the hell alone <wink>. Do not implement capitialize for Unicode strings. Introduce a new titlecase method for Unicode strings. Add a new titlecase method to 8-bit strings too. Unicode strings should also have methods to get at uppercase and lowercase (as Unicode defines those).

Tim Peters wrote:
I meant: 'a': u'A' vs. ord('a'): ord(u'A') The latter wins ;-) Reasoning for the first was that it allows character sequences to be handled by the same mapping algorithm. I decided to leave those techniques to some future implementation, since mapping integers has the nice side-effect of also allowing sequences to be used as mapping tables... resulting in some speedup at the cost of memory consumption. BTW, there are now three different ways to do char translations: 1. char -> unicode (char mapping codec's decode) 2. unicode -> char (char mapping codec's encode) 3. unicode -> unicode (unicode's .translate() method)
Right. That's what I found too... it's "lots of integers" now :-)
Dang. You're right. Here's the URL in case someone wants to join in: http://www.unicode.org/unicode/reports/tr21/tr21-2.html
The Unicode implementation has the corresponding: .upper(), .lower() and .capitalize() They work just like .toUpperCase, .toLowerCase, .toTitleCase resp. (well at least they should ;).
Hmm, that's an argument but it breaks the current method naming scheme of all lowercase letter. Perhaps I should simply provide a new method for .toTitleCase(), e.g. .title(), and leave the previous definition of .capitalize() intact...
...looks like you're more or less on the same wave length here ;-) Here's what I'll do: * implement .capitalize() in the traditional way for Unicode objects (simply convert the first char to uppercase) * implement u.title() to mean the same as Java's toTitleCase() * don't implement s.title(): the reasoning here is that it would confuse the user when she get's different return values for the same string (titlecase chars usually live in higher Unicode code ranges not reachable in Latin-1) Thanks for the feedback, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Huh? For ASCII at least, titlecase seems to map to ASCII; in your current implementation, only two Latin-1 characters (u'\265' and u'\377', I have no easy way to show them in Latin-1) map outside the Latin-1 range. Anyway, I would suggest to add a title() call to 8-bit strings as well; then we can do away with string.capwords(), which does something similar but different, mostly by accident. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
You're right, sorry for the confusion. I was thinking of other encodings like e.g. cp437 which have corresponding characters in the higher Unicode ranges.
Ok, I'll do it this way then: s.title() will use C's toupper() and tolower() for case mapping and u.title() the Unicode routines. This will be in sync with the rest of the 8-bit string world (which is locale aware on many platforms AFAIK), even though it might not return the same string as the corresponding u.title() call. u.capwords() will be disabled in the Unicode implemetation... it wasn't even implemented for the string implementetation, so there's no breakage ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[M.-A. Lemburg]
Given .title(), is .capitalize() of use for Unicode strings? Or is it just a temptation to do something senseless in the Unicode world? If it doesn't make sense, leave it out (this *seems* like compulsion <wink> to implement all current string methods in *some* way for Unicode, whether or not they make sense).

[Tim]
The intention of this is to make code that does something using strings do exactly the same strings if those strings happen to be Unicode strings with the same values. The capitalize method returns self[0].upper() + self[1:] -- that may not make sense for e.g. Japanese, but it certainly does for Russian or Greek. It also does this in JPython. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters wrote:
.capitalize() only touches the first char of the string - not sure whether it makes sense in both worlds ;-) Anyhow, the difference is there but subtle: string.capitalize() will use C's toupper() which is locale dependent, while unicode.capitalize() uses Unicode's toTitleCase() for the first character. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

"M.-A. Lemburg" wrote:
Turns out that integers are more flexible after some tests... I'll stick with them :-) Perhaps we could bump the small int optimization limit to 256 (it is currently set to 100) ?! This would be ideal for these tables, since then at least most of the keys would be shared between tables. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[M.-A. Lemburg]
You mean that if I want to map u"a" to u"A", I have to set up some sort of dict mapping ord(u"a") to ord(u"A")? I simply couldn't follow this.
Question: What is more of efficient: having lots of integers in a dictionary or lots of characters ?
My bet is "lots of integers", to reduce both space use and comparison time.
#13 is "Unicode Newline Guidelines". I assume you meant #21 ("Case Mappings").
Unicode makes distinctions among "upper case", "lower case" and "title case", and you're trying to get away with a single "capitalize" function. Java has separate toLowerCase, toUpperCase and toTitleCase methods, and that's the way to do it. Whatever you do, leave .capitalize alone for 8-bit strings -- there's no reason to break code that currently works. "capitalize" seems a terrible choice of name for a titlecase method anyway, because of its baggage connotations from 8-bit strings. Since this stuff is complicated, I say it would be much better to use the same names for these things as the Unicode and Java folk do: there's excellent documentation elsewhere for all this stuff, and it's Bad to make users mentally translate unique Python terminology to make sense of the official docs. So my vote is: leave capitalize the hell alone <wink>. Do not implement capitialize for Unicode strings. Introduce a new titlecase method for Unicode strings. Add a new titlecase method to 8-bit strings too. Unicode strings should also have methods to get at uppercase and lowercase (as Unicode defines those).

Tim Peters wrote:
I meant: 'a': u'A' vs. ord('a'): ord(u'A') The latter wins ;-) Reasoning for the first was that it allows character sequences to be handled by the same mapping algorithm. I decided to leave those techniques to some future implementation, since mapping integers has the nice side-effect of also allowing sequences to be used as mapping tables... resulting in some speedup at the cost of memory consumption. BTW, there are now three different ways to do char translations: 1. char -> unicode (char mapping codec's decode) 2. unicode -> char (char mapping codec's encode) 3. unicode -> unicode (unicode's .translate() method)
Right. That's what I found too... it's "lots of integers" now :-)
Dang. You're right. Here's the URL in case someone wants to join in: http://www.unicode.org/unicode/reports/tr21/tr21-2.html
The Unicode implementation has the corresponding: .upper(), .lower() and .capitalize() They work just like .toUpperCase, .toLowerCase, .toTitleCase resp. (well at least they should ;).
Hmm, that's an argument but it breaks the current method naming scheme of all lowercase letter. Perhaps I should simply provide a new method for .toTitleCase(), e.g. .title(), and leave the previous definition of .capitalize() intact...
...looks like you're more or less on the same wave length here ;-) Here's what I'll do: * implement .capitalize() in the traditional way for Unicode objects (simply convert the first char to uppercase) * implement u.title() to mean the same as Java's toTitleCase() * don't implement s.title(): the reasoning here is that it would confuse the user when she get's different return values for the same string (titlecase chars usually live in higher Unicode code ranges not reachable in Latin-1) Thanks for the feedback, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Huh? For ASCII at least, titlecase seems to map to ASCII; in your current implementation, only two Latin-1 characters (u'\265' and u'\377', I have no easy way to show them in Latin-1) map outside the Latin-1 range. Anyway, I would suggest to add a title() call to 8-bit strings as well; then we can do away with string.capwords(), which does something similar but different, mostly by accident. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
You're right, sorry for the confusion. I was thinking of other encodings like e.g. cp437 which have corresponding characters in the higher Unicode ranges.
Ok, I'll do it this way then: s.title() will use C's toupper() and tolower() for case mapping and u.title() the Unicode routines. This will be in sync with the rest of the 8-bit string world (which is locale aware on many platforms AFAIK), even though it might not return the same string as the corresponding u.title() call. u.capwords() will be disabled in the Unicode implemetation... it wasn't even implemented for the string implementetation, so there's no breakage ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[M.-A. Lemburg]
Given .title(), is .capitalize() of use for Unicode strings? Or is it just a temptation to do something senseless in the Unicode world? If it doesn't make sense, leave it out (this *seems* like compulsion <wink> to implement all current string methods in *some* way for Unicode, whether or not they make sense).

[Tim]
The intention of this is to make code that does something using strings do exactly the same strings if those strings happen to be Unicode strings with the same values. The capitalize method returns self[0].upper() + self[1:] -- that may not make sense for e.g. Japanese, but it certainly does for Russian or Greek. It also does this in JPython. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters wrote:
.capitalize() only touches the first char of the string - not sure whether it makes sense in both worlds ;-) Anyhow, the difference is there but subtle: string.capitalize() will use C's toupper() which is locale dependent, while unicode.capitalize() uses Unicode's toTitleCase() for the first character. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (3)
-
Guido van Rossum
-
M.-A. Lemburg
-
Tim Peters