[Python-Dev] Internationalization Toolkit

Wed, 10 Nov 1999 10:49:01 +0100

Tim Peters wrote:
> 
> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> > (under pressure from HP who want Python i18n badly and are willing to
> > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> I can't make time for a close review now.  Just one thing that hit my eye
> early:
> 
>     Python should provide a built-in constructor for Unicode strings
>     which is available through __builtins__:
> 
>     u = unicode(<encoded Python string>[,<encoding name>=
>                                          <default encoding>])
> 
>     u = u'<utf-8 encoded Python string>'
> 
> Two points on the Unicode literals (u'abc'):
> 
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.  This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
> discussed earlier, we should follow Java's lead and also introduce a \u
> escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

It would be more conform to use the Unicode ordinal (instead of
interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The
codes are easy to look up in the standard's UnicodeData.txt file or the
Unicode book for that matter.

> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2.

Perhaps we could add a flag to Unicode objects stating whether the characters
can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are
the same in most ranges).

This flag could then be used to choose optimized algorithms for scanning
the strings. Fredrik's implementation currently uses UCS2, BTW.

> BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> inverse in the Unicode world?  Both seem essential.

Good points.

How about 

  uniord(u[:1]) --> Unicode ordinal number (32-bit)

  unichr(i) --> Unicode object for character i (provided it is 32-bit);
                ValueError otherwise

They are inverse of each other, but note that Unicode allows 
private encodings too, which will of course not necessarily make
it across platforms or even from one PC to the next (see Andy Robinson's
interesting case study).

I've uploaded a new version of the proposal (0.3) to the URL:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/