[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Fri, 12 Nov 1999 12:15:15 +0100


"Da Silva, Mike" wrote:
> 
> Most of the ASCII string functions do indeed work for UTF-8.  I have made
> extensive use of this feature when writing translation logic to harmonize
> ASCII text (an SQL statement) with substitution parameters that must be
> converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
> a superset of ASCII, this all works fine.
> 
> Some of the character classification functions etc can be flaky when used
> with UTF8 characters outside the ASCII range, but simple string operations
> work fine.

That's why there's the <defencbuf> buffer which holds the UTF-8
encoded value...
 
> As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
> internal string representation are:
> 
> 1.      UTF-8 allows all characters to be displayed (in some form or other)
> on the users machine, with or without native fonts installed.  Naturally
> anything outside the ASCII range will be garbage, but it is an immense
> debugging aid when working with character encodings to be able to touch and
> feel something recognizable.  Trying to decode a block of raw UTF-16 is a
> pain.

True.

> 2.      UTF-8 works with most existing string manipulation libraries quite
> happily.  It is also portable (a char is always 8 bits, regardless of
> platform; wchar_t varies between 16 and 32 bits depending on the underlying
> operating system (although unsigned short does seems to work across
> platforms, in my experience).

You mean with the compiler applying the needed 16->32 bit extension ?

> 3.      UTF-16 has some advantages in providing fixed width characters and,
> (ignoring surrogate pairs etc) a modeless encoding space.  This is an
> advantage for fast string operations, especially on CPU's that have
> efficient operations for handling 16bit data.

Right and this is major argument for using 16 bit encodings without
state internally.

> 4.      UTF-16 would directly support a tightly coupled character properties
> engine, which would enable Unicode compliant case folding and character
> decomposition to be performed without an intermediate UTF-8 <----> UTF-16
> translation step.

Could you elaborate on this one ? It is one of the open issues
in the proposal.

> 5.      UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

AFAIK, the RE engines in Python are 8-bit clean...

BTW, wouldn't it be possible to take pcre and have it
use Py_Unicode instead of char ? [Of course, there would have to
be some extensions for character classes etc.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/