[Python-Dev] Internationalization Toolkit
M.-A. Lemburg
mal@lemburg.com
Fri, 12 Nov 1999 12:15:15 +0100
"Da Silva, Mike" wrote:
>
> Most of the ASCII string functions do indeed work for UTF-8. I have made
> extensive use of this feature when writing translation logic to harmonize
> ASCII text (an SQL statement) with substitution parameters that must be
> converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is
> a superset of ASCII, this all works fine.
>
> Some of the character classification functions etc can be flaky when used
> with UTF8 characters outside the ASCII range, but simple string operations
> work fine.
That's why there's the <defencbuf> buffer which holds the UTF-8
encoded value...
> As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
> internal string representation are:
>
> 1. UTF-8 allows all characters to be displayed (in some form or other)
> on the users machine, with or without native fonts installed. Naturally
> anything outside the ASCII range will be garbage, but it is an immense
> debugging aid when working with character encodings to be able to touch and
> feel something recognizable. Trying to decode a block of raw UTF-16 is a
> pain.
True.
> 2. UTF-8 works with most existing string manipulation libraries quite
> happily. It is also portable (a char is always 8 bits, regardless of
> platform; wchar_t varies between 16 and 32 bits depending on the underlying
> operating system (although unsigned short does seems to work across
> platforms, in my experience).
You mean with the compiler applying the needed 16->32 bit extension ?
> 3. UTF-16 has some advantages in providing fixed width characters and,
> (ignoring surrogate pairs etc) a modeless encoding space. This is an
> advantage for fast string operations, especially on CPU's that have
> efficient operations for handling 16bit data.
Right and this is major argument for using 16 bit encodings without
state internally.
> 4. UTF-16 would directly support a tightly coupled character properties
> engine, which would enable Unicode compliant case folding and character
> decomposition to be performed without an intermediate UTF-8 <----> UTF-16
> translation step.
Could you elaborate on this one ? It is one of the open issues
in the proposal.
> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.
AFAIK, the RE engines in Python are 8-bit clean...
BTW, wouldn't it be possible to take pcre and have it
use Py_Unicode instead of char ? [Of course, there would have to
be some extensions for character classes etc.]
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 49 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/