[Python-Dev] Internationalization Toolkit

Da Silva, Mike Mike.Da.Silva@uk.fid-intl.com
Fri, 12 Nov 1999 11:00:49 -0000


Most of the ASCII string functions do indeed work for UTF-8.  I have made
extensive use of this feature when writing translation logic to harmonize
ASCII text (an SQL statement) with substitution parameters that must be
converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
a superset of ASCII, this all works fine.

Some of the character classification functions etc can be flaky when used
with UTF8 characters outside the ASCII range, but simple string operations
work fine.

As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
internal string representation are:

1.	UTF-8 allows all characters to be displayed (in some form or other)
on the users machine, with or without native fonts installed.  Naturally
anything outside the ASCII range will be garbage, but it is an immense
debugging aid when working with character encodings to be able to touch and
feel something recognizable.  Trying to decode a block of raw UTF-16 is a
pain.
2.	UTF-8 works with most existing string manipulation libraries quite
happily.  It is also portable (a char is always 8 bits, regardless of
platform; wchar_t varies between 16 and 32 bits depending on the underlying
operating system (although unsigned short does seems to work across
platforms, in my experience).
3.	UTF-16 has some advantages in providing fixed width characters and,
(ignoring surrogate pairs etc) a modeless encoding space.  This is an
advantage for fast string operations, especially on CPU's that have
efficient operations for handling 16bit data.
4.	UTF-16 would directly support a tightly coupled character properties
engine, which would enable Unicode compliant case folding and character
decomposition to be performed without an intermediate UTF-8 <----> UTF-16
translation step.
5.	UTF-16 requires string operations that do not make assumptions about
nulls - this means re-implementing most of the C runtime functions to work
with unsigned shorts.

Regards,
Mike da Silva

	-----Original Message-----
	From:	Greg Stein [SMTP:gstein@lyra.org]
	Sent:	12 November 1999 10:30
	To:	Tim Peters
	Cc:	python-dev@python.org
	Subject:	RE: [Python-Dev] Internationalization Toolkit

	On Fri, 12 Nov 1999, Tim Peters wrote:
	>...
	> Using UTF-8 internally is also reasonable, and if it's being
rejected on the
	> grounds of supposed slowness

	No... my main point was interaction with the underlying OS. I made a
SWAG
	(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably
slower
	for various types of operations. As always, your infernal meddling
has
	dashed that hypothesis, so I must retreat...

	>...
	> I expect either would work well.  It's at least curious that Perl
and Tcl
	> both went with UTF-8 -- does anyone think they know *why*?  I
don't.  The
	> people here saying UCS-2 is the obviously better choice are all
from the
	> Microsoft camp <wink>.  It's not obvious to me, but then neither
do I claim
	> that UTF-8 is obviously better.

	Probably for the exact reason that you stated in your messages: many
8-bit
	(7-bit?) functions continue to work quite well when given a
UTF-8-encoded
	string. i.e. they didn't have to rewrite the entire Perl/TCL
interpreter
	to deal with a new string type.

	I'd guess it is a helluva lot easier for us to add a Python Type
than for
	Perl or TCL to whack around with new string types (since they use
strings
	so heavily).

	Cheers,
	-g

	--
	Greg Stein, http://www.lyra.org/


	_______________________________________________
	Python-Dev maillist  -  Python-Dev@python.org
	http://www.python.org/mailman/listinfo/python-dev