RE: [Python-Dev] Internationalization Toolkit

Most of the ASCII string functions do indeed work for UTF-8. I have made extensive use of this feature when writing translation logic to harmonize ASCII text (an SQL statement) with substitution parameters that must be converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is a superset of ASCII, this all works fine. Some of the character classification functions etc can be flaky when used with UTF8 characters outside the ASCII range, but simple string operations work fine. As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an internal string representation are: 1. UTF-8 allows all characters to be displayed (in some form or other) on the users machine, with or without native fonts installed. Naturally anything outside the ASCII range will be garbage, but it is an immense debugging aid when working with character encodings to be able to touch and feel something recognizable. Trying to decode a block of raw UTF-16 is a pain. 2. UTF-8 works with most existing string manipulation libraries quite happily. It is also portable (a char is always 8 bits, regardless of platform; wchar_t varies between 16 and 32 bits depending on the underlying operating system (although unsigned short does seems to work across platforms, in my experience). 3. UTF-16 has some advantages in providing fixed width characters and, (ignoring surrogate pairs etc) a modeless encoding space. This is an advantage for fast string operations, especially on CPU's that have efficient operations for handling 16bit data. 4. UTF-16 would directly support a tightly coupled character properties engine, which would enable Unicode compliant case folding and character decomposition to be performed without an intermediate UTF-8 <----> UTF-16 translation step. 5. UTF-16 requires string operations that do not make assumptions about nulls - this means re-implementing most of the C runtime functions to work with unsigned shorts. Regards, Mike da Silva -----Original Message----- From: Greg Stein [SMTP:gstein@lyra.org] Sent: 12 November 1999 10:30 To: Tim Peters Cc: python-dev@python.org Subject: RE: [Python-Dev] Internationalization Toolkit On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ _______________________________________________ Python-Dev maillist - Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev

"Da Silva, Mike" wrote:
That's why there's the <defencbuf> buffer which holds the UTF-8 encoded value...
True.
You mean with the compiler applying the needed 16->32 bit extension ?
Right and this is major argument for using 16 bit encodings without state internally.
Could you elaborate on this one ? It is one of the open issues in the proposal.
AFAIK, the RE engines in Python are 8-bit clean... BTW, wouldn't it be possible to take pcre and have it use Py_Unicode instead of char ? [Of course, there would have to be some extensions for character classes etc.] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ (and you can replace "unsigned short" with "whatever's suitable on this platform") </F>

[Da Silva, Mike]
Python strings are already null-friendly, so Python has already recoded everything it needs to get away from the no-null assumption; stropmodule.c is < 1,500 lines of code, and MAL can turn it into C++ template functions in his sleep <wink -- but stuff "like this" really is easier in C++>.

"Da Silva, Mike" wrote:
That's why there's the <defencbuf> buffer which holds the UTF-8 encoded value...
True.
You mean with the compiler applying the needed 16->32 bit extension ?
Right and this is major argument for using 16 bit encodings without state internally.
Could you elaborate on this one ? It is one of the open issues in the proposal.
AFAIK, the RE engines in Python are 8-bit clean... BTW, wouldn't it be possible to take pcre and have it use Py_Unicode instead of char ? [Of course, there would have to be some extensions for character classes etc.] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ (and you can replace "unsigned short" with "whatever's suitable on this platform") </F>

[Da Silva, Mike]
Python strings are already null-friendly, so Python has already recoded everything it needs to get away from the no-null assumption; stropmodule.c is < 1,500 lines of code, and MAL can turn it into C++ template functions in his sleep <wink -- but stuff "like this" really is easier in C++>.
participants (4)
-
Da Silva, Mike
-
Fredrik Lundh
-
M.-A. Lemburg
-
Tim Peters