Internal Format (Re: [Python-Dev] Internationalization Toolkit)
M.-A. Lemburg
mal@lemburg.com
Wed, 10 Nov 1999 11:03:36 +0100
Fredrik Lundh wrote:
>
> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
> > http://starship.skyport.net/~lemburg/unicode-proposal.txt
>
> Marc-Andre writes:
>
> The internal format for Unicode objects should either use a Python
> specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte
> little endian byte order) or a compiler provided wchar_t format (if
> available). Using the wchar_t format will ease embedding of Python in
> other Unicode aware applications, but will also make internal format
> dumps platform dependent.
>
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T). along all other
> roads lie code bloat and speed penalties...
>
> (besides, this is exactly how it's already done in
> unicode.c and what 'sre' prefers...)
Ok, byte order can cause a speed penalty, so it might be
worthwhile introducing sys.bom (or sys.endianness) for this
reason and sticking to 16-bit integers as you have already done
in unicode.h.
What I don't like is using wchar_t if available (and then addressing
it as if it were defined as unsigned integer). IMO, it's better
to define a Python Unicode representation which then gets converted
to whatever wchar_t represents on the target machine.
Another issue is whether to use UCS2 (as you have done) or UTF16
(which is what Unicode 3.0 requires)... see my other post
for a discussion.
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 51 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/