[XML-SIG] PyExpat encoding

Andrew M. Kuchling akuchlin@mems-exchange.org
Thu, 1 Jun 2000 17:41:51 -0400


On Thu, Jun 01, 2000 at 02:28:06PM -0700, Greg Stein wrote:
>In other words, I think you're confusing the character set that Expat
>operates with (Unicode) with the encoding of that charset (UTF-8 or
>UTF-16; the latter is used by the Unicode object).

Perhaps; I'm asking what's the Python type of the Python objects
passed to callbacks used by Expat. 

>Expat is characterized by its speed. Throwing conversions in there is not
>going to help.

I thought Paul said Expat could be compiled to return 16-bit Unicode.
Or... damn, does it return UCS-2 and we need UTF-16?  <looks at
xmlparse.h> In Expat 1.1, it looks to me that if you #define
XML_UNICODE, and don't #define XML_UNICODE_WCHAR_T, Expat will return
"UTF-16 encoded as unsigned shorts".  Wouldn't that be just what we
need to return Unicode objects?  

On the other hand, that means you can't use the system's copy of
Expat, since who knows what it was compiled with?  Actually, this
seems like a bug in Expat; if I have an Expat library, I have no way
of figuring out what it'll be outputting: C 'char's containing UTF-8,
unsigned short holding UTF-16, or wchar_t holding UTF-16.  (Argh, my
head explodes every time character encodings come up.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
And if there's a moral there, I don't know what it is, save maybe that we
should take our goodbyes whenever we can.
  -- Barbie, in SANDMAN #37: "I Woke Up and One of Us Was Crying"