[XML-SIG] PyExpat encoding

Greg Stein gstein@lyra.org
Thu, 1 Jun 2000 14:53:32 -0700 (PDT)


On Thu, 1 Jun 2000, Andrew M. Kuchling wrote:
> On Thu, Jun 01, 2000 at 02:28:06PM -0700, Greg Stein wrote:
> >In other words, I think you're confusing the character set that Expat
> >operates with (Unicode) with the encoding of that charset (UTF-8 or
> >UTF-16; the latter is used by the Unicode object).
> 
> Perhaps; I'm asking what's the Python type of the Python objects
> passed to callbacks used by Expat. 

Right. I'm saying that it can be either, depending on how Expat was built.

> >Expat is characterized by its speed. Throwing conversions in there is not
> >going to help.
> 
> I thought Paul said Expat could be compiled to return 16-bit Unicode.
> Or... damn, does it return UCS-2 and we need UTF-16?  <looks at
> xmlparse.h> In Expat 1.1, it looks to me that if you #define
> XML_UNICODE, and don't #define XML_UNICODE_WCHAR_T, Expat will return
> "UTF-16 encoded as unsigned shorts".  Wouldn't that be just what we
> need to return Unicode objects?  

Python is the same: UTF-16 encoded as unsigned shorts.

> On the other hand, that means you can't use the system's copy of
> Expat, since who knows what it was compiled with?

Bingo. My point exactly. By default, Expat is going to be built using
UTF-8 for the output.

> Actually, this
> seems like a bug in Expat; if I have an Expat library, I have no way
> of figuring out what it'll be outputting: C 'char's containing UTF-8,
> unsigned short holding UTF-16, or wchar_t holding UTF-16.  (Argh, my
> head explodes every time character encodings come up.)

Eek. You're right. This can be determined at compile-time, so we can Do
The Right Thing when building pyexpat. But things will be hosed if
somebody drops in a libexpat.a that was compiled differently.

Bleh. This says we should simply depend on it being compiled to output
UTF-8, or we should include a copy of the library. The latter is already
"not recommended" by the BDFL, so we can only assume that Expat will
return UTF-8.

This still doesn't discount pyexpat from having a setting to do a decoding
on the UTF-8 text and calling into Python with Unicode obs.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/