[XML-SIG] PyExpat encoding (was: XML support in Python 1.6)

Greg Stein gstein@lyra.org
Thu, 1 Jun 2000 14:28:06 -0700 (PDT)

On Thu, 1 Jun 2000, Andrew M. Kuchling wrote:
> On Thu, Jun 01, 2000 at 12:56:28PM -0700, Greg Stein wrote:
> >IMO, we should have a fixed output format, which is the Expat default:
> >UTF-8.
> I don't know; it seems a bit odd to parse a Unicode string and then
> have to convert from an 8-bit encoding back to Unicode in your
> character data handlers, attributes, etc.  The problem is that it's
> also odd to parse a regular Python string and get back Unicode.  
> OTOH, if Latin1-encoded XML has something like <!ENTITY unichar
> &#1972;> &unichar; in it, Unicode is the only thing it could possibly
> return.

Yes, Unicode is the only thing it can return.

BUT: it can return it as a Unicode object, or as a UTF-8 encoded string.

In other words, I think you're confusing the character set that Expat
operates with (Unicode) with the encoding of that charset (UTF-8 or
UTF-16; the latter is used by the Unicode object).

> Maybe PyExpat could attempt to convert its Unicode output
> into an 8-bit string (but using what encoding?), and only return
> Unicode if it has to.  
> Hmmm... on the third hand, XML is a Unicode based standard, and
> sometimes returning Unicode and sometimes an 8-bit string is also
> strange.  Maybe it's best to just always return Unicode, and leave
> further conversion to the caller.  
> I think I'd go for the third option: always returning Unicode strings.

Expat is characterized by its speed. Throwing conversions in there is not
going to help.

Yes, varying output is wrong. Expat's default is UTF-8. My recommendation
is to use UTF-8.

If somebody is adventurous, then they can add a flag to pyexpat that
states what encoding to use for the callbacks: UTF-8 or UnicodeObs. But
without that extra work, it "should" be UTF-8.


Greg Stein, http://www.lyra.org/