[XML-SIG] PyExpat encoding (was: XML support in Python 1.6)
Thu, 1 Jun 2000 14:28:06 -0700 (PDT)
On Thu, 1 Jun 2000, Andrew M. Kuchling wrote:
> On Thu, Jun 01, 2000 at 12:56:28PM -0700, Greg Stein wrote:
> >IMO, we should have a fixed output format, which is the Expat default:
> I don't know; it seems a bit odd to parse a Unicode string and then
> have to convert from an 8-bit encoding back to Unicode in your
> character data handlers, attributes, etc. The problem is that it's
> also odd to parse a regular Python string and get back Unicode.
> OTOH, if Latin1-encoded XML has something like <!ENTITY unichar
> ޴> &unichar; in it, Unicode is the only thing it could possibly
Yes, Unicode is the only thing it can return.
BUT: it can return it as a Unicode object, or as a UTF-8 encoded string.
In other words, I think you're confusing the character set that Expat
operates with (Unicode) with the encoding of that charset (UTF-8 or
UTF-16; the latter is used by the Unicode object).
> Maybe PyExpat could attempt to convert its Unicode output
> into an 8-bit string (but using what encoding?), and only return
> Unicode if it has to.
> Hmmm... on the third hand, XML is a Unicode based standard, and
> sometimes returning Unicode and sometimes an 8-bit string is also
> strange. Maybe it's best to just always return Unicode, and leave
> further conversion to the caller.
> I think I'd go for the third option: always returning Unicode strings.
Expat is characterized by its speed. Throwing conversions in there is not
going to help.
Yes, varying output is wrong. Expat's default is UTF-8. My recommendation
is to use UTF-8.
If somebody is adventurous, then they can add a flag to pyexpat that
states what encoding to use for the callbacks: UTF-8 or UnicodeObs. But
without that extra work, it "should" be UTF-8.
Greg Stein, http://www.lyra.org/