[XML-SIG] PyExpat encoding

Greg Stein gstein@lyra.org
Thu, 1 Jun 2000 22:36:17 -0700 (PDT)


Expat will accept either encoding for the text that it *consumes*.

The discussion point is about what kind of objects are passed to the
Handlers from the Expat parser. Are those objects UTF-8 strings or Unicode
objects?

Cheers,
-g

On Thu, 1 Jun 2000 tpassin@home.com wrote:

> With all the talk about default encodings, compiling for different
> encodings, and passing "unicode" to Python objects, I'm losing track -or my
> grip :) -.  Do these considerations affect either pyexpat or other python
> XML code in their ability to handle the basic required encodings, per the
> XML 1.0 Rec:
> 
> "All XML processors must accept the UTF-8 and UTF-16 encodings of 10646"
> 
> also,
> 
> "In the absence of information provided by an external transport protocol
> (e.g. HTTP or MIME), it is an error for an entity including an encoding
> declaration to be presented to the XML processor in an encoding other than
> that named in the declaration, for an encoding declaration to occur other
> than at the beginning of an external entity, or for an entity which begins
> with neither a Byte Order Mark nor an encoding declaration to use an
> encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
> ordinary ASCII entities do not strictly need an encoding declaration."
> 
> Since there is a lot of XML out there without encoding declarations, it
> would seem that UTF-8 would HAVE to be used as the default.  I admit, I
> can't find anything in the Rec that says what encoding a processor must use
> to send results to other pieces of code.  And don't these other pieces of
> code also constitute XML processors, so that they should follow the same
> rules?
> 
> I'd appreciate some enlightment in this area - if expat/pyexpat are compiled
> to "use" encoding X, how does this fact interact with the Rec's
> requirements?  Or doesn't it?
> 
> Tom Passin
> 
>  Greg Stein (and lots of others) wrote:
> 
> > On Thu, 1 Jun 2000, Andrew M. Kuchling wrote:
> > > On Thu, Jun 01, 2000 at 02:28:06PM -0700, Greg Stein wrote:
> > > >In other words, I think you're confusing the character set that Expat
> > > >operates with (Unicode) with the encoding of that charset (UTF-8 or
> > > >UTF-16; the latter is used by the Unicode object).
> > >
> > > Perhaps; I'm asking what's the Python type of the Python objects
> > > passed to callbacks used by Expat.
> >
> > Right. I'm saying that it can be either, depending on how Expat was built.
> >
> > > >Expat is characterized by its speed. Throwing conversions in there is
> not
> > > >going to help.
> > >
> > > I thought Paul said Expat could be compiled to return 16-bit Unicode.
> > > Or... damn, does it return UCS-2 and we need UTF-16?  <looks at
> > > xmlparse.h> In Expat 1.1, it looks to me that if you #define
> > > XML_UNICODE, and don't #define XML_UNICODE_WCHAR_T, Expat will return
> > > "UTF-16 encoded as unsigned shorts".  Wouldn't that be just what we
> > > need to return Unicode objects?
> >
> > Python is the same: UTF-16 encoded as unsigned shorts.
> >
> > > On the other hand, that means you can't use the system's copy of
> > > Expat, since who knows what it was compiled with?
> >
> > Bingo. My point exactly. By default, Expat is going to be built using
> > UTF-8 for the output.
> >
> > > Actually, this
> > > seems like a bug in Expat; if I have an Expat library, I have no way
> > > of figuring out what it'll be outputting: C 'char's containing UTF-8,
> > > unsigned short holding UTF-16, or wchar_t holding UTF-16.  (Argh, my
> > > head explodes every time character encodings come up.)
> >
> > Eek. You're right. This can be determined at compile-time, so we can Do
> > The Right Thing when building pyexpat. But things will be hosed if
> > somebody drops in a libexpat.a that was compiled differently.
> >
> > Bleh. This says we should simply depend on it being compiled to output
> > UTF-8, or we should include a copy of the library. The latter is already
> > "not recommended" by the BDFL, so we can only assume that Expat will
> > return UTF-8.
> >
> > This still doesn't discount pyexpat from having a setting to do a decoding
> > on the UTF-8 text and calling into Python with Unicode obs.
> >
> 
> 
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://www.python.org/mailman/listinfo/xml-sig
> 

-- 
Greg Stein, http://www.lyra.org/