[XML-SIG] Expat answers

Paul Prescod paul@prescod.net
Fri, 28 Jan 2000 16:04:59 -0600

Answers to questions that people have asked me about Expat:

1. Expat can parse DTDs if we want it to. If you compile DTD support in
then you can turn on or off parameter entity parsing on a per-parser
instance basis.

Don't parse parameter entities or the external subset 

Parse parameter entites and the external subset unless standalone was
set to "yes" in the XML declaration. 

Always parse parameter entities and the external subset 

It doesn't validate but we could probably build that in Python

2. Expat can be compiled to output either UTF-8 or UTF-16 (which is for
our purposes the same as UCS-2). It is theoretically possible to make a
parser that understands Unicode enough to do proper well-formedness
checking yet leaves characters in their native encoding but as far as I
know, no such tool exists. I don't believe that sgmlop could ever be
that tool, even when it is rewritten on top of Fredrick's fast Unicode
regexp engine because that engine would still be UTF-16/UCS-2 specific.

If you need to process shift-JIS information then you need to allow
Expat to convert it to UTF-16 and then convert it back to shift-JIS. I
don't think that there is any XML parser in the world that allows you to
work in any arbitrary native encoding with no conversions. Maybe some

Handling for non-Unicode character sets is simply not supported. The XML
world decided specifically against this based on two arguments:

 * one cannot argue against Unicode on the basis of character encoding
*efficiency* because we allow any encoding (even those compatible with
the Unicode subset of shift-JIS etc.) to be used.

 * one cannot argue against Unicode on the basis that it does not allow
"private" characters because it does:


3. Expat outputs UTF-16 so it is ready for 20-bit Unicode, wherein we
will find:

"Plane 1 is going to hold ancient and invented scripts and musical
symbols, while Plane 2 (U-0002xxxx) is reserved for additional Han
ideographs, Plane 14 (U-000Exxxx) is going to start with some meta
characters for language tagging and there are two entire bonus
private-use planes."

Python itself will not handle 20bit characters yet, so the situation
with them will be just like the situation with 16 bit characters in
Python/xmllib today (Python will think that they are two characters).

 Paul Prescod

"Andrew M. Kuchling" wrote:
> The XML-SIG's developer's day session went well, and, unlike most DD
> sessions, we actually achieved consensus on something. :) To summarize
> the outcome:
>     * The current PyDOM code will be dropped and replaced with 4DOM.
>       The precise details of how this will work are still to be
>       resolved; will the 4DOM code move into xml.dom, or will xml.dom
>       import from xml.Ft.dom and provide some wrappers?
>     * PyExpat's interface will be changed to be SAX-like, and we'll
>       lobby Guido to add PyExpat to 1.6, along with Expat itself.  It
>       will be renamed, preferably to something with SAX in the name.
>       (expat_sax? pysax?  pyxml? whatever...)  It'll be updated to
>       support all the features in current versions of Expat; Jim
>       Fulton has an updated version of PyExpat inside Zope that will
>       probably be used.
>     * xmllib.py will be left unmodified, though it'll be deprecated in
>       favor of PyExpat.
>     * When 1.6 begins supporting Unicode, we'll fork the development
>       tree into two branches; the branch that works with 1.5 will be
>       maintained, though probably not actively developed.  This will
>       leave the other branch free to use 1.6-specific features without
>       worrying about backward compatibility.
> If I've forgotten something from the session, please let me know.
> --
> A.M. Kuchling                   http://starship.python.net/crew/amk/
> First things first, but not necessarily in that order.
>     -- The Doctor, in John Flanagan and Andrew McCulloch's _Meglos_
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://www.python.org/mailman/listinfo/xml-sig

 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
The new revolutionaries believe the time has come for an aggressive 
move against our oppressors. We have established a solid beachhead 
on Friday. We now intend to fight vigorously for 'casual Thursdays.'
  -- who says America's revolutionary spirit is dead?