[XML-SIG] Expat answers

Paul Prescod paul@prescod.net
Fri, 28 Jan 2000 16:04:59 -0600


Answers to questions that people have asked me about Expat:

1. Expat can parse DTDs if we want it to. If you compile DTD support in
then you can turn on or off parameter entity parsing on a per-parser
instance basis.

"""
XML_PARAM_ENTITY_PARSING_NEVER 
Don't parse parameter entities or the external subset 


XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE 
Parse parameter entites and the external subset unless standalone was
set to "yes" in the XML declaration. 


XML_PARAM_ENTITY_PARSING_ALWAYS 
Always parse parameter entities and the external subset 
"""

It doesn't validate but we could probably build that in Python

2. Expat can be compiled to output either UTF-8 or UTF-16 (which is for
our purposes the same as UCS-2). It is theoretically possible to make a
parser that understands Unicode enough to do proper well-formedness
checking yet leaves characters in their native encoding but as far as I
know, no such tool exists. I don't believe that sgmlop could ever be
that tool, even when it is rewritten on top of Fredrick's fast Unicode
regexp engine because that engine would still be UTF-16/UCS-2 specific.

If you need to process shift-JIS information then you need to allow
Expat to convert it to UTF-16 and then convert it back to shift-JIS. I
don't think that there is any XML parser in the world that allows you to
work in any arbitrary native encoding with no conversions. Maybe some
day.

Handling for non-Unicode character sets is simply not supported. The XML
world decided specifically against this based on two arguments:

 * one cannot argue against Unicode on the basis of character encoding
*efficiency* because we allow any encoding (even those compatible with
the Unicode subset of shift-JIS etc.) to be used.

 * one cannot argue against Unicode on the basis that it does not allow
"private" characters because it does:

http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Oct-1998/0366.html
http://www.ascc.net/xml/en/utf-8/faq/faq-xsl.html

3. Expat outputs UTF-16 so it is ready for 20-bit Unicode, wherein we
will find:

"Plane 1 is going to hold ancient and invented scripts and musical
symbols, while Plane 2 (U-0002xxxx) is reserved for additional Han
ideographs, Plane 14 (U-000Exxxx) is going to start with some meta
characters for language tagging and there are two entire bonus
private-use planes."

Python itself will not handle 20bit characters yet, so the situation
with them will be just like the situation with 16 bit characters in
Python/xmllib today (Python will think that they are two characters).

 Paul Prescod


"Andrew M. Kuchling" wrote:
> 
> The XML-SIG's developer's day session went well, and, unlike most DD
> sessions, we actually achieved consensus on something. :) To summarize
> the outcome:
> 
>     * The current PyDOM code will be dropped and replaced with 4DOM.
>       The precise details of how this will work are still to be
>       resolved; will the 4DOM code move into xml.dom, or will xml.dom
>       import from xml.Ft.dom and provide some wrappers?
> 
>     * PyExpat's interface will be changed to be SAX-like, and we'll
>       lobby Guido to add PyExpat to 1.6, along with Expat itself.  It
>       will be renamed, preferably to something with SAX in the name.
>       (expat_sax? pysax?  pyxml? whatever...)  It'll be updated to
>       support all the features in current versions of Expat; Jim
>       Fulton has an updated version of PyExpat inside Zope that will
>       probably be used.
> 
>     * xmllib.py will be left unmodified, though it'll be deprecated in
>       favor of PyExpat.
> 
>     * When 1.6 begins supporting Unicode, we'll fork the development
>       tree into two branches; the branch that works with 1.5 will be
>       maintained, though probably not actively developed.  This will
>       leave the other branch free to use 1.6-specific features without
>       worrying about backward compatibility.
> 
> If I've forgotten something from the session, please let me know.
> 
> --
> A.M. Kuchling                   http://starship.python.net/crew/amk/
> First things first, but not necessarily in that order.
>     -- The Doctor, in John Flanagan and Andrew McCulloch's _Meglos_
> 
> 
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://www.python.org/mailman/listinfo/xml-sig

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
The new revolutionaries believe the time has come for an aggressive 
move against our oppressors. We have established a solid beachhead 
on Friday. We now intend to fight vigorously for 'casual Thursdays.'
  -- who says America's revolutionary spirit is dead?