[XML-SIG] Expat answers
Paul Prescod
paul@prescod.net
Fri, 28 Jan 2000 16:04:59 -0600
Answers to questions that people have asked me about Expat:
1. Expat can parse DTDs if we want it to. If you compile DTD support in
then you can turn on or off parameter entity parsing on a per-parser
instance basis.
"""
XML_PARAM_ENTITY_PARSING_NEVER
Don't parse parameter entities or the external subset
XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE
Parse parameter entites and the external subset unless standalone was
set to "yes" in the XML declaration.
XML_PARAM_ENTITY_PARSING_ALWAYS
Always parse parameter entities and the external subset
"""
It doesn't validate but we could probably build that in Python
2. Expat can be compiled to output either UTF-8 or UTF-16 (which is for
our purposes the same as UCS-2). It is theoretically possible to make a
parser that understands Unicode enough to do proper well-formedness
checking yet leaves characters in their native encoding but as far as I
know, no such tool exists. I don't believe that sgmlop could ever be
that tool, even when it is rewritten on top of Fredrick's fast Unicode
regexp engine because that engine would still be UTF-16/UCS-2 specific.
If you need to process shift-JIS information then you need to allow
Expat to convert it to UTF-16 and then convert it back to shift-JIS. I
don't think that there is any XML parser in the world that allows you to
work in any arbitrary native encoding with no conversions. Maybe some
day.
Handling for non-Unicode character sets is simply not supported. The XML
world decided specifically against this based on two arguments:
* one cannot argue against Unicode on the basis of character encoding
*efficiency* because we allow any encoding (even those compatible with
the Unicode subset of shift-JIS etc.) to be used.
* one cannot argue against Unicode on the basis that it does not allow
"private" characters because it does:
http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Oct-1998/0366.html
http://www.ascc.net/xml/en/utf-8/faq/faq-xsl.html
3. Expat outputs UTF-16 so it is ready for 20-bit Unicode, wherein we
will find:
"Plane 1 is going to hold ancient and invented scripts and musical
symbols, while Plane 2 (U-0002xxxx) is reserved for additional Han
ideographs, Plane 14 (U-000Exxxx) is going to start with some meta
characters for language tagging and there are two entire bonus
private-use planes."
Python itself will not handle 20bit characters yet, so the situation
with them will be just like the situation with 16 bit characters in
Python/xmllib today (Python will think that they are two characters).
Paul Prescod
"Andrew M. Kuchling" wrote:
>
> The XML-SIG's developer's day session went well, and, unlike most DD
> sessions, we actually achieved consensus on something. :) To summarize
> the outcome:
>
> * The current PyDOM code will be dropped and replaced with 4DOM.
> The precise details of how this will work are still to be
> resolved; will the 4DOM code move into xml.dom, or will xml.dom
> import from xml.Ft.dom and provide some wrappers?
>
> * PyExpat's interface will be changed to be SAX-like, and we'll
> lobby Guido to add PyExpat to 1.6, along with Expat itself. It
> will be renamed, preferably to something with SAX in the name.
> (expat_sax? pysax? pyxml? whatever...) It'll be updated to
> support all the features in current versions of Expat; Jim
> Fulton has an updated version of PyExpat inside Zope that will
> probably be used.
>
> * xmllib.py will be left unmodified, though it'll be deprecated in
> favor of PyExpat.
>
> * When 1.6 begins supporting Unicode, we'll fork the development
> tree into two branches; the branch that works with 1.5 will be
> maintained, though probably not actively developed. This will
> leave the other branch free to use 1.6-specific features without
> worrying about backward compatibility.
>
> If I've forgotten something from the session, please let me know.
>
> --
> A.M. Kuchling http://starship.python.net/crew/amk/
> First things first, but not necessarily in that order.
> -- The Doctor, in John Flanagan and Andrew McCulloch's _Meglos_
>
>
>
> _______________________________________________
> XML-SIG maillist - XML-SIG@python.org
> http://www.python.org/mailman/listinfo/xml-sig
--
Paul Prescod - ISOGEN Consulting Engineer speaking for himself
The new revolutionaries believe the time has come for an aggressive
move against our oppressors. We have established a solid beachhead
on Friday. We now intend to fight vigorously for 'casual Thursdays.'
-- who says America's revolutionary spirit is dead?