[XML-SIG] Status of XML 1.1 processing in Python?
ken.beesley at xrce.xerox.com
Tue Aug 30 10:07:10 CEST 2005
Thanks to Uche Ogbuji for the response. In insert some
Uche Ogbuji wrote:
>On Wed, 2005-08-24 at 13:27 +0200, Ken Beesley wrote:
>>Many thanks to Fredrik Lundh, Fred Drake and Daniel Veillard
>>for information on the status of XML 1.1 processing in Python.
>>I'll do my best to do some testing and report back.
>> Why I need XML 1.1 characters
>>In case anyone is interested, my goal is to facilitate
>>the definition of new Unicode input methods for Mac OS X.
>>Apple already supplies a very human-UNfriendly XML
>>language for defining new input methods. I have defined a
>>new human-friendly XML language and need to convert my
>>human-friendly XML files automatically to Apple's human-
>>The basic idea of input methods is that they
>>intercept incoming key events, or sequences of key events, and
>>map them into Unicode-character outputs that are sent to
>>e.g. to the buffer of a Unicode text editor. Some of these
>>Unicode output characters are control characters that are
>>invalid in XML 1.0 but valid in XML 1.1. (I.e. when you
>>press appropriate "control" keys on your keyboard, the output
>>to the application is naturally a "control character".)
>>If you define a new OS X input method in Apple's current
>>XML format, the XML file contains control characters that
>>are valid only in XML 1.1. The underlying (mystery) Apple
>>parser that processes
>>that XML file does _not_ choke on the control characters,
>>so this processor is assuming the XML 1.1 character set,
>>even if the XML file is overtly marked version="1.0". That's
>>a no-no, of course; if the file is marked version="1.0", then
>>any kosher XML processor should refuse to parse/process
>>the file if it contains control characters not valid in XML 1.0.
>>My human-friendly XML language is defined in Relax NG,
>>and when I specify version="1.1", the files validate as they
>>should using Jing. (If I change the attribute to version="1.0", then
>>Jing properly refuses to validate the files because of the invalid
>>control characters.) So far so good.
>>But then when I try to write a Python script to parse
>>the human-friendly XML language and convert it (very non-trivially)
>>to the human-unfriendly XML language defined by Apple,
>>the Python script (if limited to XML 1.0 processing) chokes
>>as soon as it sees the offending control characters. Sigh.
>>Hence my need for a Python XML parsing/processing module
>>that handles XML 1.1 characters when the file is
>>appropriate marked version="1.1".
>Interesting. nd thanks for taking a time to state your case so clearly.
>XML 1.1 does allow more control chars than 1.0, but some are still
>banned, so I don't think you have a comprehensive solution here.
The value 0x0000 (null) is still banned, but in XML 1.1 all characters
from 0x0001 through 0x001F are now legal, as long as they are
expressed inside an XML 1.1 document as Character References,
The addition of these characters solves the problem for
Apple OS X input methods. In fact, characters like
(Backspace) already appear (as character references) in
existing Apple OS X input methods (written in the
unfriendly XML format mentioned in my previous message).
If you press the Backspace key on your keyboard, you want
the input method to pass "the backspace character"
through to the application. One might suspect that applications
like this prompted the change in 1.1
The existing (hidden) Mac parser that parses XML specifications
of input methods (into a low-level binary format) already
handles  and other control characters now legal in XML 1.1
So this hidden Mac parser is XML 1.1-capable, at least as far as
control characters are concerned.
>My suggested solution would be something mike Brown and I have often
>discussed: mapping the illegal characters to the Unicode private use
>area (PUA), and then back, as needed. Python should make this an easy
>solution. You can also use special elements or processing instructions
>to encode the problem characters.
>I do not suggest relying on XML 1.1 in the way you propose because
>uptake for it is slow not only in the Expat world. It's a pretty
>controversial spec (like every second-gen spec the W3C produces, it
>Even if libxml folks and the Expat folks actually commit to XML 1.1
>support, it will probably be a little while in coming, so I suggest a
>more general workaround, such as I've suggested within the bounds of XML
Yes, one can obviously cobble together some kind of work-around,
but it's unattractive when XML 1.1 has existed for a year and a
half and would solve the
problem (and when references like  are already being
handled in Apple's own XML input-method language). When I
define my own more human-friendly XML, forcing the use of PUA
characters (which would get mapped to 1.1 control character
references in the unfriendly XML output) puts an unattractive
and unintuitive gap between my (hopefully) friendly XML
language and Apple's existing unfriendly XML. Sigh.
I see that pxdom claims to be pure Python and claims to
handle XML 1.1
I'm not very excited about using DOM of any kind, but
perhaps it's a solution.
More information about the XML-SIG