[XML-SIG] CDATA sections still not handled

Norman Walsh ndw@nwalsh.com
18 Jan 2001 15:08:05 +0700

/ matt <matt@virtualspectator.com> was heard to say:
| On Wed, 17 Jan 2001, Martin v. Loewis wrote:
| > I understand you are not interested in parsing the document; if you
| > build a DOM tree, parsing of the document will happen as a side
| > effect. You cannot avoid this: this is the only way to get a DOM tree
| > from a document. So while you are not interested in the parsing, you
| > should accept that it is done.
| This is where I see the extra step that is necessary, so tell me if
| I am on the right track.

I'm not trying to be pedantic, it just looks that way :-)

| A CDATA section that contains xml will be translated by a parser

A CDATA section cannot contain XML. It contains text, with a
particular form of escaping.

| into a text node that is still valid by virtue of the character
| references that it places in place of characters such as "<"
| ... i.e. &lt;, and that for example if they wrote some naff xml in
| an input , eg "&&<name><<" this, if escaped in the original document
| by CDAT, would be translated into a text node with
| "&amp;&amp;&lt;name>&lt;&lt;".

I think about this in a different way. Parsing a document that contains
<![CDATA[&&<name><<]]> produces an XML information set that includes
a text node that contains the Unicode characters 

  "&" "&" "<" "n" "a" "m" "e" ">" "<" "<"

These characters are not escaped in any way.

If the processor subsequently has reason to serialize the text node
in question, it may use any (or all) of the following mechanisms to
do so:

1. CDATA sections
2. The predefined entities &lt; and &amp;
3. Using numeric character references, &#60; and &#38; (in either
   decimal or hex).

If the document is known to have additional entity declarations associated
with it, these entities may also be used (for example, &gt;).

|  Now if that CDATA was supposed to be
| xml as well, but was necessarily hidden for a while so that
| validation could be performed further along a processing chain, then
| I also need to write a processor to replace the character
| references, in which case I could possibly define <!ENTITY> s for
| such a translation, so that the parser would see < instead of &lt;

There's no easy means to "unescape" these characters in an XML
processor. You can do it with Python, or some other non-XML string
processing language, and you could do it with XSLT using
disable-output-escaping (in some limited circumstances).

| many people who pick up a document and modify it and put it back.

Assuming I haven't made any typos, the following serializations of a
text node:


are indistinguishable to an XML processor. It *doesn't matter* what
escaping mechanism you use, unless you are including non-XML
processors.  If you're using non-XML processors, you may care about
the escaping, but XML isn't designed to help you with that problem.
(And you may care about other things that XML can't help you with,
like the serialization order of attributes.)

| Just the above, one wants to take the CDATA at some point and treat it as
| either an xml document on its own, or just part of the current xml document. 
| The CDATA simply being used to escape sections that could possibly break
| validation at earlier points, eg on a server, where there may be no chance of
| handling bad xml sections, but that at a later point, eg some client
| application, then an exception can be handled nicely, in which case the CDATA
| section can now be safely interpreted.  This is where I see I need reverse
| translation, and simply cannot directly parse what use to be a CDATA section.

Don't do that. I'm serious. You don't say exactly what problem you're
trying to solve, but the solution you're outlining is ugly and
fragile. (IMHO, naturally.)

                                        Be seeing you,

Norman Walsh <ndw@nwalsh.com> | Life is a great bundle of little
http://nwalsh.com/            | things.--Oliver Wendell Holmes