[XML-SIG] CDATA sections still not handled

matt matt@virtualspectator.com
Thu, 18 Jan 2001 22:27:46 +1300


... comments throughout ...


On Thu, 18 Jan 2001, Norman Walsh wrote:
> / matt <matt@virtualspectator.com> was heard to say:
> | On Wed, 17 Jan 2001, Martin v. Loewis wrote:
> [...]
> | > I understand you are not interested in parsing the document; if you
> | > build a DOM tree, parsing of the document will happen as a side
> | > effect. You cannot avoid this: this is the only way to get a DOM tree
> | > from a document. So while you are not interested in the parsing, you
> | > should accept that it is done.
> | 
> | This is where I see the extra step that is necessary, so tell me if
> | I am on the right track.
> 
> I'm not trying to be pedantic, it just looks that way :-)
> 
> | A CDATA section that contains xml will be translated by a parser
> 
> A CDATA section cannot contain XML. It contains text, with a
> particular form of escaping.

Ok, so now I am being pedantic, but this is good, I'm getting a clearer idea of
xml usage, my entry to xml has been recent and only from the building side of
documents, but now that I have to process them heavily it's nice to reason out
these things.

>From what I am seeing it seems CDATA can hold anything it wants, within the
constraints of the character encoding set.  Say I formed my own language that
happend to use things like "<" very often, then CDATA seems to give me and
"initial" way to write this in a plain, raw form, without translating it to
entity references first.  This is nice, since your new language section within
the xml document is still human readable.  It won't matter which way you go
from the point of the parser, because, for example, expat will recognize it as
character data by virtue of the CDATA escaping, or by the alternative
replacement of all xml markup in that section by entity references.

There is no way around the fact that CDATA allows you to write xml, programming
code, ..... whatever you want inside CDATA.  The parser will NOT try to parse
it.  For all I care, I could have encoded it with BASE64 ..... I don't need it
to be parsed as part of the document.


> 
> | into a text node that is still valid by virtue of the character
> | references that it places in place of characters such as "<"
> | ... i.e. &lt;, and that for example if they wrote some naff xml in
> | an input , eg "&&<name><<" this, if escaped in the original document
> | by CDAT, would be translated into a text node with
> | "&amp;&amp;&lt;name>&lt;&lt;".
> 
> I think about this in a different way. Parsing a document that contains
> <![CDATA[&&<name><<]]> produces an XML information set that includes
> a text node that contains the Unicode characters 
> 
>   "&" "&" "<" "n" "a" "m" "e" ">" "<" "<"
> 
> These characters are not escaped in any way.

Nope, not after they have been parsed, but they certainly were when they were
part of the CDATA section in the original document.  As the specification says,
they are used to ESCAPE blocks of text containing characters which would
otherwise be recognized as markup.  More on this below ....


> 
> If the processor subsequently has reason to serialize the text node
> in question, it may use any (or all) of the following mechanisms to
> do so:
> 
> 1. CDATA sections
> 2. The predefined entities &lt; and &amp;
> 3. Using numeric character references, &#60; and &#38; (in either
>    decimal or hex).
> 
> If the document is known to have additional entity declarations associated
> with it, these entities may also be used (for example, &gt;).
> 
> |  Now if that CDATA was supposed to be
> | xml as well, but was necessarily hidden for a while so that
> | validation could be performed further along a processing chain, then
> | I also need to write a processor to replace the character
> | references, in which case I could possibly define <!ENTITY> s for
> | such a translation, so that the parser would see < instead of &lt;
> 
> There's no easy means to "unescape" these characters in an XML
> processor. You can do it with Python, or some other non-XML string
> processing language, and you could do it with XSLT using
> disable-output-escaping (in some limited circumstances).
> 
> | many people who pick up a document and modify it and put it back.
> 
> Assuming I haven't made any typos, the following serializations of a
> text node:
> 
>   <![CDATA[&&<name><<]]>
>   &amp;&amp;&lt;name>&lt;&lt;
>   &amp;&amp;&lt;n&#97;me>&lt;&lt;
>   <![CDATA[&&]]>&lt;name><![CDATA[<<]]>
> 
> are indistinguishable to an XML processor. 

yes, I realize that.

>It *doesn't matter* what
> escaping mechanism you use, unless you are including non-XML
> processors.  If you're using non-XML processors, you may care about
> the escaping, but XML isn't designed to help you with that problem.
> (And you may care about other things that XML can't help you with,
> like the serialization order of attributes.)
> 
> | Just the above, one wants to take the CDATA at some point and treat it as
> | either an xml document on its own, or just part of the current xml document. 
> | The CDATA simply being used to escape sections that could possibly break
> | validation at earlier points, eg on a server, where there may be no chance of
> | handling bad xml sections, but that at a later point, eg some client
> | application, then an exception can be handled nicely, in which case the CDATA
> | section can now be safely interpreted.  This is where I see I need reverse
> | translation, and simply cannot directly parse what use to be a CDATA section.
> 
> Don't do that. I'm serious. You don't say exactly what problem you're
> trying to solve, but the solution you're outlining is ugly and
> fragile. (IMHO, naturally.)

No it's not.  If I put base64 encoded gzip compressed versions of the same
"escaped xml fragments" that I want to hide, then that would seem to make you
happy.  These xml documents are a transport, and when a transpot is interpreted
then certain tags may mean do something with the character data of this node. 
All seems pretty normal to me.  For example, say one wants to transport html. 
Now html is usually really ugly in that it is hardly ever well formed xml. 
Escaping with CDATA it is an easy way to hide that, and giving that data to an
html renderer some time later would be fine.  Being in CDATA, it is never
parsed for "well formedness".

Of course now I understand that a DOM implementation will remove CDATA tags and
replace all character data between them with entity references where
necessary.  If this is then persisted to disk and later parsed with an xml
handler, then the real characters will come back out again in the character
stream for the text node.  So that is fine too, I get back what I put in, and
who cares whether it was xml, or someones program code.

So the conclusion is that CDATA is just a useless feature if you are
parsing it into a DOM tree.  All it gives you is a free way of translating
markup to entity references.  That is nice in that sense, but not so nice that
you have now rendered your previously escaped sections as not very human
readable anymore.  And this can be a problem.  If someone complains that, for
example, their message, which was transported via some transport xml, looked
weird, and all that you had was the raw transport packets on your server, then
if things are still wrapped in nice CDATA tags then you can easily look
through it and find the improper formatting in the message.  However, if the
message has been translated into entity references, then forget it, you may as
well be looking at binary in a hex editor in some instances. 

regards
Matt




> 
>                                         Be seeing you,
>                                           norm
> 
> -- 
> Norman Walsh <ndw@nwalsh.com> | Life is a great bundle of little
> http://nwalsh.com/            | things.--Oliver Wendell Holmes