[XML-SIG] CDATA sections still not handled

Wed, 17 Jan 2001 18:47:19 +0100

> A CDATA section that contains xml 

The entire document is xml; you probably mean

"A CDATA section that contains markup delimiters"

here. A CDATA section, by definition, contains only characters. It
never contains markup.

> will be translated by a parser into a text node that is still valid
> by virtue of the character references that it places in place of
> characters such as "<" ... i.e. &lt;, and that for example if they
> wrote some naff xml in an input , eg "&&<name><<" this, if escaped
> in the original document by CDAT, would be translated into a text
> node with "&amp;&amp;&lt;name>&lt;&lt;".

Not exactly. Character entities will be replaced with their true
characters in the DOM tree, i.e. the CDATA section will appear in the
DOM tree as a text node with its contents; a text containing "&lt;" in
the input will be translated to "<" when creating the DOM tree.

It is the *output* function that does any necessary escaping. So when
the CDATA section contained a literal "<", then, on output, the pretty
printer has the option of generating &lt; or &#60; or a CDATA section.

> Now if that CDATA was supposed to be xml as well, but was
> necessarily hidden for a while so that validation could be performed
> further along a processing chain,

It seems you are trying to use XML in a way not supported by any
standard. If you have a CDATA section, it contains characters by
definition; you can't suppose that these characters are markup.

> then I also need to write a processor to replace the character
> references, in which case I could possibly define <!ENTITY> s for
> such a translation, so that the parser would see < instead of &lt;

No. Each conforming XML parser knows that &lt; represents "<" - you
don't need to supply a entity definition for that. It also knows that
"<" cannot be represented as "<" in text; section 2.4 of the
recommendation clearly says

# The ampersand character (&) and the left angle bracket (<) may
# appear in their literal form only when used as markup delimiters, or
# within a comment, a processing instruction, or a CDATA section. ...
# If they are needed elsewhere, they must be escaped using either
# numeric character references or the strings "&amp;" and "&lt;"
# respectively.

So when generating XML, a conforming processor will only emit "<"
outside a CDATA section to mean the markup delimiter.

> Just the above, one wants to take the CDATA at some point and treat
> it as either an xml document on its own, or just part of the current
> xml document.

That is not supported by the XML recommendation. A CDATA section only
contains characters, not markup. So if you treat CDATA sections in any
other way, you violate the XML recommendation.

> The CDATA simply being used to escape sections that could possibly
> break validation at earlier points, eg on a server, where there may
> be no chance of handling bad xml sections, but that at a later
> point, eg some client application, then an exception can be handled
> nicely, in which case the CDATA section can now be safely
> interpreted.  This is where I see I need reverse translation, and
> simply cannot directly parse what use to be a CDATA section.

You need to invented a new markup language for that kind of
processing; XML does not support such a kind of interpretation of a
document.

Regards,
Martin