[XML-SIG] CDATA sections still not handled

matt matt@virtualspectator.com
Thu, 18 Jan 2001 10:03:32 +1300


hmm, I'm off track again ....

On Thu, 18 Jan 2001, you wrote:
> > A CDATA section that contains xml 
> 
> The entire document is xml; you probably mean
> 
> "A CDATA section that contains markup delimiters"
> 
> here. A CDATA section, by definition, contains only characters. It
> never contains markup.
> 
> > will be translated by a parser into a text node that is still valid
> > by virtue of the character references that it places in place of
> > characters such as "<" ... i.e. &lt;, and that for example if they
> > wrote some naff xml in an input , eg "&&<name><<" this, if escaped
> > in the original document by CDAT, would be translated into a text
> > node with "&amp;&amp;&lt;name>&lt;&lt;".
> 
> Not exactly. Character entities will be replaced with their true
> characters in the DOM tree, i.e. the CDATA section will appear in the
> DOM tree as a text node with its contents; a text containing "&lt;" in
> the input will be translated to "<" when creating the DOM tree.
> 


This translation obviously happens after validation, since invalid xml like
data in CDATA will never be validated against.  Which is what I want.

> It is the *output* function that does any necessary escaping. So when
> the CDATA section contained a literal "<", then, on output, the pretty
> printer has the option of generating &lt; or &#60; or a CDATA section.
> 
> > Now if that CDATA was supposed to be xml as well, but was
> > necessarily hidden for a while so that validation could be performed
> > further along a processing chain,
> 
> It seems you are trying to use XML in a way not supported by any
> standard. If you have a CDATA section, it contains characters by
> definition; you can't suppose that these characters are markup.

I don't suppose they are, I know they are.

> 
> > then I also need to write a processor to replace the character
> > references, in which case I could possibly define <!ENTITY> s for
> > such a translation, so that the parser would see < instead of &lt;
> 
> No. Each conforming XML parser knows that &lt; represents "<" - you
> don't need to supply a entity definition for that. It also knows that
> "<" cannot be represented as "<" in text; section 2.4 of the
> recommendation clearly says
> 
> # The ampersand character (&) and the left angle bracket (<) may
> # appear in their literal form only when used as markup delimiters, or
> # within a comment, a processing instruction, or a CDATA section. ...
> # If they are needed elsewhere, they must be escaped using either
> # numeric character references or the strings "&amp;" and "&lt;"
> # respectively.
> 
> So when generating XML, a conforming processor will only emit "<"
> outside a CDATA section to mean the markup delimiter.
> 
> > Just the above, one wants to take the CDATA at some point and treat
> > it as either an xml document on its own, or just part of the current
> > xml document.
> 
> That is not supported by the XML recommendation. A CDATA section only
> contains characters, not markup. So if you treat CDATA sections in any
> other way, you violate the XML recommendation.

ummm, here is another confusing part ... the following is from the xml
specification :

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur;
they are used to escape blocks of text containing characters which would
otherwise be recognized as markup. CDATA sections begin with the
string "<![CDATA[" and end with the string "]]>":]


ummm, so can you be clearer about my apparent violation of CDATA by putting xml
like data in it?


> 
> > The CDATA simply being used to escape sections that could possibly
> > break validation at earlier points, eg on a server, where there may
> > be no chance of handling bad xml sections, but that at a later
> > point, eg some client application, then an exception can be handled
> > nicely, in which case the CDATA section can now be safely
> > interpreted.  This is where I see I need reverse translation, and
> > simply cannot directly parse what use to be a CDATA section.
> 
> You need to invented a new markup language for that kind of
> processing; XML does not support such a kind of interpretation of a
> document.


No I don't, because it works fine when the CDATA label are kept, but you are
also saying that a parser can/should translate the character references
such as "&lt;", and looking at expat, it does, so, well, it seems to work
perfectly fine.  But now I am interested why this is a violation.  A perfectly
acceptable use is that one uses xml to wrap a message, which itself may be xml,
but ut is up to the message interpreter later on to figure out if it valid. 


> 
> Regards,
> Martin
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig


regards
Matt
-------------------------------------------------------

-- 
Matt Halstead (PhD)
Research and development
VirtualSpectator
http://www.virtualspectator.com
ph 64-9-9136896