Fwd: Re: [XML-SIG] CDATA sections still not handled

matt matt@virtualspectator.com
Thu, 18 Jan 2001 12:11:26 +1300


Now I see where you are coming from.  No I don't expect anything to suddenly
see xml where CDATA was and interpret it within the same context of the
document containing this node.  All I am saying is that xml documant A holds a
node B.  Node B happens to contain some xml, because that is part of a message
format.  A doesn't need to know about the form of B, in only so far as it is
CDATA and therefore it should not try to validate it as xml if it contains xml
markup, but it will validate the character set, as, yes it is character data.

At some point a process picks up A, searches for node B, extracts it, does NOT
assume it is xml, but will look through it for any xml that exists,  If it
finds some then it validates it ... which means that section will be cut out
ans passed to an xml parser.

The important thing that I think I understand is the following :
Any xml in the CDATA section doesn't need to look like xml to the human
reader.  A parser however, when handling a text node may do the following :
a) if the tag CDATA is still there, then call handlers for the start and and
CDATA sections, and pass the character data(which may contain markup explicitly)
to the character data handler.   b) if the CDATA tags are not there, then it
will/needs to be represented as character references, such as < and one
needs to make sure that it is translated, either by the parser or by the
process reading it into the correct characters before being passed to a stream
for later processing and possibly validation.


On Thu, 18 Jan 2001, you wrote:
> > This translation obviously happens after validation, since invalid xml like
> > data in CDATA will never be validated against.  Which is what I want.
> 
> I'm telling you: the data in CDATA are is just character text, not
> markup. So no matter what text you put in there, it is always
> well-formed and valid (unless it violates the document charset).
> 

so what's this then ?

<?xml version='1.0' encoding='ISO-8859-1'?>
<text_20001222_154201>
  <body><![CDATA[some text and possibly some markup <name><<, but we don't
want to validate this yet]]>   </body>
</text_20001222_154201>
                         

looks like markup inside CDATA to me ....  I think you actually mean
"unescaped" character data does not contain markup, eg : &lt; is certainly not
markup.



> > > It seems you are trying to use XML in a way not supported by any
> > > standard. If you have a CDATA section, it contains characters by
> > > definition; you can't suppose that these characters are markup.
> > 
> > I don't suppose they are, I know they are.
> 
> Maybe in your understanding of how your application should work. Not
> in XML.

what would you say to someone wanting to let other people put html formatting
in text node data, but knowing that html is often not written as valid xml,
then escaping it is a safe bet ....



> 
> > 2.7 CDATA Sections
> > 
> > [Definition: CDATA sections may occur anywhere character data may occur;
> > they are used to escape blocks of text containing characters which would
> > otherwise be recognized as markup. CDATA sections begin with the
> > string "<![CDATA[" and end with the string "]]>":]
> > 
> 
> > ummm, so can you be clearer about my apparent violation of CDATA by
> > putting xml like data in it?
> 
> It is completely well-formed to put "xml-like" data into a CDATA
> section. However, an application that suddenly "turns" those data into
> markup by removing the CDATA markers violates XML; it appears that
> your application is supposed to operate in such a way.

Nope, nowhere near what I am trying to do.  A and B are independent.(see
above)


> 
> IOW, the data might look like xml. When they are in a CDATA section,
> they are not markup. Trying to see them as markup at some point and
> not as markup at some other point means to read something into the XML
> standard that is not there.


..... makes my html example look wrong, yet it is a common use for CDATA.

> 
> > > You need to invented a new markup language for that kind of
> > > processing; XML does not support such a kind of interpretation of a
> > > document.
> > 
> > 
> > No I don't, because it works fine when the CDATA label are kept, but you are
> > also saying that a parser can/should translate the character references
> > such as "&lt;", and looking at expat, it does, so, well, it seems to work
> > perfectly fine.  
> 
> To be precise, I'm saying it can. It might chose to keep the generate
> rougly the same, or even more, CDATA sections on output as well.
> 
> >But now I am interested why this is a violation.  A perfectly
> >acceptable use is that one uses xml to wrap a message, which itself
> >may be xml, but ut is up to the message interpreter later on to
> >figure out if it valid.
> 
> It's not a violation to put "xml like" data into a CDATA section, but
> they are just plain character data. I said
> 
> # So if you treat CDATA sections in any other way, you violate the XML
> # recommendation.
> 
> *That* is something you cannot expect to work.
> 


All I originally wanted was for CDATA tags to remain in place so that at some
point, when looking at B, one could actually look for the markup tags.  Now
that I know these are often reverse translated when character data is handles
then that is fine(I know they are with expat).


regards
Matt



> Regards,
> Martin
-------------------------------------------------------

-- 
Matt Halstead (PhD)
Research and development
VirtualSpectator
http://www.virtualspectator.com
ph 64-9-9136896