newthread 1) Re: [XML-SIG] CDATA sections still not handled

19 Jan 2001 10:27:02 +0100

* matt@virtualspectator.com
| 
| [...] since one gets CDATA begin and end events while parsing a
| document that contains CDATA section, then why couldn't the DOM
| document still represent it as a CDATA section internally?  

Because it would be a real pain, and would most likely break lots of
applications. If text nodes can suddenly be represented as both text
and cdata nodes, applications that only test for text nodes (and I
assume this is the majority) will be silently losing data.

Furthermore, the normalize method, which many applications use to
ensure that there are no adjacent text nodes in the DOM tree stops
working in the presence of cdata nodes, since these are not
normalized. 

| Furthermore, a parser such as expat will preserve the original form
| of the characters that have been escaped, and even convert them if
| they happened to be in entity references.  

What are you trying to say here?

| It seems to me that the handling of CDATA sits at the level of it's
| base class which is a text node and that the CDATA sections are only
| used to say "don't validate the following, it is ALL character
| data"..

CDATA sections and ordinary 'text'[1] are just two ways to represent
the same thing, and applications should not care which of the two ways
have been used. The distinction between these two ways of representing
character data is information about how the document was put together,
as opposed to information about what is in the document. 

In other words, this issue is really the same as the issues 'white
space in tags is lost', 'I can't tell what character data came from
numeric character references' and so on.

I think your current way of handling it, to control what is
represented as CDATA in the serializer, is the correct way to do it.
One should consider very carefully before adding information of this
sort to the document tree (or event stream), because there is such an
unbelievably awful lot of it that it needs to be handled with the
greatest of care.

I have been thinking lately that it would be an interesting experiment
to make an XML parser with an interface specialized for representing
ALL the lexical information about a document. I guess this could be
done by passing along with every event the list of tokens that made up
that event.

--Lars M.

[1] Correct terminology is really to call it character data. Text, as
    defined by XML, is both markup and character data.