
Lots of projects embed scripting & other code in XML, typically as CDATA elements. For example, XBL in Mozilla. As far as I know, no one ever bothers to define how one should _encode_ code in a CDATA segment, and it appears that at least in the Mozilla world the 'encoding' used is 'cut & paste', and it's the XBL author's responsibility to make sure that ]]> is nowhere in the JavaScript code. That seems suboptimal to me, and likely to lead to disasters down the line. The only clean solution I can think of is to define a standard encoding/decoding process for storing program code (which may very well contain occurences of ]]> in CDATA, which effectively hides that triplet from the parser. While I'm dreaming, it would be nice if all of the relevant language communities (JS, Python, Perl, etc.) could agree on what that encoding is. I'd love to hear of a recommendation on the topic by the XML folks, but I haven't been able to find any such document. Any thoughts? --david ascher

On Mon, 17 Apr 2000, David Ascher wrote:
The only clean solution I can think of is to define a standard encoding/decoding process for storing program code (which may very well contain occurences of ]]> in CDATA, which effectively hides that triplet from the parser.
Hmm. I think the way everybody does it is to use the language to get around the need for ever saying "]]>". For example, in Python, if that was outside of a string, you could insert some spaces without changing the meaning, or if it was inside a string, you could add two strings together etc. You're right that this seems a bit ugly, but i think it could be even harder to get all the language communities to swallow something like "replace all occurrences of ]]> with some ugly escape string" -- since the above (hackish) method has the advantage that you can just run code directly copied from a piece of CDATA, and now you're asking them all to run the CDATA through some unescaping mechanism beforehand. Although i'm less optimistic about the success of such a standard, i'd certainly be up for it, if we had a good answer to propose. Here is one possible answer (to pick "@@" as a string very unlikely to occur much in most scripting languages): @@ --> @@@ ]]> --> @@> def escape(text): cdata = replace(text, "@@", "@@@") cdata = replace(cdata, "]]>", "@@>") return cdata def unescape(cdata): text = replace(cdata, "@@>", "]]>") text = replace(text, "@@@", "@@") return text The string "@@" occurs nowhere in the Python standard library. Another possible solution: <] --> <]> ]]> --> <][ etc. Generating more solutions is left as an exercise to the reader. :) -- ?!ng

Hmm. I think the way everybody does it is to use the language to get around the need for ever saying "]]>". For example, in Python, if that was outside of a string, you could insert some spaces without changing the meaning, or if it was inside a string, you could add two strings together etc.
You're right that this seems a bit ugly, but i think it could be even harder to get all the language communities to swallow something like "replace all occurrences of ]]> with some ugly escape string" -- since the above (hackish) method has the advantage that you can just run code directly copied from a piece of CDATA, and now you're asking them all to run the CDATA through some unescaping mechanism beforehand.
But it has the bad disadvantages that it's language-specific and modifies code rather than encode it. It has the even worse disadvantage that it requires you to parse the code to encode/decode it, something much more expensive than is really necessary!
Although i'm less optimistic about the success of such a standard, i'd certainly be up for it, if we had a good answer to propose.
I'm thinking that if we had a good answer, we can probably get it into the core libraries for a few good languages, and document it as 'the standard', if we could get key people on board.
Here is one possible answer
Right, that's the sort of thing I was looking for.
def escape(text): cdata = replace(text, "@@", "@@@") cdata = replace(cdata, "]]>", "@@>") return cdata
def unescape(cdata): text = replace(cdata, "@@>", "]]>") text = replace(text, "@@@", "@@") return text
(the above fails on @@>, but that's the general idea I had in mind). --david I know!: "]]>" <==> "Microsoft engineers are puerile weenies!"

On Mon, 17 Apr 2000, David Ascher wrote:
(the above fails on @@>, but that's the general idea I had in mind).
Oh, that's stupid of me. I used the wrong test harness. Okay, well the latter example works (i tested it): <] --> <]> ]]> --> <][ And this also works: @@ --> @@] ]]> --> @@> -- ?!ng

What is wrong with encoding ]]> in the XML way by using an extra CDATA. In other words split up the CDATA section into two in the middle of the ]]> sequence: import string def encode_cdata(str): return '<![CDATA[' + \ string.join(string.split(str, ']]>'), ']]]]><![CDATA[>')) + \ ']]>' On Mon, Apr 17 2000 "David Ascher" wrote:
Lots of projects embed scripting & other code in XML, typically as CDATA elements. For example, XBL in Mozilla. As far as I know, no one ever bothers to define how one should _encode_ code in a CDATA segment, and it appears that at least in the Mozilla world the 'encoding' used is 'cut & paste', and it's the XBL author's responsibility to make sure that ]]> is nowhere in the JavaScript code.
That seems suboptimal to me, and likely to lead to disasters down the line.
The only clean solution I can think of is to define a standard encoding/decoding process for storing program code (which may very well contain occurences of ]]> in CDATA, which effectively hides that triplet from the parser.
While I'm dreaming, it would be nice if all of the relevant language communities (JS, Python, Perl, etc.) could agree on what that encoding is. I'd love to hear of a recommendation on the topic by the XML folks, but I haven't been able to find any such document.
Any thoughts?
--david ascher
-- Sjoerd Mullender <sjoerd.mullender@oratrix.com>

On Wed, 19 Apr 2000, Sjoerd Mullender wrote:
What is wrong with encoding ]]> in the XML way by using an extra CDATA. In other words split up the CDATA section into two in the middle of the ]]> sequence:
Brilliant. Now that i've seen it, this has to be the right answer. -- ?!ng "Je n'aime pas les stupides gar�ons, m�me quand ils sont intelligents." -- Roople Unia

What is wrong with encoding ]]> in the XML way by using an extra CDATA. In other words split up the CDATA section into two in the middle of the ]]> sequence:
import string def encode_cdata(str): return '<![CDATA[' + \ string.join(string.split(str, ']]>'), ']]]]><![CDATA[>')) + \ ']]>'
If I understand what you're proposing, you're splitting a single bit of Python code into N XML elements. This requires smarts not on the decode function (where they should be, IMO), but on the XML parsing stage (several leaves of the tree have to be merged). Seems like the wrong direction to push things. Also, I can imagine cases where the app puts several scripts in consecutive CDATA elements (assuming that's legal XML), and where a merge which inserted extra ]]> would be very surprising. Maybe I'm misunderstanding you, though.... --david ascher

David Ascher wrote:
What is wrong with encoding ]]> in the XML way by using an extra CDATA. In other words split up the CDATA section into two in the middle of the ]]> sequence:
import string def encode_cdata(str): return '<![CDATA[' + \ string.join(string.split(str, ']]>'), ']]]]><![CDATA[>')) + \ ']]>'
If I understand what you're proposing, you're splitting a single bit of Python code into N XML elements.
nope. CDATA sections are used to encode data, they're not elements: XML 1.0, section 2.7: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as mark- up. you can put each data character in its own CDATA section, if you like. if the parser cannot handle that, it's broken. (if you've used xmllib, think handle_data, not start_cdata). </F>

On Wed, Apr 19 2000 "David Ascher" wrote:
What is wrong with encoding ]]> in the XML way by using an extra CDATA. In other words split up the CDATA section into two in the middle of the ]]> sequence:
import string def encode_cdata(str): return '<![CDATA[' + \ string.join(string.split(str, ']]>'), ']]]]><![CDATA[>')) + \ ']]>'
If I understand what you're proposing, you're splitting a single bit of Python code into N XML elements. This requires smarts not on the decode function (where they should be, IMO), but on the XML parsing stage (several leaves of the tree have to be merged). Seems like the wrong direction to push things. Also, I can imagine cases where the app puts several scripts in consecutive CDATA elements (assuming that's legal XML), and where a merge which inserted extra ]]> would be very surprising.
Maybe I'm misunderstanding you, though....
I think you're not misunderstanding me, but maybe you are misunderstanding XML. :-) [Of course, it is also conceivable that I misunderstand XML. :-] First of all, I don't propose to split up the single bit of Python into multiple XML elements. CDATA sections are not XML elements. The XML standard says this: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. [http://www.w3.org/TR/REC-xml#sec-cdata-sect] In other words, according to the XML standard wherever you are allowed to put character data (such as in this case Python code), you are allowed to use CDATA sections. Their purpose is to escape blocks of text containing characters that would otherwise be recognized as markup. CDATA sections are not part of the markup, so the XML parser is allowed to coallese the multiple CDATA sections and other character data into one string before it gives it to the application. So, yes, this requires smarts on the XML parsing stage, but I think those smarts need to be there anyway. If an application put several pieces of Python code in one character data section, it is basically on its own. I don't think XML guarantees that those pieces aren't merged into one string by the XML parser before it gets to the application. As I said already, this is my interpretation of XML, and I could be misinterpreting things. -- Sjoerd Mullender <sjoerd.mullender@oratrix.com>

Sjoerd Mullender wrote:
...
CDATA sections are not part of the markup, so the XML parser is allowed to coallese the multiple CDATA sections and other character data into one string before it gives it to the application.
Allowed but not required. Most SAX parsers will not. Some DOM parsers will and some won't. :(
So, yes, this requires smarts on the XML parsing stage, but I think those smarts need to be there anyway.
I don't follow this part. Typically those "smarts" are not there. At the end of one CDATA section you get an event and at the beginning of the next you get a different event. It's the application's job to glue them together. :( Fixing this is one of the goals of EventDOM. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Pop stars come and pop stars go, but amid all this change there is one eternal truth: Whenever Bob Dylan writes a song about a guy, the guy is guilty as sin. - http://www.nj.com/page1/ledger/e2efc7.html

David Ascher wrote:
...
If I understand what you're proposing, you're splitting a single bit of Python code into N XML elements.
No, a CDATA section is not an element. But the question of whether boundary placements are meaningful is sepearate. This comes back to the "semantics question". Most tools will not differentiate between two adjacent CDATA sections and one. The XML specification does not say whether they should or should not but in practice tools that consume XML and then throw it away typically do NOT care about CDATA section boundaries and tools that edit XML *do* care. This "break it into to two sections" solution is the typical one but it is god-awful ugly, even in XML editors. Many stream-based XML tools (e.g. mostSAX parsers, xmllib) *will* report two separate CDATA sections as two different character events. Application code must be able to handle this situation. It doesn't only occur with CDATA sections. XML parsers could equally break up long text chunks based on 1024-byte block boundaries or line breaks or whatever they feel like. In my opinion these variances in behvior stem from the myth that XML has no semantics but that's another off-topic topic. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Pop stars come and pop stars go, but amid all this change there is one eternal truth: Whenever Bob Dylan writes a song about a guy, the guy is guilty as sin. - http://www.nj.com/page1/ledger/e2efc7.html
participants (5)
-
David Ascher
-
Fredrik Lundh
-
Ka-Ping Yee
-
Paul Prescod
-
Sjoerd Mullender