xml parsing escape characters

"Martin v. Löwis" martin at v.loewis.de
Thu Jan 20 15:54:30 EST 2005


Luis P. Mendes wrote:
> When I access the url via the Firefox browser and look into the source
> code, I also get:
> 
> <?xml version="1.0" encoding="utf-8"?>
> <string xmlns="http................"><DataSet>
> ~  <Order>
> ~    <Customer>439</Customer>
> ~  </Order>
> </DataSet></string>

Please do try to understand what you are seeing. This is crucial for
understanding what happens.

You may have the understanding that XML can be represented as a tree.
This would be good - if not, please read a book that explains why
XML can be considered as a tree.

In the tree, you have inner nodes, and leaf nodes. For example,
the document

<a>
   <b>Hello</b>
   <c>World</c>
</a>

has 5 nodes (ignoring whitespace content):

Element:a ---- Element:b ---- Text:"Hello"
            |
            \-- Element:c ---- Text:"World"

So the leaf nodes are typically Text nodes (unless you
have an empty element). Your document has this structure:

Element:string ---- Text:"""<DataSet>
    <Order>
       <Customer>439</Customer>
   </Order>
</DataSet>"""

So the ***TEXT*** contains the letter "<", just like it contains
the letters "O" and "r". There IS no element Order in your document,
no matter how hard you look.

If you want a DataSet *element* in your document, it should
read

<string xmlns="...">
  <DataSet>
   <Order>
    <Customer>439</Customer>
   </Order
  </DataSet>
</string>

As this is the document you apparently want to process, complain
to whoever gave you that other document.

> should I take the contents of the string tag that is text and replace
> all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?

No. We still don't know what you want to achieve, so it is difficult to
advise you what to do. My best advise is that whoever generates the XML
document should fix it.

> or should I use another parser that accomplishes the task with no need
> to replace the escaped characters?

No. The parser is working correctly.

The document you got can also be interpreted as containing another
XML document as a text. This is evil, but apparently people are doing
it, anyway. If you really want that embedded document, you need
first to extract it.

To see what I mean, do

print DataSetNode.data

The .data attribute gives you the string contents of
a text node. You could use this as an XML document, and
parse it again to an XML parser. This would be ugly,
but might be your only choice if the producer of the
document is unwilling to adjust.

Regards,
Martin





More information about the Python-list mailing list