[XML-SIG] Re: dom.minidom getting the text content of a node
Rick Hurst
rick.hurst at gmail.com
Fri Dec 10 09:40:45 CET 2004
On Thu, 9 Dec 2004 20:47:41 +0100, Fredrik Lundh <fredrik at pythonware.com> wrote:
> if you add this to the inner loop,
>
> print titleNode.childNodes
> print titleNode.firstChild.wholeText
>
> you get this output (under 2.3.3):
>
> [<DOM Text node "\n">, <DOM CDATASection node "Plone: rem...">]
>
Thanks Frederik
> > http://sourceforge.net/tracker/?func=detail&atid=105470&aid=549725&group_id=5470
>
> this bug report complains that the DOM represents the CDATA section as
> four text nodes, which is also perfectly valid (see Martin's explanation). code
> that depends on being able to identify a CDATA section in the source file is
> broken; character data, character references, entities, and CDATA section
> should all be treated as text.
that makes sense
> btw, here's the corresponding ElementTree version:
>
> from elementtree import ElementTree
>
> tree = ElementTree.parse("foo.xml")
>
> for node in tree.findall(".//blog"):
> print node.get("id")
> for content_node in node.findall("text"):
> print content_node.findtext("blogtitle")
>
> or, shorter:
>
> for node in tree.findall(".//blog"):
> print node.get("id")
> print node.findtext("text/blogtitle")
>
wow, that looks like a more concise way to do it - thanks i'll take a
look at that.
FWIW I had some sucess using Sax2 last night:-
import sys
from xml.dom.ext.reader import Sax2
# create Reader object
reader = Sax2.Reader()
# parse the document
dom1 = reader.fromStream('200406archive010.xml')
for node in dom1.getElementsByTagName("blog"):
id = node.getAttribute("id")
print int(id)
for contentNode in node.getElementsByTagName("text"):
for titleNode in contentNode.getElementsByTagName("blogtitle"):
print titleNode.firstChild.data
for titleNode in contentNode.getElementsByTagName("blogbody"):
print titleNode.firstChild.data
--
Rick Hurst
http://hypothecate.co.uk
More information about the XML-SIG
mailing list