Re: [lxml-dev] Trouble parsing large XML document with ElementTree

24 May 2008

      Dear Stefan,

I've tried the method you've suggested below, but it isn't quite working for
me. It may be that I've misunderstood your suggestion. I'll explain what
I've tried. Here is my python program, extract_links_dmoz.py:

from lxml import etree
infile = open("content.example.xml", "r")
infile.seek(0)
outfile = open("output_test001.txt", "w")
class EchoTarget():
    def start(self, tag, attrib):
        if tag.endswith("xternalPage"):
            line = attrib["about"]
            if line != "":
                outfile.write(line+"\n")
            print line
    def close(self):
        return "closed!"
parser = etree.XMLParser(target = EchoTarget())
result = etree.XML(infile.read(), parser)

This uses the short, example RDF file at
http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed
content.example.xml), and works fine. When I view the output_test001.txt
file, it contains one URL per line, which is exactly what I want for now.

However, if I change the program to read content.rdf.u8.xml (i.e. the
full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz)
instead of content.example.xml , then when I run the program I get the
following error:

Traceback (most recent call last):
  File "extract_links_dmoz.py", line 26, in <module>
    result = etree.XML(infile.read(), parser)
MemoryError

Any help you (or others) can offer would be greatly appreciated.

Many thanks,

Sam

2008/5/22 Stefan Behnel <stefan_ml@behnel.de>:
...
Hi,
Sam Kuper wrote:
...
Gosh, this is turning into a really fragmented post; apologies. I meant
to
add to the first post that once parsed, my intention was to run a fairly
simple XSL transform on the document, to extract a copy of each of the
URLs
it contains. Probably something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="
http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <html>
            <body>
                <h2>ODP URLs</h2>
                <xsl:for-each select="Topic/link">
                    <p><xsl:value-of select="@r:resource"/></p>
                </xsl:for-each>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>
That is a problem that can be solved with extremely little memory. Take a
look
at the (SAX-like) target parser interface, which will not build a tree and
instead just receive callbacks while parsing:
http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface
Write a parser target class that keeps track of being inside or outside the
"Topic" tag (start/end), and whenever you find a "link" tag while inside a
"Topic" tag, look for a "{whatever-namespace}resource" attribute in the
attrib
dictionary and and write it into a hand-generated HTML stream like the one
you
used above.
Stefan