Dear Stefan, I've tried the method you've suggested below, but it isn't quite working for me. It may be that I've misunderstood your suggestion. I'll explain what I've tried. Here is my python program, extract_links_dmoz.py: from lxml import etree infile = open("content.example.xml", "r") infile.seek(0) outfile = open("output_test001.txt", "w") class EchoTarget(): def start(self, tag, attrib): if tag.endswith("xternalPage"): line = attrib["about"] if line != "": outfile.write(line+"\n") print line def close(self): return "closed!" parser = etree.XMLParser(target = EchoTarget()) result = etree.XML(infile.read(), parser) This uses the short, example RDF file at http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed content.example.xml), and works fine. When I view the output_test001.txt file, it contains one URL per line, which is exactly what I want for now. However, if I change the program to read content.rdf.u8.xml (i.e. the full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz) instead of content.example.xml , then when I run the program I get the following error: Traceback (most recent call last): File "extract_links_dmoz.py", line 26, in <module> result = etree.XML(infile.read(), parser) MemoryError Any help you (or others) can offer would be greatly appreciated. Many thanks, Sam 2008/5/22 Stefan Behnel <stefan_ml@behnel.de>:
Hi,
Sam Kuper wrote:
Gosh, this is turning into a really fragmented post; apologies. I meant to add to the first post that once parsed, my intention was to run a fairly simple XSL transform on the document, to extract a copy of each of the URLs it contains. Probably something like this: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl=" http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>ODP URLs</h2> <xsl:for-each select="Topic/link"> <p><xsl:value-of select="@r:resource"/></p> </xsl:for-each> </body> </html> </xsl:template> </xsl:stylesheet>
That is a problem that can be solved with extremely little memory. Take a look at the (SAX-like) target parser interface, which will not build a tree and instead just receive callbacks while parsing:
http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface
Write a parser target class that keeps track of being inside or outside the "Topic" tag (start/end), and whenever you find a "link" tag while inside a "Topic" tag, look for a "{whatever-namespace}resource" attribute in the attrib dictionary and and write it into a hand-generated HTML stream like the one you used above.
Stefan