Re: [lxml-dev] Trouble parsing large XML document with ElementTree
To add to the message below, I've just tried running a much simpler program that doesn't call lxml to see if the memory error is a Python/environment one rather than being due to lxml. It turns out to be:
infile = open("content.rdf.u8.xml", "r") print infile.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError
Ok, so clearly Python isn't happy to read content.rdf.u8.xml in one go. The normal workaround for processing large text files piece by piece seems to be either to set a byte limit on how much is read at once, or to read the file line by line. However, neither of those will work in this case because they won't produce well-formed XML that the target parser interface can handle (correct me if I'm wrong). I'm sure there must be a fairly easy solution to this, but it's eluding me. All assistance greatly appreciated! Sam 2008/5/24 Sam Kuper <sam.kuper@uclmail.net>:
Dear Stefan,
I've tried the method you've suggested below, but it isn't quite working for me. It may be that I've misunderstood your suggestion. I'll explain what I've tried. Here is my python program, extract_links_dmoz.py:
from lxml import etree infile = open("content.example.xml", "r") infile.seek(0) outfile = open("output_test001.txt", "w") class EchoTarget(): def start(self, tag, attrib): if tag.endswith("xternalPage"): line = attrib["about"] if line != "": outfile.write(line+"\n") print line def close(self): return "closed!" parser = etree.XMLParser(target = EchoTarget()) result = etree.XML(infile.read(), parser)
This uses the short, example RDF file at http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed content.example.xml), and works fine. When I view the output_test001.txt file, it contains one URL per line, which is exactly what I want for now.
However, if I change the program to read content.rdf.u8.xml (i.e. the full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz) instead of content.example.xml , then when I run the program I get the following error:
Traceback (most recent call last): File "extract_links_dmoz.py", line 26, in <module> result = etree.XML(infile.read(), parser) MemoryError
Any help you (or others) can offer would be greatly appreciated.
Many thanks,
Sam
2008/5/22 Stefan Behnel <stefan_ml@behnel.de>:
Hi,
Sam Kuper wrote:
Gosh, this is turning into a really fragmented post; apologies. I meant to add to the first post that once parsed, my intention was to run a fairly simple XSL transform on the document, to extract a copy of each of the URLs it contains. Probably something like this: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl=" http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>ODP URLs</h2> <xsl:for-each select="Topic/link"> <p><xsl:value-of select="@r:resource"/></p> </xsl:for-each> </body> </html> </xsl:template> </xsl:stylesheet>
That is a problem that can be solved with extremely little memory. Take a look at the (SAX-like) target parser interface, which will not build a tree and instead just receive callbacks while parsing:
http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface
Write a parser target class that keeps track of being inside or outside the "Topic" tag (start/end), and whenever you find a "link" tag while inside a "Topic" tag, look for a "{whatever-namespace}resource" attribute in the attrib dictionary and and write it into a hand-generated HTML stream like the one you used above.
Stefan
I'm still trying to find a good way to process big xml files (i.e. xml files around as large as the available RAM, or larger). My last post to the mailing list asked if there might be a way to do this by processing fragments of the document at a time. I now realise that lxml's feed parser is intended for this sort of task (correct me if I'm wrong). So I'm trying to learn to use it, but I find its behaviour a little odd. Not sure if this is a bug or not, so I'm posting to the list for advice. When I run: from lxml import etree class EchoTarget: def start(self, tag, attrib): print "start", tag, attrib def end(self, tag): print "end", tag def data(self, data): print "data", repr(data) def close(self): print "close" return "closed!" parser = etree.XMLParser(target = EchoTarget()) parser.feed("<somethin") parser.feed("g>foo</something>") I get the result: start something {} data u'foo' end something But when I run: from lxml import etree class EchoTarget: def start(self, tag, attrib): print "start", tag, attrib def end(self, tag): print "end", tag def data(self, data): print "data", repr(data) def close(self): print "close" return "closed!" parser = etree.XMLParser(target = EchoTarget()) parser.feed("<something>foo</something>") nothing gets sent to stdout. Isn't that weird? I think so. I would have expected it to give the same result as the program above. I'd be very thankful if anyone can shed some light on the matter for me. Many thanks, Sam 2008/5/24 Sam Kuper <sam.kuper@uclmail.net>:
To add to the message below, I've just tried running a much simpler program that doesn't call lxml to see if the memory error is a Python/environment one rather than being due to lxml. It turns out to be:
infile = open("content.rdf.u8.xml", "r") print infile.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError
Ok, so clearly Python isn't happy to read content.rdf.u8.xml in one go. The normal workaround for processing large text files piece by piece seems to be either to set a byte limit on how much is read at once, or to read the file line by line. However, neither of those will work in this case because they won't produce well-formed XML that the target parser interface can handle (correct me if I'm wrong).
I'm sure there must be a fairly easy solution to this, but it's eluding me. All assistance greatly appreciated!
Sam
participants (1)
-
Sam Kuper