[lxml-dev] Trouble parsing large XML document with ElementTree
Dear lovely lxmlves, Yesterday I tried to parse a large file, the Open Directory Project's links document, available here <http://rdf.dmoz.org/rdf/content.rdf.u8.gz>. The process went like this: 1) Unzipped the file using 7-zip. No errors reported. 2) Renamed the file by adding a .xml extension, mainly so Windows (see my spec below) would recognise it as an XML file. 3) Had a look at the file in Oxygen's large document viewer. It took a few minutes to load, but everything looked shipshape. 4) Opened a command prompt, navigated to the directory containing the file, and started Python. 5) Entered: from lxml import etree 6) Entered: doc = open ('content.rdf.u8.xml', 'r') 7) Entered: docParsed = etree.parse(doc) Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up to around 96% (fair enough, it's a big document) and the Windows UI became sluggish. It didn't crash, and the RAM usage stabilised around that amount, with Windows Task Manager showing well under 10% CPU load from Python. Still, I figured it might take a while to parse, so I left it overnight. In the morning, I found the following error message immediately underneath the command I'd entered in step 7: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 2520, in lxml.etree.parse File "parser.pxi", line 1331, in lxml.etree._parseDocument File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike File "parser.pxi", line 850, in lxml.etree._BaseParser._parseDocFromFilelike File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc File "parser.pxi", line 536, in lxml.etree._handleParseResult File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Memory allocation failed : building node I hope that's meaningful to someone, and that perhaps I might be able to get some suggestions about how to parse the file on my PC. Also, I was thinking of trying to parse the file on a virtual server that only has 64M of RAM. I don't mind if the VPS takes a day or two, as long as the code to make it work is fairly straightforward. So any suggestions about that option would be helpful too. Many thanks, Sam --- Macbook 2.13GHz with 2GB RAM Windows Vista Home Premium via Leopard BootCamp ActivePython 2.5.1 lxml installed via lxml-2.0.3-py2.5-win32.egg (this was the most up-to-date egg that was available last time I checked, which was about a week or two ago)
Hmm, 64M might be unfeasibly low. Let's say 128M. Anyway, if I did go with this option, it would probably be on one of the cheaper of these<http://www.vpsville.ca/plans>machines (or something similar somewhere else), which seem like potentially an inexpensive resource for doing offline data-munging.
Gosh, this is turning into a really fragmented post; apologies. I meant to add to the first post that once parsed, my intention was to run a fairly simple XSL transform on the document, to extract a copy of each of the URLs it contains. Probably something like this: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl=" http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>ODP URLs</h2> <xsl:for-each select="Topic/link"> <p><xsl:value-of select="@r:resource"/></p> </xsl:for-each> </body> </html> </xsl:template> </xsl:stylesheet> Thanks for your patience; I'm still relatively new at this stuff, Sam
Hi, Sam Kuper wrote:
Gosh, this is turning into a really fragmented post; apologies. I meant to add to the first post that once parsed, my intention was to run a fairly simple XSL transform on the document, to extract a copy of each of the URLs it contains. Probably something like this: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl=" http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>ODP URLs</h2> <xsl:for-each select="Topic/link"> <p><xsl:value-of select="@r:resource"/></p> </xsl:for-each> </body> </html> </xsl:template> </xsl:stylesheet>
That is a problem that can be solved with extremely little memory. Take a look at the (SAX-like) target parser interface, which will not build a tree and instead just receive callbacks while parsing: http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface Write a parser target class that keeps track of being inside or outside the "Topic" tag (start/end), and whenever you find a "link" tag while inside a "Topic" tag, look for a "{whatever-namespace}resource" attribute in the attrib dictionary and and write it into a hand-generated HTML stream like the one you used above. Stefan
Dear Stefan, I've tried the method you've suggested below, but it isn't quite working for me. It may be that I've misunderstood your suggestion. I'll explain what I've tried. Here is my python program, extract_links_dmoz.py: from lxml import etree infile = open("content.example.xml", "r") infile.seek(0) outfile = open("output_test001.txt", "w") class EchoTarget(): def start(self, tag, attrib): if tag.endswith("xternalPage"): line = attrib["about"] if line != "": outfile.write(line+"\n") print line def close(self): return "closed!" parser = etree.XMLParser(target = EchoTarget()) result = etree.XML(infile.read(), parser) This uses the short, example RDF file at http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed content.example.xml), and works fine. When I view the output_test001.txt file, it contains one URL per line, which is exactly what I want for now. However, if I change the program to read content.rdf.u8.xml (i.e. the full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz) instead of content.example.xml , then when I run the program I get the following error: Traceback (most recent call last): File "extract_links_dmoz.py", line 26, in <module> result = etree.XML(infile.read(), parser) MemoryError Any help you (or others) can offer would be greatly appreciated. Many thanks, Sam 2008/5/22 Stefan Behnel <stefan_ml@behnel.de>:
Hi,
Sam Kuper wrote:
Gosh, this is turning into a really fragmented post; apologies. I meant to add to the first post that once parsed, my intention was to run a fairly simple XSL transform on the document, to extract a copy of each of the URLs it contains. Probably something like this: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl=" http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>ODP URLs</h2> <xsl:for-each select="Topic/link"> <p><xsl:value-of select="@r:resource"/></p> </xsl:for-each> </body> </html> </xsl:template> </xsl:stylesheet>
That is a problem that can be solved with extremely little memory. Take a look at the (SAX-like) target parser interface, which will not build a tree and instead just receive callbacks while parsing:
http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface
Write a parser target class that keeps track of being inside or outside the "Topic" tag (start/end), and whenever you find a "link" tag while inside a "Topic" tag, look for a "{whatever-namespace}resource" attribute in the attrib dictionary and and write it into a hand-generated HTML stream like the one you used above.
Stefan
Dear Stefan, I did read your other post, but using the file name directly when calling the parser didn't work for me. Here is what I tried: from lxml import etree outfile = open("output_test001.txt", "w") class EchoTarget(): def start(self, tag, attrib): if tag.endswith("xternalPage"): line = attrib["about"] if line != "": outfile.write(line+"\n") print line def close(self): return "closed!" parser = etree.XMLParser(target = EchoTarget()) result = etree.XML("content.example.xml", parser) This gives the following error: Traceback (most recent call last): File "extract_links_dmoz005.py", line 15, in <module> result = etree.XML("content.example.xml", parser) File "lxml.etree.pyx", line 2358, in lxml.etree.XML File "parser.pxi", line 1354, in lxml.etree._parseMemoryDocument File "parser.pxi", line 1243, in lxml.etree._parseDoc File "parser.pxi", line 795, in lxml.etree._BaseParser._parseDoc File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResultDoc File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 I have been reading the docs, but I'm new to processing XML in Python, so I don't find them all that easy to understand. I think I'm improving, though :) Thanks for your patience. Best, Sam 2008/5/24 Stefan Behnel <stefan_ml@behnel.de>:
Hi,
did you read my other post?
Sam Kuper wrote:
result = etree.XML(infile.read(), parser)
make that
result = etree.parse("thefile.xml", parser)
and consider reading the parser docs on the web page.
Stefan
-- http://five.sentenc.es | http://tinyurl.com/3x9se4 -- Mr Sam Pablo Kuper BSc MRI Research Assistant Darwin Correspondence Project Cambridge University Library West Road Cambridge CB3 9DR spk30@cam.ac.uk Office: +44 (0)1223 333008 Mobile: +44 (0) 7971858176 www.darwinproject.ac.uk
Hi, RMP! :) Sam Kuper wrote:
result = etree.XML("content.example.xml", parser)
2008/5/24 Stefan Behnel:
result = etree.parse("thefile.xml", parser)
See the difference? Please read http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files and http://codespeak.net/lxml/parsing.html Stefan
Hi, Sam Kuper wrote:
Dear lovely lxmlves, Yesterday I tried to parse a large file, the Open Directory Project's links document, available here <http://rdf.dmoz.org/rdf/content.rdf.u8.gz>. The process went like this:
1) Unzipped the file using 7-zip. No errors reported. 2) Renamed the file by adding a .xml extension, mainly so Windows (see my spec below) would recognise it as an XML file. 3) Had a look at the file in Oxygen's large document viewer. It took a few minutes to load, but everything looked shipshape. 4) Opened a command prompt, navigated to the directory containing the file, and started Python. 5) Entered: from lxml import etree 6) Entered: doc = open ('content.rdf.u8.xml', 'r') 7) Entered: docParsed = etree.parse(doc)
lxml can parse from a gzipped XML file, no need to do step 1) and 6), just do docParsed = etree.parse('content.rdf.u8.xml.gz') or even docParsed = etree.parse('http://rdf.dmoz.org/rdf/content.rdf.u8.gz') BTW, if you do 6) it should read doc = open ('content.rdf.u8.xml', 'rb') mind the 'rb' at the end.
Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up to around 96% (fair enough, it's a big document) and the Windows UI became sluggish. It didn't crash, and the RAM usage stabilised around that amount, with Windows Task Manager showing well under 10% CPU load from Python.
That means your machine was heavily swapping. The in-memory tree of libxml2 is much larger than the serialised document itself, so if it doesn't fit into RAM, parsing the tree into memory will not make you happy, especially not with 64/128MB...
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 2520, in lxml.etree.parse File "parser.pxi", line 1331, in lxml.etree._parseDocument File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike File "parser.pxi", line 850, in lxml.etree._BaseParser._parseDocFromFilelike File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc File "parser.pxi", line 536, in lxml.etree._handleParseResult File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Memory allocation failed : building node
Your operating system stopped allowing it to allocate more memory and it didn't even crash, it just gave you an exception. Isn't that cool? :) (although I wouldn't generally rely on that ...) Stefan
participants (2)
-
Sam Kuper
-
Stefan Behnel