Mailman 3 [lxml-dev] Trouble parsing large XML document with ElementTree - lxml - The Python XML Toolkit

21 May 2008

      Dear lovely lxmlves,
Yesterday I tried to parse a large file, the Open Directory Project's links
document, available here <http://rdf.dmoz.org/rdf/content.rdf.u8.gz>. The
process went like this:

1) Unzipped the file using 7-zip. No errors reported.
2) Renamed the file by adding a .xml extension, mainly so Windows (see my
spec below) would recognise it as an XML file.
3) Had a look at the file in Oxygen's large document viewer. It took a few
minutes to load, but everything looked shipshape.
4) Opened a command prompt, navigated to the directory containing the file,
and started Python.
5) Entered: from lxml import etree
6) Entered: doc = open ('content.rdf.u8.xml', 'r')
7) Entered: docParsed = etree.parse(doc)

Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up
to around 96% (fair enough, it's a big document) and the Windows UI became
sluggish. It didn't crash, and the RAM usage stabilised around that amount,
with Windows Task Manager showing well under 10% CPU load from Python.
Still, I figured it might take a while to parse, so I left it overnight.

In the morning, I found the following error message immediately underneath
the command I'd entered in step 7:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 2520, in lxml.etree.parse
  File "parser.pxi", line 1331, in lxml.etree._parseDocument
  File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument
  File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike
  File "parser.pxi", line 850, in
lxml.etree._BaseParser._parseDocFromFilelike
  File "parser.pxi", line 452, in
lxml.etree._ParserContext._handleParseResultDoc
  File "parser.pxi", line 536, in lxml.etree._handleParseResult
  File "parser.pxi", line 478, in lxml.etree._raiseParseError
lxml.etree.XMLSyntaxError: Memory allocation failed : building node

I hope that's meaningful to someone, and that perhaps I might be able to get
some suggestions about how to parse the file on my PC.

Also, I was thinking of trying to parse the file on a virtual server that
only has 64M of RAM. I don't mind if the VPS takes a day or two, as long as
the code to make it work is fairly straightforward. So any suggestions about
that option would be helpful too.

Many thanks,

Sam
---
Macbook 2.13GHz with 2GB RAM
Windows Vista Home Premium via Leopard BootCamp
ActivePython 2.5.1
lxml installed via lxml-2.0.3-py2.5-win32.egg (this was the most up-to-date
egg that was available last time I checked, which was about a week or two
ago)

[lxml-dev] Trouble parsing large XML document with ElementTree

Sam Kuper

Sam Kuper

Sam Kuper

Stefan Behnel

Sam Kuper

Stefan Behnel

Sam Kuper

Stefan Behnel

Stefan Behnel

tags

participants (2)