Hello,
lxml.etree.parse is able to load gzipped XML files directly, but
lxml.etree.iterparse is not. See below for an interactive session
demonstrating the problem on debian stable. Is it the expected
behavior, or is it a bug?
The documentation does address this point, it says only:
> lxml can parse from a local file, an HTTP URL or an FTP URL. It
> also auto-detects and reads gzip-compressed XML files (.gz).
Context: I'm handling hundreds of GB-sized files. It would be nice to
store them gzipped and have lxml decompress them on the fly, without
any specific Python code.
Thanks!
% python
Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip, sys
>>> from lxml import etree
>>> print etree.__version__
2.1.1
Let's create a gzipped XML file:
>>> gzip.open('test.xml.gz', 'wb').write('<a><b /></a>')
etree.parse is able to load it:
>>> tree = etree.parse('test.xml.gz')
>>> tree.write(sys.stdout); print
<a><b/></a>
etree.iterparse crashes:
>>> ctx = etree.iterparse('test.xml.gz')
>>> list(ctx)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:73245)
File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/
lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
etree.iterparse accepts the ungzipped file:
>>> ctx = etree.iterparse(gzip.open('test.xml.gz', 'rb'))
>>> list(ctx)
[(u'end', <Element b at 7f742b265310>), (u'end', <Element a at
7f742b2652b8>)]
--
Aymeric Augustin.