[lxml] SyntaxError when parsing UTF8-BOM encoded XML file

Jan. 28, 2014

      Hello,

I run into some issues when trying to parse a UTF8-BOM file (Python 2.7).
It was working fine until version 3.2.5 but it is not starting from version
3.3.0-beta1.
This is the error I've been getting when trying to do etree.iterparse(path,
tag='item'):

 File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:116372)
XMLSyntaxError: Document is empty, line 1, column 1
I had a look at tests/test_elementtree.py and saw that it's different from
what it used to be years ago:

    def test_encoding_utf8_bom(self):
        utext = _str('Søk på nettet')
        uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') +
                _str('<p>%s</p>') % utext)
        bom =
_bytes('\\xEF\\xBB\\xBF').decode("unicode_escape").encode("latin1")
        xml = bom + uxml.encode("utf-8")
        tree = etree.XML(xml)
        self.assertEqual(utext, tree.text)

In the mailing list I only managed to find this thread:
http://article.gmane.org/gmane.comp.python.lxml.devel/2967/match=bom
but it's not relevant because it's from 2007.

That said, lxml is amazing :)

Thank you,
   Stefano-

[lxml] SyntaxError when parsing UTF8-BOM encoded XML file

Stefano Fontana