SyntaxError when parsing UTF8-BOM encoded XML file

Hello, I run into some issues when trying to parse a UTF8-BOM file (Python 2.7). It was working fine until version 3.2.5 but it is not starting from version 3.3.0-beta1. This is the error I've been getting when trying to do etree.iterparse(path, tag='item'): File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:116372) XMLSyntaxError: Document is empty, line 1, column 1 I had a look at tests/test_elementtree.py and saw that it's different from what it used to be years ago: def test_encoding_utf8_bom(self): utext = _str('Søk på nettet') uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') + _str('<p>%s</p>') % utext) bom = _bytes('\\xEF\\xBB\\xBF').decode("unicode_escape").encode("latin1") xml = bom + uxml.encode("utf-8") tree = etree.XML(xml) self.assertEqual(utext, tree.text) In the mailing list I only managed to find this thread: http://article.gmane.org/gmane.comp.python.lxml.devel/2967/match=bom but it's not relevant because it's from 2007. That said, lxml is amazing :) Thank you, Stefano-

Stefano Fontana, 28.01.2014 16:24:
Yep, works for me when parsing from memory and files, but not with incremental parsing. I faintly recall that being an issue with libxml2's push parser (which is used for incremental parsing), but that doesn't mean there is nothing lxml could do about it. Patches welcome. Stefan

Stefan Behnel, 28.01.2014 17:18:
Bug report is here: https://bugs.launchpad.net/lxml/+bug/1274118 Stefan

Stefan Behnel, 29.01.2014 18:24:
I've uploaded a source distro for testing here: http://lxml.de/files/lxml-3.3.1pre.tar.gz Stefan

Hi, when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this: stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8')) On Ter, 2014-01-28 at 16:24 +0100, Stefano Fontana wrote:
-- Sérgio M. B.

On Qua, 2014-01-29 at 08:08 +0100, Stefan Behnel wrote:
In [1]: stringxpath = '//div[@id="México"]' In [3]: stringxpath.decode('utf-8') Out[3]: u'//div[@id="M\xe9xico"]' is not a byte string input, or maybe I don't understand. but I'd love know how I do this in right way (Python 2.7) I use lxml in that way :) many thanks, -- Sérgio M. B.

Sérgio Basto, 29.01.2014 08:21:
And it does, I just checked.
Sorry, my fault. I misread the "decode()" as "encode()", because I didn't see why you would *decode* an obvious Unicode string. The right way to do this is to say stringxpath = u'//div[@id="México"]' I.e. with a "u" prefix to make it a Unicode string in Py2.x. In any case, passing Unicode strings (at least for anything that's not plain ASCII text in Py2.x), is totally the right thing to do. Sorry for the confusion. Stefan

Stefano Fontana, 28.01.2014 16:24:
Yep, works for me when parsing from memory and files, but not with incremental parsing. I faintly recall that being an issue with libxml2's push parser (which is used for incremental parsing), but that doesn't mean there is nothing lxml could do about it. Patches welcome. Stefan

Stefan Behnel, 28.01.2014 17:18:
Bug report is here: https://bugs.launchpad.net/lxml/+bug/1274118 Stefan

Stefan Behnel, 29.01.2014 18:24:
I've uploaded a source distro for testing here: http://lxml.de/files/lxml-3.3.1pre.tar.gz Stefan

Hi, when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this: stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8')) On Ter, 2014-01-28 at 16:24 +0100, Stefano Fontana wrote:
-- Sérgio M. B.

On Qua, 2014-01-29 at 08:08 +0100, Stefan Behnel wrote:
In [1]: stringxpath = '//div[@id="México"]' In [3]: stringxpath.decode('utf-8') Out[3]: u'//div[@id="M\xe9xico"]' is not a byte string input, or maybe I don't understand. but I'd love know how I do this in right way (Python 2.7) I use lxml in that way :) many thanks, -- Sérgio M. B.

Sérgio Basto, 29.01.2014 08:21:
And it does, I just checked.
Sorry, my fault. I misread the "decode()" as "encode()", because I didn't see why you would *decode* an obvious Unicode string. The right way to do this is to say stringxpath = u'//div[@id="México"]' I.e. with a "u" prefix to make it a Unicode string in Py2.x. In any case, passing Unicode strings (at least for anything that's not plain ASCII text in Py2.x), is totally the right thing to do. Sorry for the confusion. Stefan
participants (3)
-
Stefan Behnel
-
Stefano Fontana
-
Sérgio Basto