SyntaxError when parsing UTF8-BOM encoded XML file
Hello, I run into some issues when trying to parse a UTF8-BOM file (Python 2.7). It was working fine until version 3.2.5 but it is not starting from version 3.3.0-beta1. This is the error I've been getting when trying to do etree.iterparse(path, tag='item'): File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:116372) XMLSyntaxError: Document is empty, line 1, column 1 I had a look at tests/test_elementtree.py and saw that it's different from what it used to be years ago: def test_encoding_utf8_bom(self): utext = _str('Søk på nettet') uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') + _str('<p>%s</p>') % utext) bom = _bytes('\\xEF\\xBB\\xBF').decode("unicode_escape").encode("latin1") xml = bom + uxml.encode("utf-8") tree = etree.XML(xml) self.assertEqual(utext, tree.text) In the mailing list I only managed to find this thread: http://article.gmane.org/gmane.comp.python.lxml.devel/2967/match=bom but it's not relevant because it's from 2007. That said, lxml is amazing :) Thank you, Stefano-
Stefano Fontana, 28.01.2014 16:24:
I run into some issues when trying to parse a UTF8-BOM file (Python 2.7). It was working fine until version 3.2.5 but it is not starting from version 3.3.0-beta1. This is the error I've been getting when trying to do etree.iterparse(path, tag='item'):
File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:116372) XMLSyntaxError: Document is empty, line 1, column 1 I had a look at tests/test_elementtree.py and saw that it's different from what it used to be years ago:
def test_encoding_utf8_bom(self): utext = _str('Søk på nettet') uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') + _str('<p>%s</p>') % utext) bom = _bytes('\\xEF\\xBB\\xBF').decode("unicode_escape").encode("latin1") xml = bom + uxml.encode("utf-8") tree = etree.XML(xml) self.assertEqual(utext, tree.text)
Yep, works for me when parsing from memory and files, but not with incremental parsing. I faintly recall that being an issue with libxml2's push parser (which is used for incremental parsing), but that doesn't mean there is nothing lxml could do about it. Patches welcome. Stefan
Stefan Behnel, 28.01.2014 17:18:
Stefano Fontana, 28.01.2014 16:24:
I run into some issues when trying to parse a UTF8-BOM file (Python 2.7). It was working fine until version 3.2.5 but it is not starting from version 3.3.0-beta1. This is the error I've been getting when trying to do etree.iterparse(path, tag='item'):
File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:116372) XMLSyntaxError: Document is empty, line 1, column 1 I had a look at tests/test_elementtree.py and saw that it's different from what it used to be years ago:
def test_encoding_utf8_bom(self): utext = _str('Søk på nettet') uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') + _str('<p>%s</p>') % utext) bom = _bytes('\\xEF\\xBB\\xBF').decode("unicode_escape").encode("latin1") xml = bom + uxml.encode("utf-8") tree = etree.XML(xml) self.assertEqual(utext, tree.text)
Yep, works for me when parsing from memory and files, but not with incremental parsing. I faintly recall that being an issue with libxml2's push parser (which is used for incremental parsing), but that doesn't mean there is nothing lxml could do about it. Patches welcome.
Bug report is here: https://bugs.launchpad.net/lxml/+bug/1274118 Stefan
Stefan Behnel, 29.01.2014 18:24:
Stefan Behnel, 28.01.2014 17:18:
Stefano Fontana, 28.01.2014 16:24:
I run into some issues when trying to parse a UTF8-BOM file (Python 2.7). It was working fine until version 3.2.5 but it is not starting from version 3.3.0-beta1. This is the error I've been getting when trying to do etree.iterparse(path, tag='item'):
File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:116372) XMLSyntaxError: Document is empty, line 1, column 1 I had a look at tests/test_elementtree.py and saw that it's different from what it used to be years ago:
def test_encoding_utf8_bom(self): utext = _str('Søk på nettet') uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') + _str('<p>%s</p>') % utext) bom = _bytes('\\xEF\\xBB\\xBF').decode("unicode_escape").encode("latin1") xml = bom + uxml.encode("utf-8") tree = etree.XML(xml) self.assertEqual(utext, tree.text)
Yep, works for me when parsing from memory and files, but not with incremental parsing. I faintly recall that being an issue with libxml2's push parser (which is used for incremental parsing), but that doesn't mean there is nothing lxml could do about it. Patches welcome.
Bug report is here:
I've uploaded a source distro for testing here: http://lxml.de/files/lxml-3.3.1pre.tar.gz Stefan
Hi, when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this: stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8')) On Ter, 2014-01-28 at 16:24 +0100, Stefano Fontana wrote:
Hello,
I run into some issues when trying to parse a UTF8-BOM file (Python 2.7).
It was working fine until version 3.2.5 but it is not starting from version 3.3.0-beta1.
This is the error I've been getting when trying to do etree.iterparse(path, tag='item'):
File "iterparse.pxi", line 166, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:116372) XMLSyntaxError: Document is empty, line 1, column 1 I had a look at tests/test_elementtree.py and saw that it's different from what it used to be years ago:
def test_encoding_utf8_bom(self): utext = _str('Søk på nettet') uxml = (_str('<?xml version="1.0" encoding="UTF-8"?>') + _str('<p>%s</p>') % utext) bom = _bytes('\\xEF\\xBB\ \xBF').decode("unicode_escape").encode("latin1") xml = bom + uxml.encode("utf-8") tree = etree.XML(xml) self.assertEqual(utext, tree.text)
In the mailing list I only managed to find this thread: http://article.gmane.org/gmane.comp.python.lxml.devel/2967/match=bom
but it's not relevant because it's from 2007.
That said, lxml is amazing :)
Thank you,
Stefano-
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Sérgio M. B.
Sérgio Basto, 28.01.2014 18:43:
when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this:
stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8'))
1) this has nothing to do with the topic of this thread. 2) this is completely the wrong way to do this. Actually, lxml should reject the XPath expression as invalid byte string input, so, thanks for bringing this up. Stefan
On Qua, 2014-01-29 at 08:08 +0100, Stefan Behnel wrote:
Sérgio Basto, 28.01.2014 18:43:
when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this:
stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8'))
1) this has nothing to do with the topic of this thread.
2) this is completely the wrong way to do this.
Actually, lxml should reject the XPath expression as invalid byte string input, so, thanks for bringing this up.
In [1]: stringxpath = '//div[@id="México"]' In [3]: stringxpath.decode('utf-8') Out[3]: u'//div[@id="M\xe9xico"]' is not a byte string input, or maybe I don't understand. but I'd love know how I do this in right way (Python 2.7) I use lxml in that way :) many thanks, -- Sérgio M. B.
Sérgio Basto, 29.01.2014 08:21:
On Qua, 2014-01-29 at 08:08 +0100, Stefan Behnel wrote:
Sérgio Basto, 28.01.2014 18:43:
when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this:
stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8'))
1) this has nothing to do with the topic of this thread.
2) this is completely the wrong way to do this.
Actually, lxml should reject the XPath expression as invalid byte string input
And it does, I just checked.
In [1]: stringxpath = '//div[@id="México"]' In [3]: stringxpath.decode('utf-8') Out[3]: u'//div[@id="M\xe9xico"]'
is not a byte string input, or maybe I don't understand.
Sorry, my fault. I misread the "decode()" as "encode()", because I didn't see why you would *decode* an obvious Unicode string. The right way to do this is to say stringxpath = u'//div[@id="México"]' I.e. with a "u" prefix to make it a Unicode string in Py2.x. In any case, passing Unicode strings (at least for anything that's not plain ASCII text in Py2.x), is totally the right thing to do. Sorry for the confusion. Stefan
On Qua, 2014-01-29 at 08:37 +0100, Stefan Behnel wrote:
Sérgio Basto, 29.01.2014 08:21:
On Qua, 2014-01-29 at 08:08 +0100, Stefan Behnel wrote:
Sérgio Basto, 28.01.2014 18:43:
when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this:
stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8'))
1) this has nothing to do with the topic of this thread.
2) this is completely the wrong way to do this.
Actually, lxml should reject the XPath expression as invalid byte string input
And it does, I just checked.
In [1]: stringxpath = '//div[@id="México"]' In [3]: stringxpath.decode('utf-8') Out[3]: u'//div[@id="M\xe9xico"]'
is not a byte string input, or maybe I don't understand.
Sorry, my fault. I misread the "decode()" as "encode()", because I didn't see why you would *decode* an obvious Unicode string.
The right way to do this is to say
stringxpath = u'//div[@id="México"]'
hum thanks, BTW with python 2.7 , do you know how I convert : '//div[@id="México"]' to u'//div[@id="México"]' ? thanks for your reply
I.e. with a "u" prefix to make it a Unicode string in Py2.x.
In any case, passing Unicode strings (at least for anything that's not plain ASCII text in Py2.x), is totally the right thing to do. Sorry for the confusion.
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Sérgio M. B.
participants (3)
-
Stefan Behnel
-
Stefano Fontana
-
Sérgio Basto