Mailman 3 [lxml-dev] parser bug in lxml 1.0 - lxml - The Python XML Toolkit

June 7, 2006

      Hi there,

After a hint from Guido Wesdorp, I tried the following with lxml 1.0:

utf.xml:

<?xml version="1.0"?>
<foo>
This is some UTF-8 content: ë
</foo>

and this script (tryparse.py):

from lxml import etree

f = open('utf.xml', 'r')
etree.parse(f)
f.close()

running it gives the following traceback:

Traceback (most recent call last):
   File "tryparse.py", line 4, in ?
     etree.parse(f)
   File "etree.pyx", line 1468, in etree.parse
   File "parser.pxi", line 671, in etree._parseDocument
   File "parser.pxi", line 697, in etree._parseFilelikeDocument
   File "parser.pxi", line 622, in etree._parseDocFromFilelike
   File "parser.pxi", line 379, in etree._BaseParser._parseDocFromFilelike
   File "parser.pxi", line 418, in etree._handleParseResult
   File "etree.pyx", line 151, in etree._ExceptionContext._raise_if_stored
   File "parser.pxi", line 159, in etree.copyToBuffer
   File "apihelpers.pxi", line 319, in etree._utf8
AssertionError: All strings must be Unicode or ASCII

This is of course wrong. lxml should definitely be able to parse UTF-8 
encoded XML files. This did work in previous versions of lxml too. It 
also looks like it is going into an in-memory string parser. I recall in 
earlier versions of lxml this wasn't necessary - the file object was 
inspected and the filename was extracted, passing it into libxml2 directly.

Regards,

Martijn

[lxml-dev] parser bug in lxml 1.0

Martijn Faassen

Stefan Behnel

Fredrik Lundh

Stefan Behnel

Martijn Faassen

Martijn Faassen

Martijn Faassen

Stefan Behnel

Martijn Faassen

Stefan Behnel

tags

participants (3)