[lxml-dev] parser bug in lxml 1.0

Hi there, After a hint from Guido Wesdorp, I tried the following with lxml 1.0: utf.xml: <?xml version="1.0"?> <foo> This is some UTF-8 content: ë </foo> and this script (tryparse.py): from lxml import etree f = open('utf.xml', 'r') etree.parse(f) f.close() running it gives the following traceback: Traceback (most recent call last): File "tryparse.py", line 4, in ? etree.parse(f) File "etree.pyx", line 1468, in etree.parse File "parser.pxi", line 671, in etree._parseDocument File "parser.pxi", line 697, in etree._parseFilelikeDocument File "parser.pxi", line 622, in etree._parseDocFromFilelike File "parser.pxi", line 379, in etree._BaseParser._parseDocFromFilelike File "parser.pxi", line 418, in etree._handleParseResult File "etree.pyx", line 151, in etree._ExceptionContext._raise_if_stored File "parser.pxi", line 159, in etree.copyToBuffer File "apihelpers.pxi", line 319, in etree._utf8 AssertionError: All strings must be Unicode or ASCII This is of course wrong. lxml should definitely be able to parse UTF-8 encoded XML files. This did work in previous versions of lxml too. It also looks like it is going into an in-memory string parser. I recall in earlier versions of lxml this wasn't necessary - the file object was inspected and the filename was extracted, passing it into libxml2 directly. Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
By filename, sure. When you pass through Python, however, that's different.
True. However, what do you do with code like this: f = open('embedded-xml.txt', 'r') f.seek(non_xml_header_length) etree.parse(f) f.close() I think, if you pass a file-like object, then lxml should assume that there must be a reason for you to do that. Otherwise you'd just pass the plain file name in the first place, right? There is an obvious semantic difference between etree.parse("file.xml") and f = open("file.xml", "r") etree.parse(f) f.close() We could use the same trick as for StringIO, ask the file object if it reads from the beginning (f.tell() == 0) and then special case that to read from the file name. But that would destroy the semantics of the second call: "read from the file object I passed". You'd also have hidden tempfile issues and the like. Both calls are not equivalent. I guess we should not encode what comes in from a file-like object, so I think the bug is rather at that place. However, then we'd have no way to deal with file-like objects returning unicode strings - we just don't know what they return before we start reading from them... Maybe it's acceptable to just raise an exception for file-like objects that return unicode strings or read(). Stefan

Hi Fredrik, Fredrik Lundh wrote:
Well, there is one, though. The first call to parse() says: "Here is a filename, look it up in the file system and parse the data contained in the file it points to." The second one says: "Here is an object that will give you XML data when you call it's read() method. Do that and parse what it returns." *That* is standard Python behaviour. It would be wrong to change the second into: "Here is an object that will give you XML data when you call it's read() method. Try to figure out which file in the file system it originally came from, then open that file again and read the data you get from it." One of the differences is that the file position will be magically unchanged after reading "from the file object". Another difference is that the file will be read from the beginning as I pointed out. Both are totally unexpected behaviour. I'm not questioning the general bug in lxml. I'm just saying that it's better to read the file-like object as such than to read from the file system where the user wanted us to read from a file(-like) object. Stefan

Stefan Behnel wrote: [snip]
Well, figuring out the filename is necessary anyway if we want to make things like, say, XSLT includes, work correctly right? That said of course lxml should behave correctly for Python file objects, also in the case of seek(). But, the current behavior attempting at correctness breaks far more than the hack I coded in reading the filename directly - so you could call my old approach "worse is better". :) Regards, Martijn

Fredrik Lundh wrote:
I agree that this should semantically be exactly the same. Stefan goes on to explain that in the case of the file object a seek can happen first, which is not the case for referring a filename. It would be nice if we could support seek(), and of course if you pass in a file object you can mess with the file object in advance, something you cannot do when you pass in a filename. I assume Stefan refers to that. Then again I think that *this* case, where no seek occurs, should behave the same way as the direct opening of file, and hopefully follow the same codepath in lxml, ideally a fast one that doesn't need the reading in as the file as a string. I think Fredrik had a point in previous discussions that involving unicode in both the parsing and serialization side of XML makes things less clear. We should be very careful to make sure lxml works correctly for both parsing and serialization of encoded text (in strings or files, referenced through fileobject or filename). That's the *main* usecase and this usecase should ideally be reflected in the source code. Anything to do with dealing with Python unicode strings at that level is a bonus. Regards, Martijn

Stefan Behnel wrote:
lxml supported the reading of UTF-8 encoded content when you pass in a file object. I'm pretty sure ElementTree does. So, this is a bug.
Sure, if there's no way to make this work better, then we should go the string route. But it still shouldn't bail out if you pass in a file that contains UTF-8 data (or *any* encoding as long as there's an encoding declaration).
The file-like object (StringIO) that returns unicode case is not so important to me - more important is not to break backwards compatibility, not to break compatibility with ElementTree, and the ability to actually parse XML files that are passed in through a file object.
Maybe it's acceptable to just raise an exception for file-like objects that return unicode strings or read().
Yes, I agree I'd rather have that be broken than the current situation. Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
This is fixed. The new behaviour is: Parsing a file object or file-like object reads the data in chunks from the object and checks each chunk if it's a plain string. If not, it raises a TypeError. Otherwise, it passes the bytes directly into libxml2. This means that parsing unicode strings from file-like objects is no longer supported, mainly due to the encoding sensing bug in libxml2. Parsing file-objects from the file system rather than the Python object could be made to work as in previous versions of lxml, but I really don't see why. If you want plain speed, pass a file name. If you want control over the data and the way it is read, pass a file(-like) object. Stefan

Stefan Behnel wrote: [snip bug]
This is fixed. The new behaviour is:
Great! We need to do a 1.0.1 release soon then. :)
Okay, too bad but not a big loss.
Okay, understood. How does this deal with XSLT parsing where includes need to be resolved? Do you somehow pass the filepath along? Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
I had planned that anyway. But there is still the mysterious crash that Petr encountered, and I don't want to release 1.0.1 until we made sure there won't necessarily be a 1.0.2 a week later.
I think so, too.
Yes. It uses mainly the same file-name-figuring-out code that you wrote. However, it's good you asked. You just made me aware of a second bug: loading external entities during parsing. This should also work *now*. It never worked for things like GZipFiles etc. but it stopped working in 1.0 for plain file objects. Oh, well... Stefan
participants (3)
-
Fredrik Lundh
-
Martijn Faassen
-
Stefan Behnel