[lxml-dev] Python 3 changes in lxml 2.1
Hi, as it currently seems, lxml 2.1 will support Python 2.6 and Python 3 out of the box. While fixing up lxml 2.1beta to make this work, I found a couple of things that I needed to change. Here's an (incomplete) list, so that people can start shouting at me for breaking their code. ;) One major thing that changed is that the API will now always return unicode strings for non byte stream data (.text, .tag, namespaces, ...), whereas it continues to return a byte string for plain ASCII data in Py2. Two things have become a bit quirky now. We currently return a subclass of ElementTree from XSLT, and you can call str(tree) on it to get the result. Returning a byte string here raises an exception in Py3, so that str(result) now behaves as unicode(result) did before, i.e. it returns a Python unicode string. To get the expected result as a byte string, people will have to use the new buffer protocol instead (memoryview&friends). This also means that bytes(xslt_result) will work as expected. Sadly, this means that there isn't a way to get the result in a portable way. I'm thinking about adding a .tobytes() method, but I'm not sure this is really helpful. The second quirk is serialisation to a unicode string. Instead of tostring(root, encoding=unicode) you now have to write tostring(root, encoding=str) so this requires source adaptation. Then again, this is (hopefully) a rare usage anyway and most Python code will require Py3 changes anyway. Haven't checked, but the 2to3 tool should normally take care of this. The ugliest problem I found so far is with doctests. There just isn't a way to write a Py2/Py3 portable doctest that accepts exactly a byte string or unicode strings as output, as both look different in Py2 and Py3. Also, exception names are now fully qualified, so that tracebacks look different. Tons of failing tests for nothing... Stefan
Hi there, Stefan Behnel wrote: [snip]
The ugliest problem I found so far is with doctests. There just isn't a way to write a Py2/Py3 portable doctest that accepts exactly a byte string or unicode strings as output, as both look different in Py2 and Py3. Also, exception names are now fully qualified, so that tracebacks look different. Tons of failing tests for nothing...
I wonder whether doctest can be extended/adapted so it can normalize strings. I think by the way it might be valuable to bring up this issue on the Py3K mailing list. The doctest module is after all in the standard library, and perhaps people can think up a way to break less. Regards, Martijn
Martijn Faassen wrote:
I wonder whether doctest can be extended/adapted so it can normalize strings.
You can use lib2to3 to convert doctests before running them. That works well in most cases, but isn't trivial to set up either and you may still have to change your tests to enable an automated conversion. A straight 2to3 option in the doctest module would be nice anyway. What I did for lxml was changing most doctests to the more explicit Py3 syntax (or actually a bit of a mix of both worlds) and use a couple of regular expressions to fix them up before passing them to doctest. Not ideal, but it works well enough for now. Stefan
participants (2)
-
Martijn Faassen
-
Stefan Behnel