[lxml-dev] first lessons learned while porting lxml to Py3
Hi, since we had a lengthy discussion on whether or not non-prefixed byte strings should automatically mutate into unicode strings when compiled for Py3, here are some initial lessons from my first attempt to port lxml. My first approach was (obviously) to import unicode_literals from __future__. This failed miserably, and even showed a couple of further bugs in Cython. :) I then chose the route to explicitly prepend unicode strings with 'u', as I wanted to keep my source compilable with older Cython versions that do not support the 'b' prefix. Currently, I have changed about 700 lines this way in a quick walk-through, and now I'm searching the places where this was the wrong thing to do. :) Most important evidence found: it's definitely non-trivial in a lot of places to decide what has to be unicode and what doesn't. It's non-trivial for me, and definitely not easier for Cython. One important place where I ended up with a lot of trivial changes are docstrings. Here, I would give an almost 100% chance that the user meant a unicode string if it's not prefixed. The remaining cases, e.g. where some external tool may require binary data for some kind of configuration or analysis are rare enough to just ignore them. For exactly this reason (I think), the doctest module in Py3 ignores docstrings that are not unicode. This might be a place where an automatic conversion might make sense (although, if it's the only place, that would be some funny string semantics...) Another important place are exception messages. Here, I'd give a real 100% for string literals, as their only purpose is to be human readable. A field where I really had to take care is when working with byte sequences. For example, lxml has a couple of places where strings are converted into UTF-8 and then passed into re.findall() or re.sub(). When substituting, the replacement string obviously has to be a byte string, too. I also found a bug in the Py3 re module when working with byte strings in one specific case. There are actually quite a number of places where strings are built as byte strings by combining and formatting literals, and then converted to a char*. Another place where automatic conversion must not happen. So, while still on the way, my first real-world impression meets my original opinion. There are definitely a lot of unprefixed strings in my own code that are meant to be unicode strings. Simply switching their type in Py3 will fix a lot of them, but at the same time break many others. The things that it fixes are the trivial parts: docstrings and exceptions. Almost everything else really were byte strings, and some were non-trivial things that need real work. If I can choose, I opt for going through this once and then having code that correctly distinguishes between byte strings and unicode strings in *both* Py2 and Py3, instead of additionally having to deal with changing string semantics for identical code in different environments. We might think about a way to simplify the transition from unprefixed docstrings and exception messages to unicode strings. As it currently stands, everything else is definitely out of scope for any automatism. Stefan
Sorry, wrong list. This was supposed to go to the Cython list... [but yes, there will be lxml for Python 3, and pretty soon] Stefan Behnel wrote:
since we had a lengthy discussion on whether or not non-prefixed byte strings should automatically mutate into unicode strings when compiled for Py3, here are some initial lessons from my first attempt to port lxml. [...]
Stefan Behnel wrote:
Sorry, wrong list. This was supposed to go to the Cython list...
[but yes, there will be lxml for Python 3, and pretty soon]
It's interesting to hear about here anyway. I'm glad to hear that you'll try being more clear with how unicode works for the Python 2.x version of lxml too. I think this is similar to the way the migration path enables this for plain Python code in Python 2.6. I do hope that lxml for Python 2.x can be maintained and extended for the forseeable future though; I'm sitting on a vast mountain of codebases that aren't going to Python 3.x in a hurry. Regards, Martijn
Hi, Martijn Faassen wrote:
I'm glad to hear that you'll try being more clear with how unicode works for the Python 2.x version of lxml too. I think this is similar to the way the migration path enables this for plain Python code in Python 2.6.
There will be little changes when running in Py2. It will still accept byte strings at the API level and return them for plain ASCII values. This only changes under Py3, where you will always get a unicode string back for .tag, .text, etc. I'm not even planning to block passing byte strings as tag name, although that will become really rare for Python code running in Py3.
I do hope that lxml for Python 2.x can be maintained and extended for the forseeable future though; I'm sitting on a vast mountain of codebases that aren't going to Python 3.x in a hurry.
There will only be a single code base. We ported the code that Cython generates in a way that makes it compile from Py2.3 to Py3.0 without changes, and I'm planning to continue the support for 2.3 as long as possible. We are plannung a new release of Cython shortly after the release of 3.0/2.6 beta1. Stefan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
Hi,
Martijn Faassen wrote:
I'm glad to hear that you'll try being more clear with how unicode works for the Python 2.x version of lxml too. I think this is similar to the way the migration path enables this for plain Python code in Python 2.6.
There will be little changes when running in Py2. It will still accept byte strings at the API level and return them for plain ASCII values. This only changes under Py3, where you will always get a unicode string back for .tag, .text, etc.
I'm not even planning to block passing byte strings as tag name, although that will become really rare for Python code running in Py3.
I do hope that lxml for Python 2.x can be maintained and extended for the forseeable future though; I'm sitting on a vast mountain of codebases that aren't going to Python 3.x in a hurry.
There will only be a single code base. We ported the code that Cython generates in a way that makes it compile from Py2.3 to Py3.0 without changes, and I'm planning to continue the support for 2.3 as long as possible.
Thank you! /me heaves a huge sigh of relief. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIQYyv+gerLs4ltQ4RAoaYAJ9tfnofSKDkniA2KV7mPa4AUg7UhACfQoIy 5zcGRMv37Fu4ZWIEmC8E5v0= =hp1x -----END PGP SIGNATURE-----
participants (3)
-
Martijn Faassen
-
Stefan Behnel
-
Tres Seaver