since we had a lengthy discussion on whether or not non-prefixed byte strings
should automatically mutate into unicode strings when compiled for Py3, here
are some initial lessons from my first attempt to port lxml.
My first approach was (obviously) to import unicode_literals from __future__.
This failed miserably, and even showed a couple of further bugs in Cython. :)
I then chose the route to explicitly prepend unicode strings with 'u', as I
wanted to keep my source compilable with older Cython versions that do not
support the 'b' prefix. Currently, I have changed about 700 lines this way in
a quick walk-through, and now I'm searching the places where this was the
wrong thing to do. :)
Most important evidence found: it's definitely non-trivial in a lot of places
to decide what has to be unicode and what doesn't. It's non-trivial for me,
and definitely not easier for Cython.
One important place where I ended up with a lot of trivial changes are
docstrings. Here, I would give an almost 100% chance that the user meant a
unicode string if it's not prefixed. The remaining cases, e.g. where some
external tool may require binary data for some kind of configuration or
analysis are rare enough to just ignore them. For exactly this reason (I
think), the doctest module in Py3 ignores docstrings that are not unicode.
This might be a place where an automatic conversion might make sense
(although, if it's the only place, that would be some funny string semantics...)
Another important place are exception messages. Here, I'd give a real 100% for
string literals, as their only purpose is to be human readable.
A field where I really had to take care is when working with byte sequences.
For example, lxml has a couple of places where strings are converted into
UTF-8 and then passed into re.findall() or re.sub(). When substituting, the
replacement string obviously has to be a byte string, too. I also found a bug
in the Py3 re module when working with byte strings in one specific case.
There are actually quite a number of places where strings are built as byte
strings by combining and formatting literals, and then converted to a char*.
Another place where automatic conversion must not happen.
So, while still on the way, my first real-world impression meets my original
opinion. There are definitely a lot of unprefixed strings in my own code that
are meant to be unicode strings. Simply switching their type in Py3 will fix a
lot of them, but at the same time break many others. The things that it fixes
are the trivial parts: docstrings and exceptions. Almost everything else
really were byte strings, and some were non-trivial things that need real work.
If I can choose, I opt for going through this once and then having code that
correctly distinguishes between byte strings and unicode strings in *both* Py2
and Py3, instead of additionally having to deal with changing string semantics
for identical code in different environments. We might think about a way to
simplify the transition from unprefixed docstrings and exception messages to
unicode strings. As it currently stands, everything else is definitely out of
scope for any automatism.