[Python-Dev] email package status in 3.X

Guido van Rossum guido at python.org
Sun Jun 20 19:57:05 CEST 2010


On Sun, Jun 20, 2010 at 5:26 AM, Giampaolo Rodolà <g.rodola at gmail.com> wrote:
> 2010/6/20 Steven D'Aprano <steve at pearwood.info>:
>> Python 2.x introduced Unicode strings. Python 3.x merely makes them the
>> default.
>
> "Merely"? To me this looks as the main reason why a lot of projects
> haven't been ported to Python 3 yet.
> I attempted to port pyftpdlib to python 3 several times and the
> biggest show stopper has always been the bytes / string difference
> introduced by Python 3 which forces you to *know* and *use* Unicode
> every time you deal with some text

Ah, but this is the crux of the difference between Python 2 and 3. The
distinction between text and bytes is crucial, and Python 2 tried to
paper over the differences in a way that led to endless pain. Many
clumsy and shaky hacks have been invented to alleviate the pain but it
never goes away. Python 3 takes a much clearer stance on the
difference -- your code *must* be aware of the distinction and it
*must* deal with it.

The problem comes exactly where you find it: when *porting* existing
code that uses aforementioned ways to alleviate the pain, you find
that the hacks no longer work and a properly layered design is needed
that clearly distinguishes between which variables contain bytes and
which text.

> and 2to3 is completely useless here.

Alas, this is true, because it is not a matter of changing some simple
things. The old ways are no longer supported.

> I can only imagine how difficult can it be to do such a conversion in
> a project like Twisted or Django where the I/O plays a fundamental
> role.

Django actually took one of the most principled stances towards this
issue and has already been ported (although the port is not maintained
by the core Django developers yet). I can't speak for Twisted but I
know they have some funding towards a port.

The problem is often worse for smaller libraries (like I presume
pyftplib is) which don't have a clear stance about bytes vs. text.

Another problem is some internet protocols (of which FTP I believe is
one) which use antiquated models for dealing with binary vs. text
data, often focusing entirely on encodings (usually and mistakenly
called "character sets") rather than on proper Unicode support.

> The choice of forcing the user to use Unicode and "think in Unicode"
> was a very brave one, and I'm sure it's for the better, but not
> everyone wants to deal with that because Unicode is hard to swallow.

Education is needed. When you search Google (or Bing, for that matter
:-) for "python unicode" the first hit is
http://www.amk.ca/python/howto/unicode, which is highly detailed but
probably too much information for the typical person faced with a
UnicodeError exception traceback (that page is also focused on Python
2). What we need is a cookbook on how to deal with various common
situations.

> The majority of people prefer to stay with bytes and eventually learn
> and introduce Unicode only when that is actually needed.

This is exactly what we tried to do in Python 2 and it was a flagrant
disaster. It's just that the work-arounds people have created to deal
with it don't port clearly -- which is by design.

This is why I've always said that I assumed that the Python 3
transition would take 5 years.

On the #python issue, I expect that IRC is much less influential that
some here fear (and than some fervent IRC users believe). I don't see
reason for panic or heavy-handed interference. OTOH engaging the
channel operators more in python-dev sounds like a useful approach.

-- 
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list