[Python-Dev] Low-Level Encoding Behavior on Python 3

Armin Ronacher armin.ronacher at active-4.com
Wed Mar 16 08:29:42 CET 2011


Hi everybody,

We (me and Carl Meyer) did some experimentation with encoding behavior 
on Python 3.  Carl did some hacking on getting virtualenv running on 
Python 3 and it turned out that his version of virtualenv did not work 
on Python 3 on my server either.  So none of the virtulenv installations 
did though they all seemed to work for some people.

Looking closer the problem is that virtualenv was assuming that 
'open(filename).read()' works.  However on my particular system the 
default encoding in Python 3 for files was 'ASCII'.  That encoding was 
picked up because of three things: a) Python 3's default encoding for 
opening files is picked up from the system locale, b) the ssh server 
accepts the client's encoding for everything (including filenames) and 
c) the OS X default installation for many people does not initialize 
locales properly which forces the server to fall back to 'POSIX' which 
then by applications (including Python) is picked up as ASCII.

Now this showcases a couple of problems on different levels:

-   developers assume that the default for encodings is UTF-8 because
     that is the encoding on their local machine.  Now falling back to
     the platform dependent encoding is documented but does not make a
     lot of sense.  The limiting platform is probably Windows which
     historically has problems with UTF-8 in the notepad editor.

     As a compromise I recommend UTF-8 for POSIX and UTF-8-sig for
     Windows as the Windows editor feels happier with this encoding.
     As the latter reads every file of the former that should not cause
     that many problems in practice

-   Seeing that SSH happily overrides the filesystem encoding I would
     like to forward this issue to some of the linux maintainers.  Having
     the SSH client override your filesystem encoding sounds like a
     terrible decision.  Apparently Python guesses the filesystem
     encoding from LC_CTYPES which however is overriden by connecting
     SSH clients.  Seeing how ubuntu and a bunch of other distributions
     are using Gnome which uses UTF-8 for filesystems as somewhat
     established default I would argue that Python should just assume
     UTF-8 as default encoding on a Linux environment.

-   Inform Apple about the fact that some Snow Leopard machines are
     by default setting the LC_CTYPES (and all other locales) variables
     to something that is not even a valid locale.  I am not yet sure why
     it does not happen on all machines, but it happens on more than one
     at PyCon alone.  On top of that I know that issue because it broke
     the Python "Babel" package for a while which is why I added a work-
     around for that particular problem.

     I will either way file a bug report at Apple for what the SSH client
     is doing on mixed local environments.


Are we missing anything?  Any suggestions?


Regards,
Armin


More information about the Python-Dev mailing list