[Python-Dev] Low-Level Encoding Behavior on Python 3
Armin Ronacher
armin.ronacher at active-4.com
Wed Mar 16 08:29:42 CET 2011
Hi everybody,
We (me and Carl Meyer) did some experimentation with encoding behavior
on Python 3. Carl did some hacking on getting virtualenv running on
Python 3 and it turned out that his version of virtualenv did not work
on Python 3 on my server either. So none of the virtulenv installations
did though they all seemed to work for some people.
Looking closer the problem is that virtualenv was assuming that
'open(filename).read()' works. However on my particular system the
default encoding in Python 3 for files was 'ASCII'. That encoding was
picked up because of three things: a) Python 3's default encoding for
opening files is picked up from the system locale, b) the ssh server
accepts the client's encoding for everything (including filenames) and
c) the OS X default installation for many people does not initialize
locales properly which forces the server to fall back to 'POSIX' which
then by applications (including Python) is picked up as ASCII.
Now this showcases a couple of problems on different levels:
- developers assume that the default for encodings is UTF-8 because
that is the encoding on their local machine. Now falling back to
the platform dependent encoding is documented but does not make a
lot of sense. The limiting platform is probably Windows which
historically has problems with UTF-8 in the notepad editor.
As a compromise I recommend UTF-8 for POSIX and UTF-8-sig for
Windows as the Windows editor feels happier with this encoding.
As the latter reads every file of the former that should not cause
that many problems in practice
- Seeing that SSH happily overrides the filesystem encoding I would
like to forward this issue to some of the linux maintainers. Having
the SSH client override your filesystem encoding sounds like a
terrible decision. Apparently Python guesses the filesystem
encoding from LC_CTYPES which however is overriden by connecting
SSH clients. Seeing how ubuntu and a bunch of other distributions
are using Gnome which uses UTF-8 for filesystems as somewhat
established default I would argue that Python should just assume
UTF-8 as default encoding on a Linux environment.
- Inform Apple about the fact that some Snow Leopard machines are
by default setting the LC_CTYPES (and all other locales) variables
to something that is not even a valid locale. I am not yet sure why
it does not happen on all machines, but it happens on more than one
at PyCon alone. On top of that I know that issue because it broke
the Python "Babel" package for a while which is why I added a work-
around for that particular problem.
I will either way file a bug report at Apple for what the SSH client
is doing on mixed local environments.
Are we missing anything? Any suggestions?
Regards,
Armin
More information about the Python-Dev
mailing list