[Python-Dev] Import and unicode: part two

Thu Jan 20 03:07:25 CET 2011

On Thu, Jan 20, 2011 at 01:26:01AM +0100, Victor Stinner wrote:
> Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : 
> > Additionally, many unix filesystem don't specify a filesystem encoding for
> > filenames; they deal in legal and illegal bytes which could lead to
> > troubles.  This problem of which encoding to use is a problem that can be
> > seen on UNIX systems even now.
> 
> If the system is not correctly configured, it is not a bug in Python,
> but a bug in the system config. Python relies on the locale to choose
> the filesystem encoding (sys.getfilesystemencoding()). Python uses this
> encoding to decode and encode all filenames.
> 
Saying that multiple encodings on a single system is a misconfiguration
every time it comes up does not make it true.  There's been multiple
examples of how you can end up with multiple encodings of filenames on
a single system listed in past threads: multiple users with different
encodings for their locales, mounting remote filesystems, downloading
a file.... To the existing list I'd add getting a package from pypi --
neither tar nor zip files contain encoding information about the filenames.
Therefore if I create an sdist of a python module using non-ascii filenames
using a locale of latin1 and then upload to pypi, people downloading that
on a utf-8 using locale will end up not being able to use the module.

> > * Specify an encoding per platform and stick to that.
> 
> It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
> programs will use it.
> 
The proposal is that you ignore that when talking about loading and creating
(I mentioned distutils because my thought was that distutils could grow the
ability to translate from the system locale to a chosen neutral encoding
when running setup.py any of the dist commands but that doesn't address the
issue when testing a module that you've just written so perhaps that's not
necessary.) python modules.  Python modules would have a set of defined
filesystem encodings per system.  This prevents getting a mixture of
encodings of modules and having things work in one location but fail when
used somewhere else.  Instead, you get an upfront failure until you correct
the encoding.

> Anyway, I don't see why it is a problem to have different encodings on
> different systems. Each system can use its own encoding. The bug that
> I'm trying to solve is a Python bug, not an OS bug.
> 
There is no OS bug here.  There is perhaps an OS design flaw but it's not
a flaw that will be going away soon (in part, because the present OS
designers do not see it as an OS flaw... to them it's a bug in code that
attempts to build a simpler interface on top of it.)

> > * Change import semantics to allow specifying the encoding of the module on
> >   the filesystem (seems really icky).
> 
> This is a very bad idea. I introduced PYTHONFSENCODING environment
> variable in Python 3.2, but then quickly removed it, because it
> introduced a lot of inconsistencies.
> 
Thanks for getting rid of that, PYTHONFSENCODING is a bad idea because it
doesn't solve the underlying issues.  However, when I say specifying the
encoding of the module on the filesystem, I don't mean something global like
PYTHONFSENCODING -- I mean something at the python code level::

   import café encoded_as('latin1')

After thinking about this one, though, I don't think it will work either.
This takes care of importing modules where the fs encoding of the module is
known but it doesn't where the fs encoding may be translated between
platforms.  I believe that this could arise when untarring a module on
windows using winzip or similar that gives you the option of translating
from utf-8 bytes into bytes that have meaning as characters on that
platform, for instance.

Do you have a solution to the problem?  I haven't looked at your patch so
perhaps you have an ingenous method of translating from the unicode
representation of the module in the import statement to the bytes in
arbitrary encodings on the filesystem that I haven't thought of.  If you
don't, however, then really - ASCII-only seems like the sanest of the three
solutions I can think of.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110119/39794033/attachment-0001.pgp>