[Python-Dev] Import and unicode: part two

Wed Jan 26 06:33:56 CET 2011

On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote:
> Toshio Kuratomi writes:
> 
>  > On Linux there's no defined encoding that will work; file names are just
>  > bytes to the Linux kernel so based on people's argument that the convention
>  > is and should be that filenames are utf-8 and anything else is
>  > a misconfigured system -- python should mandate that its module filenames on
>  > Linux are utf-8 rather than using the user's locale settings.
> 
> This isn't going to work where I live (Tsukuba).  At the national
> university alone there are hundreds of pre-existing *nix systems whose
> filesystems were often configured a decade or more ago.  Even if the
> hardware and OS have been upgraded, the filesystems are usually
> migrated as-is, with OS configuration tweaks to accomodate them.  Many
> of them use EUC-JP (and servers often Shift JIS).  That means that you
> won't be able to read module names with ls, and that will make Python
> unacceptable for this purpose.  I imagine that in Russia the same is
> true for the various Cyrillic encodings.
> 
Sure ... but with these systems, neither read-modules-as-locale or
read-modules-as-utf-8 are a good solution to work, correct?  Especially if
the OS does get upgraded but the filesystems with user data (and user
created modules) are migrated as-is, you'll run into situations where system
installed modules are in utf-8 and user created modules are shift-jis and so
something will always be broken.

The only way to make sure that modules work is to restrict them to ASCII-only
on the filesystem.  But because unicode module names are seen as
a necessary feature, the question is which way forward is going to lead to
the least brokenness.  Which could be locale... but from the python2
locale-related bugs that I get to look at, I doubt.

> I really don't think there is anything that can be done here except to
> warn people that "Kids, these stunts are performed by highly-trained
> professionals.  Don't try this at home!"  Of course they will anyway,
> but at least they will have been warned in sufficiently strong terms
> that they might pay attention and be able to recover when they run
> into bizarre import exceptions.
> 
So on the subject of warnings... I think a reason it's better to pick an
encoding for the platform/filesystem rather than to use locale is because
people will get an error or a warning at the appropriate time if that's the
case -- the first time they attempt to create and import a module with
a filename that's not encoded in the correct encoding for the platform.
It's all very well to say: "We wrote in the documentation on
http://docs.python.org/distutils/introduction.html#Choosing-a-name that only
ASCII names should be used when distributing python modules" but if the
interpreter doesn't complain when people use a non-ASCII filename we all
know that they aren't going to look in the documentation; they'll try it and
if it works they'll learn that habit.  

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110125/9389a888/attachment.pgp>