[Python-Dev] Import and unicode: part two

Tue Jan 25 04:26:09 CET 2011

On Thu, Jan 20, 2011 at 03:27:08PM -0500, Glyph Lefkowitz wrote:
> 
> On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote:
>     Same here. *Most* code will never be shared, or will only be shared
>     between users in the same community. When it goes wrong it's also a
>     learning opportunity. :-)
> 
> 
> Despite my usual proclivity for being contrarian, I find myself in agreement
> here.  Linux users with locales that don't specify UTF-8 frankly _should_ have
> to deal with all kinds of nastiness until they can transcode their filesystems.
>  MacOS and Windows both have a "right" answer here and your third-party tools
> shouldn't create mojibake in your filenames.
> 
However, if this is the consensus, it makes a lot more sense to pick utf-8
as *the* encoding for python module filenames on Linux.

Why UTF-8:

* UTF-8 can cover the whole range of unicode whereas most (all?) other
  locale friendly encodings cannot.
* UTF-8 is becoming a standard for Linux distributions whether or not Linux
  users are adopting it.
* Third party tools are gaining support for UTF-8 even when they aren't
  gaining support for generic encodings (If I read the spec on zip
  correctly, this is actually what's happening there).

Why not locale:
* Relying on locale is simply not portable.  If nothing prevents people from
  distributing a unicode filename then they will go ahead and do so.  If
  the result works (say, because it's utf-8 and 80% of the Linux userbase is
  using utf-8) then it will get packaged and distributed and people won't
  know that it's a problem until someone with a non-utf-8 locale decids to
  use it.
* Mixing of modules from different locales won't work.  Suppose that the
  system python installs the previous module.  The local site has other
  modules that it has installed using a different filename encoding.
  The users at the site will find that either one or hte other of the two
  modules won't work.
* Because of the portability problems you have no choice but to tell people
  not to distribute python modules with non-ASCII names.  This makes the use
  of unicode names second class indefintely (until the kernel devs decide
  that they're wrong to not enforce a filesystem encoding or Linux becomes
  irrelevant as a platform).
* If you can pick a set of encodings that are valid (utf-8 for Linux and
  MacOS, wide unicode for windows [I get the feeling from other parts of the
  conversation that Windows won't be so lucky, though]) tools to convert
  python names become easier to write.  If you restrict it far enough, you
  could even write tools/importers that automatically do the detection.

PS: Sorry for not replying immediately, the team I'm on is dealing with an
issue at my work and I'm also preparing for a conference later this week.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110124/f76b1630/attachment.pgp>