[Python-Dev] Import and unicode: part two

Thu Jan 27 01:47:08 CET 2011

On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. Löwis" wrote:
> Am 26.01.2011 10:40, schrieb Victor Stinner:
> > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
> >> Why not locale:
> >> * Relying on locale is simply not portable. (...)
> >> * Mixing of modules from different locales won't work. (...)
> > 
> > I don't understand what you are talking about.
> 
> I think by "portability", he means "moving files from one computer to
> another". He argues that if Python would mandate UTF-8 for all file
> names on Unix, moving files in such a way would support portability,
> whereas using the locale's filename might not (if the locale use a
> different charset on the target system).
> 
> While this is technically true, I don't think it's a helpful way of
> thinking: by mandating that file names are UTF-8 when accessed from
> Python, we make the actual files inaccessible on both the source and
> the target system.
> 
> > I don't understand the relation between the local filesystem encoding
> > and the portability. I suppose that you are talking about the
> > distribution of a module to other computers. Here the question is how
> > the filenames are stored during the transfer. The user is free to use
> > any tool, and try to find a tool handling Unicode correctly :-) But it's
> > no more the Python problem.
> 
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).
> 
> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.
> 
Thanks Martin, I think that you understand my view even if you don't share
it.

There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110126/cd0e9e5e/attachment.pgp>