[Python-Dev] Import and unicode: part two

Toshio Kuratomi a.badger at gmail.com
Thu Jan 20 05:39:01 CET 2011

On Thu, Jan 20, 2011 at 03:51:05AM +0100, Victor Stinner wrote:
> For a lesson at school, it is nice to write examples in the
> mother language, instead of using "raw" english with ASCII identifiers
> and filenames.

Then use this::
   import cafe as café

When you do things this way you do not have to translate between unknown
encodings into unicode.  Everything is within python source where you have
a defined encoding.

Teaching students to write non-portable code (relying on filesystem encoding
where your solution is, don't upload to pypi anything that has non-ascii
filenames) seems like the exact opposite of how you'd want to shape a young
student's understanding of good programming practices.

> In a school, you can use the same configuration
> (encoding) on all computers.
In a school computer lab perhaps.  But not on all the students' and
professors' machines.  How many professors will be cursing python when they
discover that the example code that they wrote on their Linux workstation
doesn't work when the students try to use it in their windows computer lab?
How many students will be upset when the code they turn in runs on their
professor's test machine if the lab computers were booted into the Linux
partition but not if the they were booted into Windows?

> > > > * Specify an encoding per platform and stick to that.
> > > 
> > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
> > > programs will use it.
> > > 
> > (...) This prevents getting a mixture of encodings of modules (...)
> If you have an issue with encodings, when have to fix it when you create
> a module (on disk), not when you load a module (it is too late).
It's not too late to throw a clear error of what's wrong.

> > I haven't looked at your patch so
> > perhaps you have an ingenous method of translating from the unicode
> > representation of the module in the import statement to the bytes in
> > arbitrary encodings on the filesystem that I haven't thought of.
> On Windows, My patch tries to avoid any conversion: it uses unicode
> everywhere.
> On other OSes, it uses the Python filesystem encoding to encode a module
> name (as it is done for any other operation on the filesystem with an
> unicode filename).
The other interfaces are somewhat of a red herring here.  As I wrote in
another email, importing modules has ramifications that open(), for
instance, does not.  Additionally, those other filesystem operations have
been growing the ability to take byte values and encoding parameters because
unicode translation via a single filesystem encoding is a good default but
not a complete solution.

I think that this problem demands a complete solution, however, and it seems
to me that limiting the scope of the problem is the most pleasant method to
accomplish this.  Your solution creates modules which aren't portable.  One
of my proposals creates python code which isn't portable.  The other one
suffers some of the same disadvantages as your solution in portability but
allows for tools that could automatically correct modules.

> --
> Python 3 supports bytes filename to be able to read/copy/delete
> undecodable filenames, filenames stored in a encoding different than the
> system encoding, broken filenames. It is also possible to access these
> files using PEP 383 (with surrogate characters). This is useful to use
> Python on an old system.
> > If you don't, however, then really - ASCII-only seems like the sanest 
> > of the three solutions I can think of.
> But a (Python 3) module is not supposed to have a broken filename. If it
> is the case, you have better to fix its name, instead of trying to fix
> the problem later (in Python).
We agree that there should not be broken module names.  However it seems we
very hotly disagree about the definition of that.  You think that if
a module is named appropriately on one system but is not portable to another
system, that's fine.  I think that portability between systems is very
important and sacrificing that so that someone can locally use a module with
non-ASCII characters doesn't have a justifiable reward.

> With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups),
> it is already possible to use non-ASCII module names.
Tangent: This is not true about Linux.  UTF-8 is a matter of the
interpretation of the filesystem bytes that the user specifies by setting
their system locale.  Setting system locale to ASCII for use in system-wide
scripts, is quite common as is changing locale settings in other parts of
the world (as I can tell you from the bug reports colleagues CC me on to fix
for the problems with unicode support in their python2 programs).  Allowing
module names incompatible with ascii without specifying an encoding will
just lead to bug reports down the line.

Relatively few programmers understand the difference between the python
unicode abstraction and the byte representations possible for those strings.
Allowing non-ascii characters in module filenames without specifying an
encoding sets a trap for these programmers to fall into when they move
beyond their studies to programming for customers, pypi downloaders, etc who
don't have the same environment as themselves.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110119/7e785661/attachment.pgp>

More information about the Python-Dev mailing list