[Python-Dev] Import and unicode: part two

Fri Jan 21 09:45:44 CET 2011

Nick Coghlan writes:
 > On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto <ishimoto at gembook.org> wrote:

 > > I don't want Python to encourage people to use non-ascii module names.

I don't think anybody is *encouraging* it.  The argument is for
*permitting* it, partly for consistency with other identifiers, and
partly because of Python's usual "consenting adults" standard for
permitting "dangerous" practices.

I realize this is a somewhat problematic distinction in Japan, for
several reasons, but it's really not one that can be avoided in
computing in any case.  The sooner novice programmers learn it, the
better.

 > > Today, seeing UnicodeEncodingError is one of popular reasons for
 > > newbies to abandon learning Python in Japan. Non-ascii module name is
 > > an another source of confusion for newbies.
 > >
 > > Experienced Japanese programmers may not use non-ascii module names to
 > > avoid encoding issues.
 > >
 > > But novice programmers or non-programmers willing to learn programming
 > > with Python will wish to use Japanese module names. Their programs
 > > will stop working if they copy them to another environment. Sooner or
 > > later, they will see storange ImportError and will start complaining
 > > "Python sucks! Python doesn't support Japanese!" on Twitter.

So ask them, "What language *does* 'support Japanese'?" ;-)

Seriously, "support Japanese" is an impossibly hard standard in the
current environment.  Not only does Japan have 5 more or less standard
encodings still in daily use (EUC-JP, ISO-2022-JP, Shift JIS, UTF-8,
and UTF-16LE), but many major IT companies have their own variants of
the JIS standard character repertoire (all of the variant ideographs
I've seen in the wild are in Unicode, but many corporate repertoires
add extra symbols that are not), and of course some Microsoft
utilities insist on using the deprecated UTF-8 signature with UTF-8.

That said, I really don't see module names as a particular problem.
By the time your novice is using her own modules (as opposed to
importing stdlib and PyPI add-on modules, all with ASCII-only names),
she'll be doing file I/O which has all the same problems, AFAICS.
True, file names will be strings rather than identifiers, but I don't
see why that matters.

 > > Copying files with non-ascii file name over platform is not easy as it
 > > sounds.

Agreed, it's not trivial.  But it's not that hard, either[1], and web
hosts and others *could* help by providing checkers for languages that
they support.

 > > What happen if I copy such files from OSX to my web hosting
 > > server ? Results might differ depending on tools I use to copy and
 > > platforms.

I don't see why this problem is specific to Python modules, as opposed
to any file name.

 > These all sound like good reasons to continue to *advise* against
 > using non-ASCII module names.

+1

 > But aside from that, they sound exactly like a lot of the arguments
 > we heard when Py3k started enforcing the bytes/text distinction
 > more rigorously: "you're going to break stuff!".

Well, not exactly.  Enforcing the bytes/text distinction was a change
in the definition of Python; breakage was our fault.  The change was
made because in the (not so) long run it would reduce new breakage.

Here, Python is fine (or at least we have some pretty good ideas how
to fix it), it's the world that's broken.  *Especially* Japan, with
its five standard encodings in daily use and scads of private variant
repertoires masquerading as standard encodings on top of that.  But
the whole world is broken because of the NFD/NFC thing.  AFAIK, the
only file system that tries to enforce an NF is Mac OS X HFS+, and
(unfortunately for portability *from* Mac OS X *to* other systems)
they chose NFD.  Proper NFD support is arguably better for a number of
reasons (for one, people regularly invent new composition sequences
that will not have precomposed glyphs in any font), but NFC has the
advantage that existing fonts support precomposed standard characters
while many display engines do not support composition properly yet.
And it's likely to stay broken for a while: the move to conformant
display engines is going to take more time.

I still don't see this as a reason to give up on non-ASCII module
names.  Just have the documentation warn that many non-ASCII names
will be non-portable, so use on multiple systems will require care
(maybe gloss that with "probably more care than you want to take").

Footnotes: 
[1]  I actually find copying file names with spaces to be a bigger
problem, because it's so hard to get shell quoting right.