[Python-Dev] Import and unicode: part two
v+python at g.nevcal.com
Thu Jan 27 03:43:11 CET 2011
On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:
> There's one further case that I am worried about that has no real
> "transfer". Since people here seem to think that unicode module names are
> the future (for instance, the comments about redefining the C locale to
> include utf-8 and the comments about archiving tools needing to support
> encoding bits), there are eventually going to be unicode modules that become
> dependencies of other modules and programs. These will need to be installed
> on systems. Linux distributions that ship these will need to choose
> a filesystem encoding for the filenames of these. Likely the sensible thing
> for them to do is to use utf-8 since all the ones I can think of default to
> utf-8. But, as Stephen and Victor have pointed out, users change their
> locale settings to things that aren't utf-8 and save their modules using
> filenames in that encoding. When they update their OS to a version that has
> utf-8 python module names, they will find that they have to make a choice.
> They can either change their locale settings to a utf-8 encoding and have
> the system installed modules work or they can leave their encoding on their
> non-utf-8 encoding and have the modules that they've created on-site work.
> This is not a good position to put users of these systems in.
The way this case should work, is that programs that install files
(installation is a form of transfer) should transform their names from
the encoding used in the transfer medium to the encoding of the
filesystem on which they are installed.
Python3 should access the files, transforming the names from the
encoding of the filesystem on which they are installed to Unicode for
use by the program.
I think Python3 is trying to do its part, and Victor is trying to make
that more robust on more platforms, specifically Windows.
The programs that install files, which may include programs that install
Python files I don't know, may or may not be doing their part, but
clearly there are cases where they do not.
Systems that have different encodings for names on the same or different
file systems need to have a way to obtain the encoding for the file
names, so they can be properly decoded. If they don't have such a way,
they are broken.
The rest of this is an attempt to describe the problem of Linux and
other systems which use byte strings instead of character strings as
file names. No problem, as long as programs allow byte strings as file
names. Python3 does not, for the import statement, thus the problem is
relevant for discussion here, as has been ongoing.
Since file names are defined to be byte strings, there is no way to
obtain the encoding for file names, so they cannot always be decoded,
and sometimes not properly decoded, because no one knows which encoding
was used to create them, _if any_.
Hence, Linux programs that use character strings as file names
internally and expect them to match the byte strings in the file system
are promoting a fiction: that there is a transformation (encoding) from
character strings to byte strings that will match.
When using ASCII character strings, they can be transformed to bytes
using a simple transformation: identity... but that isn't necessarily
correct, if the files were created using EBCDIC (unlikely on Linux
systems, but not impossible, since Linux files are byte strings).
When using non-ASCII character strings, the fiction promoted is even
bigger, and the transformation even harder. Any 8-bit character
encoding can pretend that identity is the correct transformation, but
the result is mojibake if it isn't. Unicode other multi-byte encodings
have an even harder job, because there can be 8-bit sequences that are
not legal for some transformations, but are legal for others. This is
when the fiction is exposed!
As the recent description of glib points out, when the file names are
read as bytes, and shown to the user for selection, possibly using some
mojibake-generating transformation to characters, the user has a
fighting chance to pick the right file, less chance if the
transformation is lossy ('?' substitutions, etc.) and/or the names are
redundant in their lossless characters.
However, when the specification of the name is in characters (such as
for Python import, or file names specified as character constants in any
application system that provides/permits such), and there are large
numbers of transformations that could be used to convert characters to
bytes, the problem is harder, and error-prone... programs that want to
promote the fiction of using characters for filenames must work harder.
It seems that Python on Linux is such a program.
One technique is to have conventions agreed on by applications and users
to limit the number of encodings used on a particular system to one
(optimal) or a few, the latter requires understanding that files created
in one encoding may not be accessible by systems that use a different
one... until they are renamed. Subsets of applications and users can
the happily share files with others of their encoding, and with the
subset of files that can be decoded successfully by their encoding, even
though it is not correct. (often ASCII, or a few mojibake characters
learned for cross-subset usage.) When multiple encodings are used
without such conventions, chaos results.
Another technique that would be amusing is to use Base64 (as Oleg
suggested), URL-encoding, or some other mapping that transforms
non-ASCII names to ASCII character sequences and the identity mapping to
obtain bytes, and then Python could ship such files to any system, as
long as it always included that mapping as one of the encodings it would
try to find files. This would probably be the most powerful solution,
but would only need to be applied to those systems that do not use
characters for filenames. It could, in fact, be applied on any system
that uses a subset of characters for filenames, and hence transcends the
need for Unicode support in a file system to use Unicode names in
Python3 import statements. It would likely be problematical for use
with 3rd-party libraries, however.
Another technique would be to try each possible encoding in turn, in
some defined order, and the filesystem searched for that byte string as
a file name, possibly matching files that shouldn't have been matched.
To limit that search, such programs could allow configuration of an
smaller ordered list of encodings to be tried to limit the search, and a
specific one to be used for the creation of new files; this opens up the
possibility of not trying the "right" encoding, for some rogue file name.
This would be an issue and implementation for Linux systems, but would
not need to be used on systems such as MacOS (which defines a particular
encoding) or Windows (which defines a particular encoding) etc. When
mounting filesystems that use byte string file names on systems with a
define encoding, it should be the responsibility of the mounting system
to do such transformations, and possibly have such configurations, and
possibly have mappings or renaming facilities, and possibly prohibit
access to files whose names cannot be transformed (of course, one can
always punt by configuring latin-1 or other encodings that can match any
byte string, but that produces mojibake, and then there is no surety
that particular files will appear to have the name that programs expect).
Of course, Victor's patch is addressing Windows issues, and Windows has
defined encodings, it is just a matter of using the proper APIs to see
them, and should be accepted.
It sounds like the current situation on Linux is that Python can access
the subset of files that match the locale encoding for which it is run.
It sounds like it would be inappropriate for Python to begin shipping
files with non-ASCII names as part of its Linux distribution, unless
facilities are created or tools used to remap non-ASCII names to the
local locale encoding. Locales that are not ASCII supersets (in
character repertoire, not encoding) could not be supported. Locales
that do not support all the characters used in files shipped with Python
could not be supported. Since locales vary wildly in their available
non-ASCII names, that limits Python eithr to shipping ASCII names only,
or restricting the locales that are supported to those that support the
I suppose that Victor's patch would point out most or all the places
where such transformations would have to be implemented, if it is
important to support systems having byte string file names whose users
cannot agree to use a single encoding for transforming to/from characters.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev