Re: [Python-Dev] Adding the 'path' module (was Re: Some RFE for review)

15 Jul 2005

      Martin v. Löwis wrote:
...
Guido van Rossum wrote:
...
Ah, sigh. I didn't know that os.listdir() behaves differently when the
argument is Unicode. Does os.listdir(".") really behave differently
than os.listdir(u".")? Bah! I don't think that's a very good design
(although I see where it comes from). Promoting only those entries
that need it seems the right solution
Unfortunately, this solution is hard to implement (I don't know whether
it is implementable at all correctly; atleast on Windows, I see no
way to implement it efficiently).
Here are a number of problems/questions:
- On Windows, should listdir use the narrow or the wide API? Obviously
  the wide API, since it is not Python which returns the question marks,
  but the Windows API.
Right.
...
- But then, the wide API gives all results as Unicode. If you want to
  promote only those entries that need it, it really means that you
  only want to "demote" those that don't need it. But how can you tell
  whether an entry needs it? There is no API to find out.
  You could declare that anything with characters >128 needs it,
  but that would be an incompatible change: If a character >128 in
  the system code page is in a file name, listdir currently returns
  it in the system code page. It then would return a Unicode string.
  Applications relying on the olde behaviour would break.
We will need a Python C API that returns:

* a string if the Unicode value is representable in the
  default encoding (usually ASCII)

* Unicode if it is not

The file system encoding should be hidden in the OS
layer (e.g. posixmodule). Python should only return
strings with the default encoding and Unicode
otherwise.

See my suggestion to Neil about making the transition to
this new strategy less painful.
...
- On Unix, all file names come out as byte strings. Again, how do
  you know which ones to promote, and using what encoding? Python
  currently guesses an encoding, but that may or may not be the one
  intended for the file name.
This is a tough one: AFAIK the file system encoding in Unix
was never really specified, in fact most file systems just
stored the names as-is without any encoding information attached
to it.

Things are moving into the direction of using UTF-8 for
filenames, though.

To solve this issue, various applications have come up with
ways around the problem, e.g. GTK uses the following strategy
to find the encoding (in the given order and adjustable using
an environment  variable):

1. locale based encoding, if given (UTF-8 on most modern Unixes)
2. UTF-8
3. Latin-1
4. CP1252 (Windows Latin-1 version)

Perhaps we should add similar support to Python ?

We should probably use a file system encoding default
of Latin-1 on Unix if no other information can be found.

That way we will assure that things don't change on
Unix unless explicitly setup by the user (Latin-1 is
round-trip safe when converting it to Unicode and back).

os.listdir() would then continue to return plain strings
and file() will open them just it does now.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 15 2005)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Re: [Python-Dev] Adding the 'path' module (was Re: Some RFE for review)

M.-A. Lemburg