Re: [Python-Dev] Unicode strings as filenames
Recently, "M.-A. Lemburg"
Jack Jansen wrote:
Off on a slight tangent: On Mac OS X the default 8-bit encoding is UTF8. os.listdir() handles this fine and so does open(). The OS does all the hard work for you [...] But in Python (unix-Python we're talking here, not MacPython), unicode(filename) fails, because site.encoding is "ascii".
Would it be safe to set site.encoding to utf8 on Mac OS X by default?
I'd rather suggest to use UTF-8 as default encoding in the subsystem layer I was talking about.
Uhm... Do you mean Py_FileSystemDefaultEncoding? Otherwise: what do you mean? And, if you do mean Py_FSDE, would that also work for listdir()? No, I guess it can't because listdir() returns simple strings, so by the time I pass them to unicode() all knowledge that they came from listdir is gone... Hmm, shouldn't StringObjects themselves carry an encoding field (defaulting to sys.encoding)? That would solve quite a few issues. read() from a binary file would return the special encoding "binary", for instance, and then the "u" and "u#" formats could make a distinction between character strings (which would be converted to unicode using the encoding they carry) and binary strings (which would be interpreted as 16-bit chars). But interning may be a showstopper, now that I think of it...
Making UTF-8 the default Python system encoding would have many other consequences -- and you'd probably lose a great deal of portability since UTF-8 conversion (nearly) always will succeed while ASCII can easily fail on other systems which use e.g. Latin-1 as native encoding.
What are your reasons for asserting this? If I read this correctly this would make Python compatible to the least common denominator of all platforms, while I think I would prefer it to allow access to all the niceties a platform gives. On Unix you really don't have a good guess for the encoding, but on MacOS and Windows you do... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
Hmm, shouldn't StringObjects themselves carry an encoding field (defaulting to sys.encoding)?
That approach has been discussed during the design phase of the Unicode API; Bill Janssen was the first to propose this in response to my talk http://www.python.org/workshops/1997-10/proceedings/loewis.html During the Unicode design, this idea came up sometimes, but it always turned out that proposers could not give a coherent semantics to such tags. Just explain what happens if you add two strings that have different encodings.
That would solve quite a fewb issues.
And introduce many new ones.
Making UTF-8 the default Python system encoding would have many other consequences -- and you'd probably lose a great deal of portability since UTF-8 conversion (nearly) always will succeed while ASCII can easily fail on other systems which use e.g. Latin-1 as native encoding.
What are your reasons for asserting this?
If I understand this claim correctly, he means: "Currently, if auto-conversion (to ASCII) succeeds, the result is likely correc. If the default encoding was UTF-8, conversion would succeed for all Unicode objects, but give incorrect results for many users, e.g. if they use Latin-1 on their terminal" This is actually a frequent problem since the introduction of UTF-8: Some applications display the bytes that make up an UTF-8 string as if it was a Latin-1 string, rendering it completely unreadable (although I can already recognize my name if I run into such an application). This problem may go unnoticed during testing, whereas an exception is likely noticed.
If I read this correctly this would make Python compatible to the least common denominator of all platforms, while I think I would prefer it to allow access to all the niceties a platform gives.
It does no such thing. The application has full control over all conversions, if it initiates them explicitly. Explicit is better then implicit. Regards, Martin
participants (2)
-
Jack Jansen
-
Martin v. Loewis