Given the stated rationale of PEP 383, I was wondering what Windows actually does. So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0.<br>
<br>First, os.listdir("f:") returns a list of strings for those file names... but those unicode strings are illegal.<br><br>You can't even print them without getting an error from Python. In fact, you also can't print strings containing the proposed half-surrogate encodings either: in both cases, the output encoder rejects them with a UnicodeEncodeError. (If not even Python, with its generally lenient attitude, can print those things, some other libraries probably will fail, too.)<br>
<br>What about round tripping? So, if you take a malformed file name from an external device (say, because it was actually encoded iso8859-15 or East Asian) and write it to an NTFS directory, it seems to write malformed UTF-16 file names. In essence, Windows doesn't really use unicode, it just implements 16bit raw character strings, just like UNIX historically implements raw 8bit character strings.<br>
<br>Then I tried the same thing on my Ubuntu 9.04 machine. It turns out that, unlike Windows, Linux is seems to be moving to consistent use of valid UTF-8. If you plug in an external device and nothing else is known about it, it gets mounted with the utf8 option and the kernel actually seems to enforce UTF-8 encoding. I think this calls into question the rationale behind PEP 383, and we should first look into what the roadmap for UNIX/Linux and UTF-8 actually is. UNIX may have consistent unicode support (via UTF-8) before Windows.<br>
<br>As I was saying, I think PEP 383 needs a lot more thought and research...<br><br>Tom<br><br>