[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Paul Moore p.f.moore at gmail.com
Fri Apr 24 14:00:40 CEST 2009


2009/4/24 Simon Cross <hodgestar+pythondev at gmail.com>:
> On Fri, Apr 24, 2009 at 12:04 PM, Glenn Linderman <glenn at nevcal.com> wrote:
>> The goal of Unicode users everywhere is to use Unicode for everything, no?
>>  After all, all "real" file should have Unicode based names, and the only
>> proper byte sequences that should exist are UTF-8 encoding Unicode bytes.
>>  (Cheek to tongue: Get out of here!)
>
> Humour aside :), the expectation that filenames are Unicode data
> simply doesn't agree with the reality of POSIX file systems.

However, it *does* agree with the reality of Windows file systems. The
fundamental problem here is that there is a strong OS disparity - for
Windows, the OS uses Unicode, for POSIX, the OS uses bytes.
Traditionally, Python has been happy to expose OS differences, and let
application code address platform portability issues. But this is such
a fundamental area, that doing so is problematic - it could easily
result in *more* code being OS-specific (in subtle,
only-affects-non-Latin-alphabet-using-users manners) rather than less.

That is why it makes sense to have *some* means of normalising things
in a way that does the best it can. The raw bytes interfaces should be
available for POSIX users writing low-level code that *must* handle
all possible nightmare scenarios[1], but Martin's proposal is designed
to handle "the majority of cases" in a platform-independent way. To
that end, a string-based interface makes sense, as frankly that's how
"normal" users think of filenames. The rest of Martin's proposal seems
to follow the same sort of practical approach.

Paul.

[1] Maybe there's a need for a Unicode interface on Windows that
doesn't do *any* encoding, even in the face of garbled Unicode - I
don't know low-level details well enough to be sure here. But the same
principle applies, that "get the raw data, regardless" is a low-level
OS-specific operation, and should not be the one used in day-to-day
programming.


More information about the Python-Dev mailing list