[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

"Martin v. Löwis" martin at v.loewis.de
Wed Apr 29 09:29:05 CEST 2009

>>>>> C. File on disk with the invalid surrogate code, accessed via the str
>>>>> interface, no decoding happens, matches in memory the file on disk
>>>>> with
>>>>> the byte that translates to the same surrogate, accessed via the bytes
>>>>> interface.  Ambiguity.
>>>> Is that an alternative to A and B?
>>> I guess it is an adjunct to case B, the current PEP.
>>> It is what happens when using the PEP on a system that provides both
>>> bytes and str interfaces, and both get used.
>> Your formulation is a bit too stenographic to me, but please trust me
>> that there is *no* ambiguity in the case you construct.
> No Martin, the point of reviewing the PEP is to _not_ trust you, even
> though you are generally very knowledgeable and very trustworthy.  It is
> much easier to find problems before something is released, or even
> coded, than it is afterwards.

Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not

> You assumed, and maybe I wasn't clear in my statement.
> By "accessed via the str interface" I mean that (on Windows) the wide
> string interface would be used to obtain a file name.

What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.

> Now, suppose that
> the file name returned contains "abc" followed by the half-surrogate
> U+DC10 -- four 16-bit codes.

Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.

Also, why is U+DC10 four 16-bit codes?

> Then, ask for the same filename via the bytes interface, using UTF-8
> encoding.

How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by "ask for"?????? WHAT INTERFACE ARE YOU
USING???? Please use specific python code.

> The PEP says that the above name would get translated to
> "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
> used to represent the half-surrogate that is actually in the file name,
> specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
> be seen as two different names in memory.

You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the "mbcs" encoding would be used. The
"mbcs" encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.

> Now posit another file which, when accessed via the str interface, has
> the name "abc" followed by U+DCED U+DCB0 U+DC90.
> Looks ambiguous to me.  Now if you have a scheme for handling this case,
> fine, but I don't understand it from what is written in the PEP.

You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.


More information about the Python-Dev mailing list