[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman
v+python at g.nevcal.com
Tue Apr 28 03:15:17 CEST 2009
On approximately 4/27/2009 2:14 PM, came the following characters from
the keyboard of Cameron Simpson:
> On 27Apr2009 00:07, Glenn Linderman <v+python at g.nevcal.com> wrote:
>
>> On approximately 4/25/2009 5:22 AM, came the following characters from
>> the keyboard of Martin v. Löwis:
>>
>>>> The problem with this, and other preceding schemes that have been
>>>> discussed here, is that there is no means of ascertaining whether a
>>>> particular file name str was obtained from a str API, or was funny-
>>>> decoded from a bytes API... and thus, there is no means of reliably
>>>> ascertaining whether a particular filename str should be passed to a
>>>> str API, or funny-encoded back to bytes.
>>>>
>>> Why is it necessary that you are able to make this distinction?
>>>
>> It is necessary that programs (not me) can make the distinction, so that
>> it knows whether or not to do the funny-encoding or not.
>>
>
> I would say this isn't so. It's important that programs know if they're
> dealing with strings-for-filenames, but not that they be able to figure
> that out "a priori" if handed a bare string (especially since they
> can't:-)
>
So you agree they can't... that there are data puns. (OK, you may not
have thought that through)
>> If a name is
>> funny-decoded when the name is accessed by a directory listing, it needs
>> to be funny-encoded in order to open the file.
>>
>
> Hmm. I had thought that legitimate unicode strings already get transcoded
> to bytes via the mapping specified by sys.getfilesystemencoding()
> (the user's locale). That already happens I believe, and Martin's
> scheme doesn't change this. He's just funny-encoding non-decodable byte
> sequences, not the decoded stuff that surrounds them.
>
So assume a non-decodable sequence in a name. That puts us into
Martin's funny-decode scheme. His funny-decode scheme produces a bare
string, indistinguishable from a bare string that would be produced by a
str API that happens to contain that same sequence. Data puns.
So when open is handed the string, should it open the file with the name
that matches the string, or the file with the name that funny-decodes to
the same string? It can't know, unless it knows that the string is a
funny-decoded string or not.
> So it is already the case that strings get decoded to bytes by
> calls like open(). Martin isn't changing that.
>
I thought the process of converting strings to bytes is called
encoding. You seem to be calling it decoding?
> I suppose if your program carefully constructs a unicode string riddled
> with half-surrogates etc and imagines something specific should happen
> to them on the way to being POSIX bytes then you might have a problem...
>
Right. Or someone else's program does that. I only want to use Unicode
file names. But if those other file names exist, I want to be able to
access them, and not accidentally get a different file.
> I think the advantage to Martin's choice of encoding-for-undecodable-bytes
> is that it _doesn't_ use normal characters for the special bits. This
> means that _all_ normal characters are left unmangled un both "bare"
> and "funny-encoded" strings.
>
Whether the characters used for funny decoding are normal or abnormal,
unless they are prevented from also appearing in filenames when they are
obtained from or passed to other APIs, there is the possibility that the
funny-decoded name also exists in the filesystem by the funny-decoded
name... a data pun on the name.
Whether the characters used for funny decoding are normal or abnormal,
if they are not prevented from also appearing in filenames when they are
obtained from or passed to other APIs, then in order to prevent data
puns, *all* names must be passed through the decoder, and the decoder
must perform a 1-to-1 reversible mapping. Martin's funny-decode process
does not perform a 1-to-1 reversible mapping (unless he's changed it
from the version of the PEP I found to read).
This is why some people have suggested using the null character for the
decoding, because it and / can't appear in POSIX file names, but
everything else can. But that makes it really hard to display the
funny-decoded characters.
> Because of that, I now think I'm -1 on your "use printable characters
> for the encoding". I think presentation of the special characters
> _should_ look bogus in an app (eg little rectangles or whatever in a
> GUI); it's a fine flashing red light to the user.
>
The reason I picked a ASCII printable character is just to make it
easier for humans to see the encoding. The scheme would also work with
a non-ASCII non-printable character... but I fail to see how that would
help a human compare the strings on a display of file names. Having a
bunch of abnormal characters in a row, displayed using a single
replacement glyph, just makes an annoying mess in the file open dialog.
> Also, by avoiding reuse of legitimate characters in the encoding we can
> avoid your issue with losing track of where a string came from;
> legitimate characters are currently untouched by Martin's scheme, except
> for the normal "bytes<->string via the user's locale" translation that
> must already happen, and there you're aided by byets and strings being
> different types.
>
There are abnormal characters, but there are no illegal characters.
NTFS permits any 16-bit "character" code, including abnormal ones,
including half-surrogates, and including full surrogate sequences that
decode to PUA characters. POSIX permits all byte sequences, including
things that look like UTF-8, things that don't look like UTF-8, things
that look like half-surrogates, and things that look like full surrogate
sequences that decode to PUA characters.
So whether the decoding/encoding scheme uses common characters, or
uncommon characters, you still have the issue of data puns, unless you
use a 1-to-1 transformation, that is reversible. With ASCII strings, I
think no one questions that you need to escape the escape characters. C
uses \ as an escape character... Everyone understands that if you want
to use a \ in a C string, you have to use \\ instead... and that scheme
has escaped the boundaries of C to other use cases. But it seems that
you think that if we could just find one more character that no one else
uses, that we wouldn't have to escape it.... and that could be true, but
there aren't any characters that no one else uses. So whatever
character (and a range makes it worse) you pick, someone else uses it.
So in order for the scheme to work, you have to escape the escape
character(s), even in names that wouldn't otherwise need to be
funny-decoded.
>> I'm certainly not experienced enough in Python development processes or
>> internals to attempt such, as yet. But somewhere in 25 years of
>> programming, I picked up the knowledge that if you want to have a 1-to-1
>> reversible mapping, you have to avoid data puns, mappings of two
>> different data values into a single data value. Your PEP, as first
>> written, didn't seem to do that... since there are two interfaces from
>> which to obtain data values, one performing a mapping from bytes to
>> "funny invalid" Unicode, and the other performing no mapping, but
>> accepting any sort of Unicode, possibly including "funny invalid"
>> Unicode, the possibility of data puns seems to exist. I may be
>> misunderstanding something about the use cases that prevent these two
>> sources of "funny invalid" Unicode from ever coexisting, but if so,
>> perhaps you could point it out, or clarify the PEP.
>>
>
> Please elucidate the "second source" of strings. I'm presuming you mean
> strings egenrated from scratch rather than obtained by something like
> listdir().
>
POSIX has byte APIs for strings, that's one source, that is most under
discussion. Windows has both bytes and 16-bit APIs for strings... the
16-bit APIs are generally mapped directly to UTF-16, but are not checked
for UTF-16 validity, so all of Martin's funny-decoded files could be
used for Windows file names on the 16-bit APIs. And yes, strings can be
generated from scratch.
> Given such a string with "funny invalid" stuff in it, and _absent_
> Martin's scheme, what do you expect the source of the strings to _expect_
> to happen to them if passed to open()? They still have to be converted
> to bytes at the POSIX layer anyway.
There is a fine encoding scheme that can take any str and encode to
bytes: UTF-8.
The problem is that UTF-8 doesn't work to take any byte sequence and
decode to str, and that means that special handling has to happen when
such byte sequences are encountered. But there is no str that can be
generated that can't be generated in other ways, which would be properly
encoded to a different byte sequence. Hence there are data puns, no
1-to-1 mapping. Hence it seems obvious to me that the only complete
solution is to have an escape character, and ensure that all strings are
decoded and encoded. As soon as you have an escape character, then you
can decode anything into displayable, standard, Unicode, and you can
create the reverse encoding unambiguously.
Without an escape character, you just have a heuristic that will work
sometimes, and break sometimes. If you believe non-UTF-8-decodable byte
sequences are rare, you can ignore them. That's what we do now, but
people squawk. If you believe that you can invent an encoding that has
data puns, and that because of the character or characters involved are
rare, that the problems that result can be ignored, fine... but people
will squawk when they hit the problem... I'm just trying to squawk now,
to point out that this is complexity for complexities sake, it adds
complexity to trade one problem for a different problem, under the
belief that the other problem is somehow rarer than the first. And
maybe it is, today. I'd much rather have a solution that actually
solves the problem.
If you don't like ? as the escape character, then pick U+10F01, and
anytime a U+10F01 is encountered in a file name, double it. And anytime
there is an undecodable byte sequence, emit U+10F01, and then U+80
through U+FF as a subsequent character for the first byte in the
undecodable sequence, and restart the decoder with the next byte.
That'll work too. But use of rare, abnormal characters to take the
place of undecodable bytes can never work, because of data puns, and
valid use of the rare, abnormal characters.
Someone suggested treating the byte sequences of the rare, abnormal
characters as undecodable bytes, and decoding them using the same
substitution rules. That would work too, if applied consistently,
because then the rare, abnormal characters would each be escaped. But
having 128 escape characters seems more complex than necessary, also.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev
mailing list