[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 11:20:44 CEST 2009

2009/4/28 Glenn Linderman <v+python at g.nevcal.com>:
> So assume a non-decodable sequence in a name.  That puts us into Martin's
> funny-decode scheme.  His funny-decode scheme produces a bare string,
> indistinguishable from a bare string that would be produced by a str API
> that happens to contain that same sequence.  Data puns.
>
> So when open is handed the string, should it open the file with the name
> that matches the string, or the file with the name that funny-decodes to the
> same string?  It can't know, unless it knows that the string is a
> funny-decoded string or not.

Sorry for picking on Glenn's comment - it's only one of many in this
thread. But it seems to me that there is an assumption that problems
will arise when code gets a potentially funny-decoded string and
doesn't know where it came from.

Is that a real concern? How many programs really don't know where
their data came from? Maybe a general-purpose library routine *might*
just need to document explicitly how it handles funny-encoded data (I
can't actually imagine anything that would, but I'll concede it may be
possible) but that's just a matter of documenting your assumptions -
no better or worse than many other cases.

This all sounds similar to the idea of "tainted" data in security - if
you lose track of untrusted data from the environment, you expose
yourself to potential security issues. So the same techniques should
be relevant here (including ignoring it if your application isn't such
that it's s concern!)

I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written. (NB, if such a claim
has been made, feel free to point me to it - I admit I've been
skimming this thread at times).

Paul.