[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Sun Apr 26 15:47:44 CEST 2009

Paul Moore writes:
 > 2009/4/24 Stephen J. Turnbull <stephen at xemacs.org>:
 > > Paul Moore writes:
 > >
 > >  > The pros for Martin's proposal are a uniform cross-platform interface,
 > >  > and a user-friendly API for the common case.
 > >
 > > A more accurate phrasing would be "... a user-friendly API for those
 > > who feel very lucky today."  Which is the common case, of course, but
 > > spins a little differently.
 > 
 > Sorry, but I think you're misrepresenting things. I'd have probably
 > let you off if you'd missed out the "very" - but I do think that it's
 > the common case. Consider:

If you need reliability, then you can't get it this way.  The reason
"very" is (somewhat) justified is that this kind of issue is a little
like unemployment.  You hardly ever meet someone who's 7.2%
unemployed, but you probably know several who are 100% unemployed.  If
you see a broken encoding once, you're likely to see it a million times
(spammers have the most broken software) or maybe have it raise an
unhandled Exception a dozen times (in rate of using busted software,
the spammers are closely followed by bosses---which would be very bad,
eh, if you 2/3 of the mail from your boss ends up in an undeliverables
queue due to encoding errors that are unhandled by your some filter in
your mail pipeline).

 > - Windows systems where broken Unicode (lone surrogates or whatever)
 > isn't involved
 > - Unix systems where the user's stated filesystem encoding is correct

 > Can you honestly say that this isn't the vast majority of real-world
 > environments?

Again, that's not the point.  The point is that six-sigma reliability
world-wide is not going to be very comforting to the poor souls who
happen to have broken software in their environment sending broken
encodings regularly, because they're going to be dealing with one or
two sigmas, and that's just not good enough in a production
environment.

 > > If you didn't start with a valid string in a known encoding, you
 > > shouldn't treat it as characters because it's not.
 > 
 > Again, that's the purist argument. If you have a string (of bytes, I
 > guess) and a 99% certain guess as to the correct encoding, then I'd
 > argue that, as long as (a) it's not mission-critical (lives or backups
 > depend on it)

Assurance that you can even determine (a) is not provided by the PEP.
There is no way to contain a problem if it should occur, because it's
"just a string" and could go anywhere, and get converted back or
otherwise manipulated in a context that doesn't know how to handle it
(which might not even be Python if a C-level extension is involved).
Given that Python has no internal mechanism for saying "in this area
only valid Unicode will be accepted", it seems likely that mission
critical software *will* interact with this feature, if only
indirectly (or perhaps only in software originally intended for use in
the U.S. only, but then it gets exported, etc).

 > and (b) you have a means of failing relatively
 > gracefully, you have every reason to make the assumption about
 > encoding.

(b) is not provided in the PEP, either.  We have no idea what the
failure mode will be.

 > After all, what's the alternative?

The alternative is to refuse to provide a simple standard way to
decode unreliably, and in that way make the user reponsible for an
explicit choice about what level and kinds of unreliability they will
accept.

I realize that's unpalatable to most people who use Python to develop
software, and so I'm unwilling to go even -0 on the PEP.  However, to
give one example, I've been following Mailman development for about 10
years, and it is a dismal story despite a group of developers very
sympathetic to encoding and multicultural issues.  As recently as
Mailman 2.10 (IIRC) there were *still* bugs in encoding handling that
could stop the show (ie, not only did the buggy post not get
processed, but the exception propagated high enough to cause
everything behind it in the queue to fail, too).  I think it would be
sad if ten years from now there was software using this technique and
failing occasionally.