[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Mon Apr 27 17:47:05 CEST 2009

Antoine Pitrou writes:

 > I'm not sure how mail being stuck in a pipeline has anything to do
 > with Martin's proposal (which deals with file paths, not with
 > SMTP...).

I hate to break it to you, but most stages of mail processing have
very little to do with SMTP.  In particular, processing MIME
attachments often requires dealing with file names.  Would practical
problems arise?  I expect they would.  Can I tell you what they are?
No; if I could I'd write a better PEP.  I'm just saying that my
experience is that Murphy's Law applies more to encoding processing
than any other area of software I've worked in (admittedly, I don't do
threads ;-).

 > Besides, I don't care about spammers and their broken software.

That's precisely my point.  The PEP's "solution" will be very
appealing to people who just don't care as long as it works for them,
in the subset of corner cases they happen to encounter.  A lot of
software, including low-level components, will be written using these
APIs, and they will result in escapes of uninterpreted bytes (encoded
as Unicode) into the textual world.

 > So you're arguing that whatever solution which isn't 100% perfect
 > but only 99.999% perfect shouldn't be implemented at all, and leave
 > the status quo at 98%?

No, I'm not talking about "whatever solution".  I'm only arguing about
PEP 383.  The point is that Martin's proposal is not just a solution
to the problem he posed.  It's also going to be the one obvious way to
make the usual mistakes, i.e., the return values will escape into code
paths they're not intended for.  And the APIs won't be killable until
Python 4000.  If we find a better way (which I think Python 3's move
to "text is Unicode" is likely to inspire!), we'll have to wait 10-15
years or more before it becomes the OOWTDI.  The only real hope about
that is that Unicode will become universal before that, and only
archaeologists will ever encounter malformed text.

I believe there are solutions that don't have that problem.
Specifically, if the return values were bytes, or (better for 2.x,
where bytes are strings as far as most programmers are concerned) as a
new data type, to indicate that they're not text until the client
acknowledges them as such.  EIBTI.

Unfortunately, Martin clearly doesn't intend to make such a change to
the PEP.  I don't have the time or the Python expertise to generate an
alternative PEP. :-(  I do have long experience with the pain of
dealing with encoding issues caused by APIs that are intended to DTRT,
conveniently.  Martin's is better than most, but I just don't think
convenience and robustness can be combined in this area.

 > This sounds disturbing to me.

BTW, I'm on record as +0 on the PEP.  I don't think the better
proposals have a chance, because most people *want* the non-solution
that they can just use as a habit, allowing Python to make decisions
that should be made by the application, and not have to do
"unnecessary" conversions and the like.  It's not obvious to me that
it should not be given to them, but I don't much like it.