[Python-Dev] Patch making the current email package (mostly) support bytes

Stephen J. Turnbull stephen at xemacs.org
Fri Oct 8 05:37:38 CEST 2010


R. David Murray writes:

 > > The MIME-charset = UNKNOWN dodge might be a better way of handling
 > > this.
 > 
 > That is a very interesting idea.  It is the *right* thing to do, since it
 > would mean that a message parsed as bytes could be generated via Generator
 > and passed to, say, smtplib without losing any information.  However,
 > It's not exactly trivial to implement, since issues of runs of characters
 > and line re-wrapping need need to be dealt with.  Perhaps Header can be
 > made to handle bytes in order to do this; I'll have to look in to
 > it.

Ouch.  RFC 822 line wrapping is a bytes->bytes transformation, and the
client shouldn't see it at all unless it inspects the wire format.
MIME-encoding is a text->bytes transformation, again an internal
matter.  The constraints on the wire format means that the MIME-
encoder needs to careful about encoded-word length.  ISTM that all you
need to know, assuming that this is a method on a Header, and it's
normally invoked just before conversion to bytes, is the codec and the
CTE, and both can be optional (default to 'utf-8' and a value
depending on the proportion of encodable characters).

You take the header, encode according to the codec, then start
MIME-encoding according to the CTE.  The maximum size of encoded words
is chosen to fit on a line within 78 bytes.  The number of bytes
encoded in each word depends only on the size of metadata associated
with the word.  (Sure you could make it prettier for those reading it
with an "MUA" like less, but I don't think that's really worth
anybody's time.)

*If* you have an 8-bit value of unknown encoding on input, this will
appear in the Header's value as a surrogate.  Hm, OK, I see the
problem ... as usual, it's that the only efficient thing to do is
encode using surrogate-escape which loses the information that these
are invalid bytes.  Would it really be that bad to add an O(length)
component where you examine the string for surrogates (and too-long
words, for that matter), and chop off those pieces for MIME encoding?

 > >  > Presumably you are suggesting that email5 be smart enough to turn my
 > >  > example into properly UTF-8/CTE encoded text.
 > > 
 > > No, in general that's undecidable without asking the originator,
 > > although humans can often make a good guess.
 > 
 > I was talking about unicode input, though, where you do know (modulo
 > the language differences that unicode hasn't yet sorted out).

I don't understand why this is difficult.  As far as what Unicode has
and hasn't sorted out, that's not your job AFAICS.  If clients want a
specific codec or other language-based style, they'd better specify it
themselves.  Else, you just stuff the Unicode into a UTF-8-encoded
bytes, and go from there.  This is *why* Unicode was designed, so that
software could do something standard and sane with text which needs to
be readable but not exquisitely crafted literary works.  No?  If you
want beauty, then use a markup language.

 > Right, but I was talking about my python3 example, where I was using
 > the email5 parser to (unsuccessfully) parse unicode.  *That's* the thing
 > email5 can't really handle, but email6 will be able to.

For email5 it would be an extension, yes, but I don't see why it would
be hard to handle Unicode input, assuming it's *really* Unicode,
unless you want to cater to "legacy" systems that might not understand
Unicode (or at least would prefer an alternative encoding).  Since
it's an extension, I don't think that's your problem, and the people
who would really like this extension (eg, the Japanese) are used to
dealing with mojibake issues.  (Of course, as an extension, you don't
need to do it at all.  This is just speculation.)

The problem would be with careless clients of email5 that find a way
to hand it bogus Unicode (eg, by inappropriately using the latin-1
codec to get a binary represention of their bytes in Unicode), but I'm
not sure how big a problem that would be.

 > Thank you very much for this piece of perspective.  I hadn't thought
 > about it that clearly before, but what you say makes perfect sense to me,
 > and is in fact the implicit perspective I've been working from when
 > working on the email6 stuff.

You're welcome, of course, and it makes me feel much better about
email6.  (Not that I had any real worries, but here we are about
halfway up a 100m cliff, and the trail just widened from 20cm to
2m. :-)



More information about the Python-Dev mailing list