[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 20:25:55 CET 2008

On approximately 11/6/2008 3:59 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
> 
>  > There is no reference to the word emacs or types in any of the messages 
>  > you've posted in this thread, maybe you are referring to another thread 
>  > somewhere?  Sorry, I'm new to this party, but I have read the whole 
>  > thread... unless my mail reader has missed part of it.
> 
> I'm sorry, you are right; the relevant message was never sent.  Here
> it is; I've looked it over briefly and it seems intelligible, but from
> your point of view it may seem out of context now.

Stuff happens.  Apology accepted.  The goal here isn't to make points or 
  play one-up, the goal is to figure out if making a more complex 
interface (having both bytes and Unicode interfaces) is beneficial to 
life.  I'm certain that I don't see all the issues yet; but if the 
issues can be stated clearly, and the alternative solutions outlined, 
then I would get educated, which is good for me, but perhaps annoying 
for you.  Progress gets made faster if we stay out of the flame-fanning.

I've read the other responses received to date, but choose to compose my 
response to this message, as it is the most meaty.  The others discuss 
only particular (interesting) details.

Summary of issues is at the end.  Skip directly to the summary before 
reading the interspersed comments, if you wish.  Search for "summarize".

Comment on general data handling.  It is good to follow the rules, of 
course, but not everyone does.  When they don't, it is not clear a 
program can cure the problem by itself.

1) If the data is already corrupted by using the wrong encoding, 
potentially it could be reversed if the proper encoding could be intuited.

1a) If it is returned as bytes, then once the proper encoding is 
intuited, the data can be decoded properly into Unicode.

1b) If it is returned as Latin-1 decoded Unicode, then once the proper 
encoding is intuited, the Unicode data can be reencoded as bytes using 
Latin-1 (this is a fully reversible, no data loss reencoding), and then 
decoded properly into Unicode.

The hard part here is intuiting the proper encoding; 1b is less 
efficient than 1a, but no less possible.  Intuiting the proper encoding 
is most likely done by human choice (iterating over: try this encoding, 
does it look better?)

2) If the data is already corrupted by using multiple encodings when 
only one is claimed, then again it could be reversed if the proper 
encodings, as well as the boundaries between them, could be intuited.

The same parts a) and b) apply as in #1, but extremely complexified by 
the boundary selections.  Again it seems that human choice is required. 
  Select a range of text, and try displaying it in a different encoding 
to see if it makes more sense.

For both 1 & 2, the user interaction is much more time consuming than 
the 3-stage decoding, encoding, and redecoding process, I would expect.

More below.

> Glenn Linderman writes:
> 
>  > This is where you use the Latin-1 conversion.  Don't throw an error
>  > when in doesn't conform, but don't go to heroic efforts to provide
>  > bytes alternatives... just convert the bytes to Unicode, and the
>  > way the mail RFCs are written, and the types of encodings used, it
>  > is mostly readable.  And if it isn't encoded, it is even more
>  > readable.
> 
> This is what XEmacs/Mule does.  It's a PITA for everybody (except the
> Mule implementers, whose life is dramatically simplified by punting
> this way).  For one thing, what's readable to a human being may be
> death to a subprogram that expects valid MIME.  GNU Emacs is even
> worse; it does provide both a bytes-like type and a unicode-like type,
> but then it turns around and provides a way to "cast" unicodes to
> bytes and vice-versa, thus exposing implementation in an unclean (and
> often buggy) way.
> 
>  > And so how much is it a problem?  What are the effects of the problem?
> 
> In Emacs, the problem is that strings that are punted get concatenated
> with strings that are properly decoded, and when reencoding is
> attempted, you get garbage or a coding error.  

Uh-huh.  Garbage (wrongly decoded, then re-encoded), I would expect. 
Coding errors, I would not, since Latin-1 codepoints are certainly 
reencodable to Unicode (creating legal looking garbage OUt of originally 
illegal garbage).  Can you give me an example of a coding error, or is 
this just FUD?

> Since Mule discarded
> the type (punt vs. decode) information, the app loses.  

This is precisely the problem that was faced for "fake unicode file 
handling" that was the topic of a thread a few weeks ago.  While the 
Latin-1 transform (or UTF-8b, or others mentioned there), can provide a 
round-trip decode/encode, it is only useful and usable if the knowledge 
that the transform was performed is retained.  The choice there was to 
have a binary interface, and build a Unicode interface on top of it that 
  can't see the binaries that do not conform to UTF-8.  The problem 
there is that existing programs expect to be able to manipulate file 
names as text, but existing operating systems provide bytes interfaces.

> There's no way to recover.  

Not automatically.  Point 2) above addresses this.  It would require 
human intelligence to attempt to recover, and even the human would find 
it extremely painstaking to assist in the recovery process.

> The apps most at risk are things like MUAs (which Emacs
> does well) and web browsers (which it doesn't), and even AUCTeX (a
> mode for handling LaTeX documents---TeX is not Unicode-aware so its
> error messages are frequently truncated in the middle of a UTF-8
> character) and they go to great lengths to keep track of what is valid
> and what is not in the app.  They don't always succeed.  I think Emacs
> should be doing this for them, somehow (and I'm an XEmacs implementer,
> not an MUA implementer!)

So your belief that Emacs should be doing this for them somehow is nice, 
perhaps it should.  However, it doesn't sound like you have a solution 
for emacs...  How should it keep track?  How is it helpful?  If TeX is 
not Unicode aware, what is it doing dealing with UTF-8 data?  Or it is 
dealing with Latin-1 transformed UTF-8 garbage?

> The situation in Python will be strongly analogous, I believe.

And so are you proposing that a binary interface to the data, rather 
than a Unicode interface to the Latin-1 transformed data, will be more 
usable by the Python solution that might be able to be similar to the 
Emacs solution, that hasn't been figured out yet?

Once the boundaries and encoding has been lost by the original buggy MUA 
that has injected the data into the email message, only human 
intelligence has a chance of recreating the original message in all 
cases, and even then it may take more than one human to achieve it.

There may be cases where heuristics can be applied, when human 
intelligence figures out the type of bugs in the original MUA, and can 
recognize patterns that allow it to rediscover the boundaries.  This is 
unlikely to work in all cases, but could perhaps work in some cases.

Even in the cases where it can work with some measurable success, I 
claim that the heuristics could be coded based on the Latin-1 
transformed Unicode equally effectively as based on the bytes.

>  > I'm not suggesting making it worse than what it already is, in
>  > bytes form; just to translate the bytes to Unicode codepoints so
>  > that they can be returned on a Unicode interface.
> 
> Which *does* make it worse, unless you enforce a type difference so
> that punted strings can't be mixed with decoded strings without
> effort.  That type difference may as well be bytes vs. Unicode as some
> subclass of Unicode vs. Unicode.

138 is still 138 whether it is a byte or a Unicode codepoint.  Yes, 
concatenating stuff that is transformed with stuff that is properly 
decoded would be stupid.  Enforcing a type difference is purely an 
application thing, though.  Each piece of data retrieved would have a 
consistent decoding provided... either the proper decoding as specified 
in the message, or the Latin-1 or current code page decode if no 
encoding is specified.  Either is reversible if the application doesn't 
like the results, and wants to try a different encoding.  The APIs could 
have optional parameters and results that specify the encoding to use, 
or the encoding that was used, to decode the results.  If the app wishes 
to keep that separate, and convert it to a different type to help it 
stay separate that is the app's privilege.  If the app wishes to 
concatenate with other data, that is the app's choice (and having the 
interface define a bunch of different types for different decodings 
wouldn't really help the ignorant app, which would simply convert the 
different types back to strings and then concatenate, or the smart app, 
which could do its own type encapsulations if it thinks that would help).

> "Why would you mix strings?"  Well, for one example there are multiple
> address headers which get collected into an addressee list for purpose
> of constructing a reply.  If one of the headers is broken and another
> is not, you get mixed mode.  

Sure.  Now you have mixed mode.  Try to send the reply message... if the 
email address part is OK, then it gets sent, with a gibberish name.  If 
the email address part is not OK, that destination bounces.

Now what?  Seriously, what else could be done?  You could try a bunch of 
different encodings to attempt to resolve the broken email address or 
name... requires human intelligence to decide which is correct... when 
the bounce message comes, the human will get involved.  If the bounce 
message doesn't come, then all is well (problem only affected the name 
part, not the email address part).

> The same thing can happen for
> multilingual message bodies: they get split into a multipart with
> different charsets for different parts, and if one is broken but
> another is not, you get mixed mode.

First, if the multilingual message bodies are know to be multilingual 
when they are encoded, and are in different multiparts, what are the 
chances that an application that knows to correctly keep the 
multilingual parts separate is dumb enough to encode one correctly and 
one incorrectly?  Is this a real scenario?  What software/version does this?

If it is real scenario, it still requires human intelligence to 
resolve... to choose different encodings, and decide which one "looks 
right".  Since it is in separate parts, the boundaries are not lost, so 
this is case 1 above.

If the boundaries are lost, the human can direct the program to go back 
to the original message, which still has its boundaries, and start over 
from there, with different encodings.  If the app wants to be smart 
enough to provide such features.  You might write such an app just for 
fun; I might or might not, depending on if someone pays me, or I have 
other incentive.

Given boundaries, it is case 1) above.  If the boundaries are lost, it 
is case 2).  How is it easier if the bytes are preserved, vs translated 
via Latin-1 to a Unicode string?

>  > So they'll use the Unicode API for text, and the bytes APIs for binary 
>  > attachments, because that is what is natural.
> 
> Well, as I see it there won't be bytes APIs for text.  The APIs will
> return Unicode text if they succeed, and raise an error if not.  If
> the error is caught, the offending object will be available as bytes.

Sure; I'd proposed a way to get a whole messages as bytes for archiving, 
logging, message store, etc.  I'd proposed a way to get a particular 
MIME part as bytes for binary parts.

You seem to be proposing a way to get text MIME parts as binary if they 
fail to decode.  I have no particular problem with the API providing 
that ability.

I have a specific question here: what encodings, when the attempt is 
made to decode to Unicode, will ever fail?

For 8-bit encodings, the answer is none.  You may get gibberish, but not 
a failure, because every 8-bit encoding has every byte value used, and 
Unicode contains all those characters.

So you've mentioned Asian encodings, and certainly these could fail to 
convert to Unicode if the decoder finds inappropriate sequences.  I 
don't know enough about all the multi-byte encodings to know if all of 
them can fail, or if applying a particular decoding might produce 
gibberish, but not fail.  The ones I know about use a particular range 
of characters to represent "first byte" of a pair, but what I don't know 
is whether any byte can follow the first byte, or if only certain bytes 
can follow the first byte.  I do that for some multi-byte encodings, the 
first byte can be followed by second bytes in the ASCII range; I don't 
know if it is illegal to be followed by another byte in the "first byte" 
range.  Certainly there could be 2-byte pairs that don't have an 
associated character, although I don't know that that exists for any 
particular encoding.

Can you cite a particular multi-byte encoding that has byte sequences 
that are illegal, and can be used to detect failure?  Or can failure 
only be detected by the human determining that it is gibberish?

>  > If improperly encoded messages are received, and appropriate 
>  > transliterations are made so that the bytes get converted (default code 
>  > page) or passed through (Latin-1 transformation), then the data may be 
>  > somewhat garbled for characters in the non-ASCII subset.  But that is 
>  > not different than the handling done by any 8-bit email client, nor, I 
>  > suspect (a little uncertainty here) different than the handling done by 
>  > Python < 3.0 mail libraries.
> 
> Which is exactly how we got to this point.  Experience with GNU
> Mailman and other such applications indicate that the implementation
> in the existing Python email module needs work, and Barry Warsaw and
> others who have tried to work on it say that it's not that easy, and
> that the API may need to change to accomodate needed changes in the
> implementation.

So let me try to summarize.  I could have reached some inappropriate 
issues or conclusions.  I'm willing to be corrected.  But I'd much 
prefer to be corrected by specific cases that can be detected and 
corrected via a bytes interface that cannot be detected and corrected by 
using a bytes-translitered-to-Unicode interface, complete with specific 
encodings that are used, properly or improperly, to arrive at the case, 
and specific APIs that must be changed to achieve the goal.

A) An attempt to decode text to Unicode may fail.
A1) doesn't apply to 8-bit encodings.
A2) doesn't apply to some multi-byte encodings
A3) applies to UTF-8
A4) may apply to some other multi-byte encodings

B) User sees gibberish because of decoding problems.  What can be done? 
  Can the app provide features to help? Do any of the features depend on 
API features?  Let's assume that the app wants to help, and provides 
features.  User must also get involved, because the app/API can't tell 
the difference between gibberish and valid text.

B1) User can see a map of the components of the email, and their 
encodings, and whether they were provided by the email message, or were 
the default for the app.  User chooses a different decoding for a 
component, and the app reprocesses that component.  API requirement: a 
way for the user/app to specify an override to the decoding for a component.

B2) User chooses binary for a particular component.  App reprocesses the 
component, and asks what file to store the binary in.  API requirement: 
a way for the user/app to specify an override to the decoding for a 
component.

I've now looked briefly at the email module APIs.  They seem quite 
flexible to me.  I don't know what happens under the covers.  It seems 
that the API is already set up flexibly enough to handle both bytes and 
Unicode!!!  Perhaps it is just the implementation that should be 
adjusted.  (I'm not saying that might not be too big a job for 3.0, I 
haven't read the code.)

It seems that get_/set_payload might want to be able to return/accept 
either string or bytes, depending on the other parameters involved.

Let's talk again about creation of messages first.

If a string is supplied, it is Unicode.  The encoding parameter 
describes what encoding should be applied to convert the message to 
wire-protocol bytes.  The data should be saved as Unicode until the 
request is made to convert it to wire protocol, so that set_charset can 
be called a few dozen times if desired (not clear why that would be 
done, though) to change the encoding.  Perhaps it is appropriate to 
verify that the encoding can happen without using the substitution 
character, or perhaps that should be the user's responsibility.  This 
choice should be documented.

If bytes are supplied, an encoding must also be supplied.  The data 
should be saved in this encoding until the request is made to convert it 
to wire-protocol.  This encoding should be used if possible, otherwise 
converted to an encoding that is acceptable to the wire protocol. 
Perhaps it is appropriate to verify that the translation, if necessary, 
can happen without using the substitution character, or perhaps that 
should be the user's responsibility.  This choice should be documented.
It seems that charset None implies ASCII, for historical reasons; 
perhaps that can be overloaded to alternately mean binary, as the 
handling would be roughly the same, but perhaps a new 'binary' charset 
should be created to make it clear that charset changes don't make 
sense, and to reject attempts to convert binary data to character data.

For an incoming message, the wire-protocol format should be used as the 
primary data store.  Cached pointers and lengths to various MIME parts 
and subparts (individual headers, body, preamble, epilogue) would be 
appropriate.  get_ operations would find the data, and interpret it 
according to the current (defaults to message content, overridden by 
set_ operations) charset and encoding.  Requesting a Unicode charset 
would imply decoding the part to Unicode from the current charset and 
would return a string; requesting other character sets would imply 
converting from the message charset to the specified charset and 
returning bytes; requesting binary (or possibly 'None', see above) would 
return the wire-protocol bytes unchanged.  Then the application could do 
what it wants to attempt to decode that data to text using other 
encodings (i.e. not starting the conversion from the encoding declared 
explicitly or implicitly in the message part).

The as_string() method becomes a misnomer in Python 3.0; since it is 
Python 3.0, that can be changed, no?  It should become as_wire_protocol, 
and would default to returning bytes of binary data, which is what the 
wire-protocol APIs need.  A variety that returns the bytes as Unicode 
codepoints could be implemented, for the purpose of "View source" type 
operations on the wire-protocol form... but that would and should only 
be a direct Latin-1 transliteration to Unicode.

Now that I've looked at the API, I don't see why it should be changed 
significantly for Python 3.0.  I have no clue how much of the guts would 
have to be changed to achieve the equivalent of what I described above. 
  I do believe that what I outlined above would use the present API to 
achieve both the "I want Unicode only" philosophy that you ascribe to 
me, and the "I want to do bit-flipping" (whatever that means) philosophy 
that you claimed for yourself.

Setting headers via the msg['Subject'] syntax to Unicode values is no 
problem.  Just make sure that they get converted to ASCII encoded 
properly at the end.  msg['Subject'] and msg[b'Subject'] could be made 
equivalent, but I'd never use the latter, it has an annoying b character 
to distract from the meaning.  The syntax should permit the use of 
Unicode, in other words, but:

* to encode non-ASCII data with full control over what parts get encoded 
and how, the Header API is still appropriate
* as an alternative, the API could be extended to include a default 
header encoding
* Strings supplied via the msg['Subject'] = 'some string' interface, are 
handled as follows: if 'some string' is in the ASCII subset, no problem. 
If not, and if the default header encoding has not been set, then an 
exception is raised.  Otherwise, the default header encoding is used to 
encode the Unicode string as necessary.

So I see the API as quite robust, although its current implementation 
may not be as described above, and I can't scope the effort to achieve 
the above.

I'd like to see a "headers_as_wire_protocol" API added for generating 
bounce messages.  It is easy enough to extract from as_wire_protocol, 
but common enough to be useful, methinks, and avoids allocating space 
for a huge message just to get its headers.

What specific problems are perceived, that the present API can't handle?

Are there areas in which it behaves differently than I outline above?

If so, is my outline an improvement, or confusing, and why?

Are there other issues?

Barry said:
 > Yes, Python 2.x's email package handles broken messages, and email-ng
 > must too.  "Handling it" means:
 >
 > 1) never throw an exception
 > 2) record defects in a usable way for upstream consumers of the message
 > to handle
 >
 > it currently also means
 >
 > 3) ignore idempotency for defective messages.

I'm not sure what "ignore idempotency" means in this context...

If the above outline is perceived as a useful set of semantics for the 
3.0 email library, I might be able to find a little time (don't tell my 
wife) to help work on them, assuming that they are mostly implemented in 
Python and/or C.  But I'd need a bit of hand-holding to get started, 
since I haven't yet figured out how to compile my own Python.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking