On Sat, 31 Aug 2013 20:37:30 +1000, Steven D'Aprano email@example.com wrote:
On 31/08/13 15:21, R. David Murray wrote:
If you've read my blog (eg: on planet python), you will be aware that I dedicated August to full time email package development.
The API looks really nice! Thank you for putting this together.
A question comes to mind though:
All input strings are unicode, and the library takes care of doing whatever encoding is required. When you pull data out of a parsed message, you get unicode, without having to worry about how to decode it yourself.
How well does your library cope with emails where the encoding is declared wrongly? Or no encoding declared at all?
It copes as best it can :) The bad bytes are preserved (unless you modify a part) but are returned as the "unknown character" in a string context. You can get the original bytes out by using the bytes access interface. (There are probably some places where how to do that isn't clear in the current API, but bascially either you use BytesGenerator or you drop down to a lower level API.)
An attempt is made to interpret "bad bytes" as utf-8, before giving up and replacing them with the 'unknown character' character. I'm not 100% sure that is a good idea.
Conveniently, your email is an example of this. Although it contains non-ASCII characters, it is declared as us-ascii:
Oh, yeah, my MUA is a little quirky and I forgot the step that would have made that correct. Wanting to rewrite it is one of the reasons I embarked on this whole email thing a few years ago :)