On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:
People see email as (rich-)text.
It's not clear what you actually mean by "(rich-)text". In the context of email, I understand it to mean HTML in the body, web-bugs, security exploits, 36pt hot-pink bold text on a lime-green background, and all the other wonderful things modern mail clients let you put in your email. But as far as I know, no mail client tries to render HTML tags inside mail headers, so you're probably not talking about HTML rich-text. I guess you mean Unicode characters. Am I right?
Now, correct me if I'm wrong, but I don't think mail headers can actually be anything but bytes. I see that my mail client, at least, sends bytes in the Subject header. If I try to send characters, e.g. the subject header "Testing-β-" (without the quotes), what actually gets sent is the bytes "=?utf-8?q?Testing-=CE=B2-?=" (again without the quotation marks). This seems to be covered by RFC 2047:
If you're proposing converting those bytes into characters, that's all very well and good, but what's your strategy for dealing with the inevitable wrongly-formatted headers? If the header can't be correctly decoded into text, there still needs to be a way to get to the raw bytes. Apart from (e.g.) mail processing apps like SpamBayes which will want to inspect the raw bytes, mail readers will need to deal with badly formatted mail. The RFC states:
"However, a mail reader MUST NOT prevent the display or handling of a message because an 'encoded-word' is incorrectly formed."
Then MTAs see email as a string of octets. So guess what:
> > bytes(message['Subject'])
gives wire format. Yow! I think I'm just joking. Right?
Er, I'm not sure. Are you joking? I hope not, because it is important to be able to get to the raw, unmodified bytes that the MTA sees, without all the fancy processing you suggest.
Otherwise, you should have a unicode, and you simply look at the range of the string. If it fits in ASCII, Bob's your uncle. If not, Bob's your aunt (and you use UTF-8).
Again, correct me if I'm wrong, but all valid mail headers must fit in ASCII. RFC 5335 defines an experimental approach to allowing full Unicode in mail headers, but surely it's going to be a while before that's common, let alone standard.
-- Steven D'Aprano