
On Mon, 2006-04-24 at 17:12 +0900, Stephen J. Turnbull wrote:
"Tokio" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
Tokio> Consider mailman get a spam from a foreign country and Tokio> caused an error. Mailman may complain UnicodeDecodeError Tokio> and spew an excerpt containing unknown charset string.
This really should not happen. Mailman should trap *all* UnicodeDecodeErrors at a very low level. (You simply cannot yet count on malformed message == SPAM in all contexts yet. Eg, just last week the Mac users here started flaming the Windows-using administration for distributing mojibake.)
The general approach should be that /everything/ gets converted to Unicode at the boundaries of the system. In Mailman 2.1, all the Unicode and i18n stuff was bolted on afterward, which is why we've had so much pain throughout, dealing with Unicode conversions. Ideally, we'd get rid of all that for 2.2 and deal only with Unicode internally.
We may have to make modifications to the email package though, but I'm not sure. It should probably always return Unicode for everything.
Then it should wash the message to make it safe. RFC 2047-encode any 8-bit headers, and use a base64 Content-Transfer-Encoding for any 8-bit message bodies or body parts that don't have a known, approved charset specified. Bonus points for checking that 8-bit body parts with a specified charset actually conform to it.
Finally, reraise some kind of exception that can be handled at the filtering policy level.
That sounds about right. Probably the email package should convert everything to Unicode internally and place Defects on the message objects that have illegal encodings.
-Barry