"Martin" == Martin von Loewis firstname.lastname@example.org writes:
Martin> To be really precise, HTML 4.01's "document character set" Martin> is the "Universal Character Set" (as defined in ISO Martin> 10646), see
Yes, many thanks.
Martin> Python does not currently include codecs for iso-2022-jp, Martin> gb2312, big5, euc-jp, shift-jis. Since mailman leaves all Martin> strings as-is, and never mixes encodings, it can let them Martin> pass through unmodified. There are a number of pitfalls, Martin> though:
I have been working actively on these problems. Hopefully we can ship these codecs with Mailman 2.1.
Martin> This is probably an archive problem only; however, if Martin> mailman adds a footer, it will produce garbage if the Martin> footer encoding differs from the message body encoding.
Martin> - To analyse the subject, Mailman needs to strip off the Martin> subject_prefix from the incoming message.
The subject and footer issue is a good one, and needs some work.
We basically need a map of charset -> localized "Re:" prefixes; with the new email module's i18n support, it's trivial to decode headers and make sure we don't add a [PREFIX] to a message with Re: [PREFIX] in the local language.
I know German uses AW: -- does anyone have a list of commonly-used response prefixes in other languages? Japanese uses Re:, as far as I know.
As for the footer.. hm. Needs more thought. I doubt anyone wants to add an attachment for the footer; I think the best thing to do would be to look up the body's charset in a table and attach a properly localized footer if it's found. If it's not found, no footer is attached. If the charset is not specified, assume us-ascii.
What do you think?
Martin> - To produce HTML pages, mailman needs to quote markup Martin> characters. For some encodings (e.g. iso-2022-jp), HTML Martin> markup character such as '<' may also occur as part of the Martin> multi-byte encoding. For these encodings, mailman Martin> currently performs no quoting at all. This is incorrect if Martin> an iso-2022-jp message contains a true '<' character, Martin> which would need to be converted to '<'.
I have written a Python module that deals with this problem directly for iso-2022-jp; it would also be possible by converting to Unicode, doing HTML escaping, then converting to the output format.
Martin> I'd encourage you to have a look at the iconv codec Martin> also. If the system iconv is powerful enough (e.g. on Martin> Linux glibc), all encodings of the world would be Martin> supported with that single codec.
Ah, if only all systems had such an iconv codec.
I'm surprised iconv is so powerful on Linux glibc, yet gettext does not support iso-2022-jp directly.