Ben Gertzfield <che@debian.org> writes:
Actually, to be precise, HTML 4.01's native encoding is Unicode, which Latin-1 happens to be a (very small) subset of.
To be really precise, HTML 4.01's "document character set" is the "Universal Character Set" (as defined in ISO 10646), see http://www.w3.org/TR/html4/charset.html What the character encoding is is a different matter (Unicode is not a character encoding); that is transmitted as part of the HTTP response. As the document above points out, the default encoding, if none is specified, is Latin-1 (they also point out that it is bad to rely on that).
Unfortunately, as much as I'd like, we can't make *everything* Unicode, because a lot of older browsers still don't support it.
That is completely irrelevant; Unicode is *not* a character encoding. In this context, it is a Python internal datatype. When producing HTML document, strings of that type need to be encoded in the target document encoding (which definitely will *not* be Unicode, but perhaps a Unicode encoding, such as UTF-8, or some other encoding).
Which East Asian ones are missing? Mailman CVS works beautifully for me with Japanese, and the screenshot I sent earlier today shows Chinese (both simplified and traditional) working in email.
Python does not currently include codecs for iso-2022-jp, gb2312, big5, euc-jp, shift-jis. Since mailman leaves all strings as-is, and never mixes encodings, it can let them pass through unmodified. There are a number of pitfalls, though: - On mailing lists, people may use different encodings; some of the common combinations might be: European languages: ISO-8859-1, ISO-8859-15 (for the Euro), UTF-8 Japanese: ISO-2022-JP, eucJP, shift-jis, UTF-8 Chinese: gb2312, big5 This is probably an archive problem only; however, if mailman adds a footer, it will produce garbage if the footer encoding differs from the message body encoding. - To analyse the subject, Mailman needs to strip off the subject_prefix from the incoming message. If the message uses a MIME-encoded header, it may be that the subject prefix is base64 encoded. Currently, mailman fails to strip the prefix in this case. There is a patch on SF that tries to decode the subject. If the encoding is not known to Python, this will still fail. - To produce HTML pages, mailman needs to quote markup characters. For some encodings (e.g. iso-2022-jp), HTML markup character such as '<' may also occur as part of the multi-byte encoding. For these encodings, mailman currently performs no quoting at all. This is incorrect if an iso-2022-jp message contains a true '<' character, which would need to be converted to '<'.
The Japanese codec is in a good state and will be easy enough to ship; the Chinese ones are only available in CVS that I know of, so we will need to make a proper distribution.
I'd encourage you to have a look at the iconv codec also. If the system iconv is powerful enough (e.g. on Linux glibc), all encodings of the world would be supported with that single codec. Regards, Martin