Ben Gertzfield firstname.lastname@example.org writes:
Actually, to be precise, HTML 4.01's native encoding is Unicode, which Latin-1 happens to be a (very small) subset of.
To be really precise, HTML 4.01's "document character set" is the "Universal Character Set" (as defined in ISO 10646), see
What the character encoding is is a different matter (Unicode is not a character encoding); that is transmitted as part of the HTTP response. As the document above points out, the default encoding, if none is specified, is Latin-1 (they also point out that it is bad to rely on that).
Unfortunately, as much as I'd like, we can't make *everything* Unicode, because a lot of older browsers still don't support it.
That is completely irrelevant; Unicode is *not* a character encoding. In this context, it is a Python internal datatype. When producing HTML document, strings of that type need to be encoded in the target document encoding (which definitely will *not* be Unicode, but perhaps a Unicode encoding, such as UTF-8, or some other encoding).
Which East Asian ones are missing? Mailman CVS works beautifully for me with Japanese, and the screenshot I sent earlier today shows Chinese (both simplified and traditional) working in email.
Python does not currently include codecs for iso-2022-jp, gb2312, big5, euc-jp, shift-jis. Since mailman leaves all strings as-is, and never mixes encodings, it can let them pass through unmodified. There are a number of pitfalls, though:
- On mailing lists, people may use different encodings; some of the common combinations might be: European languages: ISO-8859-1, ISO-8859-15 (for the Euro), UTF-8 Japanese: ISO-2022-JP, eucJP, shift-jis, UTF-8 Chinese: gb2312, big5
This is probably an archive problem only; however, if mailman adds a footer, it will produce garbage if the footer encoding differs from the message body encoding.
- To analyse the subject, Mailman needs to strip off the subject_prefix from the incoming message. If the message uses a MIME-encoded header, it may be that the subject prefix is base64 encoded. Currently, mailman fails to strip the prefix in this case. There is a patch on SF that tries to decode the subject. If the encoding is not known to Python, this will still fail.
- To produce HTML pages, mailman needs to quote markup characters. For some encodings (e.g. iso-2022-jp), HTML markup character such as '<' may also occur as part of the multi-byte encoding. For these encodings, mailman currently performs no quoting at all. This is incorrect if an iso-2022-jp message contains a true '<' character, which would need to be converted to '<'.
The Japanese codec is in a good state and will be easy enough to ship; the Chinese ones are only available in CVS that I know of, so we will need to make a proper distribution.
I'd encourage you to have a look at the iconv codec also. If the system iconv is powerful enough (e.g. on Linux glibc), all encodings of the world would be supported with that single codec.