[Mailman-i18n] HTML entities (é) in es, it, no translations
Martin von Loewis
31 Jan 2002 14:24:24 +0100
Ben Gertzfield <firstname.lastname@example.org> writes:
> Actually, to be precise, HTML 4.01's native encoding is Unicode,
> which Latin-1 happens to be a (very small) subset of.
To be really precise, HTML 4.01's "document character set" is the
"Universal Character Set" (as defined in ISO 10646), see
What the character encoding is is a different matter (Unicode is not a
character encoding); that is transmitted as part of the HTTP
response. As the document above points out, the default encoding, if
none is specified, is Latin-1 (they also point out that it is bad to
rely on that).
> Unfortunately, as much as I'd like, we can't make *everything*
> Unicode, because a lot of older browsers still don't support it.
That is completely irrelevant; Unicode is *not* a character
encoding. In this context, it is a Python internal datatype. When
producing HTML document, strings of that type need to be encoded in
the target document encoding (which definitely will *not* be Unicode,
but perhaps a Unicode encoding, such as UTF-8, or some other
> Which East Asian ones are missing? Mailman CVS works beautifully
> for me with Japanese, and the screenshot I sent earlier today shows
> Chinese (both simplified and traditional) working in email.
Python does not currently include codecs for iso-2022-jp, gb2312,
big5, euc-jp, shift-jis. Since mailman leaves all strings as-is, and
never mixes encodings, it can let them pass through unmodified. There
are a number of pitfalls, though:
- On mailing lists, people may use different encodings; some of the
common combinations might be:
European languages: ISO-8859-1, ISO-8859-15 (for the Euro), UTF-8
Japanese: ISO-2022-JP, eucJP, shift-jis, UTF-8
Chinese: gb2312, big5
This is probably an archive problem only; however, if mailman adds a
footer, it will produce garbage if the footer encoding differs from
the message body encoding.
- To analyse the subject, Mailman needs to strip off the
subject_prefix from the incoming message. If the message uses a
MIME-encoded header, it may be that the subject prefix is base64
encoded. Currently, mailman fails to strip the prefix in this
case. There is a patch on SF that tries to decode the subject. If
the encoding is not known to Python, this will still fail.
- To produce HTML pages, mailman needs to quote markup characters. For
some encodings (e.g. iso-2022-jp), HTML markup character such as '<'
may also occur as part of the multi-byte encoding. For these
encodings, mailman currently performs no quoting at all. This is
incorrect if an iso-2022-jp message contains a true '<' character,
which would need to be converted to '<'.
> The Japanese codec is in a good state and will be easy enough to
> ship; the Chinese ones are only available in CVS that I know of, so
> we will need to make a proper distribution.
I'd encourage you to have a look at the iconv codec also. If the
system iconv is powerful enough (e.g. on Linux glibc), all encodings
of the world would be supported with that single codec.