[Mailman-i18n] HTML entities (é) in es, it, no translations

Martin von Loewis loewis@informatik.hu-berlin.de
31 Jan 2002 14:24:24 +0100

Ben Gertzfield <che@debian.org> writes:

> Actually, to be precise, HTML 4.01's native encoding is Unicode,
> which Latin-1 happens to be a (very small) subset of.

To be really precise, HTML 4.01's "document character set" is the
"Universal Character Set" (as defined in ISO 10646), see


What the character encoding is is a different matter (Unicode is not a
character encoding); that is transmitted as part of the HTTP
response. As the document above points out, the default encoding, if
none is specified, is Latin-1 (they also point out that it is bad to
rely on that).

> Unfortunately, as much as I'd like, we can't make *everything* 
> Unicode, because a lot of older browsers still don't support it.

That is completely irrelevant; Unicode is *not* a character
encoding. In this context, it is a Python internal datatype. When
producing HTML document, strings of that type need to be encoded in
the target document encoding (which definitely will *not* be Unicode,
but perhaps a Unicode encoding, such as UTF-8, or some other

> Which East Asian ones are missing?  Mailman CVS works beautifully
> for me with Japanese, and the screenshot I sent earlier today shows
> Chinese (both simplified and traditional) working in email.

Python does not currently include codecs for iso-2022-jp, gb2312,
big5, euc-jp, shift-jis. Since mailman leaves all strings as-is, and
never mixes encodings, it can let them pass through unmodified. There
are a number of pitfalls, though:

- On mailing lists, people may use different encodings; some of the
  common combinations might be:
  European languages: ISO-8859-1, ISO-8859-15 (for the Euro), UTF-8
  Japanese: ISO-2022-JP, eucJP, shift-jis, UTF-8
  Chinese: gb2312, big5

  This is probably an archive problem only; however, if mailman adds a
  footer, it will produce garbage if the footer encoding differs from
  the message body encoding.

- To analyse the subject, Mailman needs to strip off the
  subject_prefix from the incoming message. If the message uses a
  MIME-encoded header, it may be that the subject prefix is base64
  encoded. Currently, mailman fails to strip the prefix in this
  case. There is a patch on SF that tries to decode the subject. If
  the encoding is not known to Python, this will still fail.

- To produce HTML pages, mailman needs to quote markup characters. For
  some encodings (e.g. iso-2022-jp), HTML markup character such as '<'
  may also occur as part of the multi-byte encoding. For these
  encodings, mailman currently performs no quoting at all. This is
  incorrect if an iso-2022-jp message contains a true '<' character,
  which would need to be converted to '&lt;'.

> The Japanese codec is in a good state and will be easy enough to
> ship; the Chinese ones are only available in CVS that I know of, so
> we will need to make a proper distribution.

I'd encourage you to have a look at the iconv codec also. If the
system iconv is powerful enough (e.g. on Linux glibc), all encodings
of the world would be supported with that single codec.