[Mailman-i18n] HTML entities (é) in es, it, no translations

Ben Gertzfield che@debian.org
Thu, 31 Jan 2002 23:08:34 +0900


>>>>> "Martin" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:

    Martin> To be really precise, HTML 4.01's "document character set"
    Martin> is the "Universal Character Set" (as defined in ISO
    Martin> 10646), see

Yes, many thanks.

    Martin> Python does not currently include codecs for iso-2022-jp,
    Martin> gb2312, big5, euc-jp, shift-jis. Since mailman leaves all
    Martin> strings as-is, and never mixes encodings, it can let them
    Martin> pass through unmodified. There are a number of pitfalls,
    Martin> though:

I have been working actively on these problems.  Hopefully we
can ship these codecs with Mailman 2.1.

    Martin>   This is probably an archive problem only; however, if
    Martin> mailman adds a footer, it will produce garbage if the
    Martin> footer encoding differs from the message body encoding.

    Martin> - To analyse the subject, Mailman needs to strip off the
    Martin> subject_prefix from the incoming message.

The subject and footer issue is a good one, and needs some work.

We basically need a map of charset -> localized "Re:" prefixes; with
the new email module's i18n support, it's trivial to decode headers
and make sure we don't add a [PREFIX] to a message with Re: [PREFIX]
in the local language.

I know German uses AW: -- does anyone have a list of commonly-used
response prefixes in other languages?  Japanese uses Re:, as far as I
know.

As for the footer.. hm.  Needs more thought.  I doubt anyone wants to
add an attachment for the footer; I think the best thing to do would
be to look up the body's charset in a table and attach a properly
localized footer if it's found.  If it's not found, no footer is
attached.  If the charset is not specified, assume us-ascii.

What do you think?

    Martin> - To produce HTML pages, mailman needs to quote markup
    Martin> characters. For some encodings (e.g. iso-2022-jp), HTML
    Martin> markup character such as '<' may also occur as part of the
    Martin> multi-byte encoding. For these encodings, mailman
    Martin> currently performs no quoting at all. This is incorrect if
    Martin> an iso-2022-jp message contains a true '<' character,
    Martin> which would need to be converted to '&lt;'.

I have written a Python module that deals with this problem directly
for iso-2022-jp; it would also be possible by converting to Unicode,
doing HTML escaping, then converting to the output format.

http://nausicaa.interq.or.jp/mailman/JisEscape.py

    Martin> I'd encourage you to have a look at the iconv codec
    Martin> also. If the system iconv is powerful enough (e.g. on
    Martin> Linux glibc), all encodings of the world would be
    Martin> supported with that single codec.

Ah, if only all systems had such an iconv codec.

I'm surprised iconv is so powerful on Linux glibc, yet gettext
does not support iso-2022-jp directly.

Ben

-- 
Brought to you by the letters M and W and the number 4.
"Ohhhh, Mentos Boy!"
Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/