[Mailman-i18n] HTML entities (é) in es, it, no
Thu, 31 Jan 2002 23:08:34 +0900
>>>>> "Martin" == Martin von Loewis <email@example.com> writes:
Martin> To be really precise, HTML 4.01's "document character set"
Martin> is the "Universal Character Set" (as defined in ISO
Martin> 10646), see
Yes, many thanks.
Martin> Python does not currently include codecs for iso-2022-jp,
Martin> gb2312, big5, euc-jp, shift-jis. Since mailman leaves all
Martin> strings as-is, and never mixes encodings, it can let them
Martin> pass through unmodified. There are a number of pitfalls,
I have been working actively on these problems. Hopefully we
can ship these codecs with Mailman 2.1.
Martin> This is probably an archive problem only; however, if
Martin> mailman adds a footer, it will produce garbage if the
Martin> footer encoding differs from the message body encoding.
Martin> - To analyse the subject, Mailman needs to strip off the
Martin> subject_prefix from the incoming message.
The subject and footer issue is a good one, and needs some work.
We basically need a map of charset -> localized "Re:" prefixes; with
the new email module's i18n support, it's trivial to decode headers
and make sure we don't add a [PREFIX] to a message with Re: [PREFIX]
in the local language.
I know German uses AW: -- does anyone have a list of commonly-used
response prefixes in other languages? Japanese uses Re:, as far as I
As for the footer.. hm. Needs more thought. I doubt anyone wants to
add an attachment for the footer; I think the best thing to do would
be to look up the body's charset in a table and attach a properly
localized footer if it's found. If it's not found, no footer is
attached. If the charset is not specified, assume us-ascii.
What do you think?
Martin> - To produce HTML pages, mailman needs to quote markup
Martin> characters. For some encodings (e.g. iso-2022-jp), HTML
Martin> markup character such as '<' may also occur as part of the
Martin> multi-byte encoding. For these encodings, mailman
Martin> currently performs no quoting at all. This is incorrect if
Martin> an iso-2022-jp message contains a true '<' character,
Martin> which would need to be converted to '<'.
I have written a Python module that deals with this problem directly
for iso-2022-jp; it would also be possible by converting to Unicode,
doing HTML escaping, then converting to the output format.
Martin> I'd encourage you to have a look at the iconv codec
Martin> also. If the system iconv is powerful enough (e.g. on
Martin> Linux glibc), all encodings of the world would be
Martin> supported with that single codec.
Ah, if only all systems had such an iconv codec.
I'm surprised iconv is so powerful on Linux glibc, yet gettext
does not support iso-2022-jp directly.
Brought to you by the letters M and W and the number 4.
"Ohhhh, Mentos Boy!"
Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/