[Mailman-i18n] Subject lines in Archives

Martin v. Löwis loewis@informatik.hu-berlin.de
02 Apr 2002 10:59:57 +0200


barry@zope.com (Barry A. Warsaw) writes:

> Thanks, this patch applies cleanly to MM2.1 cvs, so I would like to
> get some feedback from you folks as to whether I should commit this.
> I'm currently in the process of running these changes over a capture
> of the python-list mbox file, but if anybody's got a better (read:
> smaller :) sample mbox -- with lots of funky charset combinations -- I
> could test this on, I'd appreciate it.

I have revised the patch on SF to fix the problems Stefan found (both
catching lookup errors, producing proper prev/next subjects, and
producing a proper <title>).

I have also collected messages with funny charsets from various
archives, and combined them to a small mailbox at

http://www.informatik.hu-berlin.de/~loewis/test.mbox

With this, you should be able to observe the following effects:

- when reading the mailbox in current mailman, the index will be
  windows-1257; there will be lots of garbage MIME text

- when applying my patch, the utf-8 and iso-8859-1 parts of it will
  become readable. Japanese and Korean text (in the name of two
  message authors) will remain obscure.

- when making available the Japanese MIME charset names, the Japanese
  name will become readable (to those which can read Japanese, that is)

- when adding the Korean codecs, the Korean name will also become
  readable

- in all cases, the subject encoded x-mvl will remain MIME garbage.

I've changed the Date: fields of all the messages, to make them appear
in a single month. Adding messages to the archive in Jan 2001 might
shift the encodings balance, so that windows-1257 loses majority. That
should have no effect on the rendering of the index.

I don't have permission from any of the message authors, so please
ignore the actual content of their messages :-)

> I'm sure Tamito KAJIYAMA would be open to suggestions.  Otherwise, let
> me know what I'd need to add to MM's copy of the Japanese codecs
> package.

I've talked to Tamito, and he said he'll change it - although it is
not clear yet in which way. It seems clear that explicit action will
be needed (unless .pth files in pythonlib are considered from site.py,
which I doubt).

Alternatively, and independently, please consider the patch 

http://sourceforge.net/tracker/?func=detail&aid=538185&group_id=103&atid=300103

It registers the common aliases for the Japanese encodings, and maps
them to the japanese package. This code could go anywhere you like,
provided that importing HyperArch triggers its execution. Notice that
this will override any existing codecs with these names (cp932,
iso-2022-jp, etc). For Mailman, I'd consider this a good thing, since
it will provide better reproducability of results.

Regards,
Martin