Bugs item #449677, was opened at 2001-08-10 03:25 Message generated for change (Comment added) made by cfaerber You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=100103&aid=449677&...
Category: Pipermail Group: 2.0.x Status: Open Resolution: None Priority: 1 Submitted By: Ben Gertzfield (che_fox) Assigned to: Nobody/Anonymous (nobody) Summary: HyperArch.py assumes charsets are in \w+
Initial Comment: Using Mailman 2.0.6, I noticed that Japanese messages in charset iso-2022-jp are not archived correctly; their subject lines stay in MIME-encoded format, like
Subject: =?ISO-2022-JP?B?WxskQiQqTD5BMBsoQi5jb21dLmNvbS8uag==?=
etc.
I tracked this down to the following line in HyperArch.py:158
# content-type charset rx_charset = re.compile('charset="(\w+)"')
This is incorrect. According to the de-facto list of charsets at http://www.iana.org/assignments/character-sets charsets can have all sorts of characters outside of [a-zA-Z0-9_] , such as - ( ) : . etc. So, we must accept anything between the quotes with a fuzzy .+? match, instead of forcing \w+. Patch attached, against Mailman 2.0.6.
----------------------------------------------------------------------
Comment By: Claus Färber (cfaerber) Date: 2004-02-03 01:16
Message: Logged In: YES user_id=126984
This is still incorrect: The quotes are not required. There can be whitespace and folding around the "=" sign. A string matching 'charset="(\w+)"' can occur within other parameters.
----------------------------------------------------------------------
Comment By: Ben Gertzfield (che_fox) Date: 2001-08-10 04:53
Message: Logged In: YES user_id=89313
Note: this is the solution to the problem in Bug #431511.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=100103&aid=449677&...