[Mailman-Users] A scrubber issue

Tokio Kikuchi tkikuchi at is.kochi-u.ac.jp
Sun Dec 10 02:48:16 CET 2006


I'm OK with changing the recomposing part in Scrubber.py:

if not part or part.is_multipart():


if part.is_multipart():

It looks like the email package is more robust than it was when the bug 
report was issued and the Scrubber code was patched.

But as to the default charset is 'us-ascii' problem, if we put the part 
together the parts, some language text (like japanese) become 
irreversibly unreadable.  It is safe to keep it in a separate file if 
you can't archive the whole message in multipart like in Pipermail.

Additionally, the diff file which was said to be lost in the first post 
is in:
I believe the folks in gnupg.org can fix the reference in the pipermail 
archive by fixing the PUBLIC_ARCHIVE_URL in mm_cfg.py and re-generating 
the archive by bin/arch --wipe command.

Mark Sapiro wrote:
> Todd Zullinger wrote:
>> Related to the second part of Werner's message being scrubbed with the
>> message:
>>    An embedded and charset-unspecified text was scrubbed...
>> Poking in the email package (on python 2.4.4) shows:
>>    def get_content_charset(self, failobj=None):
>>        """Return the charset parameter of the Content-Type header.
>>        The returned string is always coerced to lower case.  If there is no
>>        Content-Type header, or if that header has no charset parameter,
>>        failobj is returned.
>>        """
>> This seems to violate section 5.2 of RFC 2045 which says parts lacking
>> a Content-type header should be assumed to be text/plain with a
>> charset of us-ascii.  The get_content_type method in email.Message
>> does mention RFC 2045 and uses text/plain if the content-type is
>> invalid.
> It does seem inconsistent, but I don't think we can call it a violation
> of the RFC yet, it depends on what the caller does with it.
>> Would it be appropriate to set failobj="us-ascii" when
>> calling this method in Scrubber.py?
> It might be, but I'd like to hear from Tokio first.
> Clearly this was considered at one point as a specific case and message
> exist for it where it would have been simpler to just assume it is
> us-ascii. Thus, I think there must be messages in the wild with parts
> with unspecified character sets that aren't us-ascii.

Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp

More information about the Mailman-Users mailing list