unicode / archive problem revisited
Howdy. I am currently running 2.1b5 of Mailman and am trying to sort out an issue with archiving that has crept up.
The problem has been mentioned previously from what I can tell but no resolution seems to have been mentioned.
What the problem is that list archives (for reasons I won't bore you with) have a number of SPAM message in them with all sorts of random encoding types and other mangled garbage. What happens is that when the archiver gets to the point of writing the archive, the encoding type test generates an error and the whole archiving process grinds to a crashing halt. These are busy lists and the mbox archive takes a very long time to parse and there just is not enough time in the day to search for the offending message, chop it out and wait another 45 minutes or more until the archives are regenerated to hit the next garbled header, etc. This will also continue to be a problem if any future SPAM messages sneak in via forged headers, etc.
The issue appears to be with:
/usr/local/mailman/Mailman/Archiver/HyperArch.py
Traceback (most recent call last): File "./bin/arch", line 173, in ? main() File "./bin/arch", line 163, in main archiver.close() File "/usr/local/mailman/Mailman/Archiver/pipermail.py", line 303, in close self.update_dirty_archives() File "/usr/local/mailman/Mailman/Archiver/pipermail.py", line 517, in update_dirty_archives self.update_archive(i) File "/usr/local/mailman/Mailman/Archiver/HyperArch.py", line 1058, in update_archive self.__super_update_archive(archive) File "/usr/local/mailman/Mailman/Archiver/pipermail.py", line 423, in update_archive self._update_simple_index(hdr, archive, arcdir) File "/usr/local/mailman/Mailman/Archiver/pipermail.py", line 444, in _update_simple_index self.write_index_entry(article) File "/usr/local/mailman/Mailman/Archiver/HyperArch.py", line 980, in write_index_entry subject = self.get_header("subject", article) File "/usr/local/mailman/Mailman/Archiver/HyperArch.py", line 1007, in get_header return unicode(result, article.charset) TypeError: unicode() argument 2 must be string, not None
What I want is the archiver to default to english if it cannot figure out the encoding so that at least the archiver will not die.
So two questions:
What is a valid encoding type to pass as default to the unicode call?
Secondly, is there any danger in changing the fallback option to always use a
specific charset? I'd rather have gibberish than a process that dies.
Basically, around line 1007 in "/usr/local/mailman/Mailman/Archiver/HyperArch.py" I want to change:
if isinstance(result, types.UnicodeType): return result try: return unicode(result, article.charset)
to
if isinstance(result, types.UnicodeType): return result try: return unicode(result, "some string") # never fail!
Thanks for any suggestions.
Cheers
"RB" == Ron Brogden <rb@islandnet.com> writes:
RB> Howdy. I am currently running 2.1b5 of Mailman and am trying
RB> to sort out an issue with archiving that has crept up.
RB> The problem has been mentioned previously from what I can tell
RB> but no resolution seems to have been mentioned.
RB> What the problem is that list archives (for reasons I won't
RB> bore you with) have a number of SPAM message in them with all
RB> sorts of random encoding types and other mangled garbage.
RB> What happens is that when the archiver gets to the point of
RB> writing the archive, the encoding type test generates an error
RB> and the whole archiving process grinds to a crashing halt.
RB> These are busy lists and the mbox archive takes a very long
RB> time to parse and there just is not enough time in the day to
RB> search for the offending message, chop it out and wait another
RB> 45 minutes or more until the archives are regenerated to hit
RB> the next garbled header, etc. This will also continue to be a
RB> problem if any future SPAM messages sneak in via forged
RB> headers, etc.
Could you send me a sample of an offending message, as an attachment? Better yet, file a bug report with SourceForge and upload (don't paste!) that same on the bug report.
Thanks, -Barry
participants (2)
-
barry@python.org
-
Ron Brogden