[Mailman-Users] Dealing with multiple charsets (list messages and web archive)

Sun May 11 18:55:57 CEST 2008

Stefan Förster wrote:
>
>So, now that I have a temporary fix - how can I recompose a message
>and sort the attachments? I don't mind if I have to code this myself,
>I would just appreciate a hint on where to start. For now, a single
>algorithm like (pseudo code):
>
>,----[ resort message parts ]
>| init list_of_text_parts = empty;
>| init list_of_NON_text_parts = empty;
>| init new_message = empty;
>|
>| for part in msg.walk();
>|     if part.get_content_type() <> 'text/plain'
>|         list_of_NON_text_parts.add(part);
>|     else
>|          list_of_text_parts.add(part);
>|
>| for part in list_of_text_parts.walk();
>|     message.append(part);
>|
>| for part in list_of_NON_text_parts.walk()
>|     message.append(part);
>`----
>
>would absolutely be sufficient. I'm just not familiar enough with
>Mailman yet to know exactly where to add this code (I don't speak a
>single line of Python yet, either, but what I wanna do is not really
>rocket science). Anyways, any help you could give me on that subject
>would be greatly appreciated.

What you have above is too simple, even as pseudo code

What you want is more like the attached flatten.py.txt file (.txt added
for content filtering). Note that this is far from production quality
and probably doesn't even work on some messages.

Problems I am aware of are things like

- no i18n for canned text strings

- signatures will get broken

- with multipart/alternative, the text/plain part will be aggregated
with the other text/plain parts and the text/html or other
alternatives will be separately attached.

- text/plain parts without a specified charset will not be aggregated
but will be separately attached. This is a difficult issue because
many mainstream MUAs will attach an arbitrary .txt attachment without
specifying a charset. If you then assume it is say iso-8859-1 and
convert it to unicode and in fact it was euc-jp or koi8-r or even
utf-8, you can garble it irreversably.

flatten.py is written so that it could be installed as is in Mailman as
a custom Handler. See
<http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq04.067.htp>.

Note that this will not address separate attachment of headers and
footers. If the resultant 'flattened' message is multipart for any
reason, msg_header and msg_footer will still be attached as separate
MIME parts.

The basic flow in the process is

If this is not a multipart message do nothing.

Walk through the message making two lists of elemental parts
  plain_parts are those text/plain parts with known character set
  other_parts are the rest.

If there were no plain_parts, make a Unicode text part that says so,
otherwise convert all the plain_parts to Unicode and string them
together with a separator to make a text part.

If there were no other_parts make the message a single part text/plain
message with the text payload utf-8 encoded, else make a
multipart/mixed message with a text/plain part with the text payload
utf-8 encoded followed by all the other_parts.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan