[Mailman-Users] Dealing with multiple charsets (list messagesand web archive)
Mark Sapiro
mark at msapiro.net
Sun May 11 19:19:27 CEST 2008
Mark Sapiro wrote:
>
>What you want is more like the attached flatten.py.txt file (.txt added
>for content filtering).
Sorry, I forgot the attachment. Here it is.
>Note that this is far from production quality
>and probably doesn't even work on some messages.
>
>Problems I am aware of are things like
>
>- no i18n for canned text strings
>
>- signatures will get broken
>
>- with multipart/alternative, the text/plain part will be aggregated
>with the other text/plain parts and the text/html or other
>alternatives will be separately attached.
>
>- text/plain parts without a specified charset will not be aggregated
>but will be separately attached. This is a difficult issue because
>many mainstream MUAs will attach an arbitrary .txt attachment without
>specifying a charset. If you then assume it is say iso-8859-1 and
>convert it to unicode and in fact it was euc-jp or koi8-r or even
>utf-8, you can garble it irreversably.
>
>flatten.py is written so that it could be installed as is in Mailman as
>a custom Handler. See
><http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq04.067.htp>.
>
>Note that this will not address separate attachment of headers and
>footers. If the resultant 'flattened' message is multipart for any
>reason, msg_header and msg_footer will still be attached as separate
>MIME parts.
>
>The basic flow in the process is
>
>If this is not a multipart message do nothing.
>
>Walk through the message making two lists of elemental parts
> plain_parts are those text/plain parts with known character set
> other_parts are the rest.
>
>If there were no plain_parts, make a Unicode text part that says so,
>otherwise convert all the plain_parts to Unicode and string them
>together with a separator to make a text part.
>
>If there were no other_parts make the message a single part text/plain
>message with the text payload utf-8 encoded, else make a
>multipart/mixed message with a text/plain part with the text payload
>utf-8 encoded followed by all the other_parts.
--
Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
San Francisco Bay Area, California better use your sense - B. Dylan
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: flatten.py.txt
URL: <http://mail.python.org/pipermail/mailman-users/attachments/20080511/3b5f8ce4/attachment.txt>
More information about the Mailman-Users
mailing list