phd at phd.pp.ru
Fri Oct 12 16:31:10 CEST 2001
WHAT IS IT
Mail users, especially in non-English countries, often find that mail
messages arrived in different formats, with different content types, in
different encodings and charsets. Usually this is very good because it allows
us to use apropriate formats/encodings/whatever. Sometimes, though, some
unification is desireable. For example, one may want to put mail messages into
an archive, make HTML indicies, run search indexer, etc. In such situations
converting messages to text in one character set and skipping some binary
atachmetnts will be much desireable.
Here is the solution - mimedecode.py.
This is a program to decode MIME messages. The program expects one input
file (either on command line or on stdin) which treated as an RFC822 mesage,
and decoded to stdout. If the file is not an RFC822 message the file just
piped to stdout one-to-one. If it is a simple RFC822 message it is just
decoded as one part. If it is a MIME message with multiple parts
("attachments") all parts decoded. Decoding can be controlled by command-line
WHERE TO GET
Master site: http://phd.pp.ru/Software/Python/#mimedecode
Faster mirror: http://phd.by.ru/Software/Python/#mimedecode
Requires: Python 2.0+, configured mailcap database.
Documentation (also included in the package):
Oleg Broytmann <phd at phd.pp.ru>
Copyright (C) 2001 PhiloSoft Design
mimedecode.py - decode MIME message.
mimedecode.py [-h|--help] [-V|--version] [-cCfFsS] [-beit mask] [filename]
First, Subject and Content-Disposition headers are examined. If any of
those exists, they decoded according to RFC2047. Content-Disposition header
is not decoded - only its "filename" parameter. Encoding header's
parameters is in violation of the RFC, but widely deployed anyway,
especially in the M$ Ophice GUI (often referred as "Windoze") world, where
programmers are usually ignorant lamers who never even heard about RFCs.
Correct parameter encoding specified by RFC2231. This program decodes
RFC2231-encoded parameters; continuation parameters (header*1, header*2,
etc.) are not yet supported.
Then the body of the message (or current part) decoded. Decoding starts
with looking at header Content-Transfer-Encoding. If the header specifies
non-8bit encoding (usually base64 or quoted-printable), the body converted
to 8bit. Then, if its content type is multipart (multipart/related or
multipart/mixed, e.g) every part recursively decoded. If it is not
multipart, mailcap database is consulted to find a way to convert the body
to plain text. (I have no idea how mailcap could be configured on said M$
Ophice GUI, please don't ask me; real OS users can consult my example at
http://phd.pp.ru/Software/dotfiles/mailcap.html). The decoding process uses
first copiousoutput filter it can find. If there is no filter the body just
passed as is.
Then Content-Type header consulted for charset. If it is not equal to
current default charset the body text recoded using Unicode codecs. Finally
message headers and body flushed to stdout.
Print brief usage help and exit.
Print version and exit.
Recode different character sets to current default charset; this is
Do not recode character sets.
Decode "filename" parameter of Content-Disposition header; this is
Do not decode filenames.
Decode Subject header; this is the default.
Do not decode Subject.
Append mask to the list of binary content types; if the message to
decode has a part of this type the program will pass the part as is,
without any additional processing.
Append mask to the list of error content types; if the message to
decode has a part of this type the program will raise ValueError.
Append mask to the list of content types to ignore; if the message to
decode has a part of this type the program will not pass it, instead
a line \nMessage body of type `%s' skipped.\n" will be issued.
Append mask to the list of content types to convert to text; if the
message to decode has a part of this type the program will consult
mailcap database, find first copiousoutput filter and convert the
The last 4 options (-beit) require more explanation. They allow a user
to control body decoding with great flexibility. Think about said mail
archive; for example, its maintainer wants to put there only texts, convert
Postscript/PDF to text, pass HTML and images as is, and ignore everything
mimedecode.py -t application/postscript -t application/pdf \
-b text/html -b 'image/*' -i '*/*'
When the program decodes a message (or its part), it consults
Content-Type header. The content type is searched in all 4 lists, in order
"text-binary-ignore-error". If found, appropriate action performed. If not
found, the program search the same lists for "type/*" mask (the type of
"text/html" is just "text"). If found, appropriate action performed. If not
found, the program search the same lists for "*/*" mask. If found,
appropriate action performed. If not found, the program use default action,
which is to decode everything to text (if mailcap specifies filters).
Initially all 4 lists are empty, so without any additional parameters
the program always use the default decoding.
Define current locale settings. Usually used to determine current
The program may output incorrect MIME message. The purpose of the
program is to decode whatever is possible to decode, not to produce
absolutely correct MIME output. The incorrect parts are obvious - decoded
Subject headers and filenames.
Decoding mail header parameters is incomplete - continuations in
RFC2231-encoded parameters (header*1, header*2, etc.) are not parsed yet.
This program is distributed in the hope that it will be
useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more
Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru
Programmers don't die, they just GOSUB without RETURN.
More information about the Python-list