mimedecode.py version 1.1.2

Oleg Broytmann phd at phd.pp.ru
Wed Jan 9 10:58:30 EST 2002


Hello!

                                mimedecode.py

WHAT IS IT

   Mail users, especially in non-English countries, often find that mail
messages arrived in different formats, with different content types, in
different encodings and charsets. Usually this is very good because it allows
us to use apropriate formats/encodings/whatever. Sometimes, though, some
unification is desireable. For example, one may want to put mail messages into
an archive, make HTML indicies, run search indexer, etc. In such situations
converting messages to text in one character set and skipping some binary
atachmetnts will be much desireable.

   Here is the solution - mimedecode.py.

   This is a program to decode MIME messages. The program expects one input
file (either on command line or on stdin) which treated as an RFC822 mesage,
and decoded to stdout. If the file is not an RFC822 message the file just
piped to stdout one-to-one. If it is a simple RFC822 message it is just
decoded as one part. If it is a MIME message with multiple parts
("attachments") all parts decoded. Decoding can be controlled by command-line
options.


WHAT'S NEW in version 1.1.2
   Fixed a major bug in binary attachments handling. Fixed a minor wart that
was found by PyChecker.


WHERE TO GET
   Master site: http://phd.pp.ru/Software/Python/#mimedecode

   Faster mirror: http://phd.by.ru/Software/Python/#mimedecode

   Requires: Python 2.0+ (actually tested with 2.1.1),
      configured mailcap database.

   Documentation (also included in the package):
      http://phd.pp.ru/Software/Python/mimedecode.txt
      http://phd.by.ru/Software/Python/mimedecode.txt


NAME
   mimedecode.py - decode MIME message


SYNOPSIS
   mimedecode.py [-h|--help] [-V|--version] [-cCDP] [-f charset] [-d header] [-p header:param] [-beit mask] [filename]


DESCRIPTION
   Mail users, especially in non-English countries, often find that mail
messages arrived in different formats, with different content types, in
different encodings and charsets. Usually this is very good because it allows
us to use appropriate formats/encodings/whatever. Sometimes, though, some
unification is desirable. For example, one may want to put mail messages into
an archive, make HTML indices, run search indexer, etc. In such situations
converting messages to text in one character set and skipping some binary
attachments will be much desirable.

   Here is the solution - mimedecode.py!

   This is a program to decode MIME messages. The program expects one input
file (either on command line or on stdin) which treated as an RFC822 message,
and decoded to stdout. If the file is not an RFC822 message the file just
piped to stdout one-to-one. If it is a simple RFC822 message it is just
decoded as one part. If it is a MIME message with multiple parts
("attachments") all parts decoded. Decoding can be controlled by command-line
options.

   First, Subject and Content-Disposition headers are examined. If any of
those exists, they decoded according to RFC2047. Content-Disposition header
is not decoded - only its "filename" parameter. Encoding header's
parameters is in violation of the RFC, but widely deployed anyway,
especially in the M$ Ophice GUI (often referred as "Windoze") world, where
programmers are usually ignorant lamers who never even heard about RFCs.
Correct parameter encoding specified by RFC2231. This program decodes
RFC2231-encoded parameters.

   Then the body of the message (or current part) decoded. Decoding starts
with looking at header Content-Transfer-Encoding. If the header specifies
non-8bit encoding (usually base64 or quoted-printable), the body converted
to 8bit. Then, if its content type is multipart (multipart/related or
multipart/mixed, e.g) every part recursively decoded. If it is not
multipart, mailcap database is consulted to find a way to convert the body
to plain text. (I have no idea how mailcap could be configured on said M$
Ophice GUI, please don't ask me; real OS users can consult my example at
http://phd.pp.ru/Software/dotfiles/mailcap.html). The decoding process uses
first copiousoutput filter it can find. If there is no filter the body just
passed as is.
   Then Content-Type header consulted for charset. If it is not equal to
current default charset the body text recoded using Unicode codecs. Finally
message headers and body flushed to stdout.


OPTIONS
   -h
   --help
      Print brief usage help and exit.

   -V
   --version
      Print version and exit.

   -c
      Recode different character sets in message body to current default
      charset; this is the default.

   -C
      Do not recode character sets in message body.

   -f charset
      Force this charset to be the current default charset instead of
      sys.getdefaultencoding().

   -d header
      Add the header to a list of headers to decode; initially the list
      contains headers "From" and "Subject".

   -D
      Clear the list of headers to decode (make it empty).

   -p header:param
      Add the (header, param) pair to a list of headers' parameters to
      decode; initially the list contains header "Content-Disposition",
      parameter "filename".

   -P
      Clear the list of headers' parameters to decode (make it empty).

   -b mask
      Append mask to the list of binary content types; if the message to
      decode has a part of this type the program will pass the part as is,
      without any additional processing.

   -e mask
      Append mask to the list of error content types; if the message to
      decode has a part of this type the program will raise ValueError.

   -i mask
      Append mask to the list of content types to ignore; if the message to
      decode has a part of this type the program will not pass it, instead
      a line \nMessage body of type `%s' skipped.\n" will be issued.

   -t mask
      Append mask to the list of content types to convert to text; if the
      message to decode has a part of this type the program will consult
      mailcap database, find first copiousoutput filter and convert the
      part.

   The last 4 options (-beit) require more explanation. They allow a user
to control body decoding with great flexibility. Think about said mail
archive; for example, its maintainer wants to put there only texts, convert
Postscript/PDF to text, pass HTML and images as is, and ignore everything
else. Easy:

   mimedecode.py -t application/postscript -t application/pdf \
       -b text/html -b 'image/*' -i '*/*'

   When the program decodes a message (or its part), it consults
Content-Type header. The content type is searched in all 4 lists, in order
"text-binary-ignore-error". If found, appropriate action performed. If not
found, the program search the same lists for "type/*" mask (the type of
"text/html" is just "text"). If found, appropriate action performed. If not
found, the program search the same lists for "*/*" mask. If found,
appropriate action performed. If not found, the program uses default
action, which is to decode everything to text (if mailcap specifies
filters).
   Initially all 4 lists are empty, so without any additional parameters
the program always uses the default decoding.


ENVIRONMENT
   LANG
   LC_ALL
   LC_CTYPE
      Define current locale settings. Used to determine current default
      charset (if your Python is properly installed and configured).


BUGS
   The program may produce incorrect MIME message. The purpose of the
program is to decode whatever is possible to decode, not to produce
absolutely correct MIME output. The incorrect parts are obvious - decoded
Subject headers and filenames. Other than that output is correct MIME
message (tested with mutt mail reader).
   The program does not try to guess whether the headers are correct. For
example, if a message header states that charset is iso8859-5, but the
body is actually in koi8-r - the program will recode the message to wrong
charset.


AUTHOR
   Oleg Broytmann <phd at phd.pp.ru>


COPYRIGHT
   Copyright (C) 2001 PhiloSoft Design


LICENSE
   GNU GPL


NO WARRANTIES
       This  program  is  distributed in the hope that it will be
       useful, but WITHOUT ANY WARRANTY; without even the implied
       warranty  of  MERCHANTABILITY  or FITNESS FOR A PARTICULAR
       PURPOSE.  See the GNU  General  Public  License  for  more
       details.


SEE ALSO
   mimedecode.py home page: http://phd.pp.ru/Software/Python/#mimedecode

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.




More information about the Python-list mailing list