MIME decode

Fri Oct 12 10:31:10 EDT 2001

Hello!

                                 MIME decode

WHAT IS IT

   Mail users, especially in non-English countries, often find that mail
messages arrived in different formats, with different content types, in
different encodings and charsets. Usually this is very good because it allows
us to use apropriate formats/encodings/whatever. Sometimes, though, some
unification is desireable. For example, one may want to put mail messages into
an archive, make HTML indicies, run search indexer, etc. In such situations
converting messages to text in one character set and skipping some binary
atachmetnts will be much desireable.

   Here is the solution - mimedecode.py.

   This is a program to decode MIME messages. The program expects one input
file (either on command line or on stdin) which treated as an RFC822 mesage,
and decoded to stdout. If the file is not an RFC822 message the file just
piped to stdout one-to-one. If it is a simple RFC822 message it is just
decoded as one part. If it is a MIME message with multiple parts
("attachments") all parts decoded. Decoding can be controlled by command-line
options.

WHERE TO GET
   Master site: http://phd.pp.ru/Software/Python/#mimedecode

   Faster mirror: http://phd.by.ru/Software/Python/#mimedecode

   Requires: Python 2.0+, configured mailcap database.

   Documentation (also included in the package):
      http://phd.pp.ru/Software/Python/mimedecode.txt
      http://phd.by.ru/Software/Python/mimedecode.txt

AUTHOR
   Oleg Broytmann <phd at phd.pp.ru>

COPYRIGHT
   Copyright (C) 2001 PhiloSoft Design

LICENSE
   GPL

Detailed manual

NAME
   mimedecode.py - decode MIME message.

SYNOPSIS
   mimedecode.py [-h|--help] [-V|--version] [-cCfFsS] [-beit mask] [filename]

DESCRIPTION
   First, Subject and Content-Disposition headers are examined. If any of
those exists, they decoded according to RFC2047. Content-Disposition header
is not decoded - only its "filename" parameter. Encoding header's
parameters is in violation of the RFC, but widely deployed anyway,
especially in the M$ Ophice GUI (often referred as "Windoze") world, where
programmers are usually ignorant lamers who never even heard about RFCs.
Correct parameter encoding specified by RFC2231. This program decodes
RFC2231-encoded parameters; continuation parameters (header*1, header*2,
etc.) are not yet supported.

   Then the body of the message (or current part) decoded. Decoding starts
with looking at header Content-Transfer-Encoding. If the header specifies
non-8bit encoding (usually base64 or quoted-printable), the body converted
to 8bit. Then, if its content type is multipart (multipart/related or
multipart/mixed, e.g) every part recursively decoded. If it is not
multipart, mailcap database is consulted to find a way to convert the body
to plain text. (I have no idea how mailcap could be configured on said M$
Ophice GUI, please don't ask me; real OS users can consult my example at
http://phd.pp.ru/Software/dotfiles/mailcap.html). The decoding process uses
first copiousoutput filter it can find. If there is no filter the body just
passed as is.
   Then Content-Type header consulted for charset. If it is not equal to
current default charset the body text recoded using Unicode codecs. Finally
message headers and body flushed to stdout.

OPTIONS
   -h
   --help
      Print brief usage help and exit.

   -V
   --version
      Print version and exit.

   -c
      Recode different character sets to current default charset; this is
      the default.

   -C
      Do not recode character sets.

   -f
      Decode "filename" parameter of Content-Disposition header; this is
      the default.

   -F
      Do not decode filenames.

   -s
      Decode Subject header; this is the default.

   -S
      Do not decode Subject.

   -b mask
      Append mask to the list of binary content types; if the message to
      decode has a part of this type the program will pass the part as is,
      without any additional processing.

   -e mask
      Append mask to the list of error content types; if the message to
      decode has a part of this type the program will raise ValueError.

   -i mask
      Append mask to the list of content types to ignore; if the message to
      decode has a part of this type the program will not pass it, instead
      a line \nMessage body of type `%s' skipped.\n" will be issued.

   -t mask
      Append mask to the list of content types to convert to text; if the
      message to decode has a part of this type the program will consult
      mailcap database, find first copiousoutput filter and convert the
      part.

   The last 4 options (-beit) require more explanation. They allow a user
to control body decoding with great flexibility. Think about said mail
archive; for example, its maintainer wants to put there only texts, convert
Postscript/PDF to text, pass HTML and images as is, and ignore everything
else. Easy:

   mimedecode.py -t application/postscript -t application/pdf \
       -b text/html -b 'image/*' -i '*/*'

   When the program decodes a message (or its part), it consults
Content-Type header. The content type is searched in all 4 lists, in order
"text-binary-ignore-error". If found, appropriate action performed. If not
found, the program search the same lists for "type/*" mask (the type of
"text/html" is just "text"). If found, appropriate action performed. If not
found, the program search the same lists for "*/*" mask. If found,
appropriate action performed. If not found, the program use default action,
which is to decode everything to text (if mailcap specifies filters).
   Initially all 4 lists are empty, so without any additional parameters
the program always use the default decoding.

ENVIRONMENT
   LANG
   LC_ALL
   LC_CTYPE
      Define current locale settings. Usually used to determine current
      default charset.

BUGS
   The program may output incorrect MIME message. The purpose of the
program is to decode whatever is possible to decode, not to produce
absolutely correct MIME output. The incorrect parts are obvious - decoded
Subject headers and filenames.
   Decoding mail header parameters is incomplete - continuations in
RFC2231-encoded parameters (header*1, header*2, etc.) are not parsed yet.

NO WARRANTIES
       This  program  is  distributed in the hope that it will be
       useful, but WITHOUT ANY WARRANTY; without even the implied
       warranty  of  MERCHANTABILITY  or FITNESS FOR A PARTICULAR
       PURPOSE.  See the GNU  General  Public  License  for  more
       details.
Oleg.
---- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.