[Tutor] Extracting body of all email messages from an mbox file on computer

Kent Johnson kent37 at tds.net
Sun Sep 14 14:32:03 CEST 2008


On Thu, Sep 11, 2008 at 4:22 AM, grishma govani <grishma20 at gmail.com> wrote:

> I have the e-mails from gmail in a file on my computer. I have used the code
> below extract all the headers. As you can see for now I am using text stored
> in document as my body. I just want to extract the plain text and leave out
> all the html, duplicates of plain text and all the other information like
> content type, from etc. Can anyone help me out?

Here is a program that shows the contents of an mbox file. It shows
the subject of each message and the content-type and except from each
part of the message body. It works with both single and multipart
messages.

import mailbox

def showMbox(mboxPath):
    box = mailbox.mbox(mboxPath)
    for msg in box:
        print msg['Subject']
        showPayload(msg)

        print
        print '**********************************'
        print


def showPayload(msg):
    payload = msg.get_payload()

    if msg.is_multipart():
        div = ''
        for subMsg in payload:
            print div
            showPayload(subMsg)
            div = '------------------------------'
    else:
        print msg.get_content_type()
        print payload[:200]


if __name__ == '__main__':
    showMbox('/path/to/mbox'')

Kent


More information about the Tutor mailing list