Python:Email and Header Parsing: Some Help

David M. Cooke cookedm+news at physics.mcmaster.ca
Thu Feb 26 22:40:55 CET 2004


At some point, dont bother <dontbotherworld at yahoo.com> wrote:

> Hi,
> I have written this small piece of code. I am a brand
> new  player of Python. I had asked some people for
> help, unfortunately not many helped.
> Here is the code I have:
>
> import email
> import os
> import sys
> fread = open('email_message', 'r')
> msg=email.message_from_file(fread)
> print msg
> #fwrite = open('output','w')
> #fwrite.write(msg)
>
> This way I am able to print the entire email message
> on the stdout. The program generates an error If I try
> to write the output to a file-- It says the argument
> (here msg) should be a string but not as an instance
> like here. How to write the message to another file
> then?

msg here isn't a string; it's an email.Message object. The print
statement works because print call str() on the objects passed.

You want
fwrite = open('output', 'w')
fwrite.write( msg.as_string() )

I didn't use str(msg) here, as that defaults to
msg.as_string(unixfrom=True). Depends whether or not you want the
'From <whoosit>' line at the top (which you do if you're writing an
mbox).

> 2. I have so many headers in the email message
>
> To:
> From:
> X Received:
> X Priority:
> Subject:
> etc etc.
> I want to parse the headers separtely and message
> separately. Does anyone has an example code to deal
> with Parser?

I'm not sure what you want -- email.message_from_file produces a Message
object, which already splits out the headers from the body. You can
then iterate over the headers. For example, to strip out the optional
headers (those starting with 'X-'):

for hdr in msg.keys():
    if hdr.startswith('X-'):
        del msg[hdr]

> Also I want to remove the redundant words and all html
> tags. Any advise on that?
> I saw some examples using HTMLGen But I dont have
> HTMLGen with python on my machine. I have Python
> 2.3.3. on my machine.

HTMLGen won't work, as that generates HTML (hence the name...). To
strip out the HTML tags, probably a regular expression would be
sufficient. Otherwise, have a look at HTMLParser (in the standard library).

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm(at)physics(dot)mcmaster(dot)ca



More information about the Python-list mailing list