[Tutor] Converting Microsoft .msg files to text

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Oct 13 08:17:00 CEST 2004



On Tue, 12 Oct 2004, James R. Marcus wrote:

> I'm starting my second script ever. I need to get the bad email
> addresses out of the Microsoft .msg files that were sent to me.  I'm
> assuming I should start by converting the .msg files into text format.

Hi James,


I did a search for what kind of file format the .MSG is: it appears to be
the format that Outlook uses to store email messages.  It's a binary
format, but you might be able to get away with filtering out for plain
text characters.

One way to do this is to go through each character of the file: if it has
an ordinal value less than 2**7, that might work.


For example, say that we have something like:

###
>>> message = "THIS is a MESSAGE with MIXED case."
###

and say that we want to drop out the lowercased characters.  It turns out
that each lowercased letter in a string has an ordinal (ASCII) value
between:

###
>>> ord('a'), ord('z')
(97, 122)
###

and uppercased letters go between:

###
>>> ord('A'), ord('Z')
(65, 90)
###


One way to just keep the uppercase letters from 'message' is to filter for
them:

###
>>> def isGoodCharacter(ch):
...     return ord('A') <= ord(ch) <= ord('Z')
...
>>> filter(isGoodCharacter, message)
'THISMESSAGEMIXED'
###


Similarly, you should be able to set up a filter for the readable
"printable" test part of your .msg files.

An alternative way, beside doing something with explicit ASCII values, is
to check each character and see if its 'in' some collection of printable
characters.

For example:

###
>>> def isVowel(ch):
...     return ch.lower() in 'aeiou'
...
###

And there's a variable in the 'string' module called 'string.printable'
that might come in handy.


I'm not sure if this is the best way to do this, but I can't find a nice
documented page for the .MSG file format, so this will have to do for
now... *grin*


Good luck to you!



More information about the Tutor mailing list