email, unicode, HTML, and removal thereof

Thu Oct 31 04:43:12 EST 2002

Short version:
   What should I do to strip out markup from an email'ed HTML
   document so I can get just the text?  (Yeah, it won't always
   get only the text.)  I'm having problems in how to handle
   the charset.

Solutions in pure Python or via calling a common (under unix)
external program are fine.

Long version:

I'm experimenting with the Lucene text indexer and for my
first project I'm writing a simple search of my inbox.

I want to index various types of email and attachments,
including HTML mail, PDF, PS, etc.  For the last two I
use "pdftotext" and "ps2ascii" but for the first I figured
I would use straight Python.

I'm having problems extracting the non-HTML part of
an email.  I figured I could do

(NOTE: this is not a direct copy and paste -- I culled
things from a few functions and put them into one for
brevity)

# Map from various charsets (culled from inbox) to Python's
charset_table = {
     "window-1252": "cp1252",
     "windows-1252": "cp1252",
     "nil": "Latin-1",
     "default_charset": "Latin-1",
     "x-unknown": "Latin-1",
}

# Pass in an email.Message
def get_HTML_text(msg):
   s = msg.get_payload(decode = 1)
   charset = msg.get_param("charset")
   if charset is None:
     charset = "Latin-1"
   charset = charset.lower()
   charset = charset_table.get(charset, charset)
   try:
     s = s.decode(charset, "replace")
   except LookupError:
     pass

   file = cStringIO.StringIO()
   form = formatter.AbstractFormatter(formatter.DumbWriter(file))
   p = htmllib.HTMLParser(form)
   try:
     p.feed(s)
   except sgmllib.SGMLParseError, err:
     return None
   p.close()

   return file.getvalue()

This didn't work because I get complaints about having characters
with ordinal value > 127.  I needed to change the "cStringIO" to
use the following MyStringIO and change the htmllib.HTMLParser
to MyHTMLParser.

class MyStringIO(StringIO):
     def write(self, s):
         if type(s) == type(""):
             s = unicode(s, "latin-1")
         StringIO.write(self, s)

class MyHTMLParser(htmllib.HTMLParser):
     def handle_data(self, data):
         if type(data) == type(""):
             data = unicode(data, "latin-1")
         htmllib.HTMLParser.handle_data(self, data)

I interpret the problem to HTMLParser reading a hex
escape and converting it to a string.  The way I do
things above can create characters >127.  Then when
it converts the string to unicode, it throws the exception.

My workaround solves this by forcing the string to be
interpreted in latin-1 context.

This solution doesn't feel correct.  For example, I assume
Latin-1 but it could be in window's cp1252, so I'm not
doing the charset correctly.

I can see two solutions:
   - pass in the charset to MyHTMLParser and MyStringIO (I
may not actually need the latter) so that it uses the proper
charset instead of always "latin-1".  This feels right.

   - Strip out the HTML markup on the input string (as a
series of bytes), then convert the string to unicode using
the specified charset.  I at first thought this was correct
until I remembered that some charsets might allow other
definitions of '<>&' which should be handled by the parser.
I think.

This is my first use of unicode, and the email package, and
lucene, and 3rd use of Jython, so I'm kinda wandering in
the dark here.  Any help will be much appreciated.

Thanks!

					Andrew
					dalke at dalkescientific.com