HTMLParser: getting one extra space around >/</^M tags

Les Schaffer godzilla at netmeg.net
Fri Jun 2 11:39:33 EDT 2000


I am helping someone convert html archives from a mailing list back to
mbox format for forwarding to mail-archive.com.

i am using HTMLParser to pull out the body of the message into plain
text, and AbstractFormatter/DumbWriter ...

The message bodies are all inside <PRE> tags. so nofill is set to 1
which means self.formatter.add_literal_data(data) is used to add data
to output. This is what i want for preformatted messages.

There are lots of > and < tags in the body of messages, as
mhonarc/hypermail translate the '>' quote reply and the '<>' of
embedded email addresses using these html tags. (the bodies are
relativly HTML free except for these).

In addition, the end-of-line characters __in the body of the mail
message__ use '^M'.

The problem is this: the formatter is adding one extra space around
these three characters ( &gt < and ^M).

so a back-in-text-mode mail message which looks like this:

=====
 Hello, why are you even reading this email.
 I have nothing important to say. 
 Stop it.
=====

the extra space on the left hand side which i assume comes from
translation of ^M

or this:

====
 and so then you said:

  >  i dont like you

 and i say, so what, and you said:

  >  thats what!


again, extra single space at beginning of line and on either side of
the '>'.

I cannot for the life of me (after 2 hours) find where in htmllib or
formatter these extra spaces are being inserted.

any one know?

many many thanks

les schaffer



More information about the Python-list mailing list