Re: [Mailman-Users] problem with accented characters, converting HTML to plain text

In a message of Mon, 20 Jul 2015 11:04:08 -0700, Mark Sapiro writes:
On 7/19/15 1:13 PM, Dominique Asselineau wrote:
Hello,
When a e-mail in text/html content-type is converted in to plain text, the accented characters are not treated correctly.
There are potential issues with this. Mailman gets the content of the text/html part and calls HTML_TO_PLAINTEXT_COMMAND (lynx -dump in the default case) to convert the HTML to a plain text rendering and replaces the content of the part with that and changes the Content-Type: to text/plain while maintaining any charset= parameter.
Lynx normally does not recode any characters, so the output of lynx -dump should be in the same charset is the input and it should be OK.
Problems arise if the input has characters represented as HTML entities such as á or è. In this case, lynx outputs the characters encoded in a charset which may not match the messages encoding.
In order to say more, I would need to see a raw message as sent to the list with all headers intact and the resultant raw message from the list with all headers intact.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
I had enough trouble with lynx over this -- it used to be how I converted all html mail my mail reader saw, but such characters are not rare in the mail I receive -- that I gave up on lynx.
My new rule in my mailer for how to display html text is:
w3m -dump -o display_link_number=1 -cols 78 -T text/html - I "$(echo %a | sed -r 's/.*charset="?([-a-zA-Z0-9_]*).*/\1/')" -O utf-8 | less
which is one heck of a mouthful, but hasn't caused me any problems since.
Just in case somebody else wants to ditch lynx ...
Laura

On 7/21/15 12:42 AM, Laura Creighton wrote:
My new rule in my mailer for how to display html text is:
w3m -dump -o display_link_number=1 -cols 78 -T text/html -I "$(echo %a | sed -r 's/.*charset="?([-a-zA-Z0-9_]*).*/\1/')" -O utf-8 | less
which is one heck of a mouthful, but hasn't caused me any problems since.
Just in case somebody else wants to ditch lynx ...
While I'm sure Laura's command above works well as an HTML viewer for an MUA such as might be specified in a mutt mailcap file, there are issues with trying to use this as a Mailman HTML_TO_PLAIN_TEXT_COMMAND because it gets the input charset from the message's Content-Type: header and none of the message's headers are passed to HTML_TO_PLAIN_TEXT_COMMAND. Also, it specifies the output charset as utf-8, but Mailman will not change the charset parameter in the converted MIME part. It only changes the MIME type from text/html to text/plain so if the original HTML charset is not utf-8, creating utf-8 output would be wrong.
While one could use some w3m command in HTML_TO_PLAIN_TEXT_COMMAND, the appropriate command might be something like
w3m -dump -o display_link_number=1 -cols 78 -T text/html %(filename)s
without the -I and -O options, and this could wind up with the same charset issues as lynx.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Laura Creighton
-
Mark Sapiro