[Mailman-Users] problem with accented characters, converting HTML to plain text

Tue Jul 21 09:42:18 CEST 2015

In a message of Mon, 20 Jul 2015 11:04:08 -0700, Mark Sapiro writes:
>On 7/19/15 1:13 PM, Dominique Asselineau wrote:
>> Hello,
>> 
>> When a e-mail in text/html content-type is converted in to plain text,
>> the accented characters are not treated correctly.
>
>
>There are potential issues with this. Mailman gets the content of the
>text/html part and calls HTML_TO_PLAINTEXT_COMMAND (lynx -dump in the
>default case) to convert the HTML to a plain text rendering and replaces
>the content of the part with that and changes the Content-Type: to
>text/plain while maintaining any charset= parameter.
>
>Lynx normally does not recode any characters, so the output of lynx
>-dump should be in the same charset is the input and it should be OK.
>
>Problems arise if the input has characters represented as HTML entities
>such as á or è. In this case, lynx outputs the characters
>encoded in a charset which may not match the messages encoding.
>
>In order to say more, I would need to see a raw message as sent to the
>list with all headers intact and the resultant raw message from the list
>with all headers intact.
>
>-- 
>Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
>San Francisco Bay Area, California    better use your sense - B. Dylan
>------------------------------------------------------

I had enough trouble with lynx over this -- it used to be how I
converted all html mail my mail reader saw, but such characters
are not rare in the mail I receive -- that I gave up on lynx.

My new rule in my mailer for how to display html text is:

w3m -dump -o display_link_number=1 -cols 78 -T text/html -
I "$(echo %a | sed -r 's/.*charset="?([-a-zA-Z0-9_]*).*/\1/')" -O utf-8 | less

which is one heck of a mouthful, but hasn't caused me any problems since.

Just in case somebody else wants to ditch lynx ...

Laura