Handling some isolated iso-8859-1 characters

Daniel Mahoney dan at catfolks.net
Tue Jun 3 20:38:09 CEST 2008


I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050  Cleo and I have an anouncement!   "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<not at aol.com>  Sun, 21 Nov 2004 16:21:50 -0500
<lmzdkqmqt2fj.54wmpv3zmvvx.dlg at 40tude.net>              4478    69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Anaïs".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?






More information about the Python-list mailing list