Handling some isolated iso-8859-1 characters
gagsl-py2 at yahoo.com.ar
Wed Jun 4 04:12:00 CEST 2008
En Tue, 03 Jun 2008 15:38:09 -0300, Daniel Mahoney <dan at catfolks.net>
> I'm working on an app that's processing Usenet messages. I'm making a
> connection to my NNTP feed and grabbing the headers for the groups I'm
> interested in, saving the info to disk, and doing some post-processing.
> I'm finding a few bizarre characters and I'm not sure how to handle them
> One of the lines I'm finding this problem with contains:
> 137050 Cleo and I have an anouncement! "Mlle.
> <not at aol.com> Sun, 21 Nov 2004 16:21:50 -0500
> <lmzdkqmqt2fj.54wmpv3zmvvx.dlg at 40tude.net> 4478 69 Xref:
> sn-us rec.pets.cats.community:137050
> The interesting patch is the string that reads
> An HTML rendering of what this string should look would be "Anaïs".
> What I'm doing now is a brute-force substitution from the version in the
> file to the HTML version. That's ugly. What's a better way to translate
> that string? Or is my problem that I'm grabbing the headers from the NNTP
> server incorrectly?
No, it's not you, those headers are formatted following RFC 2047
Python already has support for that format, use the email.header class,
More information about the Python-list