Handling some isolated iso-8859-1 characters
justin.mailinglists at gmail.com
Wed Jun 4 04:10:38 CEST 2008
On Jun 4, 2:38 am, Daniel Mahoney <d... at catfolks.net> wrote:
> I'm working on an app that's processing Usenet messages. I'm making a
> connection to my NNTP feed and grabbing the headers for the groups I'm
> interested in, saving the info to disk, and doing some post-processing.
> I'm finding a few bizarre characters and I'm not sure how to handle them
> One of the lines I'm finding this problem with contains:
> 137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
> <n... at aol.com> Sun, 21 Nov 2004 16:21:50 -0500
> <lmzdkqmqt2fj.54wmpv3zmvvx.... at 40tude.net> 4478 69 Xref:
> sn-us rec.pets.cats.community:137050
> The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
> An HTML rendering of what this string should look would be "Anaïs".
> What I'm doing now is a brute-force substitution from the version in the
> file to the HTML version. That's ugly. What's a better way to translate
> that string? Or is my problem that I'm grabbing the headers from the NNTP
> server incorrectly?
>>> from email.Header import decode_header
>>> (s, e), = decode_header("=?iso-8859-1?Q?Ana=EFs?=")
>>> import unicodedata
>>> import htmlentitydefs
>>> for c in s.decode(e):
... print ord(c), unicodedata.name(c)
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S
More information about the Python-list