Decode email subjects into unicode

John Machin sjmachin at lexicon.net
Tue Mar 18 19:57:40 CET 2008


On Mar 18, 9:09 pm, Laszlo Nagy <gand... at shopzeus.com> wrote:
> Sorry, meanwhile i found that "email.Headers.decode_header" can be used
> to convert the subject into unicode:
>
> > def decode_header(self,headervalue):
> > val,encoding = decode_header(headervalue)[0]
> > if encoding:
> > return val.decode(encoding)
> > else:
> > return val
>
> However, there are malformed emails and I have to put them into the
> database. What should I do with this:
>
> Return-Path: <imit... at exalumnos.com>
> X-Original-To: i... at designasign.biz
> Delivered-To: dapi... at localhost.com
> Received: from 195.228.74.135 (unknown [122.46.173.89])
> by shopzeus.com (Postfix) with SMTP id F1C071DD438;
> Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
> Date: Tue, 18 Mar 2008 12:43:45 +0200
> Message-ID: <60285728.00719565 at optometrist.com>
> From: "Euro Dice Casino" <imit... at exalumnos.com>
> To: tho... at designasign.biz
> Subject: With 2'500 Euro of Welcome Bonus you can't miss the chance!
> MIME-Version: 1.0
> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 7bit
>
> There is no encoding given in the subject but it contains 0x92. When I
> try to insert this into the database, I get:
>
> ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92
>
> All right, this probably was a spam email and I should simply discard
> it. Probably the spammer used this special character in order to prevent
> mail filters detecting "can't" and "2500". But I guess there will be
> other important (ham) emails with bad encodings. How should I handle this?

Maybe with some heuristics about the types of mistakes made by do-it-
yourself e-mail header constructors. For example, 'iso-8859-1' often
should be construed as 'cp1252':

>>> import unicodedata as ucd
>>> ucd.name('\x92'.decode('iso-8859-1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> ucd.name('\x92'.decode('cp1252'))
'RIGHT SINGLE QUOTATION MARK'
>>>



More information about the Python-list mailing list