Decode email subjects into unicode

Laszlo Nagy gandalf at shopzeus.com
Tue Mar 18 11:09:32 CET 2008


Sorry, meanwhile i found that "email.Headers.decode_header" can be used 
to convert the subject into unicode:

> def decode_header(self,headervalue):
> val,encoding = decode_header(headervalue)[0]
> if encoding:
> return val.decode(encoding)
> else:
> return val

However, there are malformed emails and I have to put them into the 
database. What should I do with this:


Return-Path: <imitate at exalumnos.com>
X-Original-To: info at designasign.biz
Delivered-To: dapinfo at localhost.com
Received: from 195.228.74.135 (unknown [122.46.173.89])
by shopzeus.com (Postfix) with SMTP id F1C071DD438;
Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
Date: Tue, 18 Mar 2008 12:43:45 +0200
Message-ID: <60285728.00719565 at optometrist.com>
From: "Euro Dice Casino" <imitate at exalumnos.com>
To: thomas at designasign.biz
Subject: With 2’500 Euro of Welcome Bonus you can’t miss the chance!
MIME-Version: 1.0
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 7bit



There is no encoding given in the subject but it contains 0x92. When I 
try to insert this into the database, I get:

ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92

All right, this probably was a spam email and I should simply discard 
it. Probably the spammer used this special character in order to prevent 
mail filters detecting "can't" and "2500". But I guess there will be 
other important (ham) emails with bad encodings. How should I handle this?

Thanks,

Laszlo




More information about the Python-list mailing list