Unicode in headers
I've been trying to fix the outstanding problems with "funny" characters in real names in Mailman[*] and along the way I ran into a situation that I /think/ needs to be addressed in the email package. I'm not sure this is a good fix, let alone the right fix so I wanted to get some feedback from these two mailing lists. Say I create a Header instance like so: from email.Header import Header h = Header(u'[P\xf6stal]', 'us-ascii') s = str(h) what would you expect the value of s to be? It's a bit of a trick question because in the current version, the str(h) will raise a UnicodeError since the h.encode() will be a unicode string containing non-ascii characters. But I think this may not be the right thing to do. For one thing, we're saying we want the header to be in the us-ascii character set. For another, the RFCs state that headers need to be ascii characters and we should encode them if necessary. OTOH, what we're doing /is/ a bit bogus since the value is clearly not in the requested character set. But OTOOH, I don't think we should have to check the value and do a bunch of coercion before we create the Header instance. My proposal is to do a type check in Header.__str__() so that if the value of self.encode() returns a unicode string, we will coerce it to an 8-bit string like so: def __str__(self): """A synonym for self.encode(). Guarantees that the return value contains only ASCII characters. """ s = self.encode() if isinstance(s, type(u'')): return s.encode(str(self._charset), 'replace') return s Here's a new test case that fails without this change, but succeeds with it (with no regressions). def test_unicode_value(self): eq = self.assertEqual v = u'[P\xf6stal]' h = Header(v, 'us-ascii') eq(str(h), '[P?stal]') In the view of doing what's most useful, I'd like to make this change, but I still don't trust my judgement about things unicode, so I'd like to get some other opinions. If we don't do this, then we'll probably have to add some defense in Generator._write_headers(), which wants to do text = '%s: %s' % (h, v) That'll raise the UnicodeError in this situation, and because this can be fairly widely removed from what might be considered the real error, it's difficult to debug. -Barry [*] BTW, Martin, Ben, Tokio and others have been very helpful here. Thanks! And I hope to have fixes in place soon.
participants (5)
-
barry@zope.com
-
Dan Ohnesorg
-
loewis@informatik.hu-berlin.de
-
Martin von Loewis
-
Tokio Kikuchi