Unicode in headers
I've been trying to fix the outstanding problems with "funny" characters in real names in Mailman[*] and along the way I ran into a situation that I /think/ needs to be addressed in the email package. I'm not sure this is a good fix, let alone the right fix so I wanted to get some feedback from these two mailing lists. Say I create a Header instance like so: from email.Header import Header h = Header(u'[P\xf6stal]', 'us-ascii') s = str(h) what would you expect the value of s to be? It's a bit of a trick question because in the current version, the str(h) will raise a UnicodeError since the h.encode() will be a unicode string containing non-ascii characters. But I think this may not be the right thing to do. For one thing, we're saying we want the header to be in the us-ascii character set. For another, the RFCs state that headers need to be ascii characters and we should encode them if necessary. OTOH, what we're doing /is/ a bit bogus since the value is clearly not in the requested character set. But OTOOH, I don't think we should have to check the value and do a bunch of coercion before we create the Header instance. My proposal is to do a type check in Header.__str__() so that if the value of self.encode() returns a unicode string, we will coerce it to an 8-bit string like so: def __str__(self): """A synonym for self.encode(). Guarantees that the return value contains only ASCII characters. """ s = self.encode() if isinstance(s, type(u'')): return s.encode(str(self._charset), 'replace') return s Here's a new test case that fails without this change, but succeeds with it (with no regressions). def test_unicode_value(self): eq = self.assertEqual v = u'[P\xf6stal]' h = Header(v, 'us-ascii') eq(str(h), '[P?stal]') In the view of doing what's most useful, I'd like to make this change, but I still don't trust my judgement about things unicode, so I'd like to get some other opinions. If we don't do this, then we'll probably have to add some defense in Generator._write_headers(), which wants to do text = '%s: %s' % (h, v) That'll raise the UnicodeError in this situation, and because this can be fairly widely removed from what might be considered the real error, it's difficult to debug. -Barry [*] BTW, Martin, Ben, Tokio and others have been very helpful here. Thanks! And I hope to have fixes in place soon.
Dne Sat, Sep 21, 2002 at 01:00:43PM -0400, Barry A. Warsaw napsal:
from email.Header import Header h = Header(u'[P\xf6stal]', 'us-ascii') s = str(h)
what would you expect the value of s to be?
Somethink like =?utf-8?Q?P=f6stal?= According to RTF we should find the most simple encoding, which is UTF8. In czech we use ISO-8859-2 and we check if there are only ASCII characters = we are using ascii, or if there are some other characters we are using ISO-8859-2. So the way can be: - are there only ASCII characters = OK let it be - are there only characters from locale preferred encoding = use locale encoding - in other cases, use UTF. cheers dan -- ----------------------------------------------------------- / Dan Ohnesorg Dan@ohnesorg.cz \ < Jinočanská 7 252 19 Rudná u Prahy > \ tel: +420 311 679679 +420 311 679976 fax: +420 311 679311 / -----------------------------------------------------------
Dan Ohnesorg wrote:
Dne Sat, Sep 21, 2002 at 01:00:43PM -0400, Barry A. Warsaw napsal:
from email.Header import Header h = Header(u'[P\xf6stal]', 'us-ascii') s = str(h)
what would you expect the value of s to be?
- are there only ASCII characters = OK let it be - are there only characters from locale preferred encoding = use locale encoding
I like this idea but how do you define email's preferred language. In mailman, it will be mm_cfg.DEFAULT_SERVER_LANGUAGE but ,,,
- in other cases, use UTF.
I think UTF-8 is OK. Older MUA won't break. -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/
Dne Sun, Sep 22, 2002 at 02:52:08PM +0900, Tokio Kikuchi napsal:
- are there only ASCII characters = OK let it be - are there only characters from locale preferred encoding = use locale encoding
I like this idea but how do you define email's preferred language. In mailman, it will be mm_cfg.DEFAULT_SERVER_LANGUAGE but ,,,
It is very good solved in mutt, in file sendlib.c, line 800 and above. I send also comments from the file. In mutt I have a variable, which has list of encodings in orded of preference: /* * Find the best charset conversion of the file from fromcode into one * of the tocodes. If successful, set *tocode and CONTENT *info and * return the number of characters converted inexactly. If no * conversion was possible, return -1. * * We convert via UTF-8 in order to avoid the condition -1(EINVAL), * which would otherwise prevent us from knowing the number of inexact * conversions. Where the candidate target charset is UTF-8 we avoid * doing the second conversion because iconv_open("UTF-8", "UTF-8") * fails with some libraries. * * We assume that the output from iconv is never more than 4 times as * long as the input for any pair of charsets we might be interested * in. */ /* * Find the first of the fromcodes that gives a valid conversion and * the best charset conversion of the file into one of the tocodes. If * successful, set *fromcode and *tocode to dynamically allocated * strings, set CONTENT *info, and return the number of characters * converted inexactly. If no conversion was possible, return -1. * * Both fromcodes and tocodes may be colon-separated lists of charsets. * However, if fromcode is zero then fromcodes is assumed to be the * name of a single charset even if it contains a colon. */ cheers dan -- ----------------------------------------------------------- / Dan Ohnesorg Dan@ohnesorg.cz \ < Jinočanská 7 252 19 Rudná u Prahy > \ tel: +420 311 679679 +420 311 679976 fax: +420 311 679311 / -----------------------------------------------------------
from email.Header import Header h = Header(u'[P\xf6stal]', 'us-ascii') s = str(h)
[...]
But I think this may not be the right thing to do. For one thing, we're saying we want the header to be in the us-ascii character set.
I think you are confusing issues here: You are *not* saying that you want the header to be in us-ascii. Instead, (to quote the docstring) Specify both s's character set, and the default character set by setting the charset argument to a Charset object You need this argument to specify the encoding of the string *you are passing*, not (primarily) of the resulting Header. Since the argument is a Unicode string and not a byte string, the encoding argument is superfluous. Now, the documentation also says that it uses the argument as the "default character set". By that, it does *not* mean that the entire header is going to be encoding in that encoding. Instead, it means that this value is used if later append calls do not declare an encoding.
My proposal is to do a type check in Header.__str__() so that if the value of self.encode() returns a unicode string, we will coerce it to an 8-bit string like so:
This is evil. You are losing data without any need. Instead, I propose the following procedure: - if a Unicode argument is passed to Header.__init__ or Header.append, take the encoding only as a hint. As an argument to __init__, also record it as the default for later .append calls. - when encoding the header, encode all Unicode strings with the hint. If that fails, encode them as UTF-8. Regards, Martin
"MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:
MvL> You need this argument to specify the encoding of the string MvL> *you are passing*, not (primarily) of the resulting MvL> Header. Since the argument is a Unicode string and not a byte MvL> string, the encoding argument is superfluous. D'oh, of course you're right Martin. >> My proposal is to do a type check in Header.__str__() so that >> if the value of self.encode() returns a unicode string, we will >> coerce it to an 8-bit string like so: MvL> This is evil. You are losing data without any need. MvL> Instead, I propose the following procedure: - if a Unicode MvL> argument is passed to Header.__init__ or Header.append, MvL> take the encoding only as a hint. As an argument to MvL> __init__, also record it as the default for later .append MvL> calls. MvL> - when encoding the header, encode all Unicode strings with MvL> the hint. If that fails, encode them as UTF-8. Alternatively, we could try to provoke a UnicodeError early, at the __init__ or .append call by doing something like: def append(self, s, charset=None): # ... # Encoding check. Better to know now whether we'll have an encoding # error than when we try to str'ify the header. Let UnicodeErrors # percolate to the caller. if _isunicode(s): s.encode(str(charset)) else: unicode(s, str(charset)) self._chunks.append((s, charset)) In other words, the caller is claiming that the string being passed in is encoded with the given character set (or the default if None is used). Fine, let's check that here since it will be easier to debug if the UnicodeError is raised now, rather than when the Generator tries to print the message header. I think I could live with that, and will work out a different algorithm in Mailman. -Barry
barry@zope.com (Barry A. Warsaw) writes:
Alternatively, we could try to provoke a UnicodeError early, at the __init__ or .append call by doing something like:
I see no reason to provoke a UnicodeError at all. An exception should only be raised if the library cannot correctly process the data being passed, or if the requested processing is ambiguous. In this case, neither is the case: there is a perfectly correct and meaningful processing of the data. If you raise an exception, the application would need to deal with it just in the same way as I propose.
I think I could live with that, and will work out a different algorithm in Mailman.
I think users of the email package will find it more acceptable if no exception is raised. Regards, Martin
participants (5)
-
barry@zope.com
-
Dan Ohnesorg
-
loewis@informatik.hu-berlin.de
-
Martin von Loewis
-
Tokio Kikuchi