
Thanks to http://bugs.python.org/issue7077 I've noticed that the socket-based logging handlers - SocketHandler, DatagramHandler and SysLogHandler - aren't Unicode-aware and can break in the presence of Unicode messages. I'd like to fix this by giving these handlers an optional (encoding=None) parameter in their __init__, and then using this to encode on output. If no encoding is specified, is it best to use locale.getpreferredencoding(), sys.getdefaultencoding(), sys.getfilesystemencoding(), 'utf-8' or something else? On my system:
which suggests to me that the locale.getpreferredencoding() should be the default. However, as I'm not a Unicode maven, any suggestions would be welcome. Regards, Vinay Sajip

I can't understand what the problem with SocketHandler/DatagramHandler is. As they use pickle, they should surely be able to send records with Unicode strings in them, no? OTOH, why is SMTPHandler not in your list?
For syslog, I don't think that's appropriate. I presume this is meant to follow RFC 5424? If so, it SHOULD send the data in UTF-8, in which case it MUST include a BOM also. A.8 then says that if you are not certain that it is UTF-8 (which you wouldn't be if the application passes a byte string), you MAY omit the BOM. Regards, Martin

Martin v. Löwis <martin <at> v.loewis.de> writes:
Of course you are right. When I posted that it was a knee-jerk reaction to the issue that was raised for SysLogHandler configured to use UDP. I did realise a bit later that the issue didn't apply to the other two handlers but I was hoping nobody would notice ;-)
OTOH, why is SMTPHandler not in your list?
I assumed smtp.sendmail() would deal with it, as it deals with the wire protocol, but perhaps I was wrong to do so. I noticed that Issue 521270 (SMTP does not handle Unicode) was closed, but I didn't look at it closely. I now see it was perhaps only a partial solution. I did a bit of searching and found this post by Marius Gedminas: http://mg.pov.lt/blog/unicode-emails-in-python.html Now if that's the right approach, shouldn't it be catered for in a more general part of the stdlib than logging - perhaps in smtplib itself? Or, seeing that Marius' post is five years old, is there a better way of doing it using the stdlib as it is now?
So ISTM that the right thing to do on 2.x would be: if str to be sent, send as is; if unicode to be sent, encode using utf-8 and send with a BOM. For 3.x, just encode using utf-8 and send with a BOM. Does that seem right? Thanks and regards, Vinay Sajip

I can't understand what the problem with SocketHandler/DatagramHandler is. As they use pickle, they should surely be able to send records with Unicode strings in them, no? OTOH, why is SMTPHandler not in your list?
For syslog, I don't think that's appropriate. I presume this is meant to follow RFC 5424? If so, it SHOULD send the data in UTF-8, in which case it MUST include a BOM also. A.8 then says that if you are not certain that it is UTF-8 (which you wouldn't be if the application passes a byte string), you MAY omit the BOM. Regards, Martin

Martin v. Löwis <martin <at> v.loewis.de> writes:
Of course you are right. When I posted that it was a knee-jerk reaction to the issue that was raised for SysLogHandler configured to use UDP. I did realise a bit later that the issue didn't apply to the other two handlers but I was hoping nobody would notice ;-)
OTOH, why is SMTPHandler not in your list?
I assumed smtp.sendmail() would deal with it, as it deals with the wire protocol, but perhaps I was wrong to do so. I noticed that Issue 521270 (SMTP does not handle Unicode) was closed, but I didn't look at it closely. I now see it was perhaps only a partial solution. I did a bit of searching and found this post by Marius Gedminas: http://mg.pov.lt/blog/unicode-emails-in-python.html Now if that's the right approach, shouldn't it be catered for in a more general part of the stdlib than logging - perhaps in smtplib itself? Or, seeing that Marius' post is five years old, is there a better way of doing it using the stdlib as it is now?
So ISTM that the right thing to do on 2.x would be: if str to be sent, send as is; if unicode to be sent, encode using utf-8 and send with a BOM. For 3.x, just encode using utf-8 and send with a BOM. Does that seem right? Thanks and regards, Vinay Sajip
participants (3)
-
"Martin v. Löwis"
-
MRAB
-
Vinay Sajip