[Mailman-i18n] "Funny" characters in real names?
Martin v. L÷wis
19 Sep 2002 09:37:38 +0200
email@example.com (Barry A. Warsaw) writes:
> Martin, sometimes this Unicode stuff makes my head hurt. ;)
In an application that deals with multiple charsets on a regular basis
(such as mailman), I recommend not to mix byte strings and Unicode
strings. This can be achieved by
- converting all byte strings that represent text data to Unicode
at the earliest possible point in processing,
- converting all Unicode strings back to byte strings just before
If most data is likely ASCII, it is tempting to use byte strings for
pure-ASCII, and Unicode for everything else. Try to resist this
If you follow this strategy, you find that processing becomes much
> So it seems like name.encode('us-ascii') is my only choice. What am I
If you are following the above strategy, you will know whether name is
Unicode or byte string. If it is Unicode, .encode is fine. If it is a
byte string, unicode(name,'ascii') will work.
I admit that the strategy has two problems:
1. In some cases, it might be impossible to generate a Unicode string
for text data. In MIME, the encoding may not be specified, or it
may be unknown to mailman, or the data may fail to convert.
In these cases, it may be acceptable to "force" the data to
Unicode: If there is no encoding, guess latin-1. If the string
fails to convert, convert it with "replace". If the encoding is
unknown, replace all non-printable characters with question marks.
Whether this is acceptable depends on how frequent the problem
occurs and whose fault that is (e.g. an unknown encoding should be
added to Mailman).
2. When converting an application that used to be byte-oriented to
Unicode, adding conversions at all required places might be too
much effort, or breakage because of incorrect data might be
In these cases, I recommend to add type tests at strategic places,
and taper over any incorrect data.
E.g. in this case, you could write a function
if type(text) is types.UnicodeType:
raise DebugError, "string not unicode:"+repr(text)
If you expect name to be a byte string, the function would be
bytes_are_ascii, of course.