[Python-Dev] unicode hell/mixing str and unicode as dictionary keys

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Mon Aug 7 16:57:15 CEST 2006

Michael Foord wrote:
> Martin v. Löwis wrote:
>>Expanding this view to Unicode should mean that a unicode
>>string U equals a byte string B if
>>U.encode(system_encode) == B or B.decode(system_encoding) == U,
>>and that they don't equal otherwise (e.g. if the conversion
>>fails with a "not convertible" exception).

I disagree. Unicode strings should always be considered distinct from
non-ASCII byte strings. Implicitly encoding or decoding in order to
perform a comparison is a bad idea; it is expensive and will often do
the wrong thing.

The programmer should explicitly encode the Unicode string or decode
the byte string before comparison (which one of these is correct is

>>Which of the two conversions is selected is arbitrary; [...]

It would not be arbitrary. In the common case where the byte encoding
uses "precomposed" characters, using "U.encode(system_encoding) == B"
will tend to succeed in more cases than "B.decode(system_encoding) == U",
because alternative representations of the same abstract character in
Unicode will be mapped to the same precomposed character.

(Whether these are cases in which the comparison *should* succeed is,
as I said above, application-dependent.)

The special case of considering US-ASCII strings to compare equal to
the corresponding Unicode string, is more reasonable than this would be
for a general byte encoding, because:

 - it can be done with no (or only a trivial) conversion,
 - US-ASCII has no precomposed characters or combining marks, so it
   does not have multiple encodings for the same abstract character,
 - Unicode has a US-ASCII subset that uses exactly the same encoding
   model as US-ASCII (whereas in general, a byte encoding might use
   an arbitrarily different encoding model to Unicode, as for example
   is the case for ISCII).

>>we should, of course, continue to use the one we always used (for
>>"ascii", there is no difference between the two).
> +1
> This seems the most (only ?) logical solution.

No; always considering Unicode and non-ASCII byte strings to be distinct
is just as logical.

David Hopwood <david.nospam.hopwood at blueyonder.co.uk>

More information about the Python-Dev mailing list