[Python-Dev] unicode hell/mixing str and unicode as dictionary keys

Mon Aug 7 16:57:15 CEST 2006

Michael Foord wrote:
> Martin v. Löwis wrote:
> 
>>[snip..]
>>Expanding this view to Unicode should mean that a unicode
>>string U equals a byte string B if
>>U.encode(system_encode) == B or B.decode(system_encoding) == U,
>>and that they don't equal otherwise (e.g. if the conversion
>>fails with a "not convertible" exception).

I disagree. Unicode strings should always be considered distinct from
non-ASCII byte strings. Implicitly encoding or decoding in order to
perform a comparison is a bad idea; it is expensive and will often do
the wrong thing.

The programmer should explicitly encode the Unicode string or decode
the byte string before comparison (which one of these is correct is
application-dependent).

>>Which of the two conversions is selected is arbitrary; [...]

It would not be arbitrary. In the common case where the byte encoding
uses "precomposed" characters, using "U.encode(system_encoding) == B"
will tend to succeed in more cases than "B.decode(system_encoding) == U",
because alternative representations of the same abstract character in
Unicode will be mapped to the same precomposed character.

(Whether these are cases in which the comparison *should* succeed is,
as I said above, application-dependent.)

The special case of considering US-ASCII strings to compare equal to
the corresponding Unicode string, is more reasonable than this would be
for a general byte encoding, because:

 - it can be done with no (or only a trivial) conversion,
 - US-ASCII has no precomposed characters or combining marks, so it
   does not have multiple encodings for the same abstract character,
 - Unicode has a US-ASCII subset that uses exactly the same encoding
   model as US-ASCII (whereas in general, a byte encoding might use
   an arbitrarily different encoding model to Unicode, as for example
   is the case for ISCII).

>>we should, of course, continue to use the one we always used (for
>>"ascii", there is no difference between the two).
> 
> +1
> 
> This seems the most (only ?) logical solution.

No; always considering Unicode and non-ASCII byte strings to be distinct
is just as logical.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>