[Python-Dev] RE: \ud800 crashes interpreter (PR#384)

M.-A. Lemburg mal@lemburg.com
Wed, 05 Jul 2000 10:37:23 +0200

Bill Tutt wrote:
> > MAL wrotw:
> >> Bill wrote:
> >> u'\ud800' causes the interpreter to crash
> >> example:
> >> print u'\ud800'
> >> What happens:
> >> The code failes to compile because while adding the constant, the
> unicode_hash
> >> function is called which for some reason requires the UTF-8 string
> format.
> > The reasoning at the time was that dictionaries should accept
> > Unicode objects as keys which match their string equivalents
> > as the same key, e.g. 'abc' works just as well as u'abc'.
> > UTF-8 was the default encoding back then. I'm not sure how
> > to fix the hash value given the new strategy w/r to the
> > default encoding...
> > According to the docs, objects comparing equal should have the
> > same hash value, yet this would require the hash value to be
> > calculated using the default encoding and that
> > would not only cause huge performance problems, but could
> > effectively render Unicode useless, because not all default
> > encodings are lossless (ok, one could work around this by
> > falling back to some other way of calculating the hash
> > value in case the conversion fails).
> Yeah, yeah, yeah. I know all that, just never liked it. :)
> The current problem is that the UTF-8 can't round trip surrogate characters
> atm.
> This is easy to fix, so I'm doing a patch to fix this oversight, unless you
> beat me to it.
> Anything else is slightly more annoying to fix.

I left out surrogates in the UTF-8 codec on purpose: the Python
implementation currently doesn't have support for these,
e.g. slicing doesn't care about UTF-16 surrogates, so I made
sure that people using these get errors ;-)

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/