[issue11489] json.dumps not parsable by json.loads (on Linux only)

Mon Mar 14 17:19:07 CET 2011

Alexander Belopolsky <belopolsky at users.sourceforge.net> added the comment:

> It appears this is an invalid unicode character.
> Shouldn't this be caught by decode("utf8")

It should and it is in Python 3.x:

>>> b'\xed\xa8\x80'.decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

Python 2.7 behavior seems to be a bug.

>>> '\xed\xa8\x80'.decode("utf8")
u'\uda00'

Note also the following difference:

In 3.x:

>>> b'\xed\xa8\x80'.decode("utf8", 'replace')
'��'

In 2.7:

>>> '\xed\xa8\x80'.decode("utf8", 'replace')
u'\uda00'

I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip just fine in 2.x and there likely to be existing code that relies on this.

>  Shouldn't anything generated by json.dumps be parsed by json.loads?

This on the other hand should probably be fixed by either rejecting lone surrogates in json.dumps or accepting them in json.loads or both.  The last alternative would be consistent with the common wisdom of being conservative in what you produce but liberal in what you accept.

----------
nosy: +belopolsky, haypo
versions: +Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue11489>
_______________________________________