[pypy-issue] Issue #2389: Different behavior of bytes.decode('utf8', 'custom_replace') (pypy/pypy)

Konstantin Lopuhin issues-reply at bitbucket.org
Thu Sep 1 05:02:01 EDT 2016


New issue 2389: Different behavior of bytes.decode('utf8', 'custom_replace')
https://bitbucket.org/pypy/pypy/issues/2389/different-behavior-of-bytesdecode-utf8

Konstantin Lopuhin:

The following program:
```
import codecs

codecs.register_error('custom_replace', lambda exc: (u'\ufffd', exc.start+1))

s1 = b"WORD\xe3\xab"
print(repr(s1.decode('utf8', 'custom_replace')))
print(repr(s1.decode('utf8', 'replace')))

s2 = b"\xef\xbb\xbfWORD\xe3\xabWORD2"
print(repr(s2.decode('utf8', 'custom_replace')))
print(repr(s2.decode('utf8', 'replace')))
```
produces different results on CPython 2.7 (I tried 2.7.6 and 2.7.12) and on PyPy 5.4.0:

```
$ pypy test.py 
u'WORD\ufffd'
u'WORD\ufffd'
u'\ufeffWORD\ufffd\ufffdWORD2'
u'\ufeffWORD\ufffdWORD2'
$ python test.py 
u'WORD\ufffd\ufffd'
u'WORD\ufffd'
u'\ufeffWORD\ufffd\ufffdWORD2'
u'\ufeffWORD\ufffdWORD2'
```

And I think CPython is more consistent here: with a custom replace function, it replaces each invalid byte with given symbol, but PyPy in some cases does a different thing.

The context: this code is used in w3lib here https://github.com/scrapy/w3lib/blob/v1.14.2/w3lib/encoding.py#L176 (the CPython bug reference might be slightly misleading here) to emulate browser behavior for invalid utf8 handling, and CPython with custom_replace agrees with browser behavior here.





More information about the pypy-issue mailing list