[Python-3000] PEP: Python3 and UnicodeDecodeError

Thu Oct 2 14:35:48 CEST 2008

Le Thursday 02 October 2008 14:07:50 M.-A. Lemburg, vous avez écrit :
> On 2008-10-02 13:50, Victor Stinner wrote:
> > This is a PEP (...)
>
> The PEP doesn't appear to address any potential changes. Wouldn't
> it be better to add such information to the Python3 documentation
> itself ?!

I don't know the right name of this document. Yeah, it may move to Doc/ in 
Python3 source code.

> > Example of an invalid bytes sequence: ::
> >     >>> str(b'\xff', 'utf8')
> >     UnicodeDecodeError
> >
> >     >>> str(b'\xff', 'iso-8859-1')
> >     'ÿ'
>
> You have left out all the options you have by using a different
> error handling mechanism (using a third parameter to str()), e.g.
> 'replace', 'ignore', etc.

Yes, I can explain why replace and ignore can *not* be use in this case. If 
you use ignore or replace, filenames will be valid unicode strings, but you 
will be unable to open / copy / remove you file.

> > Default encoding
> > ================
> >
> > Python uses "UTF-8" as the default Unicode encoding. You can read the
> > default charset using sys.getdefaultencoding(). The "default encoding" is
> > used by PyUnicode_FromStringAndSize().
>
> Not only there: the C API makes various assumptions on the default
> encoding as well. We should probably drop the term "default encoding"
> altogether and replace it with "utf-8".

The concept of "default encoding" is unclear in Python. Yes, we might remove 
sys.getdefaultencoding() and write that PyUnicode_FromStringAndSize() uses 
the UTF-8 charset.

> sys.setdefaultencoding() should probably be dropped altogether from
> Python3.

Yes.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/