[Python-3000] PEP: Python3 and UnicodeDecodeError

Thu Oct 2 14:07:50 CEST 2008

On 2008-10-02 13:50, Victor Stinner wrote:
> This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. 

The PEP doesn't appear to address any potential changes. Wouldn't
it be better to add such information to the Python3 documentation
itself ?!

> It's 
> a *draft*, don't hesitate to comment it. This document suppose that my patch 
> to allow bytes filenames is accept which is not the case today.
> 
> While I was writing this document I found poential problems in Python3. So 
> here is a TODO list (things to be checked):
> 
> FIXME: PyUnicode_DecodeFSDefaultAndSize(): errors="replace"!
> FIXME: import.c uses ASCII if default file system is unknown, whereas other
>        functions uses UTF-8
> FIXME: Write a function in Python3 to convert a bytes filename to a nice
>        string
> FIXME: When bytearray is accepted or not?
> FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get
>        bytes or unicode?
> FIXME: Use a shorter title for this PEP :-)
> 
> Can anyone write a section about bytes encoding in Unicode using escape 
> sequence?
> 
> What is the best tool to work on a PEP? I hate email threads, and I would 
> prefer SVN / Mercurial / anything else.
> ---
> 
> Title: Python3 and UnicodeDecodeError for the command line, 
>        environment variables and filenames
> 
> Introduction
> ============
> 
> Python3 does its best to give you texts encoded as a valid unicode characters
> strings. When it hits an invalid bytes sequence (according to the used
> charset), it has two choices: drops the value or raises an UnicodeDecodeError.
> This document present the behaviour of Python3 for the command line,
> environment variables and filenames.
> 
> Example of an invalid bytes sequence: ::
> 
>     >>> str(b'\xff', 'utf8')
>     UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)
> 
> whereas the same byte sequence is valid in another charset like ISO-8859-1: ::
> 
>     >>> str(b'\xff', 'iso-8859-1')
>     'ÿ'

You have left out all the options you have by using a different
error handling mechanism (using a third parameter to str()), e.g.
'replace', 'ignore', etc.

> Default encoding
> ================
> 
> Python uses "UTF-8" as the default Unicode encoding. You can read the default
> charset using sys.getdefaultencoding(). The "default encoding" is used by
> PyUnicode_FromStringAndSize().
> 
> A function sys.setdefaultencoding() exists, but it raises a ValueError for
> charset different than UTF-8 since the charset is hardcoded in
> PyUnicode_FromStringAndSize().

Not only there: the C API makes various assumptions on the default
encoding as well. We should probably drop the term "default encoding"
altogether and replace it with "utf-8".

sys.setdefaultencoding() should probably be dropped altogether from
Python3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 02 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611