[docs] [issue18713] Clearly document the use of PYTHONIOENCODING to set surrogateescape

Fri Aug 23 06:08:19 CEST 2013

Nick Coghlan added the comment:

Note: I created issue 18814 to cover some additional tools for working with surrogate escaped strings.

For this issue, we currently have http://docs.python.org/3/howto/unicode.html, which aims to be a more comprehensive guide to understanding Unicode issues.

I'm thinking we may want a "Debugging Unicode Errors" document, which defers to the existing howto guide for those that really want to understand Unicode, and instead focuses on quick fixes for resolving various problems that may present themselves.

Application developers will likely want to read the longer guide, while the debugging document would be aimed at getting script writers past their immediate hurdle, without necessarily gaining a full understanding of Unicode.

The would be for this page to become the top hit for "python surrogates not allowed", rather than the current top hit, which is a rejected bug report about it (http://bugs.python.org/issue13717).

For example:

================================
What is the meaning of "UnicodeEncodeError: surrogates not allowed"?
--------------------------------------------------------------------

Operating system metadata on POSIX based systems like Linux and Mac OS X may include improperly encoded text values. To cope with this, Python uses the "surrogateescape" error handler to store those arbitrary bytes inside a Unicode object. When converted back to bytes using the same encoding and error handler, the original byte sequence is reproduced exactly. This allows operations like opening a file based on a directory listing to work correctly, even when the metadata is not properly encoded according to the system settings.

The "surrogates not allowed" error appears when a string from one of these operating system interfaces contains an embedded arbitrary byte sequence, but an attempt is made to encode it using the default "strict" error handler rather than the "surrogateescape" handler. This commonly occurs when printing improperly encoded operating system data to the console, or writing it to a file, database or other serialised interface.

The ``PYTHONIOENCODING`` environment variable can be used to ensure operating system metadata can always be read via sys.stdin and written via sys.stdout. The following command will display the encoding Python will use by default to interact with the operating system::

    $ python3 -c "import sys; print(sys.getfilesystemencoding())"
    utf-8

This can then be used to specify an appropriate setting for ``PYTHONIOENCODING``:: 

    $ export PYTHONIOENCODING=utf-8:surrogateescape

For other interfaces, there is no such general solution. If allowing the invalid byte sequence to propagate further is acceptable, then enabling the surrogateescape handler may be appropriate. Alternatively, it may be better to track these corrupted strings back to their point of origin, and either fix the underlying metadata, or else filter them out early on.
================================

If issue 18814 is implemented, then it could point to those tools. Similarly, issue 15216 could be referenced if that is implemented.

----------
assignee:  -> docs at python
components: +Documentation
nosy: +docs at python
title: Enable surrogateescape on stdin and stdout when appropriate -> Clearly document the use of PYTHONIOENCODING to set surrogateescape

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18713>
_______________________________________