[docs] [issue34484] Unicode HOWTO incorrectly refers to Private Use Area for surrogateescape

Thu Aug 23 17:14:39 EDT 2018

New submission from Mark Dickinson <dickinsm at gmail.com>:

The Unicode HOWTO currently has contains this text in the "Files in an Unknown Encoding" section [1]:

> The surrogateescape error handler will decode any non-ASCII bytes as code
> points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These
> private code points will then be turned back into the same bytes when the
> surrogateescape error handler is used when encoding the data and writing it
> back out.

Unless I'm missing something, the subrange U+DC80 to U+DCFF of the low surrogates is *not* a Private Use Area. There *is* a kinda-sorta PUA in the high surrogates from U+DB80 to U+DBFF (because the only valid codepoints that use these surrogates in their UTF-16 encoding are the codepoints in planes 15 and 16, which are almost entirely PUA codepoints), but that's not what the surrogateescape handler is using.

[1] https://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

----------
assignee: docs at python
components: Documentation
messages: 323976
nosy: docs at python, mark.dickinson
priority: normal
severity: normal
status: open
title: Unicode HOWTO incorrectly refers to Private Use Area for surrogateescape
versions: Python 3.6, Python 3.7

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34484>
_______________________________________