[docs] 'codecs' module docs improvements (issue 19548)

Mon Dec 22 16:22:05 CET 2014

I definitely like the proposed changes overall, especially splitting
CodecInfo out into a separately documented class.

Lots more detailed feedback inline, including some genuien questions
where I don't know the answer off the top of my head myself, and it
isn't a quick check at the interactive prompt to confirm the current
behaviour.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst
File Doc/library/codecs.rst (right):

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode24
Doc/library/codecs.rst:24: bytes to bytes.
I like the idea of expanding on this paragraph. I suggest we make "text
encoding" a defined term in the glossary, and adjust this addition to
be:

Most standard codecs are :term:`text encodings <text encoding>`, which
encode text to bytes, but there are also codecs that encode text to
text, and bytes to bytes. Custom codecs beyond those provided in the
standard library may convert between arbitrary types."

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode54
Doc/library/codecs.rst:54: is not reversible, which doesn't play well
with module reloading.
I suggest pulling this caveat out into a "note". As far as phrasing
goes, something a little more formal may be good. Perhaps:

"Search function registration is not currently reversible, which may
cause problems in some cases, such as unit testing or module reloading"

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode198
Doc/library/codecs.rst:198: suitable replacement marker such as ``b'?'``
in or ``'\ufffd'``.
With the new defined term, this error handling function should gain a
sentence to indicate that it's only applicable to text encodings. (There
may actually be cases where it's usable for bytes->bytes or text->text,
but if that's actually the case, we can figure out how to fix the issue
later. For the time being, I think it's more important to clearly
document the text encoding case)

The substitution can then be clarified to say that b'?' is produced when
encoding, and '\ufffd' when decoding (we can also embed the actual
encoded code point in the documentation here, in addition to the \u
escape).

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode210
Doc/library/codecs.rst:210: unencodable character is replaced by an
appropriate XML character reference.
Also text encoding specific.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode216
Doc/library/codecs.rst:216: unencodable character is replaced by a
backslashed escape sequence.
Also text encoding specific.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode221
Doc/library/codecs.rst:221: unencodable character is replaced by a
``\N{...}`` escape sequence.
Also text encoding specific.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode238
Doc/library/codecs.rst:238: No automatic conversion of ``'\n'`` is done
on reading and writing.
We should also explicitly note here that *this* open uses the
StreamReader and StreamWriter classes provided by the CodecInfo module,
*not* the general purpose IO classes provided by the io module.

This means that 'encoding' here can be any arbitrary registered codec,
rather than necessarily being a text encoding as it is for the io module
(and hence the builtin open function)

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode241
Doc/library/codecs.rst:241: Only the codecs whose encoded form is bytes
are supported.
It's worth explicitly noting here that the types supported by the
file-like object will depend on the codec used.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode261
Doc/library/codecs.rst:261: a text stream.
I'm going to have to look at the implementation to confirm whether this
more specific version is accurate, as I suspect it isn't quite right in
the case where data_encoding == file_encoding.

For the case where they're different, we may want to use the word
"transcoding".

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode267
Doc/library/codecs.rst:267: error occurs.
Since this factory function produces a StreamRecoder instance, a cross
reference may be useful.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode329
Doc/library/codecs.rst:329: To simplify and standardize error handling,
I suggest giving this section on error handlers its own subheading and
label, so it can be referenced directly.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode332
Doc/library/codecs.rst:332: and implemented by all standard Python
codecs:
It may be worthwhile adding cross references back to the functions
implementing each error handler.

For example, finishing with: "(Implemented in :func:`strict_errors`)"

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode337
Doc/library/codecs.rst:337: | Value                   | Meaning         
                             |
I believe this table needs to be split in two, with replace through to
surrogateescape being restricted in applicability to text encodings.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode368
Doc/library/codecs.rst:368: In addition, the following error handlers
are specific to Unicode encoding
With the above table split in two, the phrasing in this introductory
paragraph may need tweaking.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode469
Doc/library/codecs.rst:469: by providing the *errors* keyword argument.
These parameters are predefined:
Given the new link target for the error handler tables, I'd replace this
bulleted list with a cross-reference.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode489
Doc/library/codecs.rst:489: :func:`register_error`.
This paragraph can move to the shared section on error handlers.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode540
Doc/library/codecs.rst:540: by providing the *errors* keyword argument.
These parameters are predefined:
As above - it's likely worth replacing the repeated list with a
cross-reference.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode554
Doc/library/codecs.rst:554: :func:`register_error`.
With this paragraph moved to the common section, it can be dropped here.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode617
Doc/library/codecs.rst:617: *stream* must be a file-like object open for
writing.
I suggest finishing this sentence as "... writing text or binary data,
as appropriate for the specific codec".

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode620
Doc/library/codecs.rst:620: providing the *errors* keyword argument.
These parameters are predefined:
As with the Incremental* APIs, I'm a fan of consolidating the error
handler documentation, and linking to it where appropriate.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode650
Doc/library/codecs.rst:650: the :meth:`write` method). Does not work on
byte encoders.
The new caveat needs to be clarified: what does "Does not work" mean?
What counts as a "byte encoder" in this context?

If the issue is "May not be present for codecs other than text
encodings", then that's what we should say.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode684
Doc/library/codecs.rst:684: *stream* must be a file-like object open for
reading.
See the StreamWriter comments for suggested phrasing here.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode708
Doc/library/codecs.rst:708: characters or bytes to return. The
:func:`read` method will
characters -> code points (due to complexities in the definition of what
actually constitutes a Unicode characters, str's are code point arrays
rather than character arrays)

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode713
Doc/library/codecs.rst:713: number of encoded bytes or characters to
read
characters -> code points again.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode760
Doc/library/codecs.rst:760: In addition to the above methods, the
:class:`StreamReader` must also inherit
Not a new issue in the patch, but "inherit" doesn't strike me as being
the right word here (I'm also not sure the statement is even true any
more).

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode1242
Doc/library/codecs.rst:1242: |                    |         | decoding
``b''``          |
This should be clear on what b'' gets decoded to.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode1250
Doc/library/codecs.rst:1250: |                    |         | Latin-1
source code.      |
This description should be caveated with the fact that Python 3 uses
UTF-8 by default, and relying on that is preferable to unicode_escape in
most cases where Python 2 source compatibility isn't needed and a utf-8
encoding cookie can't be used for some reason.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode1256
Doc/library/codecs.rst:1256: |                    |         | ..
deprecated:: 3.3       |
May be worth cross referencing PEP 393 here to explain this deprecation.

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode1265
Doc/library/codecs.rst:1265: to :class:`bytes` mappings.  They are not
supported by :meth:`bytes.decode`.
Perhaps append "(which is expected to produce :class:`str` output)" to
the new sentence?

http://bugs.python.org/review/19548/diff/13481/Doc/library/codecs.rst#newcode1320
Doc/library/codecs.rst:1320: mapping.  It is not supported by
:meth:`str.encode`.
Perhaps append "(which is expected to produce :class:`bytes` output)"?

http://bugs.python.org/review/19548/