[docs] [issue21667] Clarify status of O(1) indexing semantics of str objects

Wed Jun 11 01:20:46 CEST 2014

Jim Jewett added the comment:

And even my rewrite showed path dependency; a slight further improvement is to re-order encoding ahead of bytes.  I also added a paragraph that I hope answers the speed issue.

Proposal:

A string is a sequence of Unicode code points.  Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``.  The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (in the given text encoding) as a :class:`bytes` object suitable for transport and communication with non-Python utilities.  :meth:`bytes.decode` decodes such byte sequences into text strings.

.. impl-detail::  There are no methods exposing the internal representation of code points within a string.  While the C-API provides some additional constraints on CPython, other implementations are free to use any representation that treats code points (as opposed to either code units or some normalized form of characters) as the unit of measure.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21667>
_______________________________________