[Python-ideas] Exploring the 'strview' concept further

Thu Dec 8 17:13:53 CET 2011

On Wed, Dec 7, 2011 at 6:51 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> For stateless
> encodings, views make sense - it's really just a memory view with a
> particular way of interpreting the individual characters and providing
> the string API rather than the bytes one. For the multitude of
> ASCII-compatible single byte codings and the various fixed-width
> encodings, that could be very useful. With some fiddling, you could
> support BOM and signature encodings, too (just by offsetting your view
> a bit and adjusting your interpretation of the individual code
> points).

I really like PEP 393, and it has gotten much better even since the
initial proposal, but this one objection has been bugging me the whole
time -- I just can't find a good way to explain it.

But with the concrete code, I will take a stab now...

I want the ability to use a more efficient string representation when
I know one exists -- such as when I could be using a single-byte
charset other than Latin-1, or when the underlying data is bytes, but
I want to treat it as text temporarily without copying the whole
buffer.

PyUnicode_Kind already supports the special case of
PyUnicode_WCHAR_KIND (also known as "legacy string, not ready" --
http://hg.python.org/cpython/file/174fbbed8747/Include/unicodeobject.h
around line 247).  I would like to see another option for "custom
subtype", and to accept that strings might stay in this state longer.

A custom type need not allow direct access to the buffer as an array,
so it would have to provide its own access functions.  I accept that
using these subtype-specific functions might be slower, but I think
the downside for "normal" strings can be limited to an extra case
statement in places like PyUnicode_WRITE (at
http://hg.python.org/cpython/file/174fbbed8747/Include/unicodeobject.h
lines 487-508; currently the default case asserts
PyUnicode_4BYTE_KIND).

Looing at Barry's example:
   >>> s = ':1/123'
   >>> s[:1] == ':'
   True

Modelling this as bytes with a unicode view on top, this would work
fine (so long as you sliced the view, rather than the original bytes
object), but creating that string view wouldn't require copying the
buffer.  (Of course, the subtype's implementation of
PyUnicode_Substring might well copy parts of the buffer.)

I would expect bytes in particular to grow an
as_string(encoding="Latin-1") method, which could be used to deprecate
the various string-related methods.

A type for an alternate one-byte encoding could be as simple as using
the 1Byte variants to create a string of the same type when large
strings are called for, and a translation function when individual
characters are requested.

-jJ