[Python-ideas] Exploring the 'strview' concept further

Wed Dec 7 16:39:58 CET 2011

On Thu, 8 Dec 2011 00:53:44 +1000
Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> 3. There are issues with memoryview itself that should be accounted
> for if pursuing this idea [5]

These issues are related to complex buffer types (strided,
multi-dimensional, etc.). They wouldn't apply to a hypothetical
"linear unicode buffer".

> 1. The basic construction would be "strview(object, encoding,
> errors)". For convenience, actual str objects would just be returned
> unmodified (alternatively: a factory function could be provided with
> that behaviour)

The factory function is a better idea than silent pass-through, IMO.

> 5. If asked to index, slice or iterate over the underlying string, the
> strview would use the incremental decoder for the relevant codec to
> build an efficient mapping from code point indices to byte indices and
> then return real strings (various strategies for doing this have been
> posted to this list in the past).

Be careful, the incremental decoders use a layer of pure Python
wrappers. You want to call them on big blocks (at least 4 or 8KB, as
TextIOWrapper does) if you don't want to lose a lot of speed. So
building a mapping may not be easy.

Even bypassing the Python layer would still incur the overhead of
repeatedly calling a standalone function, instead of having a tight loop
such as the following:
http://hg.python.org/cpython/file/e49220f4c31f/Objects/unicodeobject.c#l4228

And of course, your mapping must be space-efficient enough that it's
much smaller than the full decoded string.

I think that for small strings (< 1024 bytes?), decoding and storing
the decoded string are not a big deal. Decoding once is *much* faster
(especially for optimized encodings such as latin-1 or utf-8, and only
them will be left in a few years) than trying to do it piecewise.
strview would only be a win for rather large strings. Which makes it
useless for URL parsing ;)

> Alternatively, if codecs were
> classified to explicitly indicate when they implemented stateless
> fixed width encodings, then strview could simply be restricted to only
> working with that subset of possible encodings.

From an usability POV this seems undesireable. On the other hand, if
complete decoding is required, calling str() is just as cheap.

> 7. The new type would similarly support the full string API, returning
> actual string objects rather than any kind of view.

Even for slicing?