[Python-ideas] Exploring the 'strview' concept further

Fri Dec 9 05:41:41 CET 2011

Jim Jewett writes:

 > I want the ability to use a more efficient string representation when
 > I know one exists -- such as when I could be using a single-byte
 > charset other than Latin-1,

For most people all of the time, and for almost all people most of the
time, this is a YAGNI, and gets more so every year.  As a facility of
the language, it is an attractive nuisance for developers many of whom
will undoubtedly go searching for truffles and end up consuming
Knuth's root of all error, and will attract lots of one-off RFEs to
deal with specific use cases that break with a minimal implementation.
N.B. Emacs has just given up on a 15-year experiment with such a
minimal facility (the execrable "string-as-unibyte" toggle).

Use of multiple internal text encodings is really a can of worms, as
the Emacs experience demonstrates (they were unable to even write
Latin-1 files properly, with repeated regressions of the so-called
"\201 bug" that I know of 1995-2008, mostly because of misuse of
string-as-unibyte).  XEmacs, with a proper character type, eliminated
the "\201 bug" *before* its multilingual version stopped crashing in
the codecs.  But even there, because the internal character type is
based on ISO-2022, it sucks, and we consistent have issues with bogus
decoding and the like that is hard to get around at the app level
because there's way too much generality at the underlying level that
we try to handle "transparently".

That's where you're going; maybe you can do better than XEmacs levels
of suckiness :-), but (for a general facility) it won't be easy.
Better to do that at the application level, which can decide for
itself what safeguards are needed.

 > or when the underlying data is bytes, but I want to treat it as
 > text temporarily without copying the whole buffer.

That ship has sailed AFAICS.  If the "copy-the-whole-buffer" style of
polymorphism isn't good enough, you have special knowledge of the data
and/or the application, and it's a layering violation to ask Python to
manage that data for you because Python's model of text is str.  It
will result in unexpected UnicodeErrors.

 > A custom type need not allow direct access to the buffer as an array,
 > so it would have to provide its own access functions.  I accept that
 > using these subtype-specific functions might be slower, but I think
 > the downside for "normal" strings can be limited to an extra case
 > statement in places like PyUnicode_WRITE

I expect that library code that must be robust against UnicodeError
(eg, email) will need to be prepared for gratuitous errors from custom
types.  Since that's at least a desideratum for all stdlib code, this
could be rather more expensive than you suggest.

 > A type for an alternate one-byte encoding could be as simple as using
 > the 1Byte variants to create a string of the same type when large
 > strings are called for, and a translation function when individual
 > characters are requested.

This would require substantial analysis in some cases I would expect
to be common to determine whether it wasn't a pessimization.  I
suppose that in many use cases, you will be implicitly creating many
strings, and the space overhead of the implicit strings may be greater
than the size of the single string.

In cases where the analysis is simple (eg, parsing an RFC 822 message
header out of the middle of a huge mbox file), the analysis that shows
that this could be done efficiently with a custom type can easily and
efficiently be converted to an implementation based on converting only
the bytes needed.

I understand the attraction of such facilities for simplifying user
code, but given my somewhat extensive experience with maintaining
them, I recommend that Python core Just Say No.  It's just too hard to
maintain "text invariants" when you might be processing a few million
bytes from /dev/urandom.  If one (as an application programmer) knows
better, and of course she does, then shouldn't she DTRT at the
application code level?