[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

March 30, 2020

      Hello,

On Mon, 30 Mar 2020 09:58:32 -0700
Brett Cannon <brett@python.org> wrote:
...
On Sun, Mar 29, 2020 at 10:58 AM Paul Sokolovsky <pmiscml@gmail.com>
wrote:
...
[SNIP]
1. Succumb to applying the same mis-optimization for string type as
CPython3. (With the understanding that for speed-optimized projects,
implementing mis-optimizations will eat into performance budget, and
for memory-optimized projects, it likely will lead to noticeable
memory bloat.)
[SNIP]
1. The biggest "criticism" I see is a response a-la "there's no
problem with CPython3, so there's nothing to fix". This is related
to a bigger questions "whether a life outside CPython exists", or
put more formally, where's the border between Python-the-language
and CPython-the-implementation. To address this point, I tried to
collect performance stats for a pretty wide array of Python
implementations.
I don't think characterizing this as a "mis-optimization" is fair.
There is use of in-place add with strings in the wild and CPython
happens to be able to optimize for it.
Everyone definitely doesn't have to agree with that characterization.
Nor there's strong need to be offended that it's "unfair". After all,
it's just somebody's opinion. Roughly speaking, the need to be upset by
the "mis-" prefix is about the same as need to be upset by "bad" in
some random blog post, e.g. https://snarky.ca/my-impressions-of-elm/

I'm also sure that people familiar with implementation details would
understand why that "mis-" prefix, but let me be explicit otherwise: a
string is one of the fundamental types in many languages, including
Python. And trying to make it too many things at once has its
overheads. Roughly speaking, to support efficient appending, one need to
be ready to over-allocate string storage, and maintain bookkeeping for
this. Another known optimization CPython does is for stuff like "s =
s[off:]", which requires maintaining another "offset" pointer. Even
with this simplistic consideration, internal structure of "str" would
be about the same as "io.StringIO" (which also needs to over-allocate
and maintain "current offset" pointer). But why, if there's io.StringIO
in the first place?
...
Someone was motivated to do
the optimization so we took it without hurting performance for other
things. There are plenty of other things that I see people regularly
that I don't personally think is best practices but that doesn't mean
we should automatically ignore them and not help make their code more
performant if possible without sacrificing best practice performance.
Nowhere did I argue against applying that optimization in CPython.
Surely, in general, the more optimizations, the better. I just stated
the fact that of 8 (well, 11, 11!) Python'ish implementations surveyed,
only 1 implemented it.

And what went implied, is that even under ideal conditions that other
implementations say "we have resources to implement and maintain that
optimization" (we still talking about "str +=" optimization), then at
least for some projects, it would be against their interests. E.g.
MicroPython, Pycopy, Snek optimize for memory usage, TinyPy for
simplicity of implementation. "Too-complex basic types" are also a
known problem for JITs (which become less performant due to need to
handle multiple cases of the same primitive type and much harder to
develop and debug).

At the same time, ergonomics of "str +=" is very good (heck, that's why
people use it). So, I was looking for the simplest possible
change which would allow for the largest part of that ergonomics in an
object type more suitable for content accumulation *across* different
Python'ish implementations.

I have to admit that I was inspired to write down this RFC by PEP 616
"String methods to remove prefixes and suffixes". Who'd think that
after so many years, there's still something useful to be added to
sting methods (and then, that it doesn't have to be as complex as one
can devise at full throttle, but much simpler than that).
...
And I'm not sure if you're trying to insinuate that CPython represents
Python the language
That's an old and painful (to some) topic.
...
and thus needs to not optimize for something other
implementations have/can not optimize for, which if you are
As I clarified, I don't say that CPython shouldn't optimize for things.
I just tried to argue that there's no clearly defined abstraction (*)
for accumulating string buffer, and argued that it could be easily
"established".

(*) Instead, there're various of practical hacks to implement it, as
both 2006's and this thread shows.
...
suggesting that then I have an uncomfortable conversation I need to
have with PyPy 😉.
Or if you're saying CPython and Python should be
considered separate, then why can't CPython optimize for something it
happens to be positioned to optimize for that other implementations
can't/haven't?
Yes, I personally think that CPython and Python should be
considered separate. E.g. the topic of this RFC shouldn't be
considered just from CPython's point of view, but rather from the angle
of "Python doesn't seem to define a useful abstraction of (ergonomic)
string builder, here's how different Python implementations can acquire
it almost for free".

-- 
Best regards,
 Paul                          mailto:pmiscml@gmail.com