Explicitly defining a string buffer object (aka StringIO += operator)

Hello, 1. Intro -------- It is a well-known anti-pattern to use a string as a string buffer, to construct a long (perhaps very long) string piece-wise. A running example is: buf = "" for i in range(50000): buf += "foo" print(buf) An alternative is to use a buffer-like object explicitly designed for incremental updates, which for Python is io.StringIO: buf = io.StringIO() for i in range(50000): buf.write("foo") print(buf.getvalue()) As can be seen, this requires changing the way buffer is constructed (usually in one place), the way buffer value is taken (usually in one place), but more importantly, it requires changing each line which adds content to a buffer, and there can be many of those for more complex algorithms, leading to a code less clear than the original code, requiring noise-like changes, and complicating updates to 3rd-party code which needs optimization. To address this, this RFC proposes to add an __iadd__ method (i.e. implementing "+=" operator) to io.StringIO and io.BytesIO objects, making it the exact alias of .write() method. This will allow for the code very parallel to the original str-using code: buf = io.StringIO() for i in range(50000): buf += "foo" print(buf.getvalue()) This will still require updates for buffer construction/getting value, but that's usually 2 lines. But it will leave the rest of code intact, and not obfuscate the original content construction algorithm. 2. Performance Discussion ------------------------- The motivation for this change (of promoting usage of io.StringIO, by making it look&feel more like str) is performance. But is it really a problem? Turns out, it is such a pervasive anti-pattern, that recent versions on CPython3 have a special optimization for it. Let's use following script for testing: --------- import timeit import io def string(): sb = u"" for i in range(50000): sb += u"a" def strio(): sb = io.StringIO() for i in range(50000): sb.write(u"a") print(timeit.timeit(string, number=10)) print(timeit.timeit(strio, number=10)) --------- With CPython3.6 the result is: $ python3.6 str_iadd-vs-StringIO_write.py 0.03350826998939738 0.033480543992482126 In other words, there's no difference between usage of str vs StringIO. But it wasn't always like that, with CPython2.7.17: $ python2.7 str_iadd-vs-StringIO_write.py 2.10510993004 0.0399420261383 But Python2 is dead, right? Ok, let's see how Jython3 and IronPython3 fair. To my surprise, there're no (public releases of) such. Both projects sit firmly in the Python2 territory. So, let's try them: $ java -jar jython-standalone-2.7.2.jar str_iadd-vs-StringIO_write.py 10.8869998455 1.74700021744 Sadly, I wasn't able to get to run IronPython.2.7.9.zip on my Linux system, so I used the online version at https://tio.run/#python2-iron (after discovering that https://ironpython.net/try/ is dead). 2.7.9 (IronPython 2.7.9 (2.7.9.0) on Mono 4.0.30319.42000 (64-bit)) 26.2704391479 1.55628967285 So, it seems that rumors of Python2 being dead are somewhat exaggerated. Let's try a project which tries to provide "missing migration path" between Python2 and Python3 - https://github.com/naftaliharris/tauthon Tauthon 2.8.1+ (heads/master:7da5b76f5b, Mar 29 2020, 18:05:05) $ tauthon str_iadd-vs-StringIO_write.py 0.792158126831 0.0467159748077 Whoa, tauthon seems to be faithful to its promise of being half-way between CPython2 and CPython2. Anyway, let's get back to Python3. Fortunately, there's PyPy3, so let's try that: $ ./pypy3.6-v7.3.0-linux64/bin/pypy3 str_iadd-vs-StringIO_write.py 0.5423258490045555 0.01754526497097686 Let's not forget little Python brothers and continue with MicroPython 1.12 (https://github.com/micropython/micropython): $ micropython str_iadd-vs-StringIO_write.py 41.63419413566589 0.08073711395263672 Pycopy 3.0.6 (https://github.com/pfalcon/pycopy): $ pycopy str_iadd-vs-StringIO_write.py 25.03198313713074 0.0713810920715332 I also wanted to include TinyPy (http://tinypy.org/) and Snek (https://github.com/keith-packard/snek) in the shootout, but both (seem to) lack StringIO object. These results can be summarized as follows: of more than half-dozen Python implementations, CPython3 is the only implementation which optimizes for the dubious usage of an immutable string type as an accumulating character buffer. For all other implementations, unintended usage of str incurs overhead of about one order of magnitude, 2 order of magnitude for implementations optimized for particular usecases (this includes PyPy optimized for speed vs MicroPython/Pycopy optimized for small code size and memory usage). Consequently, other implementations have 2 choices: 1. Succumb to applying the same mis-optimization for string type as CPython3. (With the understanding that for speed-optimized projects, implementing mis-optimizations will eat into performance budget, and for memory-optimized projects, it likely will lead to noticeable memory bloat.) 2. Struggle against inefficient-by-concept usage, and promote usage of the correct object types for incremental construction of string content. This would require improving ergonomics of existing string buffer object, to make its usage less painful for both writing new code and refactoring existing. As you may imagine, the purpose of this RFC is to raise awareness and try to make headway with the choice 2. 3. Scope Creep, aka "Possible Future Work" ------------------------------------------ The purpose of this RFC is specifically to propose to apply *single* simple, obvious change. .__iadd__ is just an alias for .write, period. However, for completeness, it makes sense to consider both alternatives and where the path of adding "str-like functionality" may lead us. 1. One alternative to patching StringIO would be to introduce a completely different type, e.g. StringBuf. But that largely would be "creating more entities without necessity", given that StringIO already offers needed buffering functionality, and just needs a little touch of polish with interface. If 2 classes like StringIO and StringBuf existed, it would be extra quiz to explain difference between them and why they both exist. 2. On the other hand, this RFC fixates on the output buffering. But just image how much fun can be done re: input buffers! E.g., we can define "buf[0]" to have semantics of "tmp = buf.tell(); res = buf.read(1); buf.seek(tmp); return res". Ditto for .startswith(), etc., etc. So, well... This RFC is about making .__iadd__ to be an alias for .write, and cover with this output buffering usecase. Whoever may have interest in dealing with input buffer shortcuts would need to provide a separate RFC, with separate usecases and argumentation. 4. (Self-)Criticism and Risks. ------------------------------ 1. The biggest "criticism" I see is a response a-la "there's no problem with CPython3, so there's nothing to fix". This is related to a bigger questions "whether a life outside CPython exists", or put more formally, where's the border between Python-the-language and CPython-the-implementation. To address this point, I tried to collect performance stats for a pretty wide array of Python implementations. 2. Another potential criticism is that this may open a scope creep to add more str-like functionality to classes which classically expose a stream interface. Paragraph 3.2 is dedicated specifically to address this point, by invoking hopefully the best-practice approach: request to focus on the currently proposed feature, which requires very little changes for arguably noticeable improvements. (At the same time, for an abstract possibility of this change to be found positive, this may influence further proposals from interested parties). 3. Implementing the required functionality is pretty easy with a user subclass: class MyStringBuf(io.StringIO): def __iadd__(self, s): self.write(s) return self Voila. The problem is performance. Calling such .__iadd__() method implementing in Python is 3 times slower than calling .write() directly (with CPython3.6). But paradigmatic problem is even bigger: this RFC seeks to establish the best practice of using explicitly designed for the purpose type with ergonomic interface. Saying that "we lack such clearly designated type out of the box, but if you figure out that it's a problem (if you do figure that out), you can easily resolve that on your side, albeit with a performance hit when compared with it being provided out of the box" - that's not really welcoming or ergonomic. 5. Prior Art ------------ As many things related to Python, the idea is not new. I found thread from 2006 dedicated to it: https://mail.python.org/pipermail/python-list/2006-January/403480.html (strangely, there's some mixup in archives, and "thread view" shows another message as thread starter, though it's not: https://mail.python.org/pipermail/python-list/2006-January/357453.html) . The discussion there seemed to be without clear resolution and has been swamped into discussion of unicode handling complexities in Python2 and implementation details of "StringIO" vs "cStringIO" modules (both of which were deprecated in favor of "io"). -- Best regards, Paul mailto:pmiscml@gmail.com

It’s usually an even better alternative to just put the strings into a list of strings (or to write a generator that yields them), and then pass that to the the join method. This is recommended in the official Python FAQ. It’s usually about 40% faster than using StringIO or relying on the string-concat optimization in CPython, it’s efficient across all implementations of Python, and it’s obvious _why_ it’s efficient. It can sometimes take more memory, but the tradeoffs is usually worth it. This has been well known in the Python community for decades. People coming from C++ look for something like stringstream and find StringIO; people coming from Java look for something like StringBuilder and build their own version around StringIO; people who are comfortable with Python use str.join. So third-party libraries that don’t do that are likely either (a) not expecting large amounts of data (and therefore probably suboptimal in other areas), or (b) written by someone who doesn’t really get Python. So what is StringIO for? For being a file object, but in memory rather than representing a file. Its API is exactly the same as every other file object, because that’s the whole point of it.
So your goal is to allow people to use badly-written third-party libs designed around the string-concat antipattern, without fixing those libs, by feeding them StringIO objects when they expected str objects? This seems like a solution to a theoretical problem that might work for some instances of that problem. But do you have any actual examples of third-party libs that have this problem, and that (obviously) break if you give them StringIO objects, but would not break when passed a StringIO with __iadd__?
Yes. Not as in “nobody will ever run it again”, but definitely as in “no new feature you add to Python will be backported”. Python 2.7 the language and CPython 2.7 the implementation have been feature-frozen for years now, and now they’re not even supported by the Python organization at all. So, trying to improve the behavior of Python 2.7 code by making a proposal for Python won’t get you anywhere. Adding StringIO.__iadd__ to Python 3.10 will not help anyone using Python 2.7. In fact, even if you somehow convinced everyone to make the extraordinary decision to re-open Python 2.7 and make a new 2.7.18 release with this feature backported, it still wouldn’t help the vast majority of people using Python 2.7, because most people using Python 2.7 are using stable systems with stable versions that they don’t update for years. That’s why they’re still using 2.7 in the first place: because 2.7.16 is what comes with the Linux LTS they’ve settled on for deployment, or it’s what comes with the macOS version they use for their dev boxes, or Jython doesn’t have a 3.x version yet, or whatever. So a new feature in 2.7.18 wouldn’t get to them for years, if ever. It’s also worth noting that the io module is very slow in most Python 2.x implementations. There’s a separate (and older) StringIO module, and for CPython an accelerated cStringIO, and you almost certainly want to use those, not io, here. (Except, of course, that what you really want to use is join anyway.)
The last IronPython release, 2.7.9, was in 2018. As the release notes for that version say, “With this release, we will shift the majority of work to IronPython3.” Of course IronPython3 isn’t ready for prime time yet, but it’s not because they’re still firmly in Python2 territory and still making major improvements to their 2.7 branch, it’s because it’s taking a long time to finish their 3.x branch (in part because they no longer have Microsoft and Unity throwing resources at the project). They’re not adding new features to 2.7 any more than CPython is. (They are working on a 2.7.10; but it’s just 2.7.9 with support for more .NET runtimes plus porting some security fixes from the last CPython 2.7 stdlib.) I don’t know the situation with Jython as well, but I believe it’s similar.
3. Recognize that Python and CPython have been promoting str.join for this problem for decades, and most performance-critical code is already doing that, and make sure that solution is efficient, and recognjze that poorly-written code is uncommon but does exist, and may take a bit more work to optimize than a 1-line change to optimize, but that’s acceptable—and not the responsibility of any alternate Python implementation to help with.

I completely agree with Andrew Barnert. I just want to add a little comment about overriding the `+=` (and `+`) operator for StringIO. Since StringIO is a stream --not a string--, I think `StringIO` should continue to use the common interface for streams in Python. `write()` and `read()` are fine for streams (and files) and you can find similar `write` and `read` functions in other languages. I cannot see any advantage on departing from this convention.

I agree with the arguments the OP brings forward. Maybe, it should be the case of having an `StringIO` and `BytesIO` subclass? Or better yet, just a class that wraps those, and hide away the other file-like methods and behaviors? That would keep the new class semantically as a string, and they could implement all of the str/bytes methods and attributes so as to be a drop-in replacement - _and_ add a proper `__setitem__` so that one could have a proper "mutable string". It ust would use StringIO/BytesIo as its "engine". Such code would take like, 100 lines (most of them just to forward/reimplement some of the legacy str methods), be an effective drop-in replacement, require no change to Python - it could even be put now in Pypi - and, maybe, even reach Python 3.9 in time, because, as I said, I agree with your points. On Mon, 30 Mar 2020 at 12:06, <jdveiga@gmail.com> wrote:

On Mar 30, 2020, at 08:29, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Why? What’s the benefit of building a mutable string around a virtual file object wrapped around a buffer (with all the extra complexities and performance costs that involves, like incremental Unicode encoding and decoding) instead of just building it around a buffer directly? Also, how can you implement an efficient randomly-accessible mutable string object on top of a text file object? Text files don’t do constant-time random-access seek to character positions; they can only seek to the opaque tokens returned by tell. (This should be obvious if you think about how you could seek to the 137th character in a UTF-8 file without reading all of the first 137 characters.) (In fact, recent versions of CPython optimize StringIO so it only fakes being a TextIOWrapper around a BytesIO and actually uses a Py_UCS4* buffer for storage, but that’s CPython-specific, not guaranteed, and not accessible from Python even in CPython.) And, even if that were a good idea for implementation reasons, why should the user care? If they need a mutable string, why do they care whether you give them one that inherits from or delegates to a StringIO instead of a list or an array.array of int32 or the CPython string buffer API (whether accessed via a C extension or ctypes.pythonapi) or a pure C library with its own implementation and optimizations? More generally, a StringIO is neither the obvious way nor the fastest way nor the recommended way to build strings on the fly in Python, so why do you agree with the OP that we need to make it better for that purpose? Just to benefit people who want to write C++ instead of Python? If the goal is to cater to people who won’t read the docs to learn the right way, the obvious solution is to mandate the non-quadratic string concatenation of CPython for all implementations, not to give them yet another way of doing it and hope they’ll guess or look up that one even though they didn’t guess or look up the long-standing existing one.
Sadly, this isn’t possible. Large amounts of C code—including builtins and stdlib—won’t let you duck type as a string; as it will do a type check and expect an actual str (and if you subclass str, it will ignore your methods and use the PyUnicode APIs to get your base class’s storage directly as a buffer instead). So, no type, either C or Python, can really be a drop-in replacement for str. At best you can have something that you have to call str() on half the time. That’s why there’s no MutableStr on PyPI, and no UTF8Str, no EncodedStr that can act as both a bytes and a str by remembering its encoding (Nick Coghlan’s motivating example for changing this back in the early 3.x days), etc. Fixing this cleanly would probably require splitting the string C API into abstract and concrete versions a la sequence and then changing a ton of code to respect abstract strings (to only optimize for concrete ones rather than requiring them, again like sequences). Fixing it slightly less cleanly with a hookable API might be more feasible (I’m pretty sure Nick Coghlan looked into it before the 3.3 string redesign; I don’t know if anyone has since), but it’s still probably a major change.

Hi Andrew - I made my previous post before reading your first answer. So, anyway, what we have is that for a "mutable string like object" one is free to build his wrapper - StringIO based or not - put it on pypi, and remember calling `str()` on it before having it leave your code. Thank you for the lengthy reply anyway. That said, anyone could tell about small, efficient, well maintained "mutable string" classes on Pypi? On Mon, 30 Mar 2020 at 14:07, Andrew Barnert <abarnert@yahoo.com> wrote:

On Tue, Mar 31, 2020 at 4:20 AM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
There's a vast difference between "mutable string" and "string builder". The OP was talking about this kind of thing: buf = "" for i in range(50000): buf += "foo" print(buf) And then suggested using a StringIO for that purpose. But if you're going to change your API, just use a list: buf = [] for i in range(50000): buf.append("foo") buf = "".join(buf) print(buf) So if you really want a drop-in replacement, don't build it around StringIO, build it around list. class StringBuilder: def __init__(self): self.data = [] def __iadd__(self, s): self.data.append(s) def __str__(self): return "".join(self.data) This is going to outperform anything based on StringIO fairly easily, plus it's way WAY simpler. But this is *not* a mutable string. It's a string builder. If you want a mutable string, first figure out exactly what mutations you need, and what performance you are willing to accept. ChrisA

On Tue, Mar 31, 2020 at 5:10 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
And that's what I get for quickly whipping something up and not testing it. Good catch. But you get the idea - a simple wrapper around a *list* is going to be way better than a wrapper around StringIO. ChrisA

Hello, On Tue, 31 Mar 2020 04:27:04 +1100 Chris Angelico <rosuav@gmail.com> wrote: []
I appreciate expressing it all concisely and clearly. Then let me respond here instead of the very first '"".join() rules!' reply I got. The issue with "".join() is very obvious: ------ import io import sys def strio(): sb = io.StringIO() for i in range(50000): sb.write(u"==%d==" % i) print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue())) def listjoin(): sb = [] sz = 0 for i in range(50000): v = u"==%d==" % i # All individual strings will be kept in the list and # can't be GCed before teh final join. sz += sys.getsizeof(v) sb.append(v) s = "".join(sb) sz += sys.getsizeof(sb) sz += sys.getsizeof(s) print(sz) strio() listjoin() ------ $ python3.6 memuse.py 439083 3734325 So, it's obvious, but let's formulate it clearly for avoidance of doubt: There's absolutely no need why performing trivial operation of accumulating string content should take about order of magnitude more memory than actually needed for that string content. Don't get me wrong - if you want to spend that much of your memory, then sure, you can. But jumping with that as *the only right solution* whenever somebody mentions "string concatenation" is a bit ... umm, cavalier.
This is going to outperform anything based on StringIO fairly easily,
Since when raw speed is the only criterion for performance? If you say "forever", I'll trust only if you proceed with showing assembly code with SSE and AVX which you wrote to get those last cycles out. Otherwise, being able to complete operations in reasonable amount of memory, not OOM and not being DoSed by trivial means, and finally, serving 8 times more requests in the same amount of memory - are alll quite criteria too. What's interesting, that so far, the discussion almost 1-to-1 parallels discussion in the 2006 thread I linked from the original mail.
But of course! And what's most important, nowhere did I talk what should be inside this class. My whole concern is along 2 lines: 1. This StringBuilder class *could* be an existing io.StringIO. 2. By just adding __iadd__ operator to it. That's it, nothing else. What's inside StringIO class is up to you (dear various Python implementations, their maintainers, and contributors). For example, fans of "".join() surely can have it inside. Actually, it's a known fact that Python2's "StringIO" module (the original home of StringIO class) was implemented exactly like that, so you can go straight back to the future. And again, the need for anything like that might be unclear for CPython-only users. Such users can write a StringBuilder class like above, or repeat the beautiful "".join() trick over and over again. The need for a nice string builder class may occur only from the consideration of the Python-as-a-language lacking a clear and nice abstraction for it, and from thinking how to add such an abstraction in a performant way (of which criteria are different) in as many implementation as possible, in as easy as possible way. (At least that's my path to it, I'm not sure if a different thought process might lead to it too.) -- Best regards, Paul mailto:pmiscml@gmail.com

On Tue, Mar 31, 2020 at 7:04 AM Paul Sokolovsky <pmiscml@gmail.com> wrote:
... about order of magnitude more memory ...
I suspect you may be multiply-counting some of your usage here. Rather than this, it would be more reliable to use the resident set size (on platforms where you can query that). if "strio" in sys.argv: strio() else: listjoin() print("Max RSS:", resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) Based on that, I find that it's at worst a 4:1 difference. Plus, I couldn't see any material difference - the numbers were within half a percent, basically just noise - until I upped your loop counter to 400,000, nearly ten times as much as you were doing. (At that point it became a 2:1 difference. The 4:1 didn't show up until a lot later.) So you have to be working with a *ridiculous* number of strings before there's anything to even consider. And even then, it's only notable if the individual strings are short AND all unique. Increasing the length of the strings basically made it a wash. Consider: for i in range(1000000): sb.write(u"==%d==" % i + "*"*1024) Max RSS: 2028060 for i in range(1000000): v = u"==%d==" % i + "*"*1024 Max RSS: 2104204 So at this point, the string join is slightly faster and takes slightly more memory - within 20% on the time and within 5% on the memory. ChrisA

Hello, On Tue, 31 Mar 2020 07:40:01 +1100 Chris Angelico <rosuav@gmail.com> wrote:
I may humbly suggest a different process too: get any hardware board with MicroPython and see how much data you can collect in a StringIO and in a list of strings. Well, you actually don't need a dedicated hardware, just get a Linux or Windows version and run it with a specific heap size using a -X heapsize= switch, e.g. -X heapsize=100K. Please don't stop there, we talk multiple implementations, try it on CPython too. There must be a similar option there (because how otherwise you can perform any memory-related testing!), I just forgot which. The results should be very apparent, and only forgotten option may obfuscate it. [] -- Best regards, Paul mailto:pmiscml@gmail.com

As others have pointed out, the OP started in a bit of an oblique way, but it maybe come down to this: There are some use-cases for a mutable string type. And one could certainly write one. presto: here is one: https://github.com/Daniil-Kost/mutable_strings Which looks to me to be more a toy than anything, but maybe the author is seriously using it... (it does look like it has a bug indexing if there are non-ascii) And yet, as far as I know, there has never been one that was carefully written and optimized, which would be a bit of a trick, because of how Python strings handle Unicode. (it would have been a lot easier with Python2 :-) ) So why not? 1) As pointed out, high performance strings are key to a lot of coding, so Python's str is very baked-in to a LOT of code, and can't be duck-typed. I know that pretty much the only time I ever type check (as apposed to simple duck typing EAFTP) is for str. So if one were to make a mutable string type, you'd have to convert it to a string a lot in order to use most other libraries. That being said, one could write a mutable string that mirrored' the cPython string types as much as possible, and it could be pretty efficient, even for making regular strings out of it. 2) Maybe it's really not that useful. Other than building up a long string with a bunch of small ones (which can be done fine with .join()) , I'm not sure I've had much of a use case -- it would buy you a tiny bit of performance for, say, altering strings in ways that don't change their length, but I doubt there's many (if any) applications that would see any meaningful benefit from that. So I'd say it hasn't been done because (1) it's a lot of work and (2) it would be a bit of a pain to use, and not gain much at all. A kind-of-related anecdote: numpy arrays are mutable, but you can not change their length in place. So, similar with strings, if you want to build up an array with a lot of little pieces, then the best way is to put all the pieces in a list, and then make an array out of it when you are done. I had a need to do that fairly often (reading data from files of unknown size) so I actually took the time to write an array that could be extended. Turns out that: 1) it really wasn't much faster (than using a list) in the usual use-cases anyway :-) 2) it did save memory -- which only mattered for monster arrays, and I'd likely need to do something smarter anyway in those cases. I even took some time to write a Cython-optimized version, which only helped a little. I offered it up to the numpy community. But in the end: no one expressed much interest. And I haven't used it myself for anything in a long while. Moral of the story: not much point in a special class to do something that can already be done almost as well with the builtins. -CHB On Mon, Mar 30, 2020 at 2:06 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, Mar 30, 2020 at 04:25:07PM -0700, Christopher Barker wrote:
With respect Christopher, this is a gross misrepresentation of what Paul has asked for. He is not asking for a mutable string type. If that isn't clear from the subject line of this thread, it ought to be clear from Paul's well-written and detailed post, which carefully explains what he wants. -- Steven

Steven D'Aprano writes:
[I]t ought to be clear from Paul's well-written and detailed post, which carefully explains what he wants.
Whose value to Python I still don't understand, because AFAICS it's something that on the one hand violates TOOWTDI and has no parallels elsewhere in the io module, and on the other hand is trivial to implement for any programmer who really thirsts for StringIO.__iadd__. Unless there are reasons why a derived class won't do? I agree there seem to be possible space performance issues with str.join that are especially painful for embedded applications (as came out later in the thread I believe), but if those are solved by StringIO, they're solved by StringIO. So the whole thing seems to be a cosmetic need for niche applications[1] for a niche platform[2] that is addressed by a 4-line class definition[3] for users who want the syntactic sugar. Me, I'm perfectly happy with StringIO.write because that's what I expect from the io module. FWIW YMMV of course. Footnotes: [1] I don't even use strings at all in any of my adafruit applications! [2] OK, that's going too far, sorry. Embedded matters, their needs are real needs, and they face tight constraints most of us rarely need to worry about. It's still at present a minority platform, I believe, and the rest of the sentence applies AFAIK. [3] Paul's "exact alias of .write() method", which can be done in 1 line, fails because .write() doesn't return self. Thanks, Serhiy. In the stdlib we might even want a check for "at end of buffer" (.write() can overwrite Unicode scalars anywhere in the buffer). That's definitely overengineering for a user, but in the stdlib, dunno.

Hello, On Tue, 31 Mar 2020 18:09:59 +0900 "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote: []
[3] Paul's "exact alias of .write() method", which can be done in 1 line, fails because .write() doesn't return self. Thanks, Serhiy. In
I stand corrected gentlemen, thanks for catching that. It's a poorman's pep, not a real pep after all. A real pep wouldn't use ambiguous phrases like that, but something like "__iadd__ method, with the same semantics as existing .write() method, bur returning self". In terms of C implementation, that's one line difference, in pseudocode: - return write_method(self, ...) + write_method(self, ...) + return self In terms of machine code, that would be +1 instruction, I guess such a minor difference made me discount it and use ambiguous term "alias". For reference, the implementation for Pycopy: https://github.com/pfalcon/pycopy/commit/4b149fb8a4fb18e954ba7113d1495ccf822... (such a big patch because expectedly, Pycopy optimizes operators vs general methods, and as there was no operators defined for StringIO before, it takes whole 14 lines of boilerplate to add).
Per my idea, __iadd__ would be exact equivalent of .write in behavior (a complexity-busting measure), but specific implementations of course can add extra checks if they just can't do otherwise. (Reminds me of PEP 616 discussion, where there was mentioning of raising ValueError on empty prefix, even though it all started as being an equivalent of if s.startswith(prefix): s = s[len(prefix):] And str.startswith() doesn't throw no ValueError on neither str.startswith("") or str.startswith(("foo", "")). It seems that we just can't pass by a chance to add another corner case to explain away from the already existing behavior, all with a good intention of policing our users, for they can't handle it themselves). -- Best regards, Paul mailto:pmiscml@gmail.com

Hello, On Mon, 30 Mar 2020 16:25:07 -0700 Christopher Barker <pythonchb@gmail.com> wrote:
For avoidance of doubt: nothing in my RFC has anything to do, or implies, "a mutable string type". A well-know pattern of string builder, yes. Piggybacking on existing StringIO/BytesIO classes, yes. Anything else, no. To not leave it cut and dry: IMHO, we need more const'ness in Python, not less. I my dreams I already do stuff like: from __future__ import const class Foo: pass # This is an alias for "Foo" Bar: const = Foo # This is a variable which can store a reference to Foo or any other class Baz = Foo [This is not a new RFC! Please start a new thread if you'd like to pick it up ;-)] -- Best regards, Paul mailto:pmiscml@gmail.com

Paul Sokolovsky wrote:
If I understand you are proposing a change from StringIO `write` method to `+=` operator. Is it right? I cannot see any advantage on this proposal since there is no real change in the implementation of StringIO. Or are you proposing any change in the underlying implementation and I have missed that point? In this case, I disagree with you: StringIO is a stream and I think that it is wrong to make it to "look & feel" like a string. That is my opinion. Sorry if I misunderstand you.

On Tue, Mar 31, 2020 at 07:32:11PM -0000, jdveiga@gmail.com wrote:
If I understand you are proposing a change from StringIO `write` method to `+=` operator. Is it right?
No, that is not correct. The StringIO.write() method will not be changed or removed. The proposal is to extend the class with the `+=` operator which will add as equivalent to calling write().
This proposal isn't about enhancing StringIO's functionality. The purpose of this proposal is targetted at people who are using string concatenation instead of assembling a list then calling join. It is about leveraging StringIO's ability to behave as a string builder to give people a minimally invasive change from the string concatenation anti-pattern: buf = '' # repeated many times buf += 'substring' to something which can be efficient on all Python interpreters: buf = StringIO() buf += 'substring' buf = buf.getvalue()
Paul has not suggested making StringIO look and feel like a string. Nobody is going to add 45+ string methods to StringIO. This is a minimal extension to the StringIO class which will allow people to improve their string building code with a minimal change. -- Steven

On Wed, 1 Apr 2020 at 02:07, Steven D'Aprano <steve@pearwood.info> wrote:
Thanks for paring the proposal down to its bare bones, there's a lot of side questions being discussed here that are confusing things for me. With this in mind, and looking at the bare proposal, my immediate thought is who's going to use this new approach: buf = StringIO() buf += 'substring' buf = buf.getvalue() I hope this isn't going to trigger another digression, but it seems to me that the answer is "nobody, unless they are taught about it, or work it out for themselves[1]". My reasons for saying this are that it adds no value over the current idiom of building a list then using join(), so people who already write efficient code won't need to change. The people who *might* change to this are people currently writing buf = '' # repeated many times buf += 'substring' Those people have presumably not yet learned about the (language independent) performance implication of repeated concatenation of immutable strings[2]. Ignoring CPython's optimisation for += on strings, as all that will do is allow them to survive longer without hitting the issues with this pattern, when they *do* find there's an issue, they will be looking for a better approach. At the moment, the message is relatively clear - "build a list and join it" (it's very rare that anyone suggests StringIO currently). This proposal is presumably intended to make "use StringIO and +=" a more attractive alternative alternative proposal (because it avoids the need to rewrite all those += lines). So we now find ourselves in the position of having *two* "recommended approaches" to addressing the performance issue with string concatenation. I'd contend that there's a benefit in having a single well-known idiom for fixing this issue when beginners hit it. Clarity of teaching, and less confusion for people who are learning that they need to address an issue that they weren't previously aware of. I further suggest that the benefits of the += syntax on StringIO (less change to existing code) are not sufficient to outweigh the benefits of having a single well-known "best practice" solution. So I'm -0.5 on this change (only 0.5, because it's a pretty trivial change, and not worth getting too worked up about). Paul [1] Or they have a vested interest in using the "string builder" pattern in Python, rather than using Python's native idioms. That's not an uncommon situation, but I don't think "helping people write <language X> in Python" is a good criterion for assessing language changes, in general. [2] Or they have, and know that it doesn't affect them, in which case they don't need to change anything.

Hello, On Wed, 1 Apr 2020 10:01:06 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
[]
Roughly speaking, the answer would be about the same in idea as answers to the following questions: * Who'd be using assignment expressions? (2nd way to do assignment, whoa!) * Who'd be using f-strings? (3rd (or more) way to do string formatting, bhoa!) * Who'd be writing s = s.removeprefix("foo") instead of "if s.startswith("foo"): s = s[3:]" (PEP616)? * Who'd be using binary operator @ ? * Who'd be using using unary operator + ?
Ok, so we found the answers to all those questions - people who might have a need to use, would use it. You definitely may argue of how many people (in absolute and relative figures) would use it. Let the binary operator @ and unary operator + be your aides in this task.
I don't know how much you mix with other Pythonistas, but word "clear" is an exaggeration. From those who don't like it, the usual word is "ugly", though I've seen more vivid epithets, like "repulsive": https://mail.python.org/pipermail/python-list/2006-January/403480.html More cool-headed guys like me just call it "complete wastage of memory".
Aye.
The scholasm of "there's only one way to do it" is getting old for this language. Have you already finished explaining everyone why we needed assignment expressions, and why Python originally had % as a formatting operator, and some people swear to keep not needing anything else? What's worse, is that "there's only one way to do it" gets routinely misinterpreted as "One True Way (tm)". And where Python is deficient to other languages, there's rising small-scale exceptionalism along the lines "we don't have it, and - we don't need it!". The issue is that some (many) Python programmers use a lot of different languages, and treat Python first of all as a generic programming language, not as a bag of tricks of a particular implementation. And of course, there never will be agreement between the one-true-way-tm vs nice-generic-languages factions of the community.
Another acute and beaten topic in the community. Python is a melting pot for diverse masses - beginners, greybeards, data scientists, scripting kiddies, PhD, web programmers, etc. That's one of the greatest achievements of Python, but also one of the pain points. I wonder how many people escaped from Python to just not be haunted by that "beginners" chanting. Python is beginners-friendly language, period, can't change that. Please don't bend it to be beginner-only. Please let people learn computer science inside Python, not learn bag of tricks to then escape in awe and make up haikus along the lines of: A language, originally for kids, Now for grown-up noobs. (Actual haiku seen on Reddit, sorry, can't find a link now, reproduced from memory, the original might have sounded better). [] -- Best regards, Paul mailto:pmiscml@gmail.com

Paul Sokolovsky wrote:
I would say the difference between this proposal so far and the ones listed are that they emphasized concrete, real-world examples from existing code either in the stdlib or "out in the wild", showing clear before and after benefits of the proposed syntax. It may not seem necessary to the person proposing the feature and it does take some time to research, but it creates a drastically stronger argument for the new feature. The code examples I've seen so far in the proposal have been mostly abstract or simple toy examples. To get a general idea, I'd recommend looking over the examples in their respective PEPs, and then try to do something similar in your own arguments.
While I agree that it's sometimes okay to go outside the strict bounds of "only one way to do it", there needs to be adequate justification for doing so which provides a demonstrable benefit in real-world code. So the default should be just having one way, unless we have a very strong reason to consider adding an alternative. This was the case for the features you mentioned above.
Considering the current widespread usage of Python in the software development industry and others, characterizing it as a language for "grown-up noobs" seems rather disingenuous (even if partially in jest). We emphasize readability and beginner-friendliness, but Python is very far from beginner-only and I don't think it's even reasonable to say that it's going in that direction. In some ways, it simplifies operations that would otherwise be more complicated, but that's largely the point of a high-level language: abstracting the complex and low-level parts to focus more on the core business logic. Also, while I can see that blindly relying on "str += part" can be sidestepping the underlying computer science to some degree, I find that appending the parts to a list and joining the elements is very conceptually similar to using a string buffer/builder; even if the syntax differs significantly from how other languages do it. Regarding the proposal in general though, I actually like the main idea of having "StringBuffer/StringBuilder"-like behavior, *assuming* it provides substantial benefits to alternative Python implementations compared to ``""join()``. As someone who regularly uses other languages with something similar, I find the syntax to be appealing, but not strong enough on its own to justify a stdlib version (mainly since a wrapper would be very trivial to implement). But, I'm against the idea of adding this to the existing StringIO class, largely for the reasons cited above, of it being outside of the scope of its intended use case. There's also a significant discoverability factor to consider. Based on the name and its use case in existing versions of Python, I don't think a substantial number of users will even consider using it for the purpose of building strings. As it stands, the only people who could end up benefiting from it would be the alternative implementations and their users, assuming they spend time *actively searching* for a way to build strings with reduced memory usage. So I would greatly prefer to see it as a separate class with a more informative name, even if it ends up being effectively implemented as a subset of StringIO with much of the same logic. For example: buf = StringBuilder() # feel free to bikeshed over the name for part in parts: buf += part # in the __iadd__, it would presumably call something like buf.append() or buf.write() return str(buf) This would be highly similar to existing string building classes in other popular languages, such as Java and C#. Also, on the point of memory usage: I'd very much like to see some real side-by-side comparisons of the ``''.join(parts)`` memory usage across Python implementations compared to ``StringIO.write()``. I some earlier in the thread, but the results were inaccurate since they relied entirely on ``sys.getsizeof()``, as mentioned earlier. IMO, having accurate memory benchmarks is critical to this proposal. As Chris Angelico mentioned, this can be observed through monitoring the before and after RSS (or equivalent on platforms without it). On Linux, I typically use something like this: ``` def show_rss(): os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") ``` With the above in mind, I'm currently +0 on the proposal. It seems like it might be a reasonable overall idea, but the arguments of its benefits need to be much more concrete before I'm convinced. On Wed, Apr 1, 2020 at 5:45 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:

On Wed, Apr 01, 2020 at 09:25:46PM -0400, Kyle Stanley wrote:
While I agree that it's sometimes okay to go outside the strict bounds of "only one way to do it"
The Zen of Python was invented as a joke, not holy writ, and as a series of koans intended to guide thought, not shut it down. Unfortunately, and with the greatest respect to Tim Peters, in practice that's not how it is used, *particularly* the "One Way" kaon, which is almost invariably used as a thought-terminating cliche. 1. The Zen doesn't mandate *only one way*, that is a total cannard about Python invented by the Perl community as a criticism. 2. Even if it did say "only one way", even a moment's glance at the language would show that it is not true. And moreover it *cannot* be true in any programming language. Given any task but the must basic, there will always be multiple possible implementations or algorithms, usually an *infinite* number of ways to do most things. (Not all of which will be efficient or sensible.) 3. Of all the koans in the Zen, the "One Way" koan is probably intended the most to be an ironic joke, not taken too seriously. Instead the Python community treats it as the most serious of all. In Tim Peter's own words: In writing a line about "only one way to do it", I used a device (em dash) for which at least two ways to do it (with spaces, without spaces) are commonly used, neither of which is obvious -- and deliberately picked a third way just to rub it in. https://bugs.python.org/issue3364 Let's look at what the koan actually says: There should be one-- and preferably only one --obvious way to do it. Adding emphasis: "There SHOULD BE ONE OBVIOUS WAY to do it." with only a *preference* for one way, not a hard rule. And given that Tim wrote it as a joke, having the koan intentionally go against its own advice, I think we should treat that preference as pretty low. So... what is "it", and what counts as "obvious"? This is where the koan is supposed to open our minds to new ideas, not shoot them down. In this case, "it" can be: 1. I want to build a string as efficiently as possible. 2. I want to build a string in as easy and obvious a way as possible. (There may be other "its", but they are the two that stand out.) For option 1, there is one recommended way (which may or may not be the most efficient way -- that's a quality of implementation detail): use list plus join. But it's not "obvious" until you have been immersed in Python culture for a long time. For option 1, Paul's proposal changes nothing. If list+join is the fastest and most efficient method (I shall grant this for the sake of the argument) then nothing need change. Keep doing what you are doing. The koan isn't satisfied in this case, there is One Way but it isn't Obvious. But Paul's proposal is not about fixing that. ----- For option 2, "it" cares more about readable, self-documenting code which is clear and ovious to more than just Pythonistas who have been immersed in the language for years. The beauty of Python is that it ought to be readable by everyone, including scientists and hobbists who use the language from time to time, students, sys admins, and coders from other languages. Ask a beginner, or someone who has immigrated from another language, what the obvious way to build a string is, and very few of them will say "build a list, then call a string method to join the list". Some of them might guess that they need to build a list, then call a *list* method to build a string: `list.join('')`. Why Python doesn't do that is even a FAQ. Beginners will probably say "add the strings together". People coming from other OOP languages will probably say "Use a String Builder", and possibly even stumble across StringIO as the closest thing to a builder. It's a bit odd that you have to call "write", but it builds a string out of substrings. (Later, in another post, I will give evidence that StringIO is already used as a string builder, and has been for a long time.) A significant sector of the community know the list+join idiom, but dislike it so strongly that they are willing to give up some efficiency to avoid it. Whatever the cause, there is a significant segment of the Python community who either don't know, don't care about, or actively dislike, the list+join idiom. For them, it is not Obvious and never will be, the Obvious Way is to concatenate strings into a String Builder or a bare string. This segment, the people who use string concatenation and either don't know better, don't care to change, or actively refuse to change, is the focus of this proposal. For this segment, the One Obvious Way is to concatenate strings using `+=`, and they aren't going to change for the sake of other interpreters. And that's a problem for other interpreters. Hence Paul's RFC. [...]
Surely the fact that the wrapper is "trivial" should count as a point in its favour, not against it? The greater the burden of an enhancement request, the greater the benefit it must give to justify it. If your enhancement requires a complete overhall of the entire language and interpreter and will obsolete vast swathes of documentation, the benefit has to be very compelling to justify it. But if your enhancement requires an extra dozen lines of C code, one or two tests, and an extra couple of lines of documentation, the benefit can be correspondingly smaller in order for the cost:benefit ratio to come up in its favour. The cost here is tiny. This thread alone has probably exceeded by a factor of 100 the cost of implementing the change. The benefit to CPython is probably small, but to the broader Python ecosystem (Paul mentioned nine other interpreters, I can think of at least two actively maintained ones that he missed) it is rather larger.
As I mentioned above, in another post to follow I will demonstrate that people already do know and use StringIO for concatenation. Nevertheless, you do make a good point. It may be that StringIO is not the right place for this. That can be debated without dismissing the entire idea. -- Steven

On Thu, 2 Apr 2020 at 04:59, Steven D'Aprano <steve@pearwood.info> wrote:
As the person whose comment triggered this sub-thread, can I just point out that I was *not* making a "One True Way" argunemt, as Paul, and the people following up to him, seem to have thought. I was rather saying that when explaining to beginners why repeated string concatenation is bad, it's much easier to follow up with "... and the way you address this in Python is to build a list than join it at the end", than "... and the way you address this is either to build a list and join it at the end, or use StringIO ..." and then digress into a discussion of the relative merits of the two approaches. (And in reality, people often don't learn from teachers explaining clearly, but from Google, Reddit and Stack Exchange, where the second variant would probably end up in raging debates and flamewars ;-)) I was suggesting that having a common *idiom* is better than having lots of variations with subtle trade-offs, all being presented as the best *general* solution. Maybe that's Steven's "option 2", and I'm expecting more confusion than would actually be the case. Like veryone, I'm only guessing (based on experience, I hope, but even so). I remain -0.5, and repeat that this just means "mild preference against". I'm clarifying above because I don't want to be misinterpreted, not to try harder to persuade people of my previous arguments. Paul

Here's a very readable way to concatenate strings using "+=" on lists:
Probably not exactly what Paul had in mind, but it's beginners friendly. The string version, with O(N²) timing, reads like this:
BTW: I think the above is a great example to teach beginners that choosing the right algorithm is rather important when it comes to dealing with large data sets. It's also a good example of why O(N²) isn't necessarily bad for small data sets. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 22 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Steven D'Aprano writes:
You've done this rant before. It wasn't persuasive then. It's not persuasive now.
3. Of all the koans in the Zen, the "One Way" koan is probably intended the most to be an ironic joke, not taken too seriously.
Conceded that there's a certain amount of irony there, and the Zen as a whole can be indeed be taken too seriously. But I rather doubt Tim was joking. Parsimony is always a value to be considered, and various aspects of "obvious" are clearly important to Pythonic language design. The problem is that the core argument, here as always, is
The cost here is tiny. This thread alone has probably exceeded by a factor of 100 the cost of implementing the change.
The first sentence is false. What is true is that the cost of implementing the change is only a tiny part of the cost of the change. The largest part of the cost is the language bloat incurred by lowering the bar to the level of trivial changes that benefit someone enough to write the code, document it, commit it, and push it. Those costs are incurred by all the reviewers reviewing trivial changes and prodding implementers to make them readable and PEP 8 conformant, and document them properly, by everyone reading bloated documentation, and by all the maintainers maintaining bloated code, as well as the other implementations that now need to ensure that *two* idioms (str.join and StringIO.__iadd__) are implemented efficiently. (This isn't hard, if StringIO.__iadd__ is efficient, but what if it isn't?) You may have a point that the bar should be lower than it is now, but the bar you propose here (and have proposed repeatedly in the past) is naive and dangerous. That way lies Emacs (which I love to death, but Emacs Lisp is the historic antithesis of a disciplined, well-designed language).
Instead the Python community treats [TOOWTDI] as the most serious of all.
C'mon, Steve, this is obviously not true. I would not be surprised if it's the most cited koan on Python Ideas, but that's because of the large number of trivial (and often bad) ideas proposed for addition to Python, then defended in (moderately) long threads on the grounds that "it's only three lines". Trivial stuff that clears the existing bar gets added despite the koan, and not just because of your long-running campaign against it. Other koans are taken much more seriously. We joke about "bike- shedding", but it's recognized as necessary because "readability counts". Features that sound attractive in the abstract (tuple arguments to removeprefix is a recent example) are withdrawn because "simple is better than complex", and the feature couldn't adduce enough use cases to invoke "although practicality beats purity". Some PEPs languish for literally years because of "In the face of ambiguity, refuse the temptation to guess" or "Although never is often better than *right* now". And both the str.join idiom and all comprehensions are (more or less superficially, I admit) instances of "Flat is better than nested."
I think you underestimate just how recursively deep Tim's sense of humor is.
The koan isn't satisfied [by str.join], there is One Way but it isn't Obvious.
You forgot that the project was led by Dutchmen, and it was obvious enough to them: Although that way may not be obvious at first unless you're Dutch. Obvious does not only mean "easily discoverable"; it may also mean "unforgettable once seen". I would not have discovered str.join by myself for years, but it was unforgettably correct as soon as I saw it, despite the odd syntax. Ex ante, not obvious to me. Ex post, forehead-flattening levels of "how did I miss that?!" obvious.
Surely the fact that the wrapper is "trivial" should count as a point in its favour, not against it?
Not obvious, as you concede (taken out of context, but I don't think unfairly):
If you want the builder to be at all discoverable, it needs to be attached to str, or at least to something more obvious than io, maybe string. But the *trivial* wrapper obviously belongs to io. Making it discoverable is complicated. If you have to import io into string, or worse builtins, it's not trivial any more IMO YMMV. The costs thus mount. Not all that much, but this is just the implementer's side, which is rarely all that expensive for well- defined features. TOOWTDI should bring up a grin when you notice the usage of the dashes, but it's not just a joke. Regards, Steve

On Apr 1, 2020, at 20:59, Steven D'Aprano <steve@pearwood.info> wrote:
I think it’s worth separating the two. There is a legitimate desire for a “better” way to build strings than str.join. And a string builder type is one obvious solution, as used in many other languages, like Java. To be worth adding, this string builder type should be more discoverable than str.join, and at least as readable, and close to as efficient in both time and space. Ideally, it should be more readable to people with a visceral dislike of str.join, and be significantly better in space (by not retaining all of the input strings for the entire length of the building process) on all Python implementations, and there should also be an easy to use backport. Almost none of this is true for StringIO.__iadd__, but all of it could easily be true for a new string.StringBuilder (using the name I think Stephen suggested). It’s an obvious name, that will probably be the first thing anyone finds in any reasonable search for how to do a string builder in Python. Once found, the meaning is pretty clear. And its help or docs will be useful, concise, and to the point instead of being all about irrelevant and confusing file stuff with += buried somewhere in the middle as a synonym for write. It can’t be easily misused (e.g., even people who know StringIO enough to suggest using it as a string builder mistakenly think `sb = StringIO('abc'): sb.write('def')` will append rather than overwrite). It can be documented to be roughly as fast as str.join but without retaining all of the input strings, on all implementations, which isn’t true for StringIO (which saves no space on some implementations, like CPython, and saves space at the cost of wasting a lot of time on others). It can have a better API—take an initial string on construction, give a useful repr for debugging, etc. Once there are pure-Python and C implementations, backporting them should be trivial—and, because it’s a new class rather than a change to an existing one, using the backport will be trivial too: try: from string import StringBuilder except ImportError: from stringbuilder import StringBuilder # pip install this sb = StringBuilder() for thing in things: sb += make_a_string(thing) s = str(sb) And of course people who only need to support 3.10+ could just import directly without the backport. This would still violate TOOWDTI. And that koan really isn’t just a joke; it’s a serious guideline, just not an absolute one. Adding a second way to do something faces an extra hurdle that has to be overcome, beyond the usual hurdle of conservativeness, but plenty of proposals have overcome that hurdle—str.format, for example, had obvious huge costs from two ways to do something so basic, but its benefits (especially in extensibility) were even huger. Here, the benefits aren’t as large, but the cost isn’t either. So it’s at least arguable that it’s worth doing—not because we should ignore TOOWTDI, but because the benefits in discoverability, readability, and maybe space efficiency outweigh the cost of two ways to do it. If someone wants to propose a string builder that meets that burden (and write the Python and C implementations), I’d be +0.5. But StringIO.__iadd__ does not, which is why I’m -1 on it. (For a third-party library, I’m not sure I’d bother with a C implementation—it can just check if CPython and if so use a str as its buffer… but for a builtin member of a stdlib module, that’s probably not acceptable.)

Hello, On Wed, 1 Apr 2020 21:25:46 -0400 Kyle Stanley <aeros167@gmail.com> wrote:
Well, but those are "done" changes which were backed by official PEPs (except for unary+ which hopefully was there forever). While I kinda tried to flex my arms in what it would make to write a PEP-like text, it certainly nowhere there after me spending a couple of hours on it, and collecting more evidence would take more time.
existing code either in the stdlib or "out in the wild", showing
I hardly would target CPython stdlib at this stage, given the feedback that "".join() is the fastest, and CPython implementation clearly optimizes for "speed where it can be gotten with whatever we have on our hands (which isn't much due to lack of JIT), even if those are tricks". I might be able to show "out in the wild" code (which happens to be stdlib for another Python implementation), it just needs to be properly refactored from .write() approach I partially succumbed to earlier. []
I do hope that you and other readers do trust me that I picked up that "haiku" somewhere and not made it up here on the spot. I otherwise do spend a lot of time studying criticism of Python, and keep an eye on other languages too. Because I do see a clear pattern of people abandoning advanced Python projects (compilers, JITs, etc.), and moving to other languages. And I always have that back feeling that maybe I'm wasting my time either and should just jump into those goes, julias, rusts, haskells, etc. But so far I keep seeing Python as the best - not the best language, but the best-compromise language. []
Don't get me wrong - I love the l.append/"".join(l) pattern. To me, it looks like a twisted mirror of LISP's CONS function. But that was a language where CONS was the only way to be a container! And Python even lacks linked list/cons in the first place. Bottom line: I see myself using l.append/"".join(l) about as frequent as I use cons (which is rare). []
But, I'm against the idea of adding this to the existing StringIO class,
That's quite expectable feedback, I foresaw it and mentioned in "Further Ideas (aka Scope Creep)" section of the original RFC. For a compiler language, that would be a natural choice (you don't use it - you don't get it in your binary), but interpreted language have that surprising for some implication that adding more stuff burdens everyone. To where I come from (implementing a language - small subset of Python), adding more and more stuff is definitely an anti-pattern. So, my interest lies in finding ways in extending already available functionality in *natural way* (subject to debate) to cover more interesting usecase. To not raise any worry, let me give an example of what I consider "natural" and "unnatural" way. So, in a language which already has OrderedDict type, I would never-ever "extended" a dict type, corresponding to a Computer Science type of an unordered hashtable, to be ordered either (as already handled by OrderedDict). []
I would still find that too crude an approach. If it would come to that, I would prefer to actually study internal implementation(s) in detail, and patch up sys.getsizeof() to provide actual information. As you may imagine, that's time consuming, and would be "too early" (if it all), given that the discussion oscillates between vertexes of a triangle of: 1. "Not needed" ("".join() to rule them all). 2. "+= isn't suitable for StringIO". 3. "We can do much more" (mutable string/+= for all streams/separate class). [] -- Best regards, Paul mailto:pmiscml@gmail.com

On Fri, Apr 3, 2020 at 8:26 AM Paul Sokolovsky <pmiscml@gmail.com> wrote:
But it's actually accurate. With getsizeof, you're trying to gauge how much memory something consumes, but it can't acknowledge certain types of memory savings, nor can it recognize certain types of memory consumption. By asking your operating system how much memory you're using, you guarantee to see the actual figure. OTOH, this only works for fairly large allocations, but then again, if the memory cost doesn't actually impact the RSS, is it really a cost? ChrisA

Hello, On Fri, 3 Apr 2020 08:44:23 +1100 Chris Angelico <rosuav@gmail.com> wrote:
But not exactly. Let me humbly explain what's really a cost. It's looking at PyObject_HEAD https://swenson.github.io/python-xr/Include/object.h.html#line-78 (damn, that's Python2 source, stupid google), and seeing that it's at least: Py_ssize_t ob_refcnt; \ struct _typeobject *ob_type; That's 2 word-sized fields, 16 bytes on 64-bit machine. You can dig further and further, and understand, how much memory it takes to store so-and-so kind of structure (and how it could be done differently). Now a couple of words about RSS. That's R there for a reason, you should wonder what's if it's not "R". And modern OSes are very modern and nobody knows what they do with virtual memory, or at least they can't fix bugs when something should be "R", but actually "V" - for decades: https://bugzilla.kernel.org/show_bug.cgi?id=12309 (damn, now self-isolated from spam). I hope, the idea is clear: RSS is largely outside of your control, but bytes you allocate in your source are (or should be). [] -- Best regards, Paul mailto:pmiscml@gmail.com

On Fri, Apr 3, 2020 at 9:20 AM Paul Sokolovsky <pmiscml@gmail.com> wrote:
That's fair, but the PyObject* header isn't the only cost. The actual data for a Python string isn't stored in the structure. How do you know how much memory is being consumed by that? Are you 100% certain that sys.getsizeof() is measuring that? It appears from the source code that it *probably* is (str.__sizeof__ is defined in unicodeobject.c), but it counts, for instance, the length of the UTF-8 representation (if present) plus one for null termination, and that's quite possibly not the actual allocated size, due to overhead (and possible alignment) in PyObject_REALLOC. So you have to either try to delve into the source and find every single byte of overhead or wastage.... or you just allocate a huge bunch of strings and then ask your OS how much space you're consuming. Yes, the OS is going to have very coarse granularity, but when you're trying to figure out the RAM requirements of large-string concatenations, you're looking for a large difference anyway.
Technically yes, it's under your control. In practice, I'm not so sure. ChrisA

Hello, On Fri, 3 Apr 2020 09:34:44 +1100 Chris Angelico <rosuav@gmail.com> wrote:
As I mentioned in another reply, I would look into (temporarily) patching sys.getsizeof() to return true allocation size. As we speak of it, there's another issue. Consider an object consisting of two allocations, say 16-byte header, containing a pointer to another allocation, mere 4 bytes. It would seem that the object takes 20 bytes, but the matter that there's "allocation granularity". If it's 16 bytes, then the true memory cost is 32 bytes. In in blissful distant future I would add a flag to sys.getsizeof() requesting either "internal" object size (as (apparently) returned now) or "external" as described. Anyway, this gets offtopic (or premature discussion). ;-) [] -- Best regards, Paul mailto:pmiscml@gmail.com

On 02/04/2020 22:24, Paul Sokolovsky wrote:
I think you may misunderstand. Before they were "done" changes, the changes you listed (aside from unary +) justified being done by reference to concrete, real-world examples that would be improved by them. That's been lacking to the extent that it sounds like what you're really saying is "I want to program $LANGUAGE in Python." Since that's almost always a mistake, you've actually driven me from +0 to -0.5 on the subject! -- Rhodri James *-* Kynesim Ltd

On Apr 1, 2020, at 14:47, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Then aren’t you going to be disappointed when you’re told you can now use StringIO instead in 3.10, and you start using it everywhere as soon as you can drop 3.9 support, and then you find that in many Python implementations, including CPython, it takes a little more memory than building a list and joining it, not less? The fact that it could theoretically take less memory, and might even do so on some other implementation you aren’t using, is probably not going to be much consolation. Or maybe it is, and you switch to one of those other implementations, because that’s the final straw for CPython for you—and then you discover that, unlike CPython, their StringIO does save memory over str.join, but also unlike CPython it’s also a lot slower than str.join, mainly because its StringIO is encoding every string to UTF-8 when you write and then decoding back at the end. At which point you’re probably ready to give up on Python altogether.

Hello, On Wed, 1 Apr 2020 20:05:33 -0700 Andrew Barnert <abarnert@yahoo.com> wrote:
I accept(ed) your argument. If my proposal would come forward, I would perform a (more) detailed analysis of internal implementations of StringIO in CPython and other implementation(s). In the meantime, to clarify, the memory usage figures (potential 8x space usage of list vs StringIO) do hold do MicroPython/Pycopy. I don't jump into analyzing CPython impl because IMHO the biggest current blocker is a claim that "StringIO wasn't intended to be used like that". So, unless resistance to that is softened, deep memory-digging is premature. So, I'd rather keep discussing that point, for as long as there's a desire to keep this discussion.
Thanks for pinpointing another implied matter. Yes, I kinda assume that io.StringIO is actually an optimized implementation, not just TextIOWrapper(BytesIO). We can discuss this point further if useful.
At which point you’re probably ready to give up on Python altogether.
Given that nobody wants to push people away from using "".join() (only invite people to an alternative with cookies^W better interface, and gradually not-worse performance re: speed/memory), I don't think that's a plausible scenario for breaking people's faith in Python. There're certainly bigger risks ;-). -- Best regards, Paul mailto:pmiscml@gmail.com

I'm assuming here that the goal is to make string building easier, better, and/or more discoverable, and that the io.StringIO discussion is just one way to achieve this. For example, I don't think (but maybe I'm wrong) that "must be a file-like object" is a goal here. If that's not the goal, then we should decide on the goal first. On 4/1/2020 5:43 PM, Paul Sokolovsky wrote:
I think some of those are bad examples, and not in the same league as this change. For example, the walrus operator is not just another way to do assignment (as it existed pre 3.9). But be that as it may, it seems to me that if you want to make string building easier, better, and/or more discoverable, then add a StringBuilder class. It's definitely going to be way more discoverable than "use io.StringIO" or "look up the FAQ and use ''.join()". And it could conceivably be more performant than either of those, since it would have fewer constraints. You can start with: class StringBuilder: def __init__(self): self.strio = io.StringIO() def __iadd__(self, s): self.strio.write(s) return self def __str__(self): return self.strio.getvalue() And you can later write it in C, make it use ''.join() instead of io.StringIO, or whatever else makes sense. I don't think changing the io.StringIO interface is a good decision: it's supposed to look file-like. I suppose if we could wave a magic wand and have all existing file-like objects support __iadd__ I might feel differently, but we can't. Whether such a StringBuilder is suitable for the stdlib, and if so where it goes, are questions to be answered if this approach works out. Eric

On Tue, Mar 31, 2020 at 12:21 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:
I said "there are some use cases for a mutable string type" I did not say that's what was asked for in this thread. So why did I say that? because:
A well-know pattern of string builder, yes.
As I read this suggestion, it starred with something like: * lots of people use the a "pattern of string building", using str += another_string to build up strings. * That is not an efficient pattern, and is considered an anti-pattern, even in cPython, where is has been cleverly optimized. I think everyone on this thread would agree with the above. * The "official recommended solution" is another pattern: build up in the list, and then join it. You are suggesting that it would nice if there were an efficient implementation of string building that followed the original anti-pattern's syntax. After all, if folks want to make a string, then using familiar string syntax would be nice and natural. You've pointed out that StringIO already provides an efficient implementation of string building (which could be made even more efficient, if one wanted to write that code) . And that if it grew an __iadd__ method, it would then match the pattern that you want it to match, and allow folks to improve their code with less change than going to the list.append then join method. All good. But what struck me is that in the end, this is perhaps a more friendly than the list-based method, but it's still a real shift in paradigm: I think people use str +=str not because they are thinking "I need a string builder", but because they are thinking: I need a "string". That is your choice of variable names: buf = "" for i in range(50000): buf += "foo" print(buf) is not what most folks would use, because they aren't thinking "I need a buffer in which to put a bunch of strings", they are thinking: "I need to make this big string", so would more likely write: message = "The start of the message" for i in something: buf += "some more message" do_something_with_the_message(message) which, yes, is almost exactly the same as your example, but with a different intent -- I start with a string and make it bigger, not "I make a buffer in which to build a string, and then put things in it, then get the resulting string out of the buffer. I teach a lot of beginners, so yes, I do see this code pattern a fair bit. The difference in intent means that folks are not likely to go looking for a "buffer" or "string builder" anyway. So that suggested to me that a mutable string type would completely satisfy your use case, but be more natural to folks used to strings: message = MutableString("The start of the message") for i in something: buf += "some more message" do_something_with_the_message(message) And you could do other nifty things with it, like all the string methods, without a lot of wasteful reallocating, particularly for methods that don't change the length of the string. (Though Unicode does make this a challenge!) (and yes, I know, that the "wasteful reallocating" is probably hardly ever, if ever, a bottleneck) In short: a mutable string would satisfy the requirements of a "string builder", and more. Anyway, as I said in my previous message, the fact that a Mutable string hasn't gained any traction tells us something: it really isn't that important. And I mentioned a similar effort I made to make a growable numpy array, and, well, it turned out not to be worth it either. However if we're all wrong, and there would be a demand for such a "string builder", then why not write one (could be a wrapper around StringIO if you want), and put it on PyPi, or even just for own lib, and see how it catches on. Have you done that for your own code and found you like it enough to really want to push this forward? BTW: I timed += vs StringIO, vs list + join, and found (like you did) that they are all about the same speed, under cPython 3.7. But I had a thought -- might string interning be affecting the performance? particularly for the list method: In [43]: def list_join(): ...: buf = [] ...: for i in range(10000): ...: buf.append("foo") ...: return "".join(buf) note that that is only making one string "foo", and reusing it in all items in the list. In the common case, you wouldn't get that help. OK, tested it, no it doesn't really make a difference. If you replace "foo" (which gets interned) with "foo "[:3] (which doesn't), they all take longer, but still all about the same. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Hello, On Tue, 31 Mar 2020 17:01:19 -0700 Christopher Barker <pythonchb@gmail.com> wrote: []
Thanks for the detailed explanation. For me, a canonical example of a feature of "mutable string" would be: s = MutStr("foo") s[0] = "b" This parallels the difference between immutable byte string (bytes) and mutable byte string (bytestring). As you mention, it would go further than that, e.g.: def foo(s): s.replace_inplace("foo", "bar") mys = MutStr("foofoo") foo(mys) # expected: barbar print(mys) But with all that, I don't see why such a "mutable string" would be more suitable for "string builder" pattern.
And I mentioned a similar effort I made to make a growable numpy array, and, well, it turned out not to be worth it either.
And I'm not surprised at all. That's because "mutability" and "dynamically size change" are actually orthogonal features, one doesn't imply the other. That's why I said that I don't see how a mutable string would be more suitable for my case of a string builder. I'm not familiar with Numpy enough to comment with high degree of confidence, but I may imagine that one of the ideas is to keep its internal representation simple (ahem, given that there're already multiple dimensions and stuff). And that pays off - while overall JIT story for Python leaves much to be desired, there's a whole bunch of "numeric accelerators" which burn thru those numpy arrays with a simple and regular internal structure and keep Python competitive for "scientific computing". (My favorite is https://github.com/sdiehl/numpile , (a sample of) number-crunching JIT with type inference in 1000 lines. The guy who wrote it went missing in Haskell. And I wonder how many people left Python for the reasons similar to: anything they say, the answer is "in Python, you have to concat string py putting them in an array". Oh btw, in Haskell, that's probably very true ;-) ). -- Best regards, Paul mailto:pmiscml@gmail.com

On Wed, Apr 1, 2020 at 1:31 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:
It would in one small way, which is that it would be usable directly in many (not all by any means) contexts where strings are used, so there would be less need for the .getvalue() call. I'm thinking that it would be more natural for most folks, who are, after all familiar with strings, but not so much StingIO.
Sure: numpy arrays are mutable, and they are not re-sizable.
That's why I said that I don't see how a mutable string would be more suitable for my case of a string builder.
nope -- but just as suitable.
I may imagine that one of the ideas is to keep its internal representation simple
yes and no: yes: numpy arrays were designed explicitly to be a wrapper around a regular old C array (pointer :-) ) no: there are also important reasons (views, performance) to not change the actuall memory location of the data in the same array: that is, not resize on the fly like python lists, or C++ arrays do.
"numeric accelerators" which burn thru those numpy arrays with a simple
and regular internal structure and keep Python competitive for "scientific computing".
and, well, numpy itself :-)
(My favorite is https://github.com/sdiehl/numpile ,
I hadn't seen that -- pretty cool. But also really a toy version of numba, which I'm pretty sure was started first. The funny thing is, in this thread, while I dont really see the need for adding += to StringIO to make a string builder, I kind of like the idea of adding += to the File protocol -- for all file-like objects. I like the compactness of: with open(filename, 'w') as outfile: a_loop_of_somesort: outfile += something And hey, then you'd get your string builder! One other note: if one really wanted a string builder class, a few other things would be nice, like maybe the ___str__ producing the built up string. - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Wed, Apr 01, 2020 at 07:00:00PM -0700, Christopher Barker wrote:
I don't think that this is likely to be the case. I think it was Andrew (apologies for misattributing this if it was someone else) who pointed out that the CPython internals make a lot of assumptions about strings that would make it very difficult to retrofit a mutable string class to be usable in its place. -- Steven

Christopher Barker writes:
Yah, but you also get outfile.seek(some_random_place) outfile += something for free. This seems like what mathematicians call "abuse of notation." While normally the rule is "consenting adults", I can see folks with severe space constraints "reusing the buffer" with something like outfile.seek(0) a_loop_of_somesort: outfile += something size = outfile.tell() outfile.seek(0) built_string = outfile.read(size) I'm not a fan. Steve

On Thu, Apr 2, 2020 at 6:07 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
outfile.seek(some_random_place) outfile += something
Is that not exactly the same as what you can do with .write() now?
I can see
is this any different than: outfile.seek(0) a_loop_of_somesort: outfile.write(something) size = outfile.tell() outfile.seek(0) built_string = outfile.read(size) I can't see how the += notation makes that any more likely to happen. but anyway, it's an argument against the idea of using StringIO as a "buffer" for strings. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker writes:
On Thu, Apr 2, 2020 at 6:07 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Yes. But it flies in the face of the normal semantics of str '+=' (concatenate, not overwrite) while it's exactly how seekable writable streams have worked since the '70s. I just don't ever want to see that, but I know I will. YMMV, I think it's too high a price to pay for syntactic sugar. I wouldn't have the same objection to a standalone StringBuilder class (I have other objections to that, see Paul Moore's post on multiple TOOWTDI).

Hello, On Wed, 1 Apr 2020 19:00:00 -0700 Christopher Barker <pythonchb@gmail.com> wrote: []
That's again similar to the feedback received in the 2006 thread: https://mail.python.org/pipermail/python-list/2006-January/396021.html , and just as the original author there, my response is "that's not what I'm talking about". So, why do I think it's good idea to add "+=" to BytesIO/StringIO, but not for other streams (like file objects), are BytesIO/StringIO are somehow special? My answer is yes, they are special. I implied that idea in the original RFC, but didn't want to go in detail with it for the fear of scaring reviewers with far-fetching idea. I don't think there's anything left to lose now, so let me sketch it here (maybe next time, someone will quote not just 2006's thread, but 2020's too). So, stream and buffer protocols are very important, powerful, and deep notion in Python. There's however a question that sometimes it's useful to "shape-shift" from one to another. To have an object which is a cross between a buffer and a stream, or more formally, an "adapter" from one to another. It's my idea that BytesIO/StringIO is the closest what Python has to this buffer/stream "cross-object", and actually, it is already does enough to be *the* cross-object. (Well, BytesIO is *buffer*/stream cross, with StringIO being "natural" extension of that idea, where level of "naturalness" is subject to implementation concerns, like discussion with Andrew Barnert showed). So, just think about it - BytesIO allows to construct data using stream API, and then get that get data as a buffer (using an extension method outside the stream API). Sure, BytesIO can also do other things - you don't have to use .getvalue(). But the fact that BytesIO can do different things is exactly the motivation for proposing to add another operator, +=, that's not going to change the status quo of "does different things" that much. And the operator += being added isn't random - it's buffer's append method, added to BytesIO to make it more of a "cross" between buffer and stream. The last step is just "natural" extension of the construction above to StringIO (with some implementation-level concerns, which definitely can be address as long as the interface part is clear). And surely, this path of making better buffer/stream adapter can be followed further, but not en masse "because we can it", but on a case by case basis. (I for one when stayed clear from that in bounds of this RFC, but made a notice that adding += might definitely be a precedent to drive possible future discussions in that direction.) [] -- Best regards, Paul mailto:pmiscml@gmail.com

Sorry, this has been sitting in my drafts for a while, and maybe this conversation is over. But since I wrote it, I figured I might as well post it. On Fri, Apr 3, 2020 at 4:24 AM Paul Sokolovsky <pmiscml@gmail.com> wrote:
I know. this is python-ideas -- we're not ONLY talking about your proposal :-)
<snip>
So, stream and buffer protocols are very important, powerful, and deep notion in Python.
streams, yes (though even though the docs and CS-educated folks use the word Stream, in the broader community, we usually think more in terms of "File Like Object" -- at least those of us that have been around for a long time. As for "buffer", if you search the docs, you see that the word is used to describe the Buffer Protocol, which is a whole different concept. It also shows up in various other places, describing internal behavior (like readline, or pickle, or ...). In the context of streams, it's used to describe the difference between BinaryIO and RawIO, but again, mostly as a implementation detail for streams. All that is a long way of saying that most folks at not thinking in terms of buffers, which is why most folks aren't going to think: "I need to build up a big string from little parts -- I wonder if the io module has what I'm looking for?" -- nor search for "stream" or "buffer" to find what they want. It's my idea that BytesIO/StringIO is the closest what Python has to
this buffer/stream "cross-object", and actually, it is already does enough to be *the* cross-object.
sure -- I'll agree with that. So, just think about it - BytesIO allows to construct data using stream
In fact, the entire reason it exists is to be a file-like object (i.e. the stream API). But the fact that BytesIO can do different things is exactly the
well, no. It's Sequence's extend() method
added to BytesIO to make it more of a "cross" between buffer and stream.
The thing is: Python is all about "duck typing" or "protocols" or "ABCs", whatever you want to call them. And there is not, in fact a standard "buffer" (as you are using the term here) protocol to follow: there are Sequences, and strings, and there are streams. And StringIO is already an implementation of the stream protocol (that's its whole point). So IIUC your idea here, you think it would be good to have an efficient way to building strings that follows the string protocol: actually, + and += in this context is really the sequence protocol: In [11]: lst = [1,2,3] In [12]: lst += [4,5,6] In [13]: lst Out[13]: [1, 2, 3, 4, 5, 6] In [14]: tup = (1,2,3) In [15]: tup += (4,5,6) In [16]: tup Out[16]: (1, 2, 3, 4, 5, 6) In [17]: strin = "123" In [18]: strin += "456" In [19]: strin Out[19]: '123456' Which is why I suggested that the way to get what you want would be a mutable string, rather than a single, out of place, addition to StringIO. And a StringBuilder class would be another way. Either way, I think you'd want it to behave as much like a string as it could, rather than like a stream, with one added feature. However: as it happens strings are unique in Python: I *think* they are the only built in "ABC" with only one implementation (that is, not only type with no ABC :-) ): they are not duck-typed at all (disregarding the days of Python2 with str and unicode). And as far as I know, they are not in any commonly used third party library either. This is not the case even with numbers: we have integers and floats, and other numbers in third partly libs, such as numpy (and had the __index__ dunder added to better support those). So there is a lot of code, at least in cPython, that is expecting a str, and exactly a str in various places, right down to the binary representation. And cPython implementation aside, thanks to strings being a sequence of themselves, They are often type checked in user code as well (to distinguish between a string and, e.g. a list of strings). I know in my code, checking for strings is the ONLY type checking I ever do. So that means a StringBuilder may be the way to go: focused use case and no other type could really be used in place of strings very much anyway. This all made me think *why* do I type check strings, and virtually nothing else? And it's because in most otehr places, if I want a given type, I can just try to make it from teh input: x = float(x) i = int(i) arr = np.asarray(np) but: st = str(st) doesn't work because ANY object can be made into a string. Makes me wonder if we could have an "asstr()", that acted like numpy's asarray: if it's an array already, it just passes it through. if it's not, then it tries to build an array out of it. Of course, there are currently essentially no objects that ducktype strings now, so, well, no use case. SideNote: I jsut noticed that PYthon2 actually HAD a MutableString: https://docs.python.org/2.0/lib/module-UserString.html Which was clearly (from the docs) not actually designed to be used :-) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Apr 7, 2020 at 2:42 AM Christopher Barker <pythonchb@gmail.com> wrote:
Not sure I understand your point here. Calling float() or int() will give you a float or int for many values that aren't floats or ints, just as calling str() will give you a string. The only real difference (as far as I can tell) is that, as you say, any object can be stringified. But all of those operations are potentially lossy - int(1.5) and float(1e100000) both throw away information - and they don't prove that something already was a string/int/float, just that it _now_ is. And to that end, str is no worse than the others. ChrisA

On Mon, Apr 6, 2020 at 9:49 AM Chris Angelico <rosuav@gmail.com> wrote:
yes, but the operation will only work if the value CAN be turned into a float or int (or numpy array, for asarray), and you will get a ValueError for most arbitrary objects. It doesn't have to be lossless to be useful -- you are most often going to lose SOMETHING when you do a type cast. This was particularly useful in Py2 without future division: def fun(x): x = float(x) .... meant that you were assured that x was indeed a float now, and whatever type it was before, it was a type (and value) that could be cast to a float. even it were lossy, And now, I still use that when receiving a string (Or something :-) ) that should be convertible to a float, if someone passes in some other type, or a string that can't be interpreted as a float, I get a TypeError or ValueError. But with str(), virtually any type (absolutely any?) will lead to a valid string object, regardless of its type or value. In practice, the only example I can think of is maybe Path objects, which you may want to accept in place of a str, in a function that really does need a str. but we have the __fspath__protocol now, so that's been obsoleted. and as pointed out in this thread, there is a lot of code that requires an actual str object, not just something that is duck typed like one. This is a lot like numpy, and why asarray is used a lot. but no, asstring() would not be useful -- but mostly because there are essentially no other types out there that are "string-like objects". -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Hello, On Mon, 6 Apr 2020 09:41:46 -0700 Christopher Barker <pythonchb@gmail.com> wrote:
I appreciate it, and it's actually my desired intention to not leave this thread abruptly cut (like 2006's thread), but share further "news", discussion, and argumentation. In this regard, I got enough useful suggestions, but realizing them definitely takes time and effort, hence delays with replies. []
I would humbly disagree. "Stream" is used in broader community, and "file-like object" - in narrow Python community.
I see, it's whole different concept for you. But as I mentioned, they're the same concept for me - both stream and buffer *are* protocols. And that's based on my desire to define Python as a generic programming language, based on a few consistent and powerful concepts. []
All that is a long way of saying that most folks at not thinking in terms of buffers
That's exactly what I seek to change - people to stop think in terms of adhoc notions and patterns, and start to think in terms of generic concepts (which are still efficient implementation-wise).
I appreciate that. We'll never agree in other things, and that's the motivation for my replies - to pinpoint the roots of the disagreement, to see why our think is tangential. (Of course, I do that to find people with similar, not dissimilar, ideas.)
As I said, no problem - we'll redefine the reason of its existance to be a cross-implementation of both stream and buffer APIs.
Nice note about subtyping relation. So, indeed, "buffer" is a "sequence of bytes" (*). So, you'll keep thinking of it's as a superclass method, and I as a subclass method, and we'll never agree, but fortunately, we're thinking about absolutely the same. (*) Or other real-world types like int32, uint64, etc. []
As argued, it's very in-place addition.
And a StringBuilder class would be another way.
StringBuilder would be just subset of functionality of StringIO, and would be hard to survive Occam's razor. (Mine is sharp to cut it off right away.)
And that's absolutely great! They thus allow for efficient implementation, be it interpreted, or compiled AOT or JIT. That's why you'd never hear a proposal for mutable strings from me. (And when I need mutable thingy, it's already there - mutable byte buffer aka bytearray. But that's not the type optimized for appending, BytesIO by definition is. So, that's what gets "sequence" append method.) [] -- Best regards, Paul mailto:pmiscml@gmail.com

I think _this_ is actually the root of the disagreement. A StringBuilder that does one thing and does it well survives Occam’s razor in lots of other languages, like Java. Why? That one thing could be done by a mutable string object, or by a string stream object, so why not just pile it into one of those instead? Because piling it into one of those means you run into conflicting requirements, which force you to make hard tradeoffs, and possibly tradeoffs that are bad for other code, and possibly that break assumptions that existing other code has relied on for years. Python’s StringIO is readable as well as writable. (If I have a library that wants a file object, and I have the data in memory, I just wrap it in a StringIO and now I have that file object. People use it for this all the time.) It also has a current position pointer, and can seek back to previously marked locations. It has optional newlines conversion. It has all the behavior that a file object has to have, and code relies on that fact, and that forces design decisions on you that may not be optimal for a StringBuilder. It sounds like you already know the issues with mutable strings, so I won’t go over them here. A stand-alone StringBuilder doesn’t have to do those things; it just has to append characters or strings to the end, and be able to give you a string when you’re done. So it can be optimal and at the same time dead simple. It can be nothing more than a dynamically-expanding array (or realloc buffer) of UCS4 characters. Or, if you want to (usually) trade a bit of time for a lot of space savings, it can be a union of a dynamically-expanding array of UCS1/2/4 characters (that has to reallocate and copy the first time you append an out-of-range character), but that’s still a whole lot simpler in a StringBuilder than in something that has to meet the str and PyUnicode APIs, or the file object APIs. Or you could design something more complicated if that turns out to work better. If any of these makes it hard to implement persistent seek positions that work even after you’ve reallocated, wastes overflow space when you’re using it just to read from an immutable input, etc., that would be completely irrelevant, because, unlike StringIO, nobody can ask a StringBuilder to do any of those things, so your design doesn’t have to support them. Plus, looking beyond CPython, a new class can have whatever cross-implementation requirements we write into it. You can document that a StringBuilder doesn’t retain all of its input strings, but is at minimum roughly as efficient as making a list of strings and joining them anyway, and every Python implementation will do that (or just not implement the class at all, if they can’t, and document that fact, the reason why, and the recommended porting alternative very high up in a “differences from CPython” chapter), and any backport will too. You can’t document that about StringIO, because it would just be a lie for most existing implementations (including CPython 2.6-3.9, PyPy, etc.).
Sure, buffers and streams are protocols, but they’re not the same protocol. A buffer is all about random access; a stream is not. And file is a protocol too. There are even ABCs for it. It’s also not the same protocol as the simpler thing you’re thinking of as stream, of course, but it’s certainly a protocol. And Python already is a generic language in your sense; most code is written around protocols like file and buffer and iterable and mapping and even number. Pythonic code, whenever possible, doesn’t care if I feed it a shelve instead of a dict, or a np.array of float64 instead of a float, or a StringIO instead of a TextIOWrapper around a FileIO. And people rely on that fact all the time. And you usually don’t even have to do anything special to make that true for your libraries. Your real problem seems to be just that you wish Python were designed around a simpler stream protocol instead of the big and messy file protocol. Maybe that would be better. File could be a subtype or wrapper, or maybe even a collection of them that could be composed as needed—you don’t always need seekability just because you need newline conversion, or vice versa. Java’s granular streams design is actually pretty handy at times (and I think it’s completely orthogonal to their horrible and verbose API around getting, building, and using streams). Then maybe OutputStringStream would just obviously be usable as a builder (which is almost, but not quite, true for C++). And there might be other benefits too. (We could also definitely have a cleaner API for things like socket.makefile, which today looks like a file but raises on many operations.) But that’s not the language we have. And it still won’t be the language we have if you add an __iadd__ method to StringIO. Making StringIO not be a fully-featured and optimal-for-file-like-usage file object isn’t an option, because you can’t break all the code that depends on it. The only way to get there from here would be to design a complete new stream system and get the vast majority of the Python ecosystem to switch over to using it. Which is a pretty huge ask. (And it still won’t let you just add __iadd__ to StringIO; it’ll only let you add __iadd__ to that new OutputStringStream.)

Hello, On Tue, 31 Mar 2020 17:01:19 -0700 Christopher Barker <pythonchb@gmail.com> wrote: []
Yes, that was the progressing: I started with optimizing (my own) code naively written with str += approach, and found that conversion to StringIO.write looks "ugly", so I never finished it. Thinking how to make optimization not affect code quality and clarity, I found that += on StringIO is just it. The issue is that my target implementation is Pycopy (https://github.com/pfalcon/pycopy), which is Python subset. I.e., normal direction is *removing* CPython functionality, not adding to it. Given that it seemed both very obvious addition to Python in general, and went against the normal direction for Pycopy, I decided to share this RFC. As I didn't see any points relevant to Pycopy usecase, I indeed decided to proceed. So, BytesIO.__iadd__()/StringsIO.__iadd__ were added in Pycopy release 3.0.7 (https://github.com/pfalcon/pycopy/releases/tag/v3.0.7) and are documented in the docs: https://pycopy.readthedocs.io/en/latest/library/uio.html#uio.BytesIO.__iadd_... I definitely care about both backwards and forward compatibility between Pycopy and CPython (and other Python implementations). The solution to both problems is very simple: all (well, most) of Pycopy modules are namespaced. So, __iadd__ gets added to uio.BytesIO and uio.StringIO. And there's a backport of "uio" module to CPython: https://github.com/pfalcon/pycopy-lib/blob/master/cpython-uio/uio.py And yes, it's on PyPI: $ python3 -m pip install pycopy-cpython-uio ... $ python3 ...
-- Best regards, Paul mailto:pmiscml@gmail.com

On Mar 30, 2020, at 13:06, Paul Sokolovsky <pmiscml@gmail.com> wrote:
I appreciate expressing it all concisely and clearly. Then let me respond here instead of the very first '"".join() rules!' reply I got.
Ignoring replies doesn’t actually answer them.
This doesn’t tell you anything useful. As the help for getsizeof makes clear, “Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to”. So this gives you some fixed value like 152, no matter how big the buffer and other internal objects may be. If you’re using CPython with the C accelerator, none of those things are available to you from the API, but a quick scan of the C source shows what’s there, and it’s generally actually more storage than the list version. Oversimplifying a bit: While you’re building, it keeps a _PyAccu structure, which is basically a wrapper around that same list of strings. When you call getvalue() it then builds a Py_UCS4* representation that’s in this case 4x the size of the final string (since your string is pure ASCII and will be stored in UCS1, not UCS4). And then there’s the final string. So, if this memory issue makes join unacceptable, it makes your optimization even more unacceptable. And thinking about portable code makes it even worse. Your code might be run under CPython and take even more memory, or it might be run under a different Python implementation where StringIO is not accelerated (where it’s just a TextIOWrapper around a BytesIO) and therefore be a whole lot slower instead. So it has to be able to deal with both of those possibilities, not just one; code that uses the usual idiom, on the other hand, behaves pretty similarly on all implementations.
And making a wild guess about how things might be implemented and offering an optimization based on that guess that actually makes things worse and refusing to even reply when people point out the problems isn’t even more cavalier?
No, it really couldn’t. The semantics are wrong (unless you want, say, universal newline handling in your string builder?), it’s optimized for a different use case than string building, and both the pure-Python and CPython accelerator implementations are less efficient in speed and/or memory.
That's it, nothing else. What's inside StringIO class is up to you (dear various Python implementations, their maintainers, and contributors).
Sure, but what’s inside has to actually perform the job it was designed to do and is documented to do: to simulate a file object in memory. Which is not the same thing as being a string builder.
Python2’s StringIO module is for bytes, not Unicode strings. If you want a mutable bytes-like type, bytearray already exists; there’s no need to wrap the sequence up in a file-like API just to rewrap that in a sequence-like API again; just use the sequence directly. What StringIO is there for is when you _need_ the file API, just as in Python 3’s io.BytesIO. It’s not a more efficient bytearray or one better suited for string building; it’s less efficient and less well suited for string building but it adds different features.
The problem isn’t your start, it’s jumping to the assumption that StringIO must be an answer, and then not checking the docs and the code to see if there are problems, and then ignoring the problems when they’re pointed out. Why do you think a virtual file object must be the optimal way to implement a string builder in the first place?

On Mon, Mar 30, 2020 at 01:59:42PM -0700, Andrew Barnert via Python-ideas wrote: [...]
You seem to be talking about a transient spike in memory usage, as the UCS4 string is built then disposed of. Paul seems to be talking about holding on to large numbers of substrings for long periods of time, possibly minutes or hours or even days in the case of a long running process. If StringIO.getvalue() builds an unnecessary UCS4 string, that's an obvious opportunity for optimization. Regardless of whether people use StringIO by calling the write() method or Paul's proposed `+=` this optimization might still be useful. In any case, throw in one emoji into your buffer, just one, and the whole point becomes moot. Whether you are using StringIO or list.append plus join, you still end up with a UCS4 string at the end. I don't understand the CPython implementation very well, I barely know any C at all, but your argument seems a bit dubious to me. Regardless of the implementation, if you accumulate N code points, it takes a minimum of N by the width of a code point to store that buffer. With a StringIO buffer, there is at least the opportunity to keep them all in a single buffer with minimal overhead: buf --> [CCCC] # four code points, each of 4 bytes in UCS4 With a list, you have significantly more overhead. For the sake of discussion, let's say you build it from four one-character strings. lst --> [PPPP] # four pointers to str objects Each pointer will take eight bytes on modern 64-bit systems, so that's already double the size of buf. Then there is the object overhead of the four strings, which is *particularly* acute for single ASCII chars. 50 bytes for a one byte ASCII char. So in the worst case, every char you add to your buffer takes 58 bytes in a list versus 4 for a StringIO that uses UCS4 internally. Whether StringIO takes advantage of that opportunity *right now* or not is, in a sense, irrelevent. It's an opportunity that lists don't have. Any (potential) inefficiency in StringIO could be improved, but it's baked into the design of lists that it *must* keep each string as a separate object. Of course there are only 128 unique ASCII characters, and interning reduces some of that overhead. But even in the best case where you are appending large strings there's always going to be more memory overhead in a list that a buffer has the opportunity to avoid. And if some specific implementation happens to have a particularly inefficient StringIO, that's a matter of quality of implementation and something for the users of that specific interpreter to take up with its maintainers. It's not a reason for use to reject Paul's proposal.
So wait, let me see if I understand your argument: 1. CPython's string concatentation is absolutely fine, even though it is demonstrably slower on 11 out of the 12 interpreters that Paul tested. 2. The mere possibility of even a single hypothetical Python interpreter that has a slow and unoptimized StringIO buffer is enough to count against Paul's proposal. Is that correct, or have I missed some nuance to your defence of string concatenation and rejection of Paul's proposal?
The "usual idiom" being discussed here is repeated string concatenation, which certainly does not behave similarly on all implementations. Unless, of course, you're referring to it performing *really poorly* on all implementations except CPython.
Ah, now *that* is a good point.
it’s optimized for a different use case than string building,
It is? That's odd. The whole purpose of StringIO is to build strings. What use-case do you believe it is optimized for? -- Steven

On Mon, Mar 30, 2020 at 10:00 PM Steven D'Aprano <steve@pearwood.info> wrote:
Let me tell you, since I was there. StringIO was created in order to fit code designed to a file, where all you want to do is capture its output and process it further, in the same process. (Or vice versa for the reading case of course.) IOW its *primary* feature is that it is a duck type for a file, and that is what it's optimized for. Also note that it only applies to use cases where the data does, indeed, fit in the process's memory somewhat easily -- else you should probably use a temporary file. If the filesystem were fast enough and temporary files were easier to use we wouldn't have needed it. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Mon, Mar 30, 2020 at 10:08:06PM -0700, Guido van Rossum wrote:
I misspoke: it is not the *whole* purpose. See below.
But it does that by *building a string*, does it not? That's what the getvalue() method is for. Perhaps we're talking past each other. I'm aware that the purpose of StringIO is to offer a file-like API with an in-memory object that doesn't require external storage on the file system. (Hence my retraction above about "whole purpose".) But it still has to do this by returning a string. (In the case of writing to a StringIO object, obviously.) -- Steven

Steven D'Aprano writes:
On Mon, Mar 30, 2020 at 10:08:06PM -0700, Guido van Rossum wrote:
But it does that by *building a string*, does it not?
Not all two-pass processes on external streams build strings internally. At least, earlier you insisted that StringIO is not a string.
That's what the getvalue() method is for.
True, but there's no guarantee a given process will ever invoke it. For example, I might read a file encoded as ISO-2022 into a StringIO, then read that StringIO normalizing it to another StringIO as NFD, then encode it to a file as UTF-8. Look Ma, no .getvalue! Steve

Hello, On Tue, 31 Mar 2020 16:28:22 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
Steven, I appreciate keeping up discussion from the same angle as I see it. (I read the thread in chronological order, and see that I could save on some replies, just +1's your.) For the last sentence above, I could offer an explanation "from the other side". Just imagine that we took io.StringIO and: 1. Limit its constructor to StringIO(). (No parameters, no universal newlines!) 2. Leave only the methods .write(), .getvalue(), and remove all other methods. Now, the *whole* purpose of such a class would be string building, it literally can't do anything else. Suppose we now indeed split off such a class, and named it StringBuilder. We'd now have an bit of issue explaining why all that was done. "StringBuffer implements subset of the functionality of StringIO". D'oh. Looking on the implementation side, this StringBuffer class would also have the machinery already present in StringIO, like backing store for accumulated content, size/offset into this backing store. Yes, perhaps it could have code a bit more specialized for just writing and no reading, but is it worth code duplication? So, the core of the criticism seems to be that StringIO was designed for "another purpose" and that it "does much more". It's a bit strange argumentation that a more generic/featureful object can't (just can't) be used for subset of its functionality, you should be always concerned that it can do much more. It's a bit strange logic, akin to "lists are intended to store arbitrary objects, so if you store integers in them, you're doing something wrong". Let's get to our StringBuffer class as constructed above. Now that it doesn't have "IO" at the end, we perhaps can give up and add __iadd__ method to it, having the same semantics as .write(), but returning self (thanks for correction!). But if we can do that, then with the arguments above re: entity and code duplication, perhaps we can do that on the original StringIO. About the only thing that can preclude that is an irrational belief that given that originally it was intended for more general usecase, then we can't, just can't, nudge to make it more comfortable for a subset of its usage. -- Best regards, Paul mailto:pmiscml@gmail.com

On Mar 30, 2020, at 22:03, Steven D'Aprano <steve@pearwood.info> wrote:
But StringIO has the same long-term cost of the list, _plus_ a transient spike. There’s no way that can be better than just the same long-term cost. You can try to argue that it’s not that much worse, or that it isn’t worse in some cases, or that it could be optimized to not be as much worse; I’ll snip our all of those arguments because even if you’re right, it’s still not better. So this proposal amounts to changing Python, so that we can then get everyone to stop using the idiom they’ve been using for decades and use a different one, just to get maybe at best the same performance they already have. Why does that sound reasonable to you?
The reason StringIO keeps a list (well, a C struct that’s almost the same thing as a list) is because it’s fast. It’s not the simplest implementation, it’s something that people put a lot of work into optimizing. Is it possible that someone could come up with something that’s even better for the main uses of StringiO (simulating a file) , and that also happens to be good for use as a string builder? Sure, I suppose it’s possible. But do you really think we should mame a change just so we can encourage people to switch to using something that’s slower and takes more memory (and doesn’t work in older versions of Python) just because it’s not impossible that one day someone will come up with a new optimization that makes it better instead of worse?
But if every implementation of StrjngIO, in every interpreter, is actually worse than joining lists, isn’t that a reason for us to reject the proposal?
No. This is no part of my argument. The recommended way to handle building large strings out of lots of little strings is, and always has been, to join a list. It’s in the FAQ. It’s even baked into the code of CPython (see the error message from calling sum on strings). People should not be concatenating strings, but we don’t need to offer them a better solution because they already have a better solution.
No, the fact of every real life Python interpreter having a StringIO that’s at least a little worse than string join, and in some cases a lot worse, is enough to rule out the proposal. (The facts that StringIO also has the wrong semantics is less obvious for the purpose, and isn’t a decades-long established idiom are additional problems with the proposal. And the biggest problem is that the proposal is trying to fix a problem that doesn’t exist in the first place.)
Is that correct, or have I missed some nuance to your defence of string concatenation and rejection of Paul's proposal?
You haven’t missed any nuance, you’ve missed the entire point. I am not defending string concatenation, I’m defending the established idiom of join. I am not arguing to reject Paul’s proposal because it might theoretically be inefficient on some implementation, but because it definitely is inefficient on every existing implementation. And because it’s wrong to boot, and because it doesn’t solve any actual problem.
No it isn’t. The usual idiom is join. It’s true that there are some people who never read the docs, never search StackOverflow or Python-list, never talk to other developers, etc., and abuse string concatenation. But giving them a second idiom isn’t going to change that—they’re still not going to read the docs, etc. We could give them 30 better ways to do it, and that won’t be any better than giving them 1 way.
Guiro already answered this; but let me ask a followup question: Why would you think a class that’s in the io module, that implements the text file ABC (and doesn’t implement a string-builder API, hence Paul’s proposal), and that’s documented as a way to be “an in-memory stream for text I/O” would be optimized for use as a string builder instead of for use as an in-memory file object?

Hello, On Mon, 30 Mar 2020 13:59:42 -0700 Andrew Barnert <abarnert@yahoo.com> wrote:
I'm happy to discuss various points, but it would be nice to have discussion focused, giving that the change proposed is pretty simple. I'm not sure if it my fault by having tried to structure the original RFC as a poor-man's PEP (so it's somewhat long'ish), but I definitely would like to avoid discussing extended topics along the lines of "there're some mundane languages which offer those string builder classes, but Python is so, SO, special, that it doesn't need it, and whoever thinks otherwise just doesn't get it" or "building a string from pieces by putting pointers to pieces into array, and then concatenating them together is the PEAK achievement of the computer science, and whoever didn't get that just... just... didn't read CPython (yes, CPython!) FAQ".
Yeah, I tried to account for that with "sys.getsizeof(sb) + sys.getsizeof(sb.getvalue())", thanks for noticing that.
Thanks very much for this intro into the CPython io.StringIO implementation, much appreciated. Please let me return the favor and explain how StringIO implemented in Pycopy, which I happen to maintain, and in MicroPython (as the original implementation was written by me there). So, there's an array of bytes. Both implementations use utf-8 to store strings. So, StringIO stores as many bytes as there're actual data in (utf-8) strings. Of course, there's some over-allocation policy to avoid severe quadratic behavior on growing. Overall, storing N bytes of string data requires N + small % of N bytes of data. No additional array of pointers is needed. Original constituent strings (each over-allocated of course) can be GCed in the meantime. The moral is known, and was stated in the original RFC: for as long as somebody's attention is fixated on CPython, the likely reply from them would be: "there's no problem with CPython3, so there's nothing to fix". It takes to step up, think about *multiple* implementations and *interface* they *can* offer.
Indeed, it absolutely and guaranteedly wastes a lot of memory. (It's also the fastest, no worries.)
The point I tried to show is that StringIO is never worse than str += regarding performance (stats for 8 implementations were demonstrated). What went implied is that it can be also very memory-efficient, but thanks to your thorough attention, that now was made explicit, with an implementation (very simple and obvious!) on achieving that described. I'm sorry to hear about deficiencies in StringIO implementation of your favorite Python implementation. On the positive side, now that they're identified, they can be fixed (if there's a need to care about them for that particular implementation). Likewise, I'm sorry for now showing a full possible extent of appreciation of your joining the discussion of the "StringIO vs str +=" matters with claims like "str.join is the fastest!!", with myself not showing that fullest extent of appreciation by repeatedly calling to stay on the topic of improving interface for string building to be on the same level as simple and obvious "str +=". I still tried to answer why str.join can't be a universal solution for all cases, I'm sorry if I failed to do that.
Less efficient than what? I start with simple and obvious "str +=", but vividly inefficient across different Python implementations. I proceed with proposing how with a very simple change, simplicity and obviousness of "str +=" can be retained, while runtime efficiency can be dramatically improved (without any special implied memory use deficiencies). You keep pushing that "there's a faster way to do it". Yes, you're right - there's. But my proposal was never about "fastest string concat in the west", or it would have been about rewriting some code in assembler.
Once somebody would try to implement a dedicated "string builder", they would find that it's some 80% similar to "simulate a file object in memory". On average. I'm sorry to hear about outlier implementations where (per your words), similarity is less than that.
It just occurred to me: maybe I chose the wrong class for running discussion, maybe that should have been BytesIO, and you'd be half won over by now? ;-)
I humbly disagree. And the motivation is exactly parallel to that of str vs io.StringIO. For (binary)string-builder, you constantly need to grow its internal buffer. You also need to do the same for "simulating a file in memory". Then once you have an object which does that (hopefully efficiently, again "ah" to those which don't), you don't need to complicate implementation of other objects to optimize for the "growing" case. Just use an object suitable for a particular usecase: bytearray for inplace updates, and BytesIO for growing-construction. I'm sorry in advance if FAQ for your Python implementation doesn't provide such suggestions. FAQs for other Python implementation very well may.
Wrong claim. I just suggest that it *can* be an answer.
Wrong claim: I don't say "optimal" (after all, you suggested that there's a faster way, and in some cases that can be "optimal"). I would say a "good compromise". -- Best regards, Paul mailto:pmiscml@gmail.com

On Mar 31, 2020, at 12:06, Paul Sokolovsky <pmiscml@gmail.com> wrote: I don’t know why you think being snarky helps make your case. If you make a mistake and it’s pointed out and you give a sarcastically over-enthusiastic thanks, that doesn’t change the fact that it‘s wrong, and if your rationale depends on things that aren’t true, your proposal doesn’t stand. For example, your demonstration that str.join takes 10x the memory of StringIO in CPython is wrong because you didn’t actually include the cost of the list of buffers inside the StringIO, and once you do, StringIO is actually larger rather than 1/10th the size. It doesn’t matter how much you try to belittle that point or the way it was made by exaggeratedly apologizing, it’s still true that changing everyone’s code to do things your way instead of the way they’ve always done things would increase, not decrease, their memory usage, not just in theoretical possible implementations of Python but in multiple real life implementations, including CPython.

On Mar 30, 2020, at 10:18, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
That said, anyone could tell about small, efficient, well maintained "mutable string" classes on Pypi?
I don’t know of one. But what do you actually want it for? In most cases where you want “mutable strings”, what you really want is either a string builder (just wrap up a list of strings and join), or something that (unlike a list, array.array, etc.) provides insert and delete of substrings in better than linear time, like a gap buffer or rope or tree-indexed thing or similar (and there are good PyPI libraries for some of those things). But if you actually have a use for a simple mutable string that had the str API plus the MutableSequence API and performs roughly like array.array('Q') but with substrings instead of their codepoint int values, I don’t think anyone’s built that. If you want to build it yourself, I doubt it’s possible to make a pure-Python version that’s efficient enough for real use in CPython; you’d probably need a C accelerator just to avoid the cost of boxing and unboxing between ints and single-char strings for most operations. However, you probably could build a minimal “ucs4array” class with a C accelerator and then build most of the str API on top of that in pure Python. (Or, if you want the space efficiency of CPython strings, you need ucs1/ucs2/ucs4array types and a str that switches between them.)

30.03.20 20:07, Andrew Barnert via Python-ideas пише:
Sadly, this isn’t possible. Large amounts of C code—including builtins and stdlib—won’t let you duck type as a string; as it will do a type check and expect an actual str (and if you subclass str, it will ignore your methods and use the PyUnicode APIs to get your base class’s storage directly as a buffer instead). So, no type, either C or Python, can really be a drop-in replacement for str. At best you can have something that you have to call str() on half the time.
I agree with this. It is not possible with the current PyUnicode implementation and the current C API. And even if we can make it possible for most cases, it will significantly complicate the code and the benefit will likely be not worth the cost.
That’s why there’s no MutableStr on PyPI, and no UTF8Str, no EncodedStr that can act as both a bytes and a str by remembering its encoding (Nick Coghlan’s motivating example for changing this back in the early 3.x days), etc.
It is not so hard to implement EncodedStr (but it will look not like you expect). I was going to add it and did some preparations which make it possible. You have just to add the __bytes__ method to string subclass to make bytes(encoded_str) working (it might be enough for my purposes). Or add support of the buffer protocol if you want larger compatibility with bytes, but you can not do this in pure Python. I abandoned this idea because the need (compatibility with some Python 2 pickles) was not large.

On Mon, Mar 30, 2020 at 10:07:30AM -0700, Andrew Barnert via Python-ideas wrote:
The quote about adding another abstraction layer solving every problem except the problem of having too many abstraction layers comes to mind. But let's please not hijack this proposal by making it about a full- blown mutable string object. Paul's proposal is simple: add `+=` as an alias to `.write` to StringIO and BytesIO. We have the str concat optimization to cater for people who want to concatenate strings using `buf += str`. You are absolutely right that the correct cross-platform way of doing it is to accumulate a list then join it, but that's an idiom that doesn't come easily to many people. Hence even people who know better sometimes prefer the `buf += str` idiom, and hence the repeated arguments about making join a list method. (But you must accumulate the list with append, not with list concatenation, or you are back to quadratic behaviour.) It seems to me that the least invasive change to write efficient, good looking code is Paul's suggestion to use StringIO or BytesIO with the proposed `+=` operator. Side by side: # best read using a fixed-width font buf = '' buf = [] buf = io.StringIO() for s in strings: for s in strings: for s in strings: buf += s buf.append(s) buf += s buf = ''.join(buf) buf = buf.getvalue() Clearly the first is prettiest, which is why people use it. (It goes without saying that *pretty* is a matter of opinion.) It needs no extra conversion at the end, which is nice. But it's not cross-platform, and even in CPython it's a bit risky. The middle is the most correct, but honestly, it's not that pretty. Many people *really* hate the fact that join is a string method and would rather write `buf.join('')`. The third is, in my opinion, quite nice. With the status quo `buf.write(s)`, it's much less nice. Paul's point about refactoring should be treated more seriously. If you have code that currently has a bunch of `buf += s` scattered around in many places, changing to the middle idiom is difficult: 1. you have to change the buffer initialisation; 2. you have to add an extra conversion to the end; 3. and you have to change every single `buf += s` to `buf.append(s)`. With Paul's proposal, 1 and 2 still apply, but that's just two lines. Three if you include the `import io`. But step 3 is gone. You don't have to change any of the buffer concatenations to appends. Now that's not such a big deal when all of the concatenations are right there in one little loop, but if they are scattered around dozens of methods or functions it can be a significant refactoring step.
More generally, a StringIO is neither the obvious way
If I were new to Python, and wanted to build a string, and knew that repeated concatenation was slow, I'd probably look for some sort of String Builder or String IO class before thinking of *list append*. Especially if I came from a Java background.
nor the fastest way
It's pretty close though. On my test, accumulating 500,000 strings into a list versus a StringIO buffer, then building a string, took 27.5 versus 31.6 ms. Using a string took 36.4 ms. So it's faster than the optimized string concat, and within arm's reach of list+join. Replacing buf.write with `+=` might, theoretically, shave off a bit of the overhead of attribute lookup. That would close the distance a fraction. And maybe there are other future optimizations that could follow. Or maybe not.
If writing `buf += s` is writing C++ instead of Python, then you have spent much of this thread defending the optimization added in version 2.4 to allow people to write C++ instead of Python. So why are you suddenly against it now when the underlying buffer changes from str to StringIO? When I was younger and still smarting from being on the losing side of the Pascal vs C holy wars, I really hated the idea of adding `+=` to Python because it would encourage people to write C instead of Python. I got over it :-) -- Steven

On Tue, Mar 31, 2020 at 03:01:51PM +1100, Steven D'Aprano wrote:
I re-ran the test with a single non-ASCII character added to the very end, '\U0001D400'. Both the list and the StringIO versions slowed down by about the same amount of time (approx 4ms) so the difference between them remained the same in absolute terms but shrank marginally in relative terms. YMMV. -- Steven

On Sun, Mar 29, 2020 at 10:58 AM Paul Sokolovsky <pmiscml@gmail.com> wrote:
I don't think characterizing this as a "mis-optimization" is fair. There is use of in-place add with strings in the wild and CPython happens to be able to optimize for it. Someone was motivated to do the optimization so we took it without hurting performance for other things. There are plenty of other things that I see people regularly that I don't personally think is best practices but that doesn't mean we should automatically ignore them and not help make their code more performant if possible without sacrificing best practice performance. And I'm not sure if you're trying to insinuate that CPython represents Python the language and thus needs to not optimize for something other implementations have/can not optimize for, which if you are suggesting that then I have an uncomfortable conversation I need to have with PyPy 😉. Or if you're saying CPython and Python should be considered separate, then why can't CPython optimize for something it happens to be positioned to optimize for that other implementations can't/haven't?

On Mar 30, 2020, at 10:01, Brett Cannon <brett@python.org> wrote:
I don't think characterizing this as a "mis-optimization" is fair. There is use of in-place add with strings in the wild and CPython happens to be able to optimize for it. Someone was motivated to do the optimization so we took it without hurting performance for other things. There are plenty of other things that I see people regularly that I don't personally think is best practices but that doesn't mean we should automatically ignore them and not help make their code more performant if possible without sacrificing best practice performance.
Yes. A big part of the reason there’s so much use in the wild is that for small cases that aren’t in the middle of a bottleneck, it’s perfectly reasonable for people to add two or three strings and not care about performance. (Who cares about N**2 when N<=15 and it happens at most 4 times per run of your program?) So people do it, and it’s fine. When they really do need to optimize, a quick search of the FAQ or StackOverflow or whatever will tell them the right way to do it, and they do it, but most of the time it doesn’t matter. So when CPython at some point optimized str concatenation and made a bunch of scripts 1% faster, most people didn’t notice, and of course they wouldn’t have complained if they had. Maybe the OP could argue that this was a bad decision by finding examples of code that actually relies on that optimization despite being intended to be portable to other implementations. It’s worth comparing the case of calling sum on strings—which is potentially abused more often than used harmlessly, so instead of optimizing it, CPython made it an error. But without any such known examples, it’s hard not to call the string concatenation optimization a win.

On Mon, Mar 30, 2020 at 10:24:02AM -0700, Andrew Barnert via Python-ideas wrote:
On Mar 30, 2020, at 10:01, Brett Cannon <brett@python.org> wrote:
[talking about string concatenation]
When you're talking about N that small (2 or 4, say), it is quite possible that the overhead of constructing a list then looking up and calling a method may be greater than that of string concatenation, even without the optimization. I wouldn't want to bet either way without benchmarks, and I wouldn't trust the benchmarks from one machine to apply to another.
Ah, but that's the rub. How often do they know they need to do that "quick search"? Unless they get bitten by poor performance, and spend the time to profile their script and discover the cause of the slow down, how would they know what the cause was? If people already know about the string concatenation trap, they don't need a quick search, and they're probably not writing repeated concatenation for arbitrary N in the first place. Although I have come across a few people who are completely dismissive of the idea of using cross-platform best practices. Even actively hostile to the idea that they should avoid idioms that will perform badly on other interpreters. On the third hand, if they don't know about the trap, then it won't be a quick search because they don't know what to search for (unless it's "why is Python so slow?" which won't be helpful). Disclaimer: intellectually, I like the CPython string concatenation optimization. It's clever, a Neat Hack, I really admire it. But I can't help feeling that, *just maybe*, it's a misplaced optimization, and if it were proposed today when we are more concerned about alternative interpreters, we might not have accepted it. Perhaps if CPython didn't dominate the ecosystem so completely, and more people wrote cross-platform code that was run across multiple interpreters, we wouldn't be quite so keen on an optimization that encourages quadratic behaviour half the time. So even though I don't *quite* agree with Paul, I can see that from the perspective of people using alternate interpreters, this CPython optimization could easily be characterized as a mis-optimization. "Why is CPython encouraging people to use an idiom that is all but guaranteed to be hideously slow on everyone else's interpreter?" Since Brett brought up the notion of fairness, one might even be forgiven for considering that such an optimization in the reference interpreter, knowing that most of the other interpreters cannot match it, is an unfair, aggressive, anti-competitive action. Personally I wouldn't go quite so far. But I can see why people who are passionate about alternate interpeters might feel that this optimization is both harmful and unfair on the greater Python ecosystem. Apart from cross-platform issues, another risk with the concat optimization is that it's quite fragile and sensitive to the exact form of your code. A small, seemingly insignificant change to your code can have enormous consequences: In [1]: strings = ['abc']*500000 In [2]: %%timeit ...: s = '' ...: for x in strings: ...: s = s+x ...: 36.4 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [3]: %%timeit ...: s = '' ...: for x in strings: ...: s = t = s+x ...: 59.7 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) That's more than a thousand times slower. And I think people often underestimate how painful it can be to debug performance problems caused by this. If you haven't been burned by it before, it may not be obvious just how risky repeated concatenation can be. Here is an example from real life. In 2009, about four years after the in-place string concatenation optimization was added to CPython, Chris Withers asked for help debugging a problem where Python httplib was literally hundreds of times slower than other tools, like wget and Internet Explorer: https://mail.python.org/pipermail/python-dev/2009-August/091125.html A few weeks later, Simon Cross realised the problem was probably the quadratic behaviour of repeated string addition: https://mail.python.org/pipermail/python-dev/2009-September/091582.html leading to this quote from Antoine Pitrou: "Given differences between platforms in realloc() performance, it might be the reason why it goes unnoticed under Linux but degenerates under Windows." https://mail.python.org/pipermail/python-dev/2009-September/091583.html and Guido's comment: "Also agreed that this is an embarrassment." https://mail.python.org/pipermail/python-dev/2009-September/091592.html So even in CPython, it isn't inconceivable that the concat optimization may fail and you will have hideously slow code. At this point, I think that CPython is stuck with this optimization, for good or ill. Removing it will just hurt a bunch of code that currently performs adequately. But I can't help but feel that, knowing what we know now, there's a good chance that if that optimization were proposed now rather than in 2.4, we might not accept it.
Does the CPython standard library count? See above. -- Steven

Does anyone know if any linters find and warn about the `string += word` in a loop pattern? It feels like a linter would be the place to do that. I don't think we could possibly make it an actual interpreter warning given borderline OK uses (or possibly even preferred ones). But a little nagging in tooling could draw attention. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

I have myself been "guilty" of using the problem style for N < 10. In fact, I had forgotten about the optimization even, since my uses are negligible time. For stuff like this, it's fast no matter what: for clause in query_clauses: sql += clause Maybe I have a WHERE or two. Maybe an ORDER BY. Etc. But if I'm sure there won't be more than 6 such clauses to the query I'm building, so what? Or probably likewise with bits of a file path, or a URL with optional parameters, and a few other things. On Mon, Mar 30, 2020 at 11:15 PM David Mertz <mertz@gnosis.cx> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hello, On Mon, 30 Mar 2020 23:20:28 -0400 David Mertz <mertz@gnosis.cx> wrote:
I personally don't think it's a "problem style" per se. I'm using it all the time. I'm going to keep using it. We all love Python for being a language which you can easily prototype in, and don't have to concern yourself with implementation details unless you want. My concern for Python to be a language which can be progressively and easily optimized when you need it. Going over the code and changing many lines along the lines of diff: - buf += piece + buf.append(piece) isn't my idea of "seamless" optimization, especially that I know it guaranteedly will grow my memory usage. Sadly I came to conclusion that even - buf += piece + buf.write(piece) patching hurts my aesthetic feelings. At least, that's the only explanation I have for why in one of my modules https://github.com/pfalcon/pycopy-lib/blob/master/utokenize/utokenize.py#L44 I converted one loop from "str +=" to "StringIO.write()", but not another. I then tried to think what could help with that and having += on StringIO seemed to do the trick.
[] -- Best regards, Paul mailto:pmiscml@gmail.com

Hello, On Mon, 30 Mar 2020 09:58:32 -0700 Brett Cannon <brett@python.org> wrote:
Everyone definitely doesn't have to agree with that characterization. Nor there's strong need to be offended that it's "unfair". After all, it's just somebody's opinion. Roughly speaking, the need to be upset by the "mis-" prefix is about the same as need to be upset by "bad" in some random blog post, e.g. https://snarky.ca/my-impressions-of-elm/ I'm also sure that people familiar with implementation details would understand why that "mis-" prefix, but let me be explicit otherwise: a string is one of the fundamental types in many languages, including Python. And trying to make it too many things at once has its overheads. Roughly speaking, to support efficient appending, one need to be ready to over-allocate string storage, and maintain bookkeeping for this. Another known optimization CPython does is for stuff like "s = s[off:]", which requires maintaining another "offset" pointer. Even with this simplistic consideration, internal structure of "str" would be about the same as "io.StringIO" (which also needs to over-allocate and maintain "current offset" pointer). But why, if there's io.StringIO in the first place?
Nowhere did I argue against applying that optimization in CPython. Surely, in general, the more optimizations, the better. I just stated the fact that of 8 (well, 11, 11!) Python'ish implementations surveyed, only 1 implemented it. And what went implied, is that even under ideal conditions that other implementations say "we have resources to implement and maintain that optimization" (we still talking about "str +=" optimization), then at least for some projects, it would be against their interests. E.g. MicroPython, Pycopy, Snek optimize for memory usage, TinyPy for simplicity of implementation. "Too-complex basic types" are also a known problem for JITs (which become less performant due to need to handle multiple cases of the same primitive type and much harder to develop and debug). At the same time, ergonomics of "str +=" is very good (heck, that's why people use it). So, I was looking for the simplest possible change which would allow for the largest part of that ergonomics in an object type more suitable for content accumulation *across* different Python'ish implementations. I have to admit that I was inspired to write down this RFC by PEP 616 "String methods to remove prefixes and suffixes". Who'd think that after so many years, there's still something useful to be added to sting methods (and then, that it doesn't have to be as complex as one can devise at full throttle, but much simpler than that).
And I'm not sure if you're trying to insinuate that CPython represents Python the language
That's an old and painful (to some) topic.
and thus needs to not optimize for something other implementations have/can not optimize for, which if you are
As I clarified, I don't say that CPython shouldn't optimize for things. I just tried to argue that there's no clearly defined abstraction (*) for accumulating string buffer, and argued that it could be easily "established". (*) Instead, there're various of practical hacks to implement it, as both 2006's and this thread shows.
Yes, I personally think that CPython and Python should be considered separate. E.g. the topic of this RFC shouldn't be considered just from CPython's point of view, but rather from the angle of "Python doesn't seem to define a useful abstraction of (ergonomic) string builder, here's how different Python implementations can acquire it almost for free". -- Best regards, Paul mailto:pmiscml@gmail.com

On Mar 30, 2020, at 12:00, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Because io.StringIO does _not_ need to do that. It’s documented to act like a TextIOWrapper around a BytesIO. And the pure-Python implementation (as used by some non-CPython implementations of Python) is actually implemented that way: https://github.com/python/cpython/blob/3.8/Lib/_pyio.py#L2637. Every read and write to a StringIO passes through the incremental newline processor and the incremental UTF-8 coded to get passed on to a BytesIO. That’s not remotely optimal. And it doesn’t allow you to do random-access seeks to arbitrary character positions. It’s true that the C accelerator for io.StringIO used by CPython uses a dynamic overallocated array of UCS4 instead, but you can’t rely on that portably any more than you can rely on CPython’s str.__iadd__ optimization portably. Plus, it’s optimized for typical file-like usage, not for typical string-like usage, so the resize rules aren’t the same; there’s no attempt to optimize storage for all-Latin or all-BMP text; and so on. Plus, it still has to deal with file-ish things like universal newline support which you not only don’t need, but explicitly want to not be there.
(*) Instead, there're various of practical hacks to implement it, as both 2006's and this thread shows.
No, there is one idiomatic way to do it: create a list of strings and join them. That’s not a “hack” any more than using a string builder class or a string stream/file class is a “hack”. The fact that the standard Python idiom, the standard Java idiom, and the standard C++ idiom for building strings are all different is not a defect in any of those three languages; they’re all perfectly reasonable. And changing Python to have two standard idioms instead of one (with the new one less efficient and more complicated) would not be an improvement.

Hello, On Mon, 30 Mar 2020 12:37:48 -0700 Andrew Barnert <abarnert@yahoo.com> wrote:
You miss the point of my RFC - it says it *can* do that, for free. And it *can* be documented as a class to perform very reasonable string construction across various Python implementations. And any Python implementation providing StringIO can pick it up very easily. I hear you, you say "no need". Noted, thanks for detailed feedback. (It's p.4.1 in the RFC, "there's no problem with CPython3, so there's nothing to fix"). [] -- Best regards, Paul mailto:pmiscml@gmail.com

On Mon, Mar 30, 2020 at 12:37:48PM -0700, Andrew Barnert via Python-ideas wrote:
The same comment can be made that str does not need to implement the in-place concat optimization either. And yet it does, in CPython if not any other interpreter. It seems to me that Paul makes a good case that, unlike the string concat optimization, just about every interpreter could add this to StringIO without difficulty or great cost. Perhaps they could even get together and agree to all do so. But unless CPython does so too, it won't do them much good, because hardly anyone will take advantage of it. When one platform dominates 90% of the ecosystem, one can sensibly write code that depends on that platform's specific optimizations, but going the other way, not so much. The question that comes to my mind is not whether StringIO *needs* to do this, but whether there is any significant cost to doing this? Of course there is *some* cost: somebody has to do the work, and it won't be me. But once done, is there any significant maintenance cost beyond what there would be without it? Is there any downside? [...]
And it doesn’t allow you to do random-access seeks to arbitrary character positions.
Sorry, I don't see why random access to arbitrary positions is relevant to a discussion about concatenation. What am I missing? -- Steven

ya that's what I also get for quickly whipping something up and not testing it. Good catch from the end. I found another similar at https://gstindianews.info. But you get the idea - a simple wrapper around a list is going to be way better than a wrapper around StringIO. i jayesh

It’s usually an even better alternative to just put the strings into a list of strings (or to write a generator that yields them), and then pass that to the the join method. This is recommended in the official Python FAQ. It’s usually about 40% faster than using StringIO or relying on the string-concat optimization in CPython, it’s efficient across all implementations of Python, and it’s obvious _why_ it’s efficient. It can sometimes take more memory, but the tradeoffs is usually worth it. This has been well known in the Python community for decades. People coming from C++ look for something like stringstream and find StringIO; people coming from Java look for something like StringBuilder and build their own version around StringIO; people who are comfortable with Python use str.join. So third-party libraries that don’t do that are likely either (a) not expecting large amounts of data (and therefore probably suboptimal in other areas), or (b) written by someone who doesn’t really get Python. So what is StringIO for? For being a file object, but in memory rather than representing a file. Its API is exactly the same as every other file object, because that’s the whole point of it.
So your goal is to allow people to use badly-written third-party libs designed around the string-concat antipattern, without fixing those libs, by feeding them StringIO objects when they expected str objects? This seems like a solution to a theoretical problem that might work for some instances of that problem. But do you have any actual examples of third-party libs that have this problem, and that (obviously) break if you give them StringIO objects, but would not break when passed a StringIO with __iadd__?
Yes. Not as in “nobody will ever run it again”, but definitely as in “no new feature you add to Python will be backported”. Python 2.7 the language and CPython 2.7 the implementation have been feature-frozen for years now, and now they’re not even supported by the Python organization at all. So, trying to improve the behavior of Python 2.7 code by making a proposal for Python won’t get you anywhere. Adding StringIO.__iadd__ to Python 3.10 will not help anyone using Python 2.7. In fact, even if you somehow convinced everyone to make the extraordinary decision to re-open Python 2.7 and make a new 2.7.18 release with this feature backported, it still wouldn’t help the vast majority of people using Python 2.7, because most people using Python 2.7 are using stable systems with stable versions that they don’t update for years. That’s why they’re still using 2.7 in the first place: because 2.7.16 is what comes with the Linux LTS they’ve settled on for deployment, or it’s what comes with the macOS version they use for their dev boxes, or Jython doesn’t have a 3.x version yet, or whatever. So a new feature in 2.7.18 wouldn’t get to them for years, if ever. It’s also worth noting that the io module is very slow in most Python 2.x implementations. There’s a separate (and older) StringIO module, and for CPython an accelerated cStringIO, and you almost certainly want to use those, not io, here. (Except, of course, that what you really want to use is join anyway.)
The last IronPython release, 2.7.9, was in 2018. As the release notes for that version say, “With this release, we will shift the majority of work to IronPython3.” Of course IronPython3 isn’t ready for prime time yet, but it’s not because they’re still firmly in Python2 territory and still making major improvements to their 2.7 branch, it’s because it’s taking a long time to finish their 3.x branch (in part because they no longer have Microsoft and Unity throwing resources at the project). They’re not adding new features to 2.7 any more than CPython is. (They are working on a 2.7.10; but it’s just 2.7.9 with support for more .NET runtimes plus porting some security fixes from the last CPython 2.7 stdlib.) I don’t know the situation with Jython as well, but I believe it’s similar.
3. Recognize that Python and CPython have been promoting str.join for this problem for decades, and most performance-critical code is already doing that, and make sure that solution is efficient, and recognjze that poorly-written code is uncommon but does exist, and may take a bit more work to optimize than a 1-line change to optimize, but that’s acceptable—and not the responsibility of any alternate Python implementation to help with.

I completely agree with Andrew Barnert. I just want to add a little comment about overriding the `+=` (and `+`) operator for StringIO. Since StringIO is a stream --not a string--, I think `StringIO` should continue to use the common interface for streams in Python. `write()` and `read()` are fine for streams (and files) and you can find similar `write` and `read` functions in other languages. I cannot see any advantage on departing from this convention.

I agree with the arguments the OP brings forward. Maybe, it should be the case of having an `StringIO` and `BytesIO` subclass? Or better yet, just a class that wraps those, and hide away the other file-like methods and behaviors? That would keep the new class semantically as a string, and they could implement all of the str/bytes methods and attributes so as to be a drop-in replacement - _and_ add a proper `__setitem__` so that one could have a proper "mutable string". It ust would use StringIO/BytesIo as its "engine". Such code would take like, 100 lines (most of them just to forward/reimplement some of the legacy str methods), be an effective drop-in replacement, require no change to Python - it could even be put now in Pypi - and, maybe, even reach Python 3.9 in time, because, as I said, I agree with your points. On Mon, 30 Mar 2020 at 12:06, <jdveiga@gmail.com> wrote:

On Mar 30, 2020, at 08:29, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Why? What’s the benefit of building a mutable string around a virtual file object wrapped around a buffer (with all the extra complexities and performance costs that involves, like incremental Unicode encoding and decoding) instead of just building it around a buffer directly? Also, how can you implement an efficient randomly-accessible mutable string object on top of a text file object? Text files don’t do constant-time random-access seek to character positions; they can only seek to the opaque tokens returned by tell. (This should be obvious if you think about how you could seek to the 137th character in a UTF-8 file without reading all of the first 137 characters.) (In fact, recent versions of CPython optimize StringIO so it only fakes being a TextIOWrapper around a BytesIO and actually uses a Py_UCS4* buffer for storage, but that’s CPython-specific, not guaranteed, and not accessible from Python even in CPython.) And, even if that were a good idea for implementation reasons, why should the user care? If they need a mutable string, why do they care whether you give them one that inherits from or delegates to a StringIO instead of a list or an array.array of int32 or the CPython string buffer API (whether accessed via a C extension or ctypes.pythonapi) or a pure C library with its own implementation and optimizations? More generally, a StringIO is neither the obvious way nor the fastest way nor the recommended way to build strings on the fly in Python, so why do you agree with the OP that we need to make it better for that purpose? Just to benefit people who want to write C++ instead of Python? If the goal is to cater to people who won’t read the docs to learn the right way, the obvious solution is to mandate the non-quadratic string concatenation of CPython for all implementations, not to give them yet another way of doing it and hope they’ll guess or look up that one even though they didn’t guess or look up the long-standing existing one.
Sadly, this isn’t possible. Large amounts of C code—including builtins and stdlib—won’t let you duck type as a string; as it will do a type check and expect an actual str (and if you subclass str, it will ignore your methods and use the PyUnicode APIs to get your base class’s storage directly as a buffer instead). So, no type, either C or Python, can really be a drop-in replacement for str. At best you can have something that you have to call str() on half the time. That’s why there’s no MutableStr on PyPI, and no UTF8Str, no EncodedStr that can act as both a bytes and a str by remembering its encoding (Nick Coghlan’s motivating example for changing this back in the early 3.x days), etc. Fixing this cleanly would probably require splitting the string C API into abstract and concrete versions a la sequence and then changing a ton of code to respect abstract strings (to only optimize for concrete ones rather than requiring them, again like sequences). Fixing it slightly less cleanly with a hookable API might be more feasible (I’m pretty sure Nick Coghlan looked into it before the 3.3 string redesign; I don’t know if anyone has since), but it’s still probably a major change.

Hi Andrew - I made my previous post before reading your first answer. So, anyway, what we have is that for a "mutable string like object" one is free to build his wrapper - StringIO based or not - put it on pypi, and remember calling `str()` on it before having it leave your code. Thank you for the lengthy reply anyway. That said, anyone could tell about small, efficient, well maintained "mutable string" classes on Pypi? On Mon, 30 Mar 2020 at 14:07, Andrew Barnert <abarnert@yahoo.com> wrote:

On Tue, Mar 31, 2020 at 4:20 AM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
There's a vast difference between "mutable string" and "string builder". The OP was talking about this kind of thing: buf = "" for i in range(50000): buf += "foo" print(buf) And then suggested using a StringIO for that purpose. But if you're going to change your API, just use a list: buf = [] for i in range(50000): buf.append("foo") buf = "".join(buf) print(buf) So if you really want a drop-in replacement, don't build it around StringIO, build it around list. class StringBuilder: def __init__(self): self.data = [] def __iadd__(self, s): self.data.append(s) def __str__(self): return "".join(self.data) This is going to outperform anything based on StringIO fairly easily, plus it's way WAY simpler. But this is *not* a mutable string. It's a string builder. If you want a mutable string, first figure out exactly what mutations you need, and what performance you are willing to accept. ChrisA

On Tue, Mar 31, 2020 at 5:10 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
And that's what I get for quickly whipping something up and not testing it. Good catch. But you get the idea - a simple wrapper around a *list* is going to be way better than a wrapper around StringIO. ChrisA

Hello, On Tue, 31 Mar 2020 04:27:04 +1100 Chris Angelico <rosuav@gmail.com> wrote: []
I appreciate expressing it all concisely and clearly. Then let me respond here instead of the very first '"".join() rules!' reply I got. The issue with "".join() is very obvious: ------ import io import sys def strio(): sb = io.StringIO() for i in range(50000): sb.write(u"==%d==" % i) print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue())) def listjoin(): sb = [] sz = 0 for i in range(50000): v = u"==%d==" % i # All individual strings will be kept in the list and # can't be GCed before teh final join. sz += sys.getsizeof(v) sb.append(v) s = "".join(sb) sz += sys.getsizeof(sb) sz += sys.getsizeof(s) print(sz) strio() listjoin() ------ $ python3.6 memuse.py 439083 3734325 So, it's obvious, but let's formulate it clearly for avoidance of doubt: There's absolutely no need why performing trivial operation of accumulating string content should take about order of magnitude more memory than actually needed for that string content. Don't get me wrong - if you want to spend that much of your memory, then sure, you can. But jumping with that as *the only right solution* whenever somebody mentions "string concatenation" is a bit ... umm, cavalier.
This is going to outperform anything based on StringIO fairly easily,
Since when raw speed is the only criterion for performance? If you say "forever", I'll trust only if you proceed with showing assembly code with SSE and AVX which you wrote to get those last cycles out. Otherwise, being able to complete operations in reasonable amount of memory, not OOM and not being DoSed by trivial means, and finally, serving 8 times more requests in the same amount of memory - are alll quite criteria too. What's interesting, that so far, the discussion almost 1-to-1 parallels discussion in the 2006 thread I linked from the original mail.
But of course! And what's most important, nowhere did I talk what should be inside this class. My whole concern is along 2 lines: 1. This StringBuilder class *could* be an existing io.StringIO. 2. By just adding __iadd__ operator to it. That's it, nothing else. What's inside StringIO class is up to you (dear various Python implementations, their maintainers, and contributors). For example, fans of "".join() surely can have it inside. Actually, it's a known fact that Python2's "StringIO" module (the original home of StringIO class) was implemented exactly like that, so you can go straight back to the future. And again, the need for anything like that might be unclear for CPython-only users. Such users can write a StringBuilder class like above, or repeat the beautiful "".join() trick over and over again. The need for a nice string builder class may occur only from the consideration of the Python-as-a-language lacking a clear and nice abstraction for it, and from thinking how to add such an abstraction in a performant way (of which criteria are different) in as many implementation as possible, in as easy as possible way. (At least that's my path to it, I'm not sure if a different thought process might lead to it too.) -- Best regards, Paul mailto:pmiscml@gmail.com

On Tue, Mar 31, 2020 at 7:04 AM Paul Sokolovsky <pmiscml@gmail.com> wrote:
... about order of magnitude more memory ...
I suspect you may be multiply-counting some of your usage here. Rather than this, it would be more reliable to use the resident set size (on platforms where you can query that). if "strio" in sys.argv: strio() else: listjoin() print("Max RSS:", resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) Based on that, I find that it's at worst a 4:1 difference. Plus, I couldn't see any material difference - the numbers were within half a percent, basically just noise - until I upped your loop counter to 400,000, nearly ten times as much as you were doing. (At that point it became a 2:1 difference. The 4:1 didn't show up until a lot later.) So you have to be working with a *ridiculous* number of strings before there's anything to even consider. And even then, it's only notable if the individual strings are short AND all unique. Increasing the length of the strings basically made it a wash. Consider: for i in range(1000000): sb.write(u"==%d==" % i + "*"*1024) Max RSS: 2028060 for i in range(1000000): v = u"==%d==" % i + "*"*1024 Max RSS: 2104204 So at this point, the string join is slightly faster and takes slightly more memory - within 20% on the time and within 5% on the memory. ChrisA

Hello, On Tue, 31 Mar 2020 07:40:01 +1100 Chris Angelico <rosuav@gmail.com> wrote:
I may humbly suggest a different process too: get any hardware board with MicroPython and see how much data you can collect in a StringIO and in a list of strings. Well, you actually don't need a dedicated hardware, just get a Linux or Windows version and run it with a specific heap size using a -X heapsize= switch, e.g. -X heapsize=100K. Please don't stop there, we talk multiple implementations, try it on CPython too. There must be a similar option there (because how otherwise you can perform any memory-related testing!), I just forgot which. The results should be very apparent, and only forgotten option may obfuscate it. [] -- Best regards, Paul mailto:pmiscml@gmail.com

As others have pointed out, the OP started in a bit of an oblique way, but it maybe come down to this: There are some use-cases for a mutable string type. And one could certainly write one. presto: here is one: https://github.com/Daniil-Kost/mutable_strings Which looks to me to be more a toy than anything, but maybe the author is seriously using it... (it does look like it has a bug indexing if there are non-ascii) And yet, as far as I know, there has never been one that was carefully written and optimized, which would be a bit of a trick, because of how Python strings handle Unicode. (it would have been a lot easier with Python2 :-) ) So why not? 1) As pointed out, high performance strings are key to a lot of coding, so Python's str is very baked-in to a LOT of code, and can't be duck-typed. I know that pretty much the only time I ever type check (as apposed to simple duck typing EAFTP) is for str. So if one were to make a mutable string type, you'd have to convert it to a string a lot in order to use most other libraries. That being said, one could write a mutable string that mirrored' the cPython string types as much as possible, and it could be pretty efficient, even for making regular strings out of it. 2) Maybe it's really not that useful. Other than building up a long string with a bunch of small ones (which can be done fine with .join()) , I'm not sure I've had much of a use case -- it would buy you a tiny bit of performance for, say, altering strings in ways that don't change their length, but I doubt there's many (if any) applications that would see any meaningful benefit from that. So I'd say it hasn't been done because (1) it's a lot of work and (2) it would be a bit of a pain to use, and not gain much at all. A kind-of-related anecdote: numpy arrays are mutable, but you can not change their length in place. So, similar with strings, if you want to build up an array with a lot of little pieces, then the best way is to put all the pieces in a list, and then make an array out of it when you are done. I had a need to do that fairly often (reading data from files of unknown size) so I actually took the time to write an array that could be extended. Turns out that: 1) it really wasn't much faster (than using a list) in the usual use-cases anyway :-) 2) it did save memory -- which only mattered for monster arrays, and I'd likely need to do something smarter anyway in those cases. I even took some time to write a Cython-optimized version, which only helped a little. I offered it up to the numpy community. But in the end: no one expressed much interest. And I haven't used it myself for anything in a long while. Moral of the story: not much point in a special class to do something that can already be done almost as well with the builtins. -CHB On Mon, Mar 30, 2020 at 2:06 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, Mar 30, 2020 at 04:25:07PM -0700, Christopher Barker wrote:
With respect Christopher, this is a gross misrepresentation of what Paul has asked for. He is not asking for a mutable string type. If that isn't clear from the subject line of this thread, it ought to be clear from Paul's well-written and detailed post, which carefully explains what he wants. -- Steven

Steven D'Aprano writes:
[I]t ought to be clear from Paul's well-written and detailed post, which carefully explains what he wants.
Whose value to Python I still don't understand, because AFAICS it's something that on the one hand violates TOOWTDI and has no parallels elsewhere in the io module, and on the other hand is trivial to implement for any programmer who really thirsts for StringIO.__iadd__. Unless there are reasons why a derived class won't do? I agree there seem to be possible space performance issues with str.join that are especially painful for embedded applications (as came out later in the thread I believe), but if those are solved by StringIO, they're solved by StringIO. So the whole thing seems to be a cosmetic need for niche applications[1] for a niche platform[2] that is addressed by a 4-line class definition[3] for users who want the syntactic sugar. Me, I'm perfectly happy with StringIO.write because that's what I expect from the io module. FWIW YMMV of course. Footnotes: [1] I don't even use strings at all in any of my adafruit applications! [2] OK, that's going too far, sorry. Embedded matters, their needs are real needs, and they face tight constraints most of us rarely need to worry about. It's still at present a minority platform, I believe, and the rest of the sentence applies AFAIK. [3] Paul's "exact alias of .write() method", which can be done in 1 line, fails because .write() doesn't return self. Thanks, Serhiy. In the stdlib we might even want a check for "at end of buffer" (.write() can overwrite Unicode scalars anywhere in the buffer). That's definitely overengineering for a user, but in the stdlib, dunno.

Hello, On Tue, 31 Mar 2020 18:09:59 +0900 "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote: []
[3] Paul's "exact alias of .write() method", which can be done in 1 line, fails because .write() doesn't return self. Thanks, Serhiy. In
I stand corrected gentlemen, thanks for catching that. It's a poorman's pep, not a real pep after all. A real pep wouldn't use ambiguous phrases like that, but something like "__iadd__ method, with the same semantics as existing .write() method, bur returning self". In terms of C implementation, that's one line difference, in pseudocode: - return write_method(self, ...) + write_method(self, ...) + return self In terms of machine code, that would be +1 instruction, I guess such a minor difference made me discount it and use ambiguous term "alias". For reference, the implementation for Pycopy: https://github.com/pfalcon/pycopy/commit/4b149fb8a4fb18e954ba7113d1495ccf822... (such a big patch because expectedly, Pycopy optimizes operators vs general methods, and as there was no operators defined for StringIO before, it takes whole 14 lines of boilerplate to add).
Per my idea, __iadd__ would be exact equivalent of .write in behavior (a complexity-busting measure), but specific implementations of course can add extra checks if they just can't do otherwise. (Reminds me of PEP 616 discussion, where there was mentioning of raising ValueError on empty prefix, even though it all started as being an equivalent of if s.startswith(prefix): s = s[len(prefix):] And str.startswith() doesn't throw no ValueError on neither str.startswith("") or str.startswith(("foo", "")). It seems that we just can't pass by a chance to add another corner case to explain away from the already existing behavior, all with a good intention of policing our users, for they can't handle it themselves). -- Best regards, Paul mailto:pmiscml@gmail.com

Hello, On Mon, 30 Mar 2020 16:25:07 -0700 Christopher Barker <pythonchb@gmail.com> wrote:
For avoidance of doubt: nothing in my RFC has anything to do, or implies, "a mutable string type". A well-know pattern of string builder, yes. Piggybacking on existing StringIO/BytesIO classes, yes. Anything else, no. To not leave it cut and dry: IMHO, we need more const'ness in Python, not less. I my dreams I already do stuff like: from __future__ import const class Foo: pass # This is an alias for "Foo" Bar: const = Foo # This is a variable which can store a reference to Foo or any other class Baz = Foo [This is not a new RFC! Please start a new thread if you'd like to pick it up ;-)] -- Best regards, Paul mailto:pmiscml@gmail.com

Paul Sokolovsky wrote:
If I understand you are proposing a change from StringIO `write` method to `+=` operator. Is it right? I cannot see any advantage on this proposal since there is no real change in the implementation of StringIO. Or are you proposing any change in the underlying implementation and I have missed that point? In this case, I disagree with you: StringIO is a stream and I think that it is wrong to make it to "look & feel" like a string. That is my opinion. Sorry if I misunderstand you.

On Tue, Mar 31, 2020 at 07:32:11PM -0000, jdveiga@gmail.com wrote:
If I understand you are proposing a change from StringIO `write` method to `+=` operator. Is it right?
No, that is not correct. The StringIO.write() method will not be changed or removed. The proposal is to extend the class with the `+=` operator which will add as equivalent to calling write().
This proposal isn't about enhancing StringIO's functionality. The purpose of this proposal is targetted at people who are using string concatenation instead of assembling a list then calling join. It is about leveraging StringIO's ability to behave as a string builder to give people a minimally invasive change from the string concatenation anti-pattern: buf = '' # repeated many times buf += 'substring' to something which can be efficient on all Python interpreters: buf = StringIO() buf += 'substring' buf = buf.getvalue()
Paul has not suggested making StringIO look and feel like a string. Nobody is going to add 45+ string methods to StringIO. This is a minimal extension to the StringIO class which will allow people to improve their string building code with a minimal change. -- Steven

On Wed, 1 Apr 2020 at 02:07, Steven D'Aprano <steve@pearwood.info> wrote:
Thanks for paring the proposal down to its bare bones, there's a lot of side questions being discussed here that are confusing things for me. With this in mind, and looking at the bare proposal, my immediate thought is who's going to use this new approach: buf = StringIO() buf += 'substring' buf = buf.getvalue() I hope this isn't going to trigger another digression, but it seems to me that the answer is "nobody, unless they are taught about it, or work it out for themselves[1]". My reasons for saying this are that it adds no value over the current idiom of building a list then using join(), so people who already write efficient code won't need to change. The people who *might* change to this are people currently writing buf = '' # repeated many times buf += 'substring' Those people have presumably not yet learned about the (language independent) performance implication of repeated concatenation of immutable strings[2]. Ignoring CPython's optimisation for += on strings, as all that will do is allow them to survive longer without hitting the issues with this pattern, when they *do* find there's an issue, they will be looking for a better approach. At the moment, the message is relatively clear - "build a list and join it" (it's very rare that anyone suggests StringIO currently). This proposal is presumably intended to make "use StringIO and +=" a more attractive alternative alternative proposal (because it avoids the need to rewrite all those += lines). So we now find ourselves in the position of having *two* "recommended approaches" to addressing the performance issue with string concatenation. I'd contend that there's a benefit in having a single well-known idiom for fixing this issue when beginners hit it. Clarity of teaching, and less confusion for people who are learning that they need to address an issue that they weren't previously aware of. I further suggest that the benefits of the += syntax on StringIO (less change to existing code) are not sufficient to outweigh the benefits of having a single well-known "best practice" solution. So I'm -0.5 on this change (only 0.5, because it's a pretty trivial change, and not worth getting too worked up about). Paul [1] Or they have a vested interest in using the "string builder" pattern in Python, rather than using Python's native idioms. That's not an uncommon situation, but I don't think "helping people write <language X> in Python" is a good criterion for assessing language changes, in general. [2] Or they have, and know that it doesn't affect them, in which case they don't need to change anything.

Hello, On Wed, 1 Apr 2020 10:01:06 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
[]
Roughly speaking, the answer would be about the same in idea as answers to the following questions: * Who'd be using assignment expressions? (2nd way to do assignment, whoa!) * Who'd be using f-strings? (3rd (or more) way to do string formatting, bhoa!) * Who'd be writing s = s.removeprefix("foo") instead of "if s.startswith("foo"): s = s[3:]" (PEP616)? * Who'd be using binary operator @ ? * Who'd be using using unary operator + ?
Ok, so we found the answers to all those questions - people who might have a need to use, would use it. You definitely may argue of how many people (in absolute and relative figures) would use it. Let the binary operator @ and unary operator + be your aides in this task.
I don't know how much you mix with other Pythonistas, but word "clear" is an exaggeration. From those who don't like it, the usual word is "ugly", though I've seen more vivid epithets, like "repulsive": https://mail.python.org/pipermail/python-list/2006-January/403480.html More cool-headed guys like me just call it "complete wastage of memory".
Aye.
The scholasm of "there's only one way to do it" is getting old for this language. Have you already finished explaining everyone why we needed assignment expressions, and why Python originally had % as a formatting operator, and some people swear to keep not needing anything else? What's worse, is that "there's only one way to do it" gets routinely misinterpreted as "One True Way (tm)". And where Python is deficient to other languages, there's rising small-scale exceptionalism along the lines "we don't have it, and - we don't need it!". The issue is that some (many) Python programmers use a lot of different languages, and treat Python first of all as a generic programming language, not as a bag of tricks of a particular implementation. And of course, there never will be agreement between the one-true-way-tm vs nice-generic-languages factions of the community.
Another acute and beaten topic in the community. Python is a melting pot for diverse masses - beginners, greybeards, data scientists, scripting kiddies, PhD, web programmers, etc. That's one of the greatest achievements of Python, but also one of the pain points. I wonder how many people escaped from Python to just not be haunted by that "beginners" chanting. Python is beginners-friendly language, period, can't change that. Please don't bend it to be beginner-only. Please let people learn computer science inside Python, not learn bag of tricks to then escape in awe and make up haikus along the lines of: A language, originally for kids, Now for grown-up noobs. (Actual haiku seen on Reddit, sorry, can't find a link now, reproduced from memory, the original might have sounded better). [] -- Best regards, Paul mailto:pmiscml@gmail.com

Paul Sokolovsky wrote:
I would say the difference between this proposal so far and the ones listed are that they emphasized concrete, real-world examples from existing code either in the stdlib or "out in the wild", showing clear before and after benefits of the proposed syntax. It may not seem necessary to the person proposing the feature and it does take some time to research, but it creates a drastically stronger argument for the new feature. The code examples I've seen so far in the proposal have been mostly abstract or simple toy examples. To get a general idea, I'd recommend looking over the examples in their respective PEPs, and then try to do something similar in your own arguments.
While I agree that it's sometimes okay to go outside the strict bounds of "only one way to do it", there needs to be adequate justification for doing so which provides a demonstrable benefit in real-world code. So the default should be just having one way, unless we have a very strong reason to consider adding an alternative. This was the case for the features you mentioned above.
Considering the current widespread usage of Python in the software development industry and others, characterizing it as a language for "grown-up noobs" seems rather disingenuous (even if partially in jest). We emphasize readability and beginner-friendliness, but Python is very far from beginner-only and I don't think it's even reasonable to say that it's going in that direction. In some ways, it simplifies operations that would otherwise be more complicated, but that's largely the point of a high-level language: abstracting the complex and low-level parts to focus more on the core business logic. Also, while I can see that blindly relying on "str += part" can be sidestepping the underlying computer science to some degree, I find that appending the parts to a list and joining the elements is very conceptually similar to using a string buffer/builder; even if the syntax differs significantly from how other languages do it. Regarding the proposal in general though, I actually like the main idea of having "StringBuffer/StringBuilder"-like behavior, *assuming* it provides substantial benefits to alternative Python implementations compared to ``""join()``. As someone who regularly uses other languages with something similar, I find the syntax to be appealing, but not strong enough on its own to justify a stdlib version (mainly since a wrapper would be very trivial to implement). But, I'm against the idea of adding this to the existing StringIO class, largely for the reasons cited above, of it being outside of the scope of its intended use case. There's also a significant discoverability factor to consider. Based on the name and its use case in existing versions of Python, I don't think a substantial number of users will even consider using it for the purpose of building strings. As it stands, the only people who could end up benefiting from it would be the alternative implementations and their users, assuming they spend time *actively searching* for a way to build strings with reduced memory usage. So I would greatly prefer to see it as a separate class with a more informative name, even if it ends up being effectively implemented as a subset of StringIO with much of the same logic. For example: buf = StringBuilder() # feel free to bikeshed over the name for part in parts: buf += part # in the __iadd__, it would presumably call something like buf.append() or buf.write() return str(buf) This would be highly similar to existing string building classes in other popular languages, such as Java and C#. Also, on the point of memory usage: I'd very much like to see some real side-by-side comparisons of the ``''.join(parts)`` memory usage across Python implementations compared to ``StringIO.write()``. I some earlier in the thread, but the results were inaccurate since they relied entirely on ``sys.getsizeof()``, as mentioned earlier. IMO, having accurate memory benchmarks is critical to this proposal. As Chris Angelico mentioned, this can be observed through monitoring the before and after RSS (or equivalent on platforms without it). On Linux, I typically use something like this: ``` def show_rss(): os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") ``` With the above in mind, I'm currently +0 on the proposal. It seems like it might be a reasonable overall idea, but the arguments of its benefits need to be much more concrete before I'm convinced. On Wed, Apr 1, 2020 at 5:45 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:

On Wed, Apr 01, 2020 at 09:25:46PM -0400, Kyle Stanley wrote:
While I agree that it's sometimes okay to go outside the strict bounds of "only one way to do it"
The Zen of Python was invented as a joke, not holy writ, and as a series of koans intended to guide thought, not shut it down. Unfortunately, and with the greatest respect to Tim Peters, in practice that's not how it is used, *particularly* the "One Way" kaon, which is almost invariably used as a thought-terminating cliche. 1. The Zen doesn't mandate *only one way*, that is a total cannard about Python invented by the Perl community as a criticism. 2. Even if it did say "only one way", even a moment's glance at the language would show that it is not true. And moreover it *cannot* be true in any programming language. Given any task but the must basic, there will always be multiple possible implementations or algorithms, usually an *infinite* number of ways to do most things. (Not all of which will be efficient or sensible.) 3. Of all the koans in the Zen, the "One Way" koan is probably intended the most to be an ironic joke, not taken too seriously. Instead the Python community treats it as the most serious of all. In Tim Peter's own words: In writing a line about "only one way to do it", I used a device (em dash) for which at least two ways to do it (with spaces, without spaces) are commonly used, neither of which is obvious -- and deliberately picked a third way just to rub it in. https://bugs.python.org/issue3364 Let's look at what the koan actually says: There should be one-- and preferably only one --obvious way to do it. Adding emphasis: "There SHOULD BE ONE OBVIOUS WAY to do it." with only a *preference* for one way, not a hard rule. And given that Tim wrote it as a joke, having the koan intentionally go against its own advice, I think we should treat that preference as pretty low. So... what is "it", and what counts as "obvious"? This is where the koan is supposed to open our minds to new ideas, not shoot them down. In this case, "it" can be: 1. I want to build a string as efficiently as possible. 2. I want to build a string in as easy and obvious a way as possible. (There may be other "its", but they are the two that stand out.) For option 1, there is one recommended way (which may or may not be the most efficient way -- that's a quality of implementation detail): use list plus join. But it's not "obvious" until you have been immersed in Python culture for a long time. For option 1, Paul's proposal changes nothing. If list+join is the fastest and most efficient method (I shall grant this for the sake of the argument) then nothing need change. Keep doing what you are doing. The koan isn't satisfied in this case, there is One Way but it isn't Obvious. But Paul's proposal is not about fixing that. ----- For option 2, "it" cares more about readable, self-documenting code which is clear and ovious to more than just Pythonistas who have been immersed in the language for years. The beauty of Python is that it ought to be readable by everyone, including scientists and hobbists who use the language from time to time, students, sys admins, and coders from other languages. Ask a beginner, or someone who has immigrated from another language, what the obvious way to build a string is, and very few of them will say "build a list, then call a string method to join the list". Some of them might guess that they need to build a list, then call a *list* method to build a string: `list.join('')`. Why Python doesn't do that is even a FAQ. Beginners will probably say "add the strings together". People coming from other OOP languages will probably say "Use a String Builder", and possibly even stumble across StringIO as the closest thing to a builder. It's a bit odd that you have to call "write", but it builds a string out of substrings. (Later, in another post, I will give evidence that StringIO is already used as a string builder, and has been for a long time.) A significant sector of the community know the list+join idiom, but dislike it so strongly that they are willing to give up some efficiency to avoid it. Whatever the cause, there is a significant segment of the Python community who either don't know, don't care about, or actively dislike, the list+join idiom. For them, it is not Obvious and never will be, the Obvious Way is to concatenate strings into a String Builder or a bare string. This segment, the people who use string concatenation and either don't know better, don't care to change, or actively refuse to change, is the focus of this proposal. For this segment, the One Obvious Way is to concatenate strings using `+=`, and they aren't going to change for the sake of other interpreters. And that's a problem for other interpreters. Hence Paul's RFC. [...]
Surely the fact that the wrapper is "trivial" should count as a point in its favour, not against it? The greater the burden of an enhancement request, the greater the benefit it must give to justify it. If your enhancement requires a complete overhall of the entire language and interpreter and will obsolete vast swathes of documentation, the benefit has to be very compelling to justify it. But if your enhancement requires an extra dozen lines of C code, one or two tests, and an extra couple of lines of documentation, the benefit can be correspondingly smaller in order for the cost:benefit ratio to come up in its favour. The cost here is tiny. This thread alone has probably exceeded by a factor of 100 the cost of implementing the change. The benefit to CPython is probably small, but to the broader Python ecosystem (Paul mentioned nine other interpreters, I can think of at least two actively maintained ones that he missed) it is rather larger.
As I mentioned above, in another post to follow I will demonstrate that people already do know and use StringIO for concatenation. Nevertheless, you do make a good point. It may be that StringIO is not the right place for this. That can be debated without dismissing the entire idea. -- Steven
participants (19)
-
Andrew Barnert
-
Brett Cannon
-
C. Titus Brown
-
Chris Angelico
-
Christopher Barker
-
David Mertz
-
Eric V. Smith
-
gstindianews.info@gmail.com
-
Guido van Rossum
-
jdveiga@gmail.com
-
Joao S. O. Bueno
-
Kyle Stanley
-
M.-A. Lemburg
-
Paul Moore
-
Paul Sokolovsky
-
Rhodri James
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano