Re: [Python-ideas] Create a StringBuilder class and use it everywhere

The whole point is that people don't write u''.join idiom just because they don't know it is slow. And when they see StringBuilder -- they can ask themselves "why is he using that". I don't mind using u''.join, but it just doesn't make people think about speed at all. The most popular (as from what I can see) thing right now where people start seeing that += is slow is when they try to do that on PyPy (which doesn't have hack like CPython, who is still slow) and ask "why my pypy code is sooooo slow". With StringBuilder used widely that would not be the case.

On Thu, 2011-08-25 at 18:28 +0300, k_bx wrote:
I think a FAQ on "How can I make my python program faster?", with suggestions such as using list .join for building large strings instead of using += would be better. There probably already is one some place... Yep... http://wiki.python.org/moin/PythonSpeed/PerformanceTips This in my opinion is more about fitting the code to the problem than it is about speeding up general python code. I once wrote a text comparison engine that solved cryptograms by comparing to a text source. A large text source was read into a dictionary of words to be compared to. At first it was quite slow, but by presorting the data and putting it into smaller dictionaries, it sped up the program by several order of magnitudes. Cheers, Ron

On Thu, 25 Aug 2011 18:28:34 +0300 k_bx <k.bx@ya.ru> wrote:
I don't mind using u''.join, but it just doesn't make people think about speed at all.
Realistically, not many workloads have performance issues with string concatenation in the first place. So not caring is the right thing to do in most cases.
Different implementations having different performance characteristics is not totally unexpected, is it? (and I'm sure the PyPy developers wouldn't mind adding another hack) Regards Antoine.

On 2011-08-25, at 18:40 , Antoine Pitrou wrote:
Since Pypy does not use refcounting, it can't do that as a rule (it might be possible to handle it for a limited number of cases via escape analysis, proving there can be only one reference to the string, but I'd say there are more interesting things to use escape analysis for). Also, http://twitter.com/#!/alex_gaynor/status/104326041920749569
Wish CPython didn't contains hacks which make str += str faster, sometimes, depending on refcounting details :(

Le jeudi 25 août 2011 à 18:50 +0200, Masklinn a écrit :
Ah, you're right. However, PyPy has another (and quite broader) set of optimizations available: http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#strin... Besides:
The CPython optimization itself works in a limited number of cases, because having a refcount of 1 essentially means it's a local variable, and because of the result having to be stored back immediately in the same local variable (otherwise you can't recycle the original object's storage). Regards Antoine.

Antoine Pitrou, 25.08.2011 18:58:
And its JIT could potentially just enable its string-join optimisation automatically when it sees that a variable holds a string, and is never being assigned to inside of a loop or sequence of operations except for the += operator. Any other operation on the string would then just turn it back into a normal string by joining it first. But this is seriously getting off-topic now. Stefan

k_bx, 25.08.2011 17:28:
I don't mind using u''.join, but it just doesn't make people think about speed at all.
When I see something like a StringBuilder, I guess the first thing I'd wonder about is why the programmer didn't just use StringIO() or even just ''.join(). That makes the code appear much more magic than it eventually turns out to be when looking closer. Plus, it's doomed to be slower, simply because it goes through more indirections. You may be right that using StringIO won't make people think about speed. Somebody who doesn't know it would likely go: "oh, that's nice - writing to a string as if it were a file - I get that". So it tells you what it does, instead of distracting you into thinking about performance implications. That's the beauty of it. Optimisations are just that: optimisations. They are orthogonal to what the code does - or at least they should be. Even string concatenation can preform just fine in many cases.
Sounds like yet another reason not to do it then. Seriously, there are hardly any language runtimes out there where continued string concatenation is efficient, let alone guaranteed to be so. You just shouldn't expect that it is. The optimisation in CPython was simply done because it *can* be done, so that simple cases (and stupid benchmarks) can continue to use simple concatenation and still be efficient. Well, in some cases at least. Stefan

Wandering about, looking up statistics info for a program I was writing, I found a recommendation to add various useful 'special functions' to C's math library: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1069.pdf The arguments in that paper make a lot of sense to me, and apply well to Python. They came up with a good list, IMnsHO. I'd recommend implementing this list in some form as library functions in Python. Blindly copying wouldn't end up particularly 'Pythonic;' tweaking the API is required. Some of the selection choices, such as returning real only, ought to be reevaluated, for example. Obviously, any of the decisions to keep things C-like rather than object-oriented ought to shift, as well. Function names are only important as far as they are clear. I suggest naming per <general category><specific case> e.g. distribution_t(), or dist_F(), and include modification for algebraic order, as well, so gamma() and log_gamma(). That said, anything clear is fine. Thoughts on the matter? I noticed that the math library in 2.7+ added the gamma and log(gamma) functions, already, which was nice. Obviously, most, if not all, are already present in extensions modules such as NumPy, but there is value in having these things built into the language. "Batteries included, "and all that. By the by, if that is far too much for one suggestion, then please just treat this as a suggestion to add just the incomplete beta function. (P-values for binomial, F, and t are all nice, too, though with inc. beta, they aren't terrible to generate. I really think they should be included in the standard library.) -Nate

You didn't get any responses AFAICT. That doesn't mean nobody is interested -- perhaps your proposal is simply too general? Do you feel up to making some more specific recommendations about the exact list of functions to add? It's easier to criticize a concrete proposal. Do you feel up to producing a patch that just adds the incomplete beta function? --Guido On Tue, Aug 30, 2011 at 8:46 AM, Spectral One <ghostwriter402@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 8/31/11 2:05 PM, Guido van Rossum wrote:
It shows up deeply mis-threaded under "Create a StringBuilder class and use it everywhere" in my client. Perhaps Spectral One should try reposting it so that it shows up as a new thread. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Wed, Aug 31, 2011 at 3:19 PM, Robert Kern <robert.kern@gmail.com> wrote:
What client is that? In my GMail (for once) it shows up as a new thread with subject "Re: [Python-ideas] Expanding statistical functions in Python's std. lib." I guess your client got confused by some of these headers: References: <549901314286114@web119.yandex.ru> <4E56E859.3090504@canterbury.ac.nz> In-Reply-To: <4E56E859.3090504@canterbury.ac.nz> -- --Guido van Rossum (python.org/~guido)

On 8/31/11 5:23 PM, Guido van Rossum wrote:
Thunderbird, via GMane, which may or may not be adding more confusion to the mix.
And indeed, it shows up threaded under Greg Ewing's Aug 25 post to the StringBuilder thread. Email threading is something of an art, but I'm not sure it's right to say that my client is getting "confused" by taking the In-Reply-To header at its word. ;-) Anyways, that's why I suspect he's not getting many responses. As to the substance of the proposal, I'm -0 on having the full complement of statistical distribution functions and +0 on adding just the incomplete beta function. Personally, I will never use any of them since I can get them from scipy. I am at least going to be using numpy to generate any of the test statistics that I would pass through these functions. I don't see anything particularly compelling about having them in the math module as opposed to a third party module (be it scipy or something lighter-weight). That said, having a good complement of common special functions that can be used to build up a variety of less-common functions is a good thing to have in a standard library. I think you could defend adding the incomplete beta function on that principle, if nothing else. You could make a similar argument for the Bessel functions j0(), j1() and jn(). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 8/31/11 11:39 PM, David Townshend wrote:
This had probably been thought about before, but why not include numpy in stdlib?
Well, it's not particularly germane to this thread since most of the requested functions exist in scipy but not numpy. In any case, numpy is too large, too C, and too actively developed to be a part of the standard library at this time. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

I think this is very Pythonic. On blogs describing python, one of the features they list is batteries included. Statistical functions would be great. I give it +1.

On Tue, Aug 30, 2011 at 11:46 AM, Spectral One <ghostwriter402@gmail.com> wrote:
I'm not sure that many people who could make tons of use from statistical functions don't already have cause to be using numpy/scipy. I would certainly be unfortunate if having a little more statistics functionality in the stdlib discouraged people who should be using numpy from doing so. "Batteries included" has always been a bit of an oversell, and as a Python user I don't have any expectation of being able to do fairly-specialized work without third-party modules, nor do I think it's necessarily a net gain for Python if I could. -0 Mike

On Thu, 2011-08-25 at 18:28 +0300, k_bx wrote:
I think a FAQ on "How can I make my python program faster?", with suggestions such as using list .join for building large strings instead of using += would be better. There probably already is one some place... Yep... http://wiki.python.org/moin/PythonSpeed/PerformanceTips This in my opinion is more about fitting the code to the problem than it is about speeding up general python code. I once wrote a text comparison engine that solved cryptograms by comparing to a text source. A large text source was read into a dictionary of words to be compared to. At first it was quite slow, but by presorting the data and putting it into smaller dictionaries, it sped up the program by several order of magnitudes. Cheers, Ron

On Thu, 25 Aug 2011 18:28:34 +0300 k_bx <k.bx@ya.ru> wrote:
I don't mind using u''.join, but it just doesn't make people think about speed at all.
Realistically, not many workloads have performance issues with string concatenation in the first place. So not caring is the right thing to do in most cases.
Different implementations having different performance characteristics is not totally unexpected, is it? (and I'm sure the PyPy developers wouldn't mind adding another hack) Regards Antoine.

On 2011-08-25, at 18:40 , Antoine Pitrou wrote:
Since Pypy does not use refcounting, it can't do that as a rule (it might be possible to handle it for a limited number of cases via escape analysis, proving there can be only one reference to the string, but I'd say there are more interesting things to use escape analysis for). Also, http://twitter.com/#!/alex_gaynor/status/104326041920749569
Wish CPython didn't contains hacks which make str += str faster, sometimes, depending on refcounting details :(

Le jeudi 25 août 2011 à 18:50 +0200, Masklinn a écrit :
Ah, you're right. However, PyPy has another (and quite broader) set of optimizations available: http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#strin... Besides:
The CPython optimization itself works in a limited number of cases, because having a refcount of 1 essentially means it's a local variable, and because of the result having to be stored back immediately in the same local variable (otherwise you can't recycle the original object's storage). Regards Antoine.

Antoine Pitrou, 25.08.2011 18:58:
And its JIT could potentially just enable its string-join optimisation automatically when it sees that a variable holds a string, and is never being assigned to inside of a loop or sequence of operations except for the += operator. Any other operation on the string would then just turn it back into a normal string by joining it first. But this is seriously getting off-topic now. Stefan

k_bx, 25.08.2011 17:28:
I don't mind using u''.join, but it just doesn't make people think about speed at all.
When I see something like a StringBuilder, I guess the first thing I'd wonder about is why the programmer didn't just use StringIO() or even just ''.join(). That makes the code appear much more magic than it eventually turns out to be when looking closer. Plus, it's doomed to be slower, simply because it goes through more indirections. You may be right that using StringIO won't make people think about speed. Somebody who doesn't know it would likely go: "oh, that's nice - writing to a string as if it were a file - I get that". So it tells you what it does, instead of distracting you into thinking about performance implications. That's the beauty of it. Optimisations are just that: optimisations. They are orthogonal to what the code does - or at least they should be. Even string concatenation can preform just fine in many cases.
Sounds like yet another reason not to do it then. Seriously, there are hardly any language runtimes out there where continued string concatenation is efficient, let alone guaranteed to be so. You just shouldn't expect that it is. The optimisation in CPython was simply done because it *can* be done, so that simple cases (and stupid benchmarks) can continue to use simple concatenation and still be efficient. Well, in some cases at least. Stefan

Wandering about, looking up statistics info for a program I was writing, I found a recommendation to add various useful 'special functions' to C's math library: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1069.pdf The arguments in that paper make a lot of sense to me, and apply well to Python. They came up with a good list, IMnsHO. I'd recommend implementing this list in some form as library functions in Python. Blindly copying wouldn't end up particularly 'Pythonic;' tweaking the API is required. Some of the selection choices, such as returning real only, ought to be reevaluated, for example. Obviously, any of the decisions to keep things C-like rather than object-oriented ought to shift, as well. Function names are only important as far as they are clear. I suggest naming per <general category><specific case> e.g. distribution_t(), or dist_F(), and include modification for algebraic order, as well, so gamma() and log_gamma(). That said, anything clear is fine. Thoughts on the matter? I noticed that the math library in 2.7+ added the gamma and log(gamma) functions, already, which was nice. Obviously, most, if not all, are already present in extensions modules such as NumPy, but there is value in having these things built into the language. "Batteries included, "and all that. By the by, if that is far too much for one suggestion, then please just treat this as a suggestion to add just the incomplete beta function. (P-values for binomial, F, and t are all nice, too, though with inc. beta, they aren't terrible to generate. I really think they should be included in the standard library.) -Nate

You didn't get any responses AFAICT. That doesn't mean nobody is interested -- perhaps your proposal is simply too general? Do you feel up to making some more specific recommendations about the exact list of functions to add? It's easier to criticize a concrete proposal. Do you feel up to producing a patch that just adds the incomplete beta function? --Guido On Tue, Aug 30, 2011 at 8:46 AM, Spectral One <ghostwriter402@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 8/31/11 2:05 PM, Guido van Rossum wrote:
It shows up deeply mis-threaded under "Create a StringBuilder class and use it everywhere" in my client. Perhaps Spectral One should try reposting it so that it shows up as a new thread. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Wed, Aug 31, 2011 at 3:19 PM, Robert Kern <robert.kern@gmail.com> wrote:
What client is that? In my GMail (for once) it shows up as a new thread with subject "Re: [Python-ideas] Expanding statistical functions in Python's std. lib." I guess your client got confused by some of these headers: References: <549901314286114@web119.yandex.ru> <4E56E859.3090504@canterbury.ac.nz> In-Reply-To: <4E56E859.3090504@canterbury.ac.nz> -- --Guido van Rossum (python.org/~guido)

On 8/31/11 5:23 PM, Guido van Rossum wrote:
Thunderbird, via GMane, which may or may not be adding more confusion to the mix.
And indeed, it shows up threaded under Greg Ewing's Aug 25 post to the StringBuilder thread. Email threading is something of an art, but I'm not sure it's right to say that my client is getting "confused" by taking the In-Reply-To header at its word. ;-) Anyways, that's why I suspect he's not getting many responses. As to the substance of the proposal, I'm -0 on having the full complement of statistical distribution functions and +0 on adding just the incomplete beta function. Personally, I will never use any of them since I can get them from scipy. I am at least going to be using numpy to generate any of the test statistics that I would pass through these functions. I don't see anything particularly compelling about having them in the math module as opposed to a third party module (be it scipy or something lighter-weight). That said, having a good complement of common special functions that can be used to build up a variety of less-common functions is a good thing to have in a standard library. I think you could defend adding the incomplete beta function on that principle, if nothing else. You could make a similar argument for the Bessel functions j0(), j1() and jn(). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 8/31/11 11:39 PM, David Townshend wrote:
This had probably been thought about before, but why not include numpy in stdlib?
Well, it's not particularly germane to this thread since most of the requested functions exist in scipy but not numpy. In any case, numpy is too large, too C, and too actively developed to be a part of the standard library at this time. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

I think this is very Pythonic. On blogs describing python, one of the features they list is batteries included. Statistical functions would be great. I give it +1.

On Tue, Aug 30, 2011 at 11:46 AM, Spectral One <ghostwriter402@gmail.com> wrote:
I'm not sure that many people who could make tons of use from statistical functions don't already have cause to be using numpy/scipy. I would certainly be unfortunate if having a little more statistics functionality in the stdlib discouraged people who should be using numpy from doing so. "Batteries included" has always been a bit of an oversell, and as a Python user I don't have any expectation of being able to do fairly-specialized work without third-party modules, nor do I think it's necessarily a net gain for Python if I could. -0 Mike
participants (14)
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
Christopher King
-
David Townshend
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
k_bx
-
Masklinn
-
Mike Graham
-
Robert Kern
-
ron3200
-
Spectral One
-
Stefan Behnel