Mailman 3 Create a StringBuilder class and use it everywhere - Python-ideas

Create a StringBuilder class and use it everywhere

k_bx

Aug. 25, 2011

9:28 a.m.

Hi! There's a certain problem right now in python that when people need to build string from pieces they really often do something like this:: def main_pure(): b = u"initial value" for i in xrange(30000): b += u"more data" return b The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance):: def main_list_append(): b = [u"initial value"] for i in xrange(3000000): b.append(u"more data") return u"".join(b) The results are:: kost@kost-laptop:~/tmp$ time python string_bucket_pure.py real 0m7.194s user 0m3.590s sys 0m3.580s kost@kost-laptop:~/tmp$ time python string_bucket_append.py real 0m0.417s user 0m0.330s sys 0m0.080s Fantastic, isn't it? Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way. Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder:: class StringBuilder(object): """Use it instead of doing += for building unicode strings from pieces""" def __init__(self, val=u""): self.val = val self.appended = [] def __iadd__(self, other): self.appended.append(other) return self def __unicode__(self): self.val = u"".join((self.val, u"".join(self.appended))) self.appended = [] return self.val Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that: 1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help(). Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000): def main_bucket(): b = StringBuilder(u"initial value ") for i in xrange(30000000): b += u"more data" return unicode(b) For CPython:: kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py real 0m12.944s user 0m11.670s sys 0m1.260s kost@kost-laptop:~/tmp$ time python string_bucket_append.py real 0m3.540s user 0m2.830s sys 0m0.690s For PyPy 1.6:: (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py real 0m18.593s user 0m12.930s sys 0m5.600s (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py real 0m16.214s user 0m11.750s sys 0m4.280s Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit. p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-) Thanks.

Show replies by date

M.-A. Lemburg

August 2011

9:45 a.m.

New subject: Create a StringBuilder class and use it everywhere

k_bx wrote:

...

Hi!

There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::

def main_pure(): b = u"initial value" for i in xrange(30000): b += u"more data" return b

The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::

def main_list_append(): b = [u"initial value"] for i in xrange(3000000): b.append(u"more data") return u"".join(b)

The results are::

kost@kost-laptop:~/tmp$ time python string_bucket_pure.py

real 0m7.194s user 0m3.590s sys 0m3.580s kost@kost-laptop:~/tmp$ time python string_bucket_append.py

real 0m0.417s user 0m0.330s sys 0m0.080s

Fantastic, isn't it?

Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.

Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::

class StringBuilder(object): """Use it instead of doing += for building unicode strings from pieces""" def __init__(self, val=u""): self.val = val self.appended = []

def __iadd__(self, other): self.appended.append(other) return self

def __unicode__(self): self.val = u"".join((self.val, u"".join(self.appended))) self.appended = [] return self.val

Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:

1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().

Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):

def main_bucket(): b = StringBuilder(u"initial value ") for i in xrange(30000000): b += u"more data" return unicode(b)

For CPython::

kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

real 0m12.944s user 0m11.670s sys 0m1.260s

kost@kost-laptop:~/tmp$ time python string_bucket_append.py

real 0m3.540s user 0m2.830s sys 0m0.690s

For PyPy 1.6::

(pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

real 0m18.593s user 0m12.930s sys 0m5.600s

(pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py

real 0m16.214s user 0m11.750s sys 0m4.280s

Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.

p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-)

I think you should use cStringIO in your class implementation. The list + join idiom is nice, but it has the disadvantage of creating and keeping alive many small string objects (with all the memory overhead and fragmentation that goes along with it). AFAIR, the most efficient approach is using arrays:

...

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2011)

...

2011-10-04: PyCon DE 2011, Leipzig, Germany 40 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

k_bx

9:57 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 12:45, "M.-A. Lemburg" <mal@egenix.com>:

...

k_bx wrote:

...
Hi!

There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::

     def main_pure():          b = u"initial value"          for i in xrange(30000):              b += u"more data"          return b

The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::

     def main_list_append():          b = [u"initial value"]          for i in xrange(3000000):              b.append(u"more data")          return u"".join(b)

The results are::

     kost@kost-laptop:~/tmp$ time python string_bucket_pure.py

     real 0m7.194s      user 0m3.590s      sys 0m3.580s      kost@kost-laptop:~/tmp$ time python string_bucket_append.py

     real 0m0.417s      user 0m0.330s      sys 0m0.080s

Fantastic, isn't it?

Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.

Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::

     class StringBuilder(object):          """Use it instead of doing += for building unicode strings from pieces"""          def __init__(self, val=u""):              self.val = val              self.appended = []

         def __iadd__(self, other):              self.appended.append(other)              return self

         def __unicode__(self):              self.val = u"".join((self.val, u"".join(self.appended)))              self.appended = []              return self.val

Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:

1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().

Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):

     def main_bucket():          b = StringBuilder(u"initial value ")          for i in xrange(30000000):              b += u"more data"          return unicode(b)

For CPython::

         kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

         real 0m12.944s          user 0m11.670s          sys 0m1.260s

         kost@kost-laptop:~/tmp$ time python string_bucket_append.py

         real 0m3.540s          user 0m2.830s          sys 0m0.690s

For PyPy 1.6::

         (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

         real 0m18.593s          user 0m12.930s          sys 0m5.600s

         (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py

         real 0m16.214s          user 0m11.750s          sys 0m4.280s

Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.

p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-)

I think you should use cStringIO in your class implementation. The list + join idiom is nice, but it has the disadvantage of creating and keeping alive many small string objects (with all the memory overhead and fragmentation that goes along with it).

AFAIR, the most efficient approach is using arrays:

...
...
...
import array t = array.array('u') t.extend(u'ДЖЭ') t

array('u', u'\xe4\xf6\xfc')

...
...
...
t.tounicode()

u'\xe4\xf6\xfc'

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Aug 25 2011)

...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/ mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany                40 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

  eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg           Registered at Amtsgericht Duesseldorf: HRB 46611               http://www.egenix.com/company/contact/

I'm perfectly ok with different implementation of StringBuilder, but the main idea and proposal here is to make it in standard library somehow and force (and promote) uses of it everywhere, maybe write some FAQ. So that when you see some new += code all you need it so go and fix that without worrying about complains :-D

M.-A. Lemburg

10:19 a.m.

New subject: Create a StringBuilder class and use it everywhere

k_bx wrote:

...

I'm perfectly ok with different implementation of StringBuilder, but the main idea and proposal here is to make it in standard library somehow and force (and promote) uses of it everywhere, maybe write some FAQ. So that when you see some new += code all you need it so go and fix that without worrying about complains :-D

I guess adding something like this to string.py would be worthwhile exploring. It's a very common use case and the list-idiom doesn't read well in practice. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2011)

...

Larry Hastings

10:34 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 08/25/2011 03:19 AM, M.-A. Lemburg wrote:

...

I think the right place to do this is inside Python itself. I proposed something to do that several years ago, been meaning to revive it. http://bugs.python.org/issue1569040 /larry/

Dirkjan Ochtman

8:35 a.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, Aug 25, 2011 at 11:45, M.-A. Lemburg <mal@egenix.com> wrote:

...

AFAIK using cStringIO just for string building is much slower than using list.append() + join(). IIRC we tested some micro-benchmarks on this for Mercurial output (where it was a significant part of the profile for some commands). That was on Python 2, of course, it may be better in io.StringIO and/or Python 3. Cheers, Dirkjan

M.-A. Lemburg

9:27 a.m.

New subject: Create a StringBuilder class and use it everywhere

Dirkjan Ochtman wrote:

...

Turns our you're right (list.append must have gotten a lot faster since I last tested this years ago, or I simply misremembered the results).

...

Here's the Python2 code: """ TIMEIT_N = 10 N = 1000000 SIZES = (2, 10, 23, 30, 33, 22, 15, 16, 27) N_STRINGS = len(SIZES) STRINGS = ['x' * SIZES[i] for i in range(N_STRINGS)] REFERENCE = ''.join(STRINGS[i % N_STRINGS] for i in xrange(N)) def cstringio(): import cStringIO s = cStringIO.StringIO() write = s.write for i in xrange(N): write(STRINGS[i % N_STRINGS]) result = s.getvalue() assert result == REFERENCE def array(): import array s = array.array('c') write = s.fromstring for i in xrange(N): write(STRINGS[i % N_STRINGS]) result = s.tostring() assert result == REFERENCE def listappend(): l = [] append = l.append for i in xrange(N): append(STRINGS[i % N_STRINGS]) result = ''.join(l) assert result == REFERENCE if __name__ == '__main__': import sys, timeit for test in sys.argv[1:]: print 'Running test %s ...' % test t = timeit.timeit('%s()' % test, 'from __main__ import %s' % test, number=TIMEIT_N) print ' %.2f ms' % (t / TIMEIT_N * 1e3) """ Aside: For some reason cStringIO and array got slower in Python 2.7. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2011)

...

2011-10-04: PyCon DE 2011, Leipzig, Germany 36 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Masklinn

9:44 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 2011-08-29, at 11:27 , M.-A. Lemburg wrote:

...

Converting your code straight to bytes (so array still works) yields this on Python 3.2.1: > python3.2 timetest.py io array listappend Running test io ... 334.03 ms Running test array ... 776.66 ms Running test listappend ... 314.90 ms For string (excluding array): > python3.2 timetest.py io listappend Running test io ... 451.45 ms Running test listappend ... 356.39 ms

M.-A. Lemburg

10:25 a.m.

New subject: Create a StringBuilder class and use it everywhere

Masklinn wrote:

...

Unicode works with the array module as well. Just use 'u' as array code and replace fromstring/tostring with fromunicode/tounicode. In any case, the array module approach appears to the be slowest of all three tests. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2011)

...

Antoine Pitrou

12:40 p.m.

New subject: Create a StringBuilder class and use it everywhere

On Mon, 29 Aug 2011 11:27:23 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...

The join() idiom only does one big copy at the end, while the StringIO/BytesIO idiom copies at every resize (unless the memory allocator is very smart). Both are O(N) but the join() version does less copies and (re)allocations. (there are also the list resizings but that object is much smaller) Regards Antoine.

k.bx＠ya.ru

4:04 p.m.

New subject: Create a StringBuilder class and use it everywhere

29.08.11, 15:43, "Antoine Pitrou" <solipsis@pitrou.net>:

...

Ok, so I think the best approach would be to implement via join + [], but do flush every 1000 ops, since it can save memory. As for the whole idea -- I still think that creating something like this and adding to stdlib (with __iadd__ and . append() API, which makes refactoring need to be only one string, like doing StringBuilder(u"Foo")) and documenting that would be super-cool. So who says the last word on this?

Antoine Pitrou

4:10 p.m.

Le lundi 29 août 2011 à 19:04 +0300, k.bx@ya.ru a écrit :

...

Ok, so I think the best approach would be to implement via join + [], but do flush every 1000 ops, since it can save memory.

That approach (or a similar one) could actually be integrated into StringIO and BytesIO. As long as you only write() at the end of the in-memory object, there's no need to actually concatenate. And it would be much easier (and less impacting on C extension code) to implement that approach in the StringIO and BytesIO objects, than in the bytes and str types as Larry did. Regards Antoine.

Carl Matthew Johnson

9:53 a.m.

New subject: Create a StringBuilder class and use it everywhere

Interesting semantics… What version of Python were you using? The current documentation has this to say: • CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use thestr.join() method which assures consistent linear concatenation performance across versions and implementations. Changed in version 2.4: Formerly, string concatenation never occurred in-place. <http://docs.python.org/library/stdtypes.html> It's my understanding that the naïve approach should now have performance comparable to the "proper" list append technique as long as you use CPython >2.4. -- Carl Johnson

k_bx

9:56 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 12:53, "Carl Matthew Johnson" <cmjohnson.mailinglist@gmail.com>:

...

I use cpython 2.7 that comes with Ubuntu Natty with latest updates.

Masklinn

10:01 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 2011-08-25, at 11:53 , Carl Matthew Johnson wrote:

...

Steven D'Aprano

1:57 p.m.

New subject: Create a StringBuilder class and use it everywhere

Carl Matthew Johnson wrote:

...

Relying on that is a bad idea. It is not portable from CPython to any other Python (none of IronPython, Jython or PyPy can include that optimization), it also depends on details of the memory manager used by your operating system (what is fast on one computer can be slow on another), and it doesn't even work under all circumstances (it relies on the string having exactly one reference as well as the exact form of the concatenation). Here's a real-world example of how the idiom of repeated string concatenation goes bad: http://www.mail-archive.com/pypy-dev@python.org/msg00682.html Here's another example, from a few years back, where part of the standard library using string concatenation was *extremely* slow under Windows. Linux users saw no slowdown and it was very hard to diagnose the problem: http://www.mail-archive.com/python-dev@python.org/msg40692.html -- Steven

k_bx

10:38 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 12:28, "k_bx" <k.bx@ya.ru>:

...

Hi!

There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::

    def main_pure():         b = u"initial value"         for i in xrange(30000):             b += u"more data"         return b

The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::

    def main_list_append():         b = [u"initial value"]         for i in xrange(3000000):             b.append(u"more data")         return u"".join(b)

The results are::

    kost@kost-laptop:~/tmp$ time python string_bucket_pure.py

    real 0m7.194s     user 0m3.590s     sys 0m3.580s     kost@kost-laptop:~/tmp$ time python string_bucket_append.py

    real 0m0.417s     user 0m0.330s     sys 0m0.080s

Fantastic, isn't it?

Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.

Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::

    class StringBuilder(object):         """Use it instead of doing += for building unicode strings from pieces"""         def __init__(self, val=u""):             self.val = val             self.appended = []

        def __iadd__(self, other):             self.appended.append(other)             return self

        def __unicode__(self):             self.val = u"".join((self.val, u"".join(self.appended)))             self.appended = []             return self.val

Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:

1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().

Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):

    def main_bucket():         b = StringBuilder(u"initial value ")         for i in xrange(30000000):             b += u"more data"         return unicode(b)

For CPython::

        kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

        real 0m12.944s         user 0m11.670s         sys 0m1.260s

        kost@kost-laptop:~/tmp$ time python string_bucket_append.py

        real 0m3.540s         user 0m2.830s         sys 0m0.690s

For PyPy 1.6::

        (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

        real 0m18.593s         user 0m12.930s         sys 0m5.600s

        (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py

        real 0m16.214s         user 0m11.750s         sys 0m4.280s

Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.

p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-)

Thanks.

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.

Georg Brandl

10:50 a.m.

New subject: Create a StringBuilder class and use it everywhere

Am 25.08.2011 12:38, schrieb k_bx:

...

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.

You do realize that MutableString's __iadd__ just performs += on str operands? Georg

k_bx

10:55 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 13:50, "Georg Brandl" <g.brandl@gmx.net>:

...

Oh, I'm sorry, I thought it uses cStringIO internally. Let's forget about MutableString then.

Terry Reedy

3:41 p.m.

New subject: Create a StringBuilder class and use it everywhere

On 8/25/2011 6:38 AM, k_bx wrote:

...

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.

(removed, i presume you mean...) and added bytearray. I have no idea if += on such is any better than O(n*n) -- Terry Jan Reedy

Antoine Pitrou

4:35 p.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, 25 Aug 2011 11:41:11 -0400 Terry Reedy <tjreedy@udel.edu> wrote:

...

On bytearray? Yes, it is. It's a similar algorithm as lists, and therefore O(total length) amortized. Regards Antoine.

Antoine Pitrou

11:36 a.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, 25 Aug 2011 12:28:14 +0300 k_bx <k.bx@ya.ru> wrote:

...

And Python has io.StringIO. I don't think we need to reinvent the wheel under another name. http://docs.python.org/library/io.html#io.StringIO By the way, when prototyping snippets for the purpose of demonstrating new features, you should really use Python 3, because Python 2 is in bugfix-only mode. (same applies to benchmark results, actually) Regards Antoine.

Nick Coghlan

11:47 a.m.

New subject: Create a StringBuilder class and use it everywhere

If the join idiom really bothers you... import io def build_str(iterable): # Essentially ''.join, just with str() coercion # and less memory fragmentation target = io.StringIO() for item in iterable: target.write(str(item)) return target.getvalue() # Caution: decorator abuse ahead # I'd prefer this to a StringBuilder class, though :) def gen_str(g): return build_str(g())

...

Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Mike Graham

12:31 p.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, Aug 25, 2011 at 5:28 AM, k_bx <k.bx@ya.ru> wrote:

...

This doesn't seem nicer to read and write to me than the list form. I also do not see any reason to believe it will stop people from doing it the quadratic way if the ubiquitous make-a-list-then-join idiom does not. Mike

Steven D'Aprano

2 p.m.

New subject: Create a StringBuilder class and use it everywhere

Mike Graham wrote:

...

Agreed. Just because the Java idiom is StringBuilder doesn't mean Python should ape it. Python already has a "build strings efficiently" idiom: ''.join(iterable_of_strings) If people can't, or won't, learn this idiom, why would they learn to use StringBuilder instead? -- Steven

Stefan Behnel

3:15 p.m.

New subject: Create a StringBuilder class and use it everywhere

Steven D'Aprano, 25.08.2011 16:00:

...

Plus, StringBuilder is only a special case. Joining a string around other delimiters is straight forward once you've learned about ''.join(). Doing the same with StringBuilder is non-trivial (as the Java example nicely shows). Stefan

Arnaud Delobelle

2:02 p.m.

New subject: Create a StringBuilder class and use it everywhere

On 25 August 2011 13:31, Mike Graham <mikegraham@gmail.com> wrote:

...

+1 -- Arnaud

Terry Reedy

3:24 p.m.

New subject: Create a StringBuilder class and use it everywhere

On 8/25/2011 5:28 AM, k_bx wrote:

...

I do not see the need to keep the initial piece separate and do the double join. For Py3 class StringBuilder(object): """Use it instead of doing += for building unicode strings from pieces""" def __init__(self, val=""): self.pieces = [val] def __iadd__(self, item): self.pieces.append(item) return self def __str__(self): val = "".join(self.pieces) self.pieces = [val] return val s = StringBuilder('a') s += 'b' s += 'c' print(s) s += 'd' print(s)

...

...
...
abc abcd

I am personally happy enough with [].append, but I can see the attraction of += if doing many separate lines rather than .append within a loop. -- Terry Jan Reedy

Antoine Pitrou

12:45 a.m.

New subject: Performance of the "".join() idiom

For the record, the "".join() idiom also has its downsides. If you build a list of many tiny strings, memory consumption can grow beyond the reasonable (in one case, building a 600MB JSON string outgrew the RAM of an 8GB machine). One solution is to regularly accumulate the primary list into a secondary accumulation list as done in http://hg.python.org/cpython/rev/47176e8d7060 Regards Antoine. On Thu, 25 Aug 2011 12:28:14 +0300 k_bx <k.bx@ya.ru> wrote:

...

M.-A. Lemburg

August 2011

9:45 a.m.

New subject: Create a StringBuilder class and use it everywhere

k_bx wrote:

...

Hi!

There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::

def main_pure(): b = u"initial value" for i in xrange(30000): b += u"more data" return b

The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::

def main_list_append(): b = [u"initial value"] for i in xrange(3000000): b.append(u"more data") return u"".join(b)

The results are::

kost@kost-laptop:~/tmp$ time python string_bucket_pure.py

real 0m7.194s user 0m3.590s sys 0m3.580s kost@kost-laptop:~/tmp$ time python string_bucket_append.py

real 0m0.417s user 0m0.330s sys 0m0.080s

Fantastic, isn't it?

Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.

Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::

class StringBuilder(object): """Use it instead of doing += for building unicode strings from pieces""" def __init__(self, val=u""): self.val = val self.appended = []

def __iadd__(self, other): self.appended.append(other) return self

def __unicode__(self): self.val = u"".join((self.val, u"".join(self.appended))) self.appended = [] return self.val

Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:

1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().

Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):

def main_bucket(): b = StringBuilder(u"initial value ") for i in xrange(30000000): b += u"more data" return unicode(b)

For CPython::

kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

real 0m12.944s user 0m11.670s sys 0m1.260s

kost@kost-laptop:~/tmp$ time python string_bucket_append.py

real 0m3.540s user 0m2.830s sys 0m0.690s

For PyPy 1.6::

(pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

real 0m18.593s user 0m12.930s sys 0m5.600s

(pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py

real 0m16.214s user 0m11.750s sys 0m4.280s

Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.

p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-)

...

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2011)

...

k_bx

9:57 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 12:45, "M.-A. Lemburg" <mal@egenix.com>:

...

k_bx wrote:

...
Hi!

There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::

     def main_pure():          b = u"initial value"          for i in xrange(30000):              b += u"more data"          return b

The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::

     def main_list_append():          b = [u"initial value"]          for i in xrange(3000000):              b.append(u"more data")          return u"".join(b)

The results are::

     kost@kost-laptop:~/tmp$ time python string_bucket_pure.py

     real 0m7.194s      user 0m3.590s      sys 0m3.580s      kost@kost-laptop:~/tmp$ time python string_bucket_append.py

     real 0m0.417s      user 0m0.330s      sys 0m0.080s

Fantastic, isn't it?

Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.

Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::

     class StringBuilder(object):          """Use it instead of doing += for building unicode strings from pieces"""          def __init__(self, val=u""):              self.val = val              self.appended = []

         def __iadd__(self, other):              self.appended.append(other)              return self

         def __unicode__(self):              self.val = u"".join((self.val, u"".join(self.appended)))              self.appended = []              return self.val

Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:

1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().

Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):

     def main_bucket():          b = StringBuilder(u"initial value ")          for i in xrange(30000000):              b += u"more data"          return unicode(b)

For CPython::

         kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

         real 0m12.944s          user 0m11.670s          sys 0m1.260s

         kost@kost-laptop:~/tmp$ time python string_bucket_append.py

         real 0m3.540s          user 0m2.830s          sys 0m0.690s

For PyPy 1.6::

         (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

         real 0m18.593s          user 0m12.930s          sys 0m5.600s

         (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py

         real 0m16.214s          user 0m11.750s          sys 0m4.280s

Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.

p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-)

I think you should use cStringIO in your class implementation. The list + join idiom is nice, but it has the disadvantage of creating and keeping alive many small string objects (with all the memory overhead and fragmentation that goes along with it).

AFAIR, the most efficient approach is using arrays:

...
...
...
import array t = array.array('u') t.extend(u'ДЖЭ') t

array('u', u'\xe4\xf6\xfc')

...
...
...
t.tounicode()

u'\xe4\xf6\xfc'

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Aug 25 2011)

...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/ mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany                40 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

  eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg           Registered at Amtsgericht Duesseldorf: HRB 46611               http://www.egenix.com/company/contact/

M.-A. Lemburg

10:19 a.m.

New subject: Create a StringBuilder class and use it everywhere

k_bx wrote:

...

I'm perfectly ok with different implementation of StringBuilder, but the main idea and proposal here is to make it in standard library somehow and force (and promote) uses of it everywhere, maybe write some FAQ. So that when you see some new += code all you need it so go and fix that without worrying about complains :-D

...

Larry Hastings

10:34 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 08/25/2011 03:19 AM, M.-A. Lemburg wrote:

...

I think the right place to do this is inside Python itself. I proposed something to do that several years ago, been meaning to revive it. http://bugs.python.org/issue1569040 /larry/

Dirkjan Ochtman

8:35 a.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, Aug 25, 2011 at 11:45, M.-A. Lemburg <mal@egenix.com> wrote:

...

M.-A. Lemburg

9:27 a.m.

New subject: Create a StringBuilder class and use it everywhere

Dirkjan Ochtman wrote:

...

Turns our you're right (list.append must have gotten a lot faster since I last tested this years ago, or I simply misremembered the results).

...

Masklinn

August 2011

9:44 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 2011-08-29, at 11:27 , M.-A. Lemburg wrote:

...

M.-A. Lemburg

10:25 a.m.

New subject: Create a StringBuilder class and use it everywhere

Masklinn wrote:

...

Antoine Pitrou

12:40 p.m.

New subject: Create a StringBuilder class and use it everywhere

On Mon, 29 Aug 2011 11:27:23 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:

...

k.bx＠ya.ru

4:04 p.m.

New subject: Create a StringBuilder class and use it everywhere

29.08.11, 15:43, "Antoine Pitrou" <solipsis@pitrou.net>:

...

Antoine Pitrou

4:10 p.m.

Le lundi 29 août 2011 à 19:04 +0300, k.bx@ya.ru a écrit :

...

Ok, so I think the best approach would be to implement via join + [], but do flush every 1000 ops, since it can save memory.

Carl Matthew Johnson

9:53 a.m.

New subject: Create a StringBuilder class and use it everywhere

k_bx

August 2011

9:56 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 12:53, "Carl Matthew Johnson" <cmjohnson.mailinglist@gmail.com>:

...

I use cpython 2.7 that comes with Ubuntu Natty with latest updates.

Masklinn

10:01 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 2011-08-25, at 11:53 , Carl Matthew Johnson wrote:

...

Steven D'Aprano

1:57 p.m.

New subject: Create a StringBuilder class and use it everywhere

Carl Matthew Johnson wrote:

...

k_bx

10:38 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 12:28, "k_bx" <k.bx@ya.ru>:

...

Hi!

There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::

    def main_pure():         b = u"initial value"         for i in xrange(30000):             b += u"more data"         return b

The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::

    def main_list_append():         b = [u"initial value"]         for i in xrange(3000000):             b.append(u"more data")         return u"".join(b)

The results are::

    kost@kost-laptop:~/tmp$ time python string_bucket_pure.py

    real 0m7.194s     user 0m3.590s     sys 0m3.580s     kost@kost-laptop:~/tmp$ time python string_bucket_append.py

    real 0m0.417s     user 0m0.330s     sys 0m0.080s

Fantastic, isn't it?

Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.

Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::

    class StringBuilder(object):         """Use it instead of doing += for building unicode strings from pieces"""         def __init__(self, val=u""):             self.val = val             self.appended = []

        def __iadd__(self, other):             self.appended.append(other)             return self

        def __unicode__(self):             self.val = u"".join((self.val, u"".join(self.appended)))             self.appended = []             return self.val

Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:

1. It has caching 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().

Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):

    def main_bucket():         b = StringBuilder(u"initial value ")         for i in xrange(30000000):             b += u"more data"         return unicode(b)

For CPython::

        kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

        real 0m12.944s         user 0m11.670s         sys 0m1.260s

        kost@kost-laptop:~/tmp$ time python string_bucket_append.py

        real 0m3.540s         user 0m2.830s         sys 0m0.690s

For PyPy 1.6::

        (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_bucket.py

        real 0m18.593s         user 0m12.930s         sys 0m5.600s

        (pypy)kost@kost-laptop:~/tmp$ time python string_bucket_append.py

        real 0m16.214s         user 0m11.750s         sys 0m4.280s

Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.

p.s.: also, why not use cStringIO? 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=). 3. it's somehow slow on pypy right now :-)

Thanks.

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.

Georg Brandl

10:50 a.m.

New subject: Create a StringBuilder class and use it everywhere

Am 25.08.2011 12:38, schrieb k_bx:

...

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.

You do realize that MutableString's __iadd__ just performs += on str operands? Georg

k_bx

10:55 a.m.

New subject: Create a StringBuilder class and use it everywhere

25.08.2011, 13:50, "Georg Brandl" <g.brandl@gmx.net>:

...

Oh, I'm sorry, I thought it uses cStringIO internally. Let's forget about MutableString then.

Terry Reedy

August 2011

3:41 p.m.

New subject: Create a StringBuilder class and use it everywhere

On 8/25/2011 6:38 AM, k_bx wrote:

...

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.

(removed, i presume you mean...) and added bytearray. I have no idea if += on such is any better than O(n*n) -- Terry Jan Reedy

Antoine Pitrou

4:35 p.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, 25 Aug 2011 11:41:11 -0400 Terry Reedy <tjreedy@udel.edu> wrote:

...

On bytearray? Yes, it is. It's a similar algorithm as lists, and therefore O(total length) amortized. Regards Antoine.

Antoine Pitrou

11:36 a.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, 25 Aug 2011 12:28:14 +0300 k_bx <k.bx@ya.ru> wrote:

...

Nick Coghlan

11:47 a.m.

New subject: Create a StringBuilder class and use it everywhere

...

Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Mike Graham

12:31 p.m.

New subject: Create a StringBuilder class and use it everywhere

On Thu, Aug 25, 2011 at 5:28 AM, k_bx <k.bx@ya.ru> wrote:

...

Steven D'Aprano

2 p.m.

New subject: Create a StringBuilder class and use it everywhere

Mike Graham wrote:

...

Stefan Behnel

August 2011

10:15 a.m.

New subject: Create a StringBuilder class and use it everywhere

Steven D'Aprano, 25.08.2011 16:00:

...

Arnaud Delobelle

9:02 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 25 August 2011 13:31, Mike Graham <mikegraham@gmail.com> wrote:

...

+1 -- Arnaud

Terry Reedy

10:24 a.m.

New subject: Create a StringBuilder class and use it everywhere

On 8/25/2011 5:28 AM, k_bx wrote:

...

...
...
abc abcd

I am personally happy enough with [].append, but I can see the attraction of += if doing many separate lines rather than .append within a loop. -- Terry Jan Reedy

Antoine Pitrou

7:45 p.m.

New subject: Performance of the "".join() idiom

...

4951

Age (days ago)

4955

Last active (days ago)

List overview

Download

28 comments

15 participants

participants (15)

Antoine Pitrou
Arnaud Delobelle
Carl Matthew Johnson
Dirkjan Ochtman
Georg Brandl
k.bx＠ya.ru
k_bx
Larry Hastings
M.-A. Lemburg
Masklinn
Mike Graham
Nick Coghlan
Stefan Behnel
Steven D'Aprano
Terry Reedy

Create a StringBuilder class and use it everywhere

k_bx

k_bx

k.bx＠ya.ru

Carl Matthew Johnson

k_bx

k_bx

k_bx

k_bx

k.bx＠ya.ru

Carl Matthew Johnson

k_bx

k_bx

k_bx

tags

participants (15)