[Python-ideas] Create a StringBuilder class and use it everywhere

k_bx k.bx at ya.ru
Thu Aug 25 11:57:22 CEST 2011


25.08.2011, 12:45, "M.-A. Lemburg" <mal at egenix.com>:
> k_bx wrote:
>
>>  Hi!
>>
>>  There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::
>>
>>      def main_pure():
>>          b = u"initial value"
>>          for i in xrange(30000):
>>              b += u"more data"
>>          return b
>>
>>  The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::
>>
>>      def main_list_append():
>>          b = [u"initial value"]
>>          for i in xrange(3000000):
>>              b.append(u"more data")
>>          return u"".join(b)
>>
>>  The results are::
>>
>>      kost at kost-laptop:~/tmp$ time python string_bucket_pure.py
>>
>>      real 0m7.194s
>>      user 0m3.590s
>>      sys 0m3.580s
>>      kost at kost-laptop:~/tmp$ time python string_bucket_append.py
>>
>>      real 0m0.417s
>>      user 0m0.330s
>>      sys 0m0.080s
>>
>>  Fantastic, isn't it?
>>
>>  Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.
>>
>>  Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::
>>
>>      class StringBuilder(object):
>>          """Use it instead of doing += for building unicode strings from pieces"""
>>          def __init__(self, val=u""):
>>              self.val = val
>>              self.appended = []
>>
>>          def __iadd__(self, other):
>>              self.appended.append(other)
>>              return self
>>
>>          def __unicode__(self):
>>              self.val = u"".join((self.val, u"".join(self.appended)))
>>              self.appended = []
>>              return self.val
>>
>>  Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:
>>
>>  1. It has caching
>>  2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().
>>
>>  Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):
>>
>>      def main_bucket():
>>          b = StringBuilder(u"initial value ")
>>          for i in xrange(30000000):
>>              b += u"more data"
>>          return unicode(b)
>>
>>  For CPython::
>>
>>          kost at kost-laptop:~/tmp$ time python string_bucket_bucket.py
>>
>>          real 0m12.944s
>>          user 0m11.670s
>>          sys 0m1.260s
>>
>>          kost at kost-laptop:~/tmp$ time python string_bucket_append.py
>>
>>          real 0m3.540s
>>          user 0m2.830s
>>          sys 0m0.690s
>>
>>  For PyPy 1.6::
>>
>>          (pypy)kost at kost-laptop:~/tmp$ time python string_bucket_bucket.py
>>
>>          real 0m18.593s
>>          user 0m12.930s
>>          sys 0m5.600s
>>
>>          (pypy)kost at kost-laptop:~/tmp$ time python string_bucket_append.py
>>
>>          real 0m16.214s
>>          user 0m11.750s
>>          sys 0m4.280s
>>
>>  Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.
>>
>>  p.s.: also, why not use cStringIO?
>>  1. it's not semantically right to create file-like string just to join multiple string pieces into one.
>>  2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=).
>>  3. it's somehow slow on pypy right now :-)
>
> I think you should use cStringIO in your class implementation.
> The list + join idiom is nice, but it has the disadvantage of
> creating and keeping alive many small string objects (with all
> the memory overhead and fragmentation that goes along with it).
>
> AFAIR, the most efficient approach is using arrays:
>
>>>>  import array
>>>>  t = array.array('u')
>>>>  t.extend(u'ДЖЭ')
>>>>  t
>
> array('u', u'\xe4\xf6\xfc')
>
>>>>  t.tounicode()
>
> u'\xe4\xf6\xfc'
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Source  (#1, Aug 25 2011)
>>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
> ________________________________________________________________________
> 2011-10-04: PyCon DE 2011, Leipzig, Germany                40 days to go
>
> ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
>
>   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>           Registered at Amtsgericht Duesseldorf: HRB 46611
>               http://www.egenix.com/company/contact/

I'm perfectly ok with different implementation of StringBuilder, but the main idea and proposal here is to make it in standard library somehow and force (and promote) uses of it everywhere, maybe write some FAQ. So that when you see some new += code all you need it so go and fix that without worrying about complains :-D



More information about the Python-ideas mailing list