StringChain -- a data structure for managing large sequences of chunks of bytes

Zooko O'Whielacronx zookog at gmail.com
Mon Mar 22 01:09:46 EDT 2010


Folks:

I failed to make something sufficiently clear in my original message
about StringChain. The use case that I am talking about is not simply
that you need to accumulate a sequence of incoming chunks of data,
concatenate them together, and then process the entire result. If that
is all you need to do then indeed you can accumulate the incoming
strings in a list and then run ''.join(thelist) when you have received
all of them and are ready to process them.

But the use case that I am talking about is where you need to
accumulate new incoming strings into your buffer while alternately
processing leading prefixes of the buffer. The typical example is that
sequences of bytes are arriving on your TCP socket and you are
processing the stream incrementally. You need to process the leading
prefixes of the stream as soon as you can (e.g. as soon as a line
followed by a newline has accumulated, or as soon as another complete
element in your data format has accumulated, etc.); you can't simply
wait until the bytes stop coming and then concatenate the entire
collection together at once.

This is exactly the current situation in foolscap [1] which is causing
a performance problem in Tahoe-LAFS [2].

To illustrate what I mean I cooked up some benchmarks showing the task
of "accumulate a bunch of things then consume them in a single gulp"
versus the task of "alternate between accumulating and consuming"
(with accumulation going faster than consumption until the input
stream runs dry).

I implemented a few data structures for benchmarking: StringChain, the
naive "accumulatorstr += newstr" approach (named "Stringy" in the
benchmarks), one based on cStringIO (named "StringIOy"), one based on
accumulating the new strings into a list and then doing
''.join(accumulatorlist) whenever you need to consume a leading prefix
(called "SimplerStringChain").

Below are the abbreviated results of the benchmark. You can easily run
this benchmark yourself by following the README.txt in the StringChain
source package [3]. These results show that for the load imposed by
this benchmark StringChain is the only one of the four that scales and
that it is also nice and fast.

The left-hand column is the total number of bytes that were run
through it. The results are shown in nanoseconds per byte in
exponential notation. ("e+01" means times 10, "e+02" means times 100,
and "e+03" means times 1000, etc.) Each result is the best of 10
tries, or of 5 tries for the tasks which were taking too long to run
it 10 times.

Regards,

Zooko

[1] http://foolscap.lothar.com/trac/ticket/149
[2] http://allmydata.org/pipermail/tahoe-dev/2010-March/004181.html
[3] http://tahoe-lafs.org/trac/stringchain

impl:  StringChain
task:  _accumulate_then_one_gulp
  10000 best: 2.694e+00
  50000 best: 2.742e+00
 100000 best: 2.310e+00
 500000 best: 2.040e+00
1000000 best: 1.988e+00
5000000 best: 2.193e+00

task:  _alternate_str
  10000 best: 6.509e+00
  50000 best: 4.559e+00
 100000 best: 4.308e+00
 500000 best: 4.070e+00
1000000 best: 3.991e+00
5000000 best: 4.000e+00

impl:  SimplerStringChain
task:  _accumulate_then_one_gulp
  10000 best: 1.407e+00
  50000 best: 2.317e+00
 100000 best: 2.012e+00
 500000 best: 1.902e+00
1000000 best: 1.897e+00
5000000 best: 2.104e+00

task:  _alternate_str
  10000 best: 4.888e+00
  50000 best: 5.198e+00
 100000 best: 1.750e+01
 500000 best: 6.233e+01
1000000 best: 1.134e+02
5000000 best: 7.599e+02

impl:  StringIOy
task:  _accumulate_then_one_gulp
  10000 best: 4.196e+00
  50000 best: 5.522e+00
 100000 best: 4.499e+00
 500000 best: 3.756e+00
1000000 best: 4.176e+00
5000000 best: 5.414e+00

task:  _alternate_str
  10000 best: 5.484e+00
  50000 best: 7.863e+00
 100000 best: 2.126e+01
 500000 best: 6.972e+01
1000000 best: 1.219e+02
5000000 best: 9.463e+02

task:  _accumulate_then_one_gulp
  10000 best: 1.502e+00
  50000 best: 1.420e+01
 100000 best: 2.245e+01
 500000 best: 8.577e+01
1000000 best: 2.295e+02
5000000 best: 1.326e+03

task:  _alternate_str
  10000 best: 3.290e+00
  50000 best: 4.220e+00
 100000 best: 1.665e+01
 500000 best: 6.281e+01
1000000 best: 1.127e+02
5000000 best: 7.626e+02



More information about the Python-list mailing list