StringChain -- a data structure for managing large sequences of chunks of bytes

Zooko O'Whielacronx zookog at gmail.com
Mon Mar 22 17:21:53 EDT 2010


On Mon, Mar 22, 2010 at 2:07 AM, Steven D'Aprano
<steven at remove.this.cybersource.com.au> wrote:
>
> Perhaps you should have said that it was a wrapper around deque giving
> richer functionality, rather than giving the impression that it was a
> brand new data structure invented by you. People are naturally going to
> be more skeptical about a newly invented data structure than one based on
> a reliable, component like deque.

The fact that StringChain uses deque to hold the queue of strings
isn't that important. I just benchmarked it by swapping in the deque
for a list and using the list costs about one third of a nanosecond
per byte at the scales that the benchmark exercises (namely 10,000,000
bytes in about 10,000 strings). A third of a nanosecond per byte is
about 4% of the runtime.

I also implemented and benchmarked a simpler deque-based scheme which
puts all the actual bytes from the strings into a deque with
self.d.extend(newstr). As you would expect, this shows good asymptotic
performance but the constants are relatively bad so that at all of the
actual loads measured it was three orders of magnitude worse than
StringChain and than String-Chain-with-a-list-instead-of-a-deque.
Moral: the constants matter!

Those benchmarks are appended. You can run the benchmarks yourself per
the README.txt.

But anyway, I take your point and I updated the StringChain README to
explain that it is a pretty simple data structure that holds a list
(actually a deque) of strings and isn't anything too clever or novel.

By the way, we could further micro-optimize this kind of thing if
''.join() would accept either strings or buffers instead of requiring
strings:

>>> ''.join([buffer("abc"), "def"])
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string, buffer found

Then whenever StringChain needs to slice a string into leading and
trailing parts, it could construct a buffer() viewing each part
instead of making a copy of each part.


> it. Maybe you should consider linking to it on PyPI and seeing if there
> is any interest from others?

http://pypi.python.org/pypi/stringchain

Regards,

Zooko

impl:  StringChain
task:  _accumulate_then_one_gulp
  10000 best: 5.698e+00,   3th-best: 7.486e+00, mean: 7.758e+00,
 100000 best: 4.640e+00,   3th-best: 4.690e+00, mean: 7.674e+00,
 1000000 best: 3.789e+00,   3th-best: 3.806e+00, mean: 3.995e+00,
10000000 best: 4.095e+00,   1th-best: 4.095e+00, mean: 4.095e+00,
task:  _alternate_str
  10000 best: 1.378e+01,   3th-best: 1.390e+01, mean: 1.500e+01,
 100000 best: 9.198e+00,   3th-best: 9.248e+00, mean: 9.385e+00,
 1000000 best: 8.715e+00,   3th-best: 8.755e+00, mean: 8.808e+00,
10000000 best: 8.738e+00,   1th-best: 8.738e+00, mean: 8.738e+00,
impl:  StringChainWithList
task:  _accumulate_then_one_gulp
  10000 best: 3.600e+00,   3th-best: 3.695e+00, mean: 4.129e+00,
 100000 best: 4.070e+00,   3th-best: 4.079e+00, mean: 4.162e+00,
 1000000 best: 3.662e+00,   3th-best: 3.688e+00, mean: 3.721e+00,
10000000 best: 3.902e+00,   1th-best: 3.902e+00, mean: 3.902e+00,
1th-worst: 3.902e+00, worst: 3.902e+00 (of      1)
task:  _alternate_str
  10000 best: 1.369e+01,   3th-best: 1.380e+01, mean: 1.442e+01,
 100000 best: 9.251e+00,   3th-best: 9.289e+00, mean: 9.416e+00,
 1000000 best: 8.809e+00,   3th-best: 8.876e+00, mean: 8.943e+00,
10000000 best: 9.095e+00,   1th-best: 9.095e+00, mean: 9.095e+00,
impl:  Dequey
task:  _accumulate_then_one_gulp
  10000 best: 2.772e+02,   3th-best: 2.785e+02, mean: 2.911e+02,
 100000 best: 2.314e+02,   3th-best: 2.334e+02, mean: 2.422e+02,
 1000000 best: 2.282e+02,   3th-best: 2.288e+02, mean: 2.370e+02,
10000000 best: 2.587e+02,   1th-best: 2.587e+02, mean: 2.587e+02,
task:  _alternate_str
  10000 best: 1.576e+03,   3th-best: 1.580e+03, mean: 1.608e+03,
 100000 best: 1.301e+03,   3th-best: 1.303e+03, mean: 1.306e+03,
 1000000 best: 1.275e+03,   3th-best: 1.276e+03, mean: 1.278e+03,
10000000 best: 1.280e+03,   1th-best: 1.280e+03, mean: 1.280e+03,



More information about the Python-list mailing list