StringChain -- a data structure for managing large sequences of chunks of bytes

Zooko O'Whielacronx zookog at gmail.com
Fri Mar 12 08:11:37 CET 2010


Folks:

Every couple of years I run into a problem where some Python code that
worked well at small scales starts burning up my CPU at larger scales,
and the underlying issue turns out to be the idiom of accumulating
data by string concatenation. It just happened again
(http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard
to make the data accumulator efficient without introducing a bunch of
bugs into the surrounding code. So this time around I decided to try
encapsulating my preferred more efficient idiom into a reusable class.

So I present to you StringChain, which is an efficient way to
accumulate and process data in many chunks:

http://tahoe-lafs.org/trac/stringchain

Here are some benchmarks generated by running python -OOu -c 'from
stringchain.bench import bench; bench.quick_bench()' as instructed by
the README.txt file.

The N: in the left-hand column is how many bytes were in the test
dataset. The ave rate: number in the right-hand column is how many
bytes per second were processed. "naive" means the string-based idiom
sketched above and "strch" means using the StringChain class.

_buildup init_naive
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    890, ave
rate: 58350579
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    265, ave
rate: 34800398
N:  262144, time: best:    0.01,   2th-best:    0.01, ave:    0.01,
2th-worst:    0.01, worst:    0.01 (of      5), reps/s:     79, ave
rate: 20745346
N:  524288, time: best:    0.05,   2th-best:    0.05, ave:    0.05,
2th-worst:    0.05, worst:    0.05 (of      5), reps/s:     20, ave
rate: 10823850
_buildup init_strch
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:  25543, ave
rate: 1674043282
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:  14179, ave
rate: 1858538925
N:  262144, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   8016, ave
rate: 2101513050
N:  524288, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   4108, ave
rate: 2154215572
_consume init_naive_loaded
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    931, ave
rate: 61037862
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    270, ave
rate: 35454393
N:  262144, time: best:    0.01,   2th-best:    0.01, ave:    0.01,
2th-worst:    0.01, worst:    0.01 (of      5), reps/s:     74, ave
rate: 19471963
N:  524288, time: best:    0.05,   2th-best:    0.05, ave:    0.05,
2th-worst:    0.05, worst:    0.06 (of      5), reps/s:     19, ave
rate: 10146747
_consume init_strch_loaded
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   4309, ave
rate: 282447500
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   2313, ave
rate: 303263357
N:  262144, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   1186, ave
rate: 311159052
N:  524288, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    606, ave
rate: 317814669
_randomy init_naive
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    479, ave
rate: 31450561
N:  131072, time: best:    0.01,   2th-best:    0.01, ave:    0.01,
2th-worst:    0.01, worst:    0.01 (of      5), reps/s:    140, ave
rate: 18461191
N:  262144, time: best:    0.02,   2th-best:    0.02, ave:    0.02,
2th-worst:    0.03, worst:    0.03 (of      5), reps/s:     42, ave
rate: 11127714
N:  524288, time: best:    0.06,   2th-best:    0.07, ave:    0.08,
2th-worst:    0.08, worst:    0.09 (of      5), reps/s:     13, ave
rate:  6906341
_randomy init_strch
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    973, ave
rate: 63827127
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    495, ave
rate: 64970669
N:  262144, time: best:    0.00,   2th-best:    0.00, ave:    0.00,
2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    239, ave
rate: 62913360
N:  524288, time: best:    0.01,   2th-best:    0.01, ave:    0.01,
2th-worst:    0.01, worst:    0.01 (of      5), reps/s:    121, ave
rate: 63811569

The naive approach is slower than the StringChain class, and the
bigger the dataset the slower it goes. The StringChain class is fast
and also it is scalable (with regard to these benchmarks at least...).

Thanks!

Regards,

Zooko



More information about the Python-list mailing list