gather information from various files efficiently
Steven Bethard
steven.bethard at gmail.com
Tue Dec 14 13:14:29 EST 2004
Klaus Neuner wrote:
> The straightforward way to solve this problem is to create a
> dictionary. Like so:
>
>
> [...]
>
> a, b = get_information(line)
> if a in dict.keys():
> dict[a].append(b)
> else:
> dict[a] = [b]
>
So I timed the three suggestions with a few different datasets:
> cat builddict.py
def askpermission(d, k, v):
if k in d:
d[k].append(v)
else:
d[k] = [v]
def askforgiveness(d, k, v):
try:
d[k].append(v)
except KeyError:
d[k] = [v]
def default(d, k, v):
d.setdefault(k, []).append(v)
def test(items, func):
d = {}
for k, v in items:
func(d, k, v)
Dataset where every value causes a collision:
> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i
in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.62 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i
in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.58 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i
in range(1000)], bd.default)"
100 loops, best of 3: 2.03 msec per loop
Dataset where no value causes a collision:
> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i
in range(1000)], bd.askpermission)"
100 loops, best of 3: 2.29 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i
in range(1000)], bd.askforgiveness)"
100 loops, best of 3: 9.96 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i
in range(1000)], bd.default)"
100 loops, best of 3: 2.98 msec per loop
Datset where one of every 5 values causes a collision:
> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for
i in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.82 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for
i in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.79 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for
i in range(1000)], bd.default)"
100 loops, best of 3: 2.2 msec per loop
So, when there are lots of collisions, you may get a small benefit from
the try/except solution. If there are very few collisions, you probably
would rather the if/else solution. The setdefault solution patterns
about like the if/else solution, but is somewhat slower.
I will probably continue to use setdefault, because I think it's
prettier =) but if you're running into a speed bottleneck in this sort
of situation, you might consider one of the other solutions.
Steve
More information about the Python-list
mailing list