[New-bugs-announce] [issue26904] Difflib quick_ratio() could use Counter()

Michael Cuthbert report at bugs.python.org
Mon May 2 00:29:13 EDT 2016


New submission from Michael Cuthbert:

The implementation used in difflib.SequenceMatcher().quick_ratio() counts how often each member of the sequence (character, list entry, etc.) appears in order to calculate its lower bound.

Counting how often an entry appears in an iterable has been sped up in cPython w/ the _count_elements() c function in _collections.

This patch uses the collections uses collections.Counter() as a way of getting this speedup (and any future speedups in cPython or other implementations); code is somewhat simplified also.

Performance:
There is a slight overhead to creating two collections.Counter() objects rather than simple dicts.  On two Mac systems (Python 3.5 on stock Macbook Air, and Py 3.6a0 latest on recent Mac Pro) the new implementation passes the speed of the previous when the length of the iterable is around 60 items.  As the number of items increases, the performance gains increase significantly. By 400 items, the new implementation's speed is about 3x the old, and seems to approach 3.6x asymptotically.

Below 60 items, the older implementation is faster; reaching a max of 8x the speed of the new when comparing a string of one element against a string of one element.  (The degenerate case of comparing one or two empty iterables is checked for in this implementation and is thus faster than the old implementation). I believe that users using quick_ratio() are more likely to be looking for speed improvements on larger 

Like the previous implementation, the count of the items in seq2 (self.b) is cached; if run again, it is about 41% faster (compared to 47% faster before).

This is my first patch submission, so any suggestions would be appreciated.

----------
components: Library (Lib)
files: difflib_quick_ratio.patch
keywords: patch
messages: 264618
nosy: Michael Cuthbert
priority: normal
severity: normal
status: open
title: Difflib quick_ratio() could use Counter()
type: performance
versions: Python 3.6
Added file: http://bugs.python.org/file42676/difflib_quick_ratio.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26904>
_______________________________________


More information about the New-bugs-announce mailing list