[issue27575] dict viewkeys intersection slow for large dicts
David Su
report at bugs.python.org
Sat Jul 23 14:05:29 EDT 2016
David Su added the comment:
I am trying to implement the following algorithm:
http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
This algorithm is used for fast nearest neighbor queries for spell correction.
Given a corpus of strings, for each string s, I generate all subsequences (Note: subsequence, not substring) of that string length (len(n) - k), call that subsequence t, where k is a tunable parameter. Then, I create a dict mapping the subsequences t to the original string s.
This process creates a very large dict from a relatively small corpus. For example, my corpus of ~500k strings blew up to a dict of ~10M keys, with k=3.
Then, for a query string q, you generate all subsequences of (len(q) - k), and perform an intersection with the keys of the dict. Then, the values of the dict corresponding to the key intersection will be possible misspelling of the query string q.
In pseudo-python:
def get_subseq(s: str, k: int):
returns all subsequences of s of length len(s) - k
def build_dict(corpus: List[str], k: int):
d = defaultdict(list)
for orig_str in corpus:
for subseq in get_subseq(orig_str, k):
d[subseq].append(orig_str)
return d
def query(d: dict, q: str, k:int):
q_subseq = set(get_subseq(q, k))
intersection = d.viewkeys() & q_subseq # this is slow when len(d) is large
return [orig_str for k in intersection for orig_str in d[k]]
By exposing an intersection interface, a certain degree of performance is expected by the user. It took me a considerable amount of debugging to realize that the dict.viewkeys() intersection was the performance bottleneck. I finally decided to dig into the CPython source code before discovering the root cause of the issue. While I am familiar with C and found the issue relatively quickly, for those that aren't, it should not be necessary to dig into the source code for a mainstream programming language like Python.
I am currently looking at how to potentially fix this in the CPython repository. I may submit a patch in the near future if I find a good solution.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue27575>
_______________________________________
More information about the Python-bugs-list
mailing list