Efficient String Lookup?
bokr at oz.net
Sun Oct 17 09:55:41 CEST 2004
On Sun, 17 Oct 2004 07:18:07 GMT, "Chris S." <chrisks at NOSPAM.udel.edu> wrote:
>Bengt Richter wrote:
>> On Sun, 17 Oct 2004 05:50:49 GMT, "Chris S." <chrisks at NOSPAM.udel.edu> wrote:
>>>Sorry for the ambiguity. My case is actually pretty simple. '#'
>>>represents any single character, so it's essentially the same as re's
>>>'.'. The match must be exact, so the string and pattern must be of equal
>>>lengths. Each wildcard is independent of other wildcards. For example,
>>>suppose we restricted the possible characters to 1 and 0, then the
>>>pattern '##' would only match '00', '01', '10', and '11'. This pattern
>>>would not match '0', '111', etc. I feel that a trie would work well, but
>>>I'm concerned that for large patterns, the overhead in the Python
>>>implementation would be too inefficient.
>> So is the set of patterns static and you want to find which pattern(s!)
>> match dynamic input? How many patterns vs inputs strings? What max
>> length patterns, input strings? Total volume?
>Patterns and inputs are dynamic, input more so than patterns. The
>number, length, and volume of patterns and strings should be arbitrary.
Strategies for performance will vary according to volume (small => anything ok), relative
sizes of strings and respective sets of strings, assuming enough work to make you notice a difference.
patterns>>inputs => walk tables based on patterns using inputs vs patterns<<inputs =>
walk tables base on input sets based on strategic walks through patterns)
Max length of strings, patterns will make some things worthwhile if strings are small,
dumb if they are thousands of characters. Who can tell what performance tricks you may need
or not? Why don't you just try (^pattern$|^pat2$|...) for every pattern to rescan the whole
input for each pattern, and we'll worry about performance later ;-)
More information about the Python-list