[Tutor] How can I find a group of characters in a list of strings?

Matt Ruffalo matt.ruffalo at gmail.com
Wed Jul 25 20:42:58 EDT 2018

On 2018-07-25 20:23, Mats Wichmann wrote:
> On 07/25/2018 05:50 PM, Jim wrote:
>> Linux mint 18 and python 3.6
>> I have a list of strings that contains slightly more than a million
>> items. Each item is a string of 8 capital letters like so:
>> ['MIBMMCCO', 'YOWHHOY', ...]
>> I need to check and see if the letters 'OFHCMLIP' are one of the items
>> in the list but there is no way to tell in what order the letters will
>> appear. So I can't just search for the string 'OFHCMLIP'. I just need to
>> locate any strings that are made up of those letters no matter their order.
>> I suppose I could loop over the list and loop over each item using a
>> bunch of if statements exiting the inner loop as soon as I find a letter
>> is not in the string, but there must be a better way.
>> I'd appreciate hearing about a better way to attack this.
> It's possible that the size of the biglist and the length of the key has
> enough performance impacts that a quicky (untested because I don't have
> your data) solution is unworkable for performance reasons.  But a quicky
> might be to take these two steps:
> 1. generate a list of the permutations of the target
> 2. check if any member of the target-permutation-list is in the biglist.
> Python sets are a nice way to check membership.
> from itertools import permutations
> permlist = [''.join(p) for p in permutations('MIBMMCCO', 8)]
> if not set(permlist).isdisjoint(biglist):
>     print("Found a permutation of MIBMMCCO")

I would *strongly* recommend against keeping a list of all permutations
of the query string; though there are only 8! = 40320 permutations of 8
characters, suggesting anything with factorial runtime should be done
only as a last resort.

This could pretty effectively be solved by considering each string in
the list as a set of characters for query purposes, and keeping a set of
those, making membership testing constant-time. Note that the inner sets
will have to be frozensets because normal sets aren't hashable.

For example:

In [1]: strings = ['MIBMMCCO', 'YOWHHOY']

In [2]: query = 'OFHCMLIP'

In [3]: search_db = {frozenset(s) for s in strings}

In [4]: frozenset(query) in search_db
Out[4]: False

In [5]: frozenset('MMCOCBIM') in search_db # permutation of first string
Out[5]: True


More information about the Tutor mailing list