[Tutor] How can I find a group of characters in a list of strings?
Martin A. Brown
martin at linux-ip.net
Wed Jul 25 20:29:27 EDT 2018
> I have a list of strings that contains slightly more than a
> million items. Each item is a string of 8 capital letters like so:
>
> ['MIBMMCCO', 'YOWHHOY', ...]
>
> I need to check and see if the letters 'OFHCMLIP' are one of the items in the
> list but there is no way to tell in what order the letters will appear. So I
> can't just search for the string 'OFHCMLIP'. I just need to locate any strings
> that are made up of those letters no matter their order.
>
> I suppose I could loop over the list and loop over each item using a bunch of
> if statements exiting the inner loop as soon as I find a letter is not in the
> string, but there must be a better way.
>
> I'd appreciate hearing about a better way to attack this.
>
> thanks, Jim
If I only had to do this once, over only a million items (given
today's CPU power), so I'd probably do something like the below
using sets. I couldn't tell from your text whether you wanted to
see all of the entries in 'OFHCMLIP' in each entry or if you wanted
to see only that some subset were present. So, here's a script that
will produce a partial match and exact match.
Note, I made a 9-character string, too because you had a 7-character
string as your second sample -- mostly to point out that the
9-character string satisfies an exact match although it sports an
extra character.
farm = ['MIBMMCCO', 'YOWHHOY', 'OFHCMLIP', 'OFHCMLIPZ', 'FHCMLIP', 'NEGBQJKR']
needle = set('OFHCMLIP')
for haystack in farm:
partial = needle.intersection(haystack)
exact = needle.intersection(haystack) == needle
print(haystack, exact, ''.join(sorted(partial)))
On the other hand, there are probably lots of papers on how to do
this much more efficiently.
-Martin
MIBMMCCO False CIMO
YOWHHOY False HO
OFHCMLIP True CFHILMOP
OFHCMLIPZ True CFHILMOP
FHCMLIP False CFHILMP
NEGBQJKR False
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list