Regular Expressions: large amount of or's

Tue Mar 1 17:33:11 EST 2005

----- Original Message ----- 
From: "André Søreng" <wsoereng at tiscali.no>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Tuesday, March 01, 2005 8:46 PM
Subject: Regular Expressions: large amount of or's

> 
> Hi!
> 
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
> 
> With the re module, this can be solved something like this:
> 
> import re
> 
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
> 
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
> 
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
> 
> André
> 

Here's a function. You can copy and run it. See if that meets your requirements. It's an algorithm I once used for a search-and-replace program. That's why I collect the string indexes. It can easily be adapted. Suppose you only want to know which words occur, not where they occur. You then don't need the string indexes and don't need to return the same word repeatedly for each occurrences. If your list is 10,000 words as you say, some optimizing would be indicated, such as sorting it alphabetically and match only the subset with the initial matching the character at the current string offset.

Salut 

Frederic

############################################

def search_words (s = None, wtf = None):

   if s == None:

      # Unless you provide your own string, it'll use this one

      s = 'Personal firewall software may warn about the connection IDLE\n' \
          '    makes to its subprocess using this computer\'s internal loopback.\n'

   if wtf == None:

      # unless you provide your own list, it'll use this one

      words_to_find = (
         'fire',
         'wall',
         'process',
         'subprocess',
         'war',
         'ware',
         'international',
         'inter',
         'internal',
         'internals',
         'internal combustion engine',    # You can search any substring, not just words
         ' ',         
         '   ',
         '\n',
         '\t',
      )

   else:

      words_to_find = wtf

   WORD     = 0
   POSITION = 1

   words = []      # Put matching words and string offsets into this list to return

   length_of_string = len (s)
   read_index       = 0

   while read_index < length_of_string:

      # Try to match all words in your list against the string at offset read_index

      words_list_index = 0              # Start matching at index 0 ('fire')

      all_matches = []                  # Collector for all matches at current string offset

      for word in words_to_find:        # Go through your entire list of words one by one

         if s [read_index:].startswith (word):   # If current word matches,

            all_matches.append ((len (pattern), words_list_index))      # collect its length and index in your list 

         words_list_index += 1

      if all_matches > []:          # If one or more matches occurred

         all_matches.sort ()        # Put them in order
         all_matches.reverse ()     # and the longest match on top

         words.append ((words_to_find [all_matches [0][POSITION]], read_index))   # Collect the longest matching word. Other strategies 
                                                                                  # are possible. Search and replace requires this one.

         read_index += all_matches [0][WORD]                                      # Continue next read just past the matched section,
                                                                                  # unless you allow overlapping matches. re doesn't.
                                                                                  # Search and replace doesn't.

      else:                    # No word matched at current offset

         read_index += 1       # Try one further down next

   return words                # [(' ', 8), ('wall', 13), (' ', 17), ('ware', 22), (' ', 26), etc.

############################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20050301/4865a8ef/attachment.html>