Regular Expressions: large amount of or's
Anthra Norell
anthra.norell at tiscalinet.ch
Tue Mar 1 17:33:11 EST 2005
----- Original Message -----
From: "André Søreng" <wsoereng at tiscali.no>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Tuesday, March 01, 2005 8:46 PM
Subject: Regular Expressions: large amount of or's
>
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
>
> André
>
Here's a function. You can copy and run it. See if that meets your requirements. It's an algorithm I once used for a search-and-replace program. That's why I collect the string indexes. It can easily be adapted. Suppose you only want to know which words occur, not where they occur. You then don't need the string indexes and don't need to return the same word repeatedly for each occurrences. If your list is 10,000 words as you say, some optimizing would be indicated, such as sorting it alphabetically and match only the subset with the initial matching the character at the current string offset.
Salut
Frederic
############################################
def search_words (s = None, wtf = None):
if s == None:
# Unless you provide your own string, it'll use this one
s = 'Personal firewall software may warn about the connection IDLE\n' \
' makes to its subprocess using this computer\'s internal loopback.\n'
if wtf == None:
# unless you provide your own list, it'll use this one
words_to_find = (
'fire',
'wall',
'process',
'subprocess',
'war',
'ware',
'international',
'inter',
'internal',
'internals',
'internal combustion engine', # You can search any substring, not just words
' ',
' ',
'\n',
'\t',
)
else:
words_to_find = wtf
WORD = 0
POSITION = 1
words = [] # Put matching words and string offsets into this list to return
length_of_string = len (s)
read_index = 0
while read_index < length_of_string:
# Try to match all words in your list against the string at offset read_index
words_list_index = 0 # Start matching at index 0 ('fire')
all_matches = [] # Collector for all matches at current string offset
for word in words_to_find: # Go through your entire list of words one by one
if s [read_index:].startswith (word): # If current word matches,
all_matches.append ((len (pattern), words_list_index)) # collect its length and index in your list
words_list_index += 1
if all_matches > []: # If one or more matches occurred
all_matches.sort () # Put them in order
all_matches.reverse () # and the longest match on top
words.append ((words_to_find [all_matches [0][POSITION]], read_index)) # Collect the longest matching word. Other strategies
# are possible. Search and replace requires this one.
read_index += all_matches [0][WORD] # Continue next read just past the matched section,
# unless you allow overlapping matches. re doesn't.
# Search and replace doesn't.
else: # No word matched at current offset
read_index += 1 # Try one further down next
return words # [(' ', 8), ('wall', 13), (' ', 17), ('ware', 22), (' ', 26), etc.
############################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20050301/4865a8ef/attachment.html>
More information about the Python-list
mailing list