Looking for a regexp generator based on a set of known string representative of a string set
Paul McGuire
ptmcg at austin.rr._bogus_.com
Fri Sep 15 19:03:24 EDT 2006
"Paul McGuire" <ptmcg at austin.rr._bogus_.com> wrote in message
news:OTgMg.15795$o42.8868 at tornado.texas.rr.com...
> "Andy Dingley" <dingbat at codesmiths.com> wrote in message
> news:1157730937.621029.259320 at i3g2000cwc.googlegroups.com...
>>
>> vbfoobar at gmail.com wrote:
>>
>>> I am looking for python code that takes as input a list of strings
>>> [...] and outputs the python regular expression
>>
>> (s1|s2|s3|s4|s5)
>> for strings of "s1" etc.
>>
>> Regex compilers are themselves quite good at optimising beyond this
>>
>
> It turns out this problem is a little trickier, especially when one of
> your strings is a leading subset of another. For instance, let's say we
> are looking for comparison operators, one of <, >, =, >=, <=, or !=.
> Simply concatenating with intervening '|' characters gives this regexp:
>
> "<|>|=|<=|>=|!="
>
> However, the leading '<' and '>' alternatives mask the later '<=', '<>',
> and '>=' alternatives, and so the regexp never matches the longer options
> (I was not able to find a greediness switch that would override this). So
> when searching "a >= b" we get this:
>
>>>> re.findall("<|>|=|<=|>=|!=", "a >= b")
> ['>', '=']
>
> By moving the longer option to the front of the regexp, the longer option
> is no longer masked by the shorter:
>
>>>> re.findall(">=|<|>|=|<=|!=", "a >= b")
> ['>=']
>
>
> You also can't just concatenate input strings, since it is very likely
> they will contain one of the magic re symbols ()[]?*./\+, etc. So
> re.escape needs to be called to add the necessary '\'s.
>
> Here is an extract from pyparsing's oneOf function that does something
> similar, that handles the leading substring masking problem, and escapes
> the input strings, before concatenating them to a valid regexp. Of
> course, the simplest approach would be to just sort the symbols by
> descending length, but you may have some a priori knowledge that 'X' is a
> very common match, and want that option tested as early as possible. So
> this method only reorders strings if there is a masking problem.
>
>
> def createREfrom( symbols ): #symbols is a list of strings
> isequal = ( lambda a,b: a == b )
> masks = ( lambda a,b: b.startswith(a) )
> i = 0
> while i < len(symbols)-1:
> cur = symbols[i]
> for j,other in enumerate(symbols[i+1:]):
> if ( isequal(other, cur) ):
> del symbols[i+j+1]
> break
> elif ( masks(cur, other) ):
> del symbols[i+j+1]
> symbols.insert(i,other)
> cur = other
> break
> else:
> i += 1
> return "|".join( [ re.escape(sym) for sym in symbols] )
>
>>>> print createREfrom(["ABC","ABCDEF","ABCGHI"])
> ABCDEF|ABCGHI|ABC
>>>> print createREfrom("> < = <= >= != << >> <<< >>>".split())
> \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
>>>> re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
>>>> b")
> ['<=']
>
>
> Note, this does not do any optimization, such as collapsing
> "ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the
> Python cookbook for such optimization.
>
> -- Paul
>
>
More information about the Python-list
mailing list