Looking for a regexp generator based on a set of known string representative of a string set

Fri Sep 8 12:29:34 EDT 2006

"Andy Dingley" <dingbat at codesmiths.com> wrote in message 
news:1157730937.621029.259320 at i3g2000cwc.googlegroups.com...
>
> vbfoobar at gmail.com wrote:
>
>> I am looking for python code that takes as input a list of strings
>> [...] and outputs the python regular expression
>
>    (s1|s2|s3|s4|s5)
> for strings of "s1" etc.
>
> Regex compilers are themselves quite good at optimising beyond this
>

It turns out this problem is a little trickier, especially when one of your 
strings is a leading subset of another.  For instance, let's say we are 
looking for comparison operators, one of <, >, =, >=, <=, or !=.  Simply 
concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>', and 
'>=' alternatives, and so the regexp never matches the longer options (I was 
not able to find a greediness switch that would override this).  So when 
searching "a >= b" we get this:

>>> re.findall("<|>|=|<=|>=|!=", "a >= b")
['>', '=']

By moving the longer option to the front of the regexp, the longer option is 
no longer masked by the shorter:

>>> re.findall(">=|<|>|=|<=|!=", "a >= b")
['>=']

You also can't just concatenate input strings, since it is very likely they 
will contain one of the magic re symbols ()[]?*./\+, etc.  So re.escape 
needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something 
similar, that handles the leading substring masking problem, and escapes the 
input strings, before concatenating them to a valid regexp.  Of course, the 
simplest approach would be to just sort the symbols by descending length, 
but you may have some a priori knowledge that 'X' is a very common match, 
and want that option tested as early as possible.  So this method only 
reorders strings if there is a masking problem.

def createREfrom( symbols ):  #symbols is a list of strings
    isequal = ( lambda a,b: a == b )
    masks = ( lambda a,b: b.startswith(a) )
    i = 0
    while i < len(symbols)-1:
        cur = symbols[i]
        for j,other in enumerate(symbols[i+1:]):
            if ( isequal(other, cur) ):
                del symbols[i+j+1]
                break
            elif ( masks(cur, other) ):
                del symbols[i+j+1]
                symbols.insert(i,other)
                cur = other
                break
        else:
            i += 1
    return "|".join( [ re.escape(sym) for sym in symbols] )

>>> print createREfrom(["ABC","ABCDEF","ABCGHI"])
ABCDEF|ABCGHI|ABC
>>> print createREfrom("> < = <= >= != << >> <<< >>>".split())
\>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
>>> re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <= 
>>> b")
['<=']

Note, this does not do any optimization, such as collapsing "ABCDEF|ABCGHI" 
to "ABC(DEF|GHI)".  I think there are some recipes in the Python cookbook 
for such optimization.

-- Paul