Looking for a regexp generator based on a set of known string representative of a string set

Anthra Norell anthra.norell at tiscalinet.ch
Sat Sep 9 10:30:07 CEST 2006

Diez B. Roggisch wrote:
> vbfoobar at gmail.com schrieb:
>> Hello
>> I am looking for python code that takes as input a list of strings
>> (most similar,
>> but not necessarily, and rather short: say not longer than 50 chars)
>> and that computes and outputs the python regular expression that
>> matches
>> these string values (not necessarily strictly, perhaps the code is able
>> to determine
>> patterns, i.e. families of strings...).
> For matching the exact set, of course a concatenation can be used.
> However, this is of limited use, because then simple string find will
> suffice.
> In general, this can't be done. It might be possible that the underlying
> structure of the language isn't regular at all - for example, the simple
> language of palindromes isn't.
> And even if it was - the search space is just to large to explore. You
> could generate a bazillion matching expressions, but .* will always
> match - so how do you prevent the generator from converging towards that?
> If all you have are those strings, you are better off trying to infer
> some structure yourself.
> Diez

Here is how regular expressions raise confusion. One starts with a relatively simple matching problem and before long finds oneself
in a "search space too large to explore", having to formulate the right expression out of "a bazillion" permutations with a syntax
replete with potential traps: backslashes, special characters, precedence order, etc. The fact emerges (once more) that a regular
expression does not spare the effort of acquiring a crystal clear conception of what needs to be done as an absolute prerequisite of
writing code that in the end can be relied upon to do what it is supposed to do. In order to acquire that crystal clear conception a
"space too large to explore" is hardly a working environment one should be eager to get into. Yet regular expressions seem to exert
a strange attraction and are often sought for problems that don't really require them. I suspect this is because the regular
expression parsers are made by highly qualified experts and so they may recommend themselves to the programmer with a less than
perfect understanding of a task as a sort of magic bullet that "does it right" all by itself. Regex eding as a substitute for
problem analysis so to speak ...
      Most OPs are rather sketchy and leave it up to their readers to guess purpose, requirements, incidentals, environmentals,
restrictions, etc. Accordingly, threads often veer off topic very quickly into lofty realms of academic interest, the OP never to be
heard from again. So I would suggest that the OP explain what he intends to do with his regular expression. A lot depends on it.
      In the interim I'd like to resize the search space to a manageable dimension with the proven hypothesis that the number of
useless permutations of the kind being discussed is a bazillion minus one, because, fortunately, there is only one that makes sense.
It follows these two rules: 1. First come first serve. And 2. Of two contending targets the longer prevails. That is not to say
other rules never make sense. It is to say that in ninety-nine percent of the cases this rule applies and that the other one percent
can also be understood in terms of these two rules plus exception rules.

Chapter 3.1. from the SE doc visualizes the point by replacing a set of overlapping targets with their upper-cased selves:

   >>> overlapping_targets = 'be=BE being=BEING been=BEEN bee=BEE belong=BELONG long=LONG longer=LONGER'
   >>> high_strung_bee_story = "There was a bee belonging to hive nine longing to be a beetle and thinking that being a bee was
okay, but she had been a bee long enough and wouldn't be one much longer."
   >>> SE.SE (overlapping_targets)(high_strung_bee_story)
"There was a BEE BELONGing to hive nine LONGing to BE a BEEtle and thinking that BEING a BEE was okay, but she had BEEN a BEE LONG
enough and wouldn't BE one much LONGER."

The matches appear correct, though BELONGing, LONGing and BEEtle exemplify unintended matches. These could be avoided by refinement
of the target set. The partially successful result points out the direction the refinement has to take. Precedence being unrelated
to the order of the definitions, editing the set is carefree in that it does not impose precedence considerations. The benefit in
terms of modularity is substantial, as module sets can be combined.

Doing the example with a regular expression:

>>> r = re.compile ('be|being|been|bee|belong|long|longer')
>>> r.findall (high_strung_bee_story)
['be', 'be', 'long', 'long', 'be', 'be', 'be', 'be', 'be', 'be', 'long', 'be', 'long']

The regular expression also applies rule one: first come first serve, but applies a different rule two, defining precedence in the
order in which the targets line up in the expression, left to right. In order to implement the SE rule of sensibility, the patterns
should be lined up  reverse-sorted, which places the longer ones to the left.

>>> r = re.compile ('longer|long|belong|being|been|bee|be')
>>> r.findall (high_strung_bee_story)
['bee', 'belong', 'long', 'be', 'bee', 'being', 'bee', 'been', 'bee', 'long', 'be', 'longer']

Which of the two is correct? Without knowing the purpose of the search it is impossible to know. I expect it to be the second one.
But only the OP knows and could help his helpers help by letting them know what he is up to.


More information about the Python-list mailing list