filtering out "bad" regular expressions from user input

Andrew Dalke dalke at acm.org
Fri Sep 29 12:47:31 EDT 2000


Skip Montanaro wrote:
>I'm looking for a better approach to filtering out bad regular expressions.

Here's one approach:

Parse the regular expression (get /F's sre_parse from 2.0 - it's very
nice).

Simplify things by not allowing lookahead/behind assertions.  Since
you're using MySQL's regexp engine, you would also check for any
other construct it doesn't support.

Convert the tree into "real" characters, so that "category_word"
is "_ABCDEF...Zabc...z", etc.

The time slowdown occurs for backtracking, especially if there are
multiple levels of backtracking.  Backtracking can only occur if
there are is a branch where either route accepts the same character.
The only ones to be concerned about are +, * and {}s, so look for
those nodes and make sure the next node in the expression doesn't
contain the same character.

The other option is to write a regex engine which uses a DFA instead
of an NFA, but that's not a good solution for your problem.

                    Andrew
                    dalke at acm.org







More information about the Python-list mailing list