filtering out "bad" regular expressions from user input

Skip Montanaro skip at mojam.com
Fri Sep 29 10:59:24 EDT 2000


We allow users to input regular expressions to our search engine.  Problem
is, when (and they inevitably do, so it's not "if") they input a bad regular
expression, it sends the backend search engine into a tizzy (backend is
MySQL, frontend is Python).  For example, "^" will cause it to match
everything, as will "^.*$", "" and ".*".

Currently, I'm filtering out a small set of known bad inputs using simple
string comparison:

    '^'  '$'  '.'  '|'  ''  '[]'  '[.]'  '.*'  '.+'  '^.+$'  '^.*$'

Of course, that leaves out '^.*.*$', '^.* .*$', '.+.*' and '.*|.*', to name
but a few.

I'm looking for a better approach to filtering out bad regular expressions.
I realize determining if a regular expression will run for a long time is
similar (or equivalent) to the halting problem and largely depends on the
context in which it is applied, but there must be a better way to reduce the
odds of allowing bad input through to a suitably small value than simply
looking for a small set of easy-to-guess cases.  I could simply disallow any
searches that contain '.*' or '.+', but there's always '[A-Z0-9a-z ]*' and
more, and I don't want to simply eliminate patterns containing "*" or "+",
because that would pretty much eviscerate the capability.

Any thoughts?

-- 
Skip Montanaro (skip at mojam.com)
http://www.mojam.com/
http://www.musi-cal.com/




More information about the Python-list mailing list