[Tutor] search/match file position q

Danny Yoo dyoo at hashcollision.org
Tue Oct 7 20:13:32 CEST 2014


> So, what makes regex wrong for this job? question still remains: does the
> search start at the beginning of the line each time or does it step forward
> from the last search? I will check out beautiful soup as suggested in a
> subsequent mail; I'd still like to finish this process:<}}


Mathematically, regular expressions can capture a certain class of
text called the "regular languages".  Regular languages have a few
characteristics.  As a concrete example of a limitation: you can't
write a pattern that properly does parentheses matching with a regular
expression alone.

This isn't a challenge to your machismo: it's a matter of mathematics!
 For the precise details on the impossibility proof, you'd need to
take a CS theory class, and in particular, learn about the "pumping
lemma for regular expressions".  Sipser's "Introduction to the Theory
of Computation" has a good presentation.  This is one reason why CS
theory matters: it can tell you when some approach is not a good idea.
:P

HTML is not a regular language: it has nested substructure.  The same
problem about matching balanced parentheses is essentially that of
matching start and end tags.

So that's the objections from the purely mathematical point of view.
This is not to say that regular expressions are useless: they work
well for breaking down HTML into a sequence of tokens.  If you only
care about processing individual tokens at a time, regexes might be
appropriate.  They're just not the best tool for everything.  From a
practical point of view: HTML parsing libraries such as Beautiful Soup
are nicer to work with than plain regular expressions.


More information about the Tutor mailing list