ask for a RE pattern to match TABLE in html

David C. Ullrich dullrich at sprynet.com
Mon Jun 30 21:43:45 CEST 2008


In article 
<50285840-7601-41be-aa3d-865f46fe85c6 at 56g2000hsm.googlegroups.com>,
 Dan <thermostat at gmail.com> wrote:

> On Jun 27, 1:32 pm, "David C. Ullrich" <dullr... at sprynet.com> wrote:
> > In article
> > <62f752f3-d840-42de-a414-0d56d15d7... at w4g2000prd.googlegroups.com>,
> >  Jonathan Gardner <jgard... at jonathangardner.net> wrote:
> >
> > > On Jun 26, 3:22 pm, MRAB <goo... at mrabarnett.plus.com> wrote:
> > > > Try something like:
> >
> > > > re.compile(r'<table\b.*?>.*?</table>', re.DOTALL)
> >
> > > So you would pick up strings like "<table><tr><td><table><tr><td>foo</
> > > td></tr></table>"? I doubt that is what oyster wants.
> >
> > I asked a question recently - nobody answered, I think
> > because they assumed it was just a rhetorical question:
> >
> > (i) It's true, isn't it, that it's impossible for the
> > formal CS notion of "regular expression" to correctly
> > parse nested open/close delimiters?
> 
> Yes. For the proof, you want to look at the pumping lemma found in
> your favorite Theory of Computation textbook.

Ah, thanks. Don't have a favorite text, not having any at all.
But wikipedia works - what I found at 

http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

was pretty clear. (Yes, it's exactly that \1, \2 stuff that
convinced me I really don't understand what one can do with
a Python regex.)

> >
> > (ii) The regexes in languages like Python and Perl include
> > features that are not part of the formal CS notion of
> > "regular expression". Do they include something that
> > does allow parsing nested delimiters properly?
> 
> So, I think most of the extensions fall into syntactic sugar
> (certainly all the character classes \b \s \w, etc). The ability to
> look at input without consuming it is more than syntactic sugar, but
> my intuition is that it could be pretty easily modeled by a
> nondeterministic finite state machine, which is of equivalent power to
> REs. The only thing I can really think of that is completely non-
> regular is the \1 \2, etc syntax to match previously match strings
> exactly. But since you can't to an arbitrary number of them, I don't
> think its actually context free. (I'm not prepared to give a proof
> either way). Needless to say that even if you could, it would be
> highly impractical to match parentheses using those.
> 
> So, yeah, to match arbitrary nested delimiters, you need a real
> context free parser.
> 
> >
> > --
> > David C. Ullrich
> 
> 
> -Dan

-- 
David C. Ullrich



More information about the Python-list mailing list