need help with re module

Sat Jun 23 00:12:17 EDT 2007

Gabriel Genellina wrote:
> En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler <dwahler at gmail.com>  
> escribió:
> 
>> On 6/20/07, Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
>>
[snip]
>> I agree that BeautifulSoup is probably the best tool for the job, but
>> this doesn't sound right to me. Since the OP doesn't care about tags
>> being properly nested, I don't see why a regex (albeit a tricky one)
>> wouldn't work. For example:
>>
[snip]
>>
>> Granted, this misses out a few things (e.g. DOCTYPE declarations), but
>> those should be straightforward to handle.
> 
> It doesn't handle a lot of things. For this input (not very special, 
> just  a few simple mistakes):
> 
> <html>
> <a href="http://foo.com/baz.html>click here</a>
> <p>What if price<100? You lose.
> <p>What if HitPoints<-10? You are dead.
> <p>Assignment: target <-- any_expression
> Just a few last words.
> </html>
> 
> the BeautifulSoup version gives:
> 
> click here
> What if price<100? You lose.
> What if HitPoints<-10? You are dead.
> Assignment: target <-- any_expression
> Just a few last words.
> 
> and the regular expression version gives:
> 
> <a href="http://foo.com/baz.html>click here
> What if priceWhat if HitPointsAssignment: target
> 
> Clearly the BeautifulSoup version gives the "right" result, or the  
> "expected" one.
> It's hard to get that with only a regular expression, you need more 
> power;  and BeautifulSoup fills the gap.

Speak for yourself.  If I'm writing an HTML syntax checker, I think I'll 
skip BeautifulSoup and use something that gives me the results that I 
expect, not the results that you expect.