a regular expression question

Sat Mar 22 09:05:30 EST 2003

lrl at cox.net (Luke) wrote:
> >>> re1 = re.compile("<a .*?>([0-9]+?)</a>(.*?)")

First off, let me echo the sentiments of the other posters that regex, 
while a wonderful tool, is not a universal tool.  Python includes 
dedicated HTML/SGML parsers, and they are the right tools to use for 
this.

Second, as somewhat of a meta comment, I think it's a testiment to the 
complexity of {HT,SG,X}ML parsers everywhere that people are always 
looking for ways to avoid using them.  Regex is no picnic, yet people 
seem to prefer trying to use it to doing it "the right way".

I'm not picking on Python's implementation's here; it happens in all 
environments. I'm not picking on Luke either; I see lots of people do 
this.  I've seen it in my own company.  I've seen it in my own 
development group.  I've even seen it with some of the people I pair 
with.  For fear of possible self-incrimination, I'll stop there :-)

There was an interesting item on slashdot a few days ago on this very 
subject.  http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog

Finally, I really do have an on-topic comment about your regex.  Why do 
you have the space after the 'a' in '<a .*?>'?  I gather the '.*?' part 
is trying to make it optional to have anything after the 'a', so you can 
recognize '<a>' as well as '<a href="foo">'  You require the space, so 
with your regex, '<a >' will get recognized, but not '<a>'.  I suspect 
this is not what you intended.  And, of course, that just brings us back 
to my original statement that if you want to parse HTML, use tools 
designed to parse HTML.  Somebody else has already worried about these 
kinds of details.