a regular expression question
Roy Smith
roy at panix.com
Sat Mar 22 09:05:30 EST 2003
lrl at cox.net (Luke) wrote:
> >>> re1 = re.compile("<a .*?>([0-9]+?)</a>(.*?)")
First off, let me echo the sentiments of the other posters that regex,
while a wonderful tool, is not a universal tool. Python includes
dedicated HTML/SGML parsers, and they are the right tools to use for
this.
Second, as somewhat of a meta comment, I think it's a testiment to the
complexity of {HT,SG,X}ML parsers everywhere that people are always
looking for ways to avoid using them. Regex is no picnic, yet people
seem to prefer trying to use it to doing it "the right way".
I'm not picking on Python's implementation's here; it happens in all
environments. I'm not picking on Luke either; I see lots of people do
this. I've seen it in my own company. I've seen it in my own
development group. I've even seen it with some of the people I pair
with. For fear of possible self-incrimination, I'll stop there :-)
There was an interesting item on slashdot a few days ago on this very
subject. http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog
Finally, I really do have an on-topic comment about your regex. Why do
you have the space after the 'a' in '<a .*?>'? I gather the '.*?' part
is trying to make it optional to have anything after the 'a', so you can
recognize '<a>' as well as '<a href="foo">' You require the space, so
with your regex, '<a >' will get recognized, but not '<a>'. I suspect
this is not what you intended. And, of course, that just brings us back
to my original statement that if you want to parse HTML, use tools
designed to parse HTML. Somebody else has already worried about these
kinds of details.
More information about the Python-list
mailing list