Regular expression to structure HTML

Brian D briandenzer at gmail.com
Fri Oct 2 16:35:04 CEST 2009


Yes, John, that's correct. I'm trying to trap and discard the <tr> row
<td> elements, re-formatting with pipes so that I can more readily
import the data into a database. The tags are, of course, initially
useful for pattern discovery. But there are other approaches -- I
could just replace the tags and capture the data as an array.

I'm well aware of the problems using regular expressions for html
parsing. This isn't merely a question of knowing when to use the right
tool. It's a question about how to become a better developer using
regular expressions.

I'm trying to figure out where the regular expression fails. The
structure of the page I'm scraping is uniform in the production of
tags -- it's an old ASP page that pulls data from a database.

What's different in the first <tr> row is the appearance of a comma, a
# pound symbol, and a number (", Inc #4"). I'm making the assumption
that's what's throwing off the remainder of the regular expression --
because (despite the snark by others above) the expression is working
for every other data row. But I could be wrong. Of course, if I could
identify the problem, I wouldn't be asking. That's why I posted the
question for other eyes to review.

I discovered that I may actually be properly parsing the data from the
tags when I tried this test in a Python interpreter:

>>> s = "New Horizon Technical Academy, Inc #4</a>"
>>> p = re.compile(r'([\s\S\WA-Za-z0-9]*)(</.*?>)')
>>> m = p.match(s)
>>> m = p.match(s)
>>> m.group(0)
"New Horizon Technical Academy, Inc #4</a>"
>>> m.group(1)
"New Horizon Technical Academy, Inc #4"
>>> m.group(2)
'</a>'

I found it curious that I was capturing the groups as sequences, but I
didn't understand how to use this knowledge in named groups -- or
maybe I am merely mis-identifying the source of the regular expression
problem.

It's a puzzle. I'm hoping someone will want to share the wisdom of
their experience, not criticize for the attempt to learn. Maybe one
shouldn't learn how to use a hammer on a screw, but I wouldn't say
that I have never hammered a screw into a piece of wood just because I
only had a hammer.

Thanks,
Brian


On Oct 2, 8:38 am, John <jmg3... at gmail.com> wrote:
> On Oct 2, 1:10 am, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
>
>
>
> > I'm kind of new to regular expressions, and I've spent hours trying to
> > finesse a regular expression to build a substitution.
>
> > What I'd like to do is extract data elements from HTML and structure
> > them so that they can more readily be imported into a database.
>
> > No -- sorry -- I don't want to use BeautifulSoup (though I have for
> > other projects). Humor me, please -- I'd really like to see if this
> > can be done with just regular expressions.
>
> > Note that the output is referenced using named groups.
>
> > My challenge is successfully matching the HTML tags in between the
> > first table row, and the second table row.
>
> > I'd appreciate any suggestions to improve the approach.
>
> > rText = "<tr><td valign=top>8583</td><td valign=top><a
> > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
> > Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
> > tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
> > lic_number=9371>Career Learning Center</a></td><td
> > valign=top>Jefferson</td><td valign=top>70113</td></tr>"
>
> > rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
> > valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
> > Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
> > \g<zname>\n', rText)
>
> > print rText
>
> > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
> > valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
> > valign=top>9371</td><td valign=top><a href=lic_details.asp?
> > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113
>
> Some suggestions to start off with:
>
>   * triple-quote your multiline strings
>   * consider using the re.X, re.M, and re.S options for re.compile()
>   * save your re object after you compile it
>   * note that re.sub() returns a new string
>
> Also, it sounds like you want to replace the first 2 <td> elements for
> each <tr> element with their content separated by a pipe (throwing
> away the <td> tags themselves), correct?
>
> ---John




More information about the Python-list mailing list