[Tutor] best way to scrape html

Peter Kim peateyk at gmail.com
Wed Feb 16 06:45:54 CET 2005


Which method is best and most pythonic to scrape text data with
minimal formatting?

I'm trying to read a large html file and strip out most of the markup,
but leaving the simple formatting like <p>, <b>, and <i>.  For example:

<p class="BodyText" style="MARGIN: 0in 0in 12pt"><font face="Times New
Roman"><b style="font-weight: normal"><span lang="EN-GB"
style="FONT-SIZE: 12pt">Trigger:</span></b><span lang="EN-GB"
style="FONT-SIZE: 12pt"><span style="spacerun: yes">&#160;</span>
Debate on budget in Feb-Mar. New moves to cut medical costs by better
technology.</span></font></p>

I want to change the above to:

<p><b>Trigger:</b> Debate on budget in Feb-Mar.  New moves to
cutmedical costs by better technology.</p>

Since I wanted some practice in regex, I started with something like this:

pattern = "(?:<)(.+?)(?: ?.*?>)(.*?)(</\1>)"
result = re.compile(pattern, re.IGNORECASE | re.VERBOSE |
re.DOTALL).findall(html)

But it's getting messy real fast and somehow the non-greedy parts
don't seem to work as intended.  Also I realized that the html file is
going to be 10,000+ lines, so I wonder if regex can be used for large
strings.

So I'm thinking of using sgmllib.py (as in the Dive into Python
example).  Is this where I should be using libxml2.py?  As you can
tell this is my first foray into both parsing and regex so advice in
terms of best practice would be very helpful.

Thanks,
Peter Kim


More information about the Tutor mailing list