[Tutor] best way to scrape html

Wed Feb 16 12:20:22 CET 2005

You might find these threads on comp.lang.python interesting:
http://tinyurl.com/5zmpn
http://tinyurl.com/6mxmb

Peter Kim wrote:
> Which method is best and most pythonic to scrape text data with
> minimal formatting?
> 
> I'm trying to read a large html file and strip out most of the markup,
> but leaving the simple formatting like <p>, <b>, and <i>.  For example:
> 
> <p class="BodyText" style="MARGIN: 0in 0in 12pt"><font face="Times New
> Roman"><b style="font-weight: normal"><span lang="EN-GB"
> style="FONT-SIZE: 12pt">Trigger:</span></b><span lang="EN-GB"
> style="FONT-SIZE: 12pt"><span style="spacerun: yes">&#160;</span>
> Debate on budget in Feb-Mar. New moves to cut medical costs by better
> technology.</span></font></p>
> 
> I want to change the above to:
> 
> <p><b>Trigger:</b> Debate on budget in Feb-Mar.  New moves to
> cutmedical costs by better technology.</p>
> 
> Since I wanted some practice in regex, I started with something like this:
> 
> pattern = "(?:<)(.+?)(?: ?.*?>)(.*?)(</\1>)"
> result = re.compile(pattern, re.IGNORECASE | re.VERBOSE |
> re.DOTALL).findall(html)
> 
> But it's getting messy real fast and somehow the non-greedy parts
> don't seem to work as intended.  Also I realized that the html file is
> going to be 10,000+ lines, so I wonder if regex can be used for large
> strings.
> 
> So I'm thinking of using sgmllib.py (as in the Dive into Python
> example).  Is this where I should be using libxml2.py?  As you can
> tell this is my first foray into both parsing and regex so advice in
> terms of best practice would be very helpful.
> 
> Thanks,
> Peter Kim
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>