[Tutor] best way to scrape html
Kent Johnson
kent37 at tds.net
Wed Feb 16 12:20:22 CET 2005
You might find these threads on comp.lang.python interesting:
http://tinyurl.com/5zmpn
http://tinyurl.com/6mxmb
Peter Kim wrote:
> Which method is best and most pythonic to scrape text data with
> minimal formatting?
>
> I'm trying to read a large html file and strip out most of the markup,
> but leaving the simple formatting like <p>, <b>, and <i>. For example:
>
> <p class="BodyText" style="MARGIN: 0in 0in 12pt"><font face="Times New
> Roman"><b style="font-weight: normal"><span lang="EN-GB"
> style="FONT-SIZE: 12pt">Trigger:</span></b><span lang="EN-GB"
> style="FONT-SIZE: 12pt"><span style="spacerun: yes"> </span>
> Debate on budget in Feb-Mar. New moves to cut medical costs by better
> technology.</span></font></p>
>
> I want to change the above to:
>
> <p><b>Trigger:</b> Debate on budget in Feb-Mar. New moves to
> cutmedical costs by better technology.</p>
>
> Since I wanted some practice in regex, I started with something like this:
>
> pattern = "(?:<)(.+?)(?: ?.*?>)(.*?)(</\1>)"
> result = re.compile(pattern, re.IGNORECASE | re.VERBOSE |
> re.DOTALL).findall(html)
>
> But it's getting messy real fast and somehow the non-greedy parts
> don't seem to work as intended. Also I realized that the html file is
> going to be 10,000+ lines, so I wonder if regex can be used for large
> strings.
>
> So I'm thinking of using sgmllib.py (as in the Dive into Python
> example). Is this where I should be using libxml2.py? As you can
> tell this is my first foray into both parsing and regex so advice in
> terms of best practice would be very helpful.
>
> Thanks,
> Peter Kim
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list