[Tutor] BeautifulSoup confusion

Fri Apr 10 03:11:07 CEST 2009

On Thu, Apr 9, 2009 at 7:27 PM, Steve Lyskawa <steve.mckmps at gmail.com> wrote:
> I'm having a
> problem with Beautiful Soup.  I can get it to scrape off all the href links
> on a web page but I am having problems selecting specific URI's from the
> output supplied by Beautiful Soup.
> What exactly is it returning to me and what command would I use to find that
> out?

Generally it gives you Tag and NavigableString objects, or lists of
same. To find out what something is, print its type:
print type(x)

> Do I have to take each line it give me and put it into a list before I
> can, for example, get only certain URI's containing a certain string or use
> the results to get the web page that the URI is referring to?
> The pseudo code for what I am trying to do:
> Get all URI's from web page that contain string "env.html"
> Open the web page it is referring to.
> Scrape selected information off of that page.

If you want to get all URI's at once, then that would imply creating a
list. You could also process URI's one-at-a-time.

> I'm have problem with step #1.  I can get all URI's but I can't see to get
> re.compile to work right.  If I could get it to give me the URI only without
> tags or link description, that would be ideal.

Something like this should get you started:

soup = BeautifulSoup(<some text from a web page>)
for anchor in soup.findAll('a', href=re.compile(r'env\.html')):
  print anchor['href']

That says, find all <a> tags whose 'href' attribute matches the regex
'env\.html'. The Tag object will be assigned to the anchor variable.
Then the value of the 'href' attribute is printed.

I find it very helpful with BS to experiment at the command line. It
often takes a few tries to understand what it is giving you and how to
get exactly what you want.

Kent