[Tutor] extract uri from beautiful soup string

Tue Oct 16 07:40:04 CEST 2012

On Mon, Oct 15, 2012 at 1:17 PM, Norman Khine <norman at khine.net> wrote:
>
> i made an update: https://gist.github.com/3891927 which works based on
> some of the recommendations posted here.
>
> any suggestions for improvement?

I can't make any specific recommendations about using BeautifulSoup
since I haven't used it in a long time. Here are a few general
suggestions.

PEP 8 recommends using 4-space indents instead of tabs, if possible.

Move constant definitions out of loops. For example, there's no point
to repeatedly building the dict literal and assigning it to "headers"
for each page_no. Also, "headers" isn't used. You need to make a
Request object:

http://docs.python.org/library/urllib2#urllib2.Request

Don't put unnecessary parentheses (round brackets) around assigned
expressions. For example, replace

    req = ('%s%s' % (origin_site, page_no))
    ...
    assoc_link = (tag.string)

with

    req = '%s%s' % (origin_site, page_no)
    ...
    assoc_link = tag.string

I'd also move the "%s" into origin_site:

    origin_site = ('http://typo3.nimes.fr/index.php?'
                   'id=annuaire_assos&theme=0&rech=&num_page=%s')

    req = origin_site % page_no

The parentheses around the string literal are necessary because it's
split over two lines. (I use Gmail's webmail interface, which line
wraps at 69 columns -- sometimes; it tries to be clever.)

Use "continue" to avoid unnecessary nesting:

    for page_no in pages:
        ...
        try:
            doc = urllib2.urlopen(req)
        except urllib2.URLError, e:
            continue
        soup = BeautifulSoup.BeautifulSoup(doc)
        for row in soup.findAll('tr', {'class': 'menu2'}):
            assoc_data = []
            ...
            for i in soup.findAll('a', {'href': '#'}):
                if 'associations' not in i.attrMap['onclick']:
                    continue
                req = i.attrMap['onclick'].split("'")[1]
                try:
                    doc = urllib2.urlopen(req)
                except urllib2.URLError, e:
                    continue
                soup = BeautifulSoup.BeautifulSoup(doc)

Don't test "if emails != []". Use "if emails" instead. A non-empty
list is always True, and an empty list is always False.

Finally, in several places you create an empty list and populate it
with a for loop. For simple cases like these, using comprehensions and
generator expressions executes faster, yet it's still easy to code and
easy to understand. For example, you can replace the following:

        assoc_data = []
        for assoc_theme in soup.findAll('u'):
            assoc_data.append(assoc_theme.renderContents())
        for assoc_name in soup.findAll('td', {'width': '70%'}):
            assoc_data.append(assoc_name.renderContents())

with something like this:

        assoc_data = [assoc_theme.renderContents()
            for assoc_theme in soup.findAll('u')]
        assoc_data.extend(assoc_name.renderContents()
            for assoc_name in soup.findAll('td', {'width': '70%'}))

At least to me this is just as readable as the original, and the
generated code is more efficient. If, however, BeautifulSoup is the
limiting factor here, the efficiency gain will be chump change in the
big picture. Still, in a simple case like this it's hard to argue
against using a comprehension or generator expression. In cases that
use complex expressions for the value and conditions, I think it makes
more sense to use a for loop, which is easier to read and debug.