[Tutor] extract uri from beautiful soup string

Tue Oct 16 13:52:32 CEST 2012

On Tue, Oct 16, 2012 at 6:40 AM, eryksun <eryksun at gmail.com> wrote:
> On Mon, Oct 15, 2012 at 1:17 PM, Norman Khine <norman at khine.net> wrote:
>>
>> i made an update: https://gist.github.com/3891927 which works based on
>> some of the recommendations posted here.
>>
>> any suggestions for improvement?
>
> I can't make any specific recommendations about using BeautifulSoup
> since I haven't used it in a long time. Here are a few general
> suggestions.
>
> PEP 8 recommends using 4-space indents instead of tabs, if possible.
>
> Move constant definitions out of loops. For example, there's no point
> to repeatedly building the dict literal and assigning it to "headers"
> for each page_no. Also, "headers" isn't used. You need to make a
> Request object:
>
> http://docs.python.org/library/urllib2#urllib2.Request
>
>
> Don't put unnecessary parentheses (round brackets) around assigned
> expressions. For example, replace
>
>     req = ('%s%s' % (origin_site, page_no))
>     ...
>     assoc_link = (tag.string)
>
> with
>
>     req = '%s%s' % (origin_site, page_no)
>     ...
>     assoc_link = tag.string
>
> I'd also move the "%s" into origin_site:
>
>     origin_site = ('http://typo3.nimes.fr/index.php?'
>                    'id=annuaire_assos&theme=0&rech=&num_page=%s')
>
>     req = origin_site % page_no
>
> The parentheses around the string literal are necessary because it's
> split over two lines. (I use Gmail's webmail interface, which line
> wraps at 69 columns -- sometimes; it tries to be clever.)
>
> Use "continue" to avoid unnecessary nesting:
>
>     for page_no in pages:
>         ...
>         try:
>             doc = urllib2.urlopen(req)
>         except urllib2.URLError, e:
>             continue
>         soup = BeautifulSoup.BeautifulSoup(doc)
>         for row in soup.findAll('tr', {'class': 'menu2'}):
>             assoc_data = []
>             ...
>             for i in soup.findAll('a', {'href': '#'}):
>                 if 'associations' not in i.attrMap['onclick']:
>                     continue
>                 req = i.attrMap['onclick'].split("'")[1]
>                 try:
>                     doc = urllib2.urlopen(req)
>                 except urllib2.URLError, e:
>                     continue
>                 soup = BeautifulSoup.BeautifulSoup(doc)
>
>
> Don't test "if emails != []". Use "if emails" instead. A non-empty
> list is always True, and an empty list is always False.
>
> Finally, in several places you create an empty list and populate it
> with a for loop. For simple cases like these, using comprehensions and
> generator expressions executes faster, yet it's still easy to code and
> easy to understand. For example, you can replace the following:
>
>         assoc_data = []
>         for assoc_theme in soup.findAll('u'):
>             assoc_data.append(assoc_theme.renderContents())
>         for assoc_name in soup.findAll('td', {'width': '70%'}):
>             assoc_data.append(assoc_name.renderContents())
>
> with something like this:
>
>         assoc_data = [assoc_theme.renderContents()
>             for assoc_theme in soup.findAll('u')]
>         assoc_data.extend(assoc_name.renderContents()
>             for assoc_name in soup.findAll('td', {'width': '70%'}))
>
> At least to me this is just as readable as the original, and the
> generated code is more efficient. If, however, BeautifulSoup is the
> limiting factor here, the efficiency gain will be chump change in the
> big picture. Still, in a simple case like this it's hard to argue
> against using a comprehension or generator expression. In cases that
> use complex expressions for the value and conditions, I think it makes
> more sense to use a for loop, which is easier to read and debug.

thanks, i made the changes https://gist.github.com/3891927

-- 
%>>> "".join( [ {'*':'@','^':'.'}.get(c,None) or
chr(97+(ord(c)-83)%26) for c in ",adym,*)&uzq^zqf" ] )