Web Scraping - Output File

Thu Apr 26 14:14:24 EDT 2012

On 26/04/2012 18:54, SMac2347 at comcast.net wrote:
> Hello,
>
> I am having some difficulty generating the output I want from web
> scraping. Specifically, the script I wrote, while it runs without any
> errors, is not writing to the output file correctly. It runs, and
> creates the output .txt file; however, the file is blank (ideally it
> should be populated with a list of names).
>
> I took the base of a program that I had before for a different data
> gathering task, which worked beautifully, and edited it for my
> purposes here. Any insight as to what I might be doing wrote would be
> highly appreciated. Code is included below. Thanks!
>
> import os
> import re
> import urllib2
>
> outfile = open("Skadden.txt","w")
>
> A = 1
> Z = 26
>
> for letter in range(A,Z):
>
>      for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?
> contentID=44&alphaSearch="+str(letter)):
>
>              x = line
>              if '"><B>' in line:
>                      start=x.find('"><B>"')
>                      end= x.find('</B></A></nobr></td>',start)
>                      name=x[start:end]
>                      outfile.write(name+"\n")
>                      print name

Firstly, 'letter' goes from 1 (inclusive) to 26 (exclusive), so the
URLs are:

     http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=1
     http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=2
     ...
     http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=25

What you need is:

     http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=A
     http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=B
     ...
     http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=Z

Secondly, the names in the HTML source aren't enclosed by '"><B>' and
'</B></A></nobr></td>'.