[Tutor] Can't loop thru file and don't see the problem
Christian Witts
cwitts at compuscan.co.za
Thu Dec 3 12:46:43 CET 2009
Roy Hinkelman wrote:
>
> Your list is great. I've been lurking for the past two weeks while I
> learned the basics. Thanks.
>
> I am trying to loop thru 2 files and scrape some data, and the loops
> are not working.
>
> The script is not getting past the first URL from state_list, as the
> test print shows.
>
> If someone could point me in the right direction, I'd appreciate it.
>
> I would also like to know the difference between open() and
> csv.reader(). I had similar issues with csv.reader() when opening
> these files.
>
> Any help greatly appreciated.
>
> Roy
>
> Code: Select all
> # DOWNLOAD USGS MISSING FILES
>
> import mechanize
> import BeautifulSoup as B_S
> import re
> # import urllib
> import csv
>
> # OPEN FILES
> # LOOKING FOR THESE SKUs
> _missing = open('C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB
> files\\missing_topo_list.csv', 'r')
> # IN THESE STATES
> _states = open('C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB files\\state_list.csv', 'r')
> # IF NOT FOUND, LIST THEM HERE
> _missing_files = []
> # APPEND THIS FILE WITH META
> _topo_meta = open('C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB files\\topo_meta.csv', 'a')
>
> # OPEN PAGE
> for each_state in _states:
> each_state = each_state.replace("\n", "")
> print each_state
> html = mechanize.urlopen(each_state)
> _soup = B_S.BeautifulSoup(html)
>
> # SEARCH THRU PAGE AND FIND ROW CONTAINING META MATCHING SKU
> _table = _soup.find("table", "tabledata")
> print _table #test This is returning 'None'
>
If you take a look at the webpage you open up, you will notice there are
no tables. Are you certain you are using the correct URLs for this ?
> for each_sku in _missing:
The for loop `for each_sku in _missing:` will only iterate once, you can
either pre-read it into a list / dictionary / set (whichever you prefer)
or change it to
_missing_filename = 'C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB files\\missing_topo_list.csv'
for each_sku in open(_missing_filename):
# carry on here
> each_sku = each_sku.replace("\n","")
> print each_sku #test
> try:
> _row = _table.find('tr', text=re.compile(each_sku))
> except (IOError, AttributeError):
> _missing_files.append(each_sku)
> continue
> else:
> _row = _row.previous
> _row = _row.parent
> _fields = _row.findAll('td')
> _name = _fields[1].string
> _state = _fields[2].string
> _lat = _fields[4].string
> _long = _fields[5].string
> _sku = _fields[7].string
>
> _topo_meta.write(_name + "|" + _state + "|" + _lat +
> "|" + _long + "|" + _sku + "||")
>
> print x +': ' + _name
>
> print "Missing Files:"
> print _missing_files
> _topo_meta.close()
> _missing.close()
> _states.close()
>
>
> The message I am getting is:
>
> Code:
> >>>
> http://libremap.org/data/state/Colorado/drg/
> None
> 33087c2
> Traceback (most recent call last):
> File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code
> samples/usgs_missing_file_META.py", line 34, in <module>
> _row = _table.find('tr', text=re.compile(each_sku))
> AttributeError: 'NoneType' object has no attribute 'find'
>
>
> And the files look like:
>
> Code:
> state_list
> http://libremap.org/data/state/Colorado/drg/
> http://libremap.org/data/state/Connecticut/drg/
> http://libremap.org/data/state/Pennsylvania/drg/
> http://libremap.org/data/state/South_Dakota/drg/
>
> missing_topo_list
> 33087c2
> 34087b2
> 33086b7
> 34086c2
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
Hope the comments above help in your endeavours.
--
Kind Regards,
Christian Witts
More information about the Tutor
mailing list