[Tutor] Can't loop thru file and don't see the problem
Roy Hinkelman
royhink at gmail.com
Fri Dec 4 01:01:28 CET 2009
Thank you very much!
I had forgotten that unix URLs are case sensitive.
Also, I changed my 'For' statements to your suggestion, tweaked the
exception code a little, and it's working.
So, there are obviously several ways to open files. Do you have a standard
practice, or does it depend on the file format?
I will eventually be working with Excel and possibly mssql tables.
Thanks again for your help.
Roy
On Thu, Dec 3, 2009 at 3:46 AM, Christian Witts <cwitts at compuscan.co.za>wrote:
> Roy Hinkelman wrote:
>
>>
>> Your list is great. I've been lurking for the past two weeks while I
>> learned the basics. Thanks.
>>
>> I am trying to loop thru 2 files and scrape some data, and the loops are
>> not working.
>>
>> The script is not getting past the first URL from state_list, as the test
>> print shows.
>>
>> If someone could point me in the right direction, I'd appreciate it.
>>
>> I would also like to know the difference between open() and csv.reader().
>> I had similar issues with csv.reader() when opening these files.
>>
>> Any help greatly appreciated.
>>
>> Roy
>>
>> Code: Select all
>> # DOWNLOAD USGS MISSING FILES
>>
>> import mechanize
>> import BeautifulSoup as B_S
>> import re
>> # import urllib
>> import csv
>>
>> # OPEN FILES
>> # LOOKING FOR THESE SKUs
>> _missing = open('C:\\Documents and
>> Settings\\rhinkelman\\Desktop\\working DB files\\missing_topo_list.csv',
>> 'r')
>> # IN THESE STATES
>> _states = open('C:\\Documents and
>> Settings\\rhinkelman\\Desktop\\working DB files\\state_list.csv', 'r')
>> # IF NOT FOUND, LIST THEM HERE
>> _missing_files = []
>> # APPEND THIS FILE WITH META
>> _topo_meta = open('C:\\Documents and
>> Settings\\rhinkelman\\Desktop\\working DB files\\topo_meta.csv', 'a')
>>
>> # OPEN PAGE
>> for each_state in _states:
>> each_state = each_state.replace("\n", "")
>> print each_state
>> html = mechanize.urlopen(each_state)
>> _soup = B_S.BeautifulSoup(html)
>> # SEARCH THRU PAGE AND FIND ROW CONTAINING META MATCHING SKU
>> _table = _soup.find("table", "tabledata")
>> print _table #test This is returning 'None'
>>
>> If you take a look at the webpage you open up, you will notice there are
> no tables. Are you certain you are using the correct URLs for this ?
>
> for each_sku in _missing:
>>
> The for loop `for each_sku in _missing:` will only iterate once, you can
> either pre-read it into a list / dictionary / set (whichever you prefer) or
> change it to
> _missing_filename = 'C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB files\\missing_topo_list.csv'
> for each_sku in open(_missing_filename):
> # carry on here
>
>> each_sku = each_sku.replace("\n","")
>> print each_sku #test
>> try:
>> _row = _table.find('tr', text=re.compile(each_sku))
>> except (IOError, AttributeError):
>> _missing_files.append(each_sku)
>> continue
>> else:
>> _row = _row.previous
>> _row = _row.parent
>> _fields = _row.findAll('td')
>> _name = _fields[1].string
>> _state = _fields[2].string
>> _lat = _fields[4].string
>> _long = _fields[5].string
>> _sku = _fields[7].string
>>
>> _topo_meta.write(_name + "|" + _state + "|" + _lat + "|" +
>> _long + "|" + _sku + "||")
>> print x +': ' + _name
>>
>> print "Missing Files:"
>> print _missing_files
>> _topo_meta.close()
>> _missing.close()
>> _states.close()
>>
>>
>> The message I am getting is:
>>
>> Code:
>> >>>
>> http://libremap.org/data/state/Colorado/drg/
>> None
>> 33087c2
>> Traceback (most recent call last):
>> File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code
>> samples/usgs_missing_file_META.py", line 34, in <module>
>> _row = _table.find('tr', text=re.compile(each_sku))
>> AttributeError: 'NoneType' object has no attribute 'find'
>>
>>
>> And the files look like:
>>
>> Code:
>> state_list
>> http://libremap.org/data/state/Colorado/drg/
>> http://libremap.org/data/state/Connecticut/drg/
>> http://libremap.org/data/state/Pennsylvania/drg/
>> http://libremap.org/data/state/South_Dakota/drg/
>>
>> missing_topo_list
>> 33087c2
>> 34087b2
>> 33086b7
>> 34086c2
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Tutor maillist - Tutor at python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
>>
> Hope the comments above help in your endeavours.
>
> --
> Kind Regards,
> Christian Witts
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20091203/c985891e/attachment.htm>
More information about the Tutor
mailing list