[Tutor] Can't loop thru file and don't see the problem
Christian Witts
cwitts at compuscan.co.za
Fri Dec 4 09:00:44 CET 2009
Roy Hinkelman wrote:
> Thank you very much!
>
> I had forgotten that unix URLs are case sensitive.
>
> Also, I changed my 'For' statements to your suggestion, tweaked the
> exception code a little, and it's working.
>
> So, there are obviously several ways to open files. Do you have a
> standard practice, or does it depend on the file format?
>
> I will eventually be working with Excel and possibly mssql tables.
>
> Thanks again for your help.
>
> Roy
>
>
>
> On Thu, Dec 3, 2009 at 3:46 AM, Christian Witts
> <cwitts at compuscan.co.za <mailto:cwitts at compuscan.co.za>> wrote:
>
> Roy Hinkelman wrote:
>
>
> Your list is great. I've been lurking for the past two weeks
> while I learned the basics. Thanks.
>
> I am trying to loop thru 2 files and scrape some data, and the
> loops are not working.
>
> The script is not getting past the first URL from state_list,
> as the test print shows.
>
> If someone could point me in the right direction, I'd
> appreciate it.
>
> I would also like to know the difference between open() and
> csv.reader(). I had similar issues with csv.reader() when
> opening these files.
>
> Any help greatly appreciated.
>
> Roy
>
> Code: Select all
> # DOWNLOAD USGS MISSING FILES
>
> import mechanize
> import BeautifulSoup as B_S
> import re
> # import urllib
> import csv
>
> # OPEN FILES
> # LOOKING FOR THESE SKUs
> _missing = open('C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB
> files\\missing_topo_list.csv', 'r')
> # IN THESE STATES
> _states = open('C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB
> files\\state_list.csv', 'r')
> # IF NOT FOUND, LIST THEM HERE
> _missing_files = []
> # APPEND THIS FILE WITH META
> _topo_meta = open('C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB
> files\\topo_meta.csv', 'a')
>
> # OPEN PAGE
> for each_state in _states:
> each_state = each_state.replace("\n", "")
> print each_state
> html = mechanize.urlopen(each_state)
> _soup = B_S.BeautifulSoup(html)
> # SEARCH THRU PAGE AND FIND ROW CONTAINING META
> MATCHING SKU
> _table = _soup.find("table", "tabledata")
> print _table #test This is returning 'None'
>
> If you take a look at the webpage you open up, you will notice
> there are no tables. Are you certain you are using the correct
> URLs for this ?
>
> for each_sku in _missing:
>
> The for loop `for each_sku in _missing:` will only iterate once,
> you can either pre-read it into a list / dictionary / set
> (whichever you prefer) or change it to
> _missing_filename = 'C:\\Documents and
> Settings\\rhinkelman\\Desktop\\working DB
> files\\missing_topo_list.csv'
> for each_sku in open(_missing_filename):
> # carry on here
>
> each_sku = each_sku.replace("\n","")
> print each_sku #test
> try:
> _row = _table.find('tr', text=re.compile(each_sku))
> except (IOError, AttributeError):
> _missing_files.append(each_sku)
> continue
> else:
> _row = _row.previous
> _row = _row.parent
> _fields = _row.findAll('td')
> _name = _fields[1].string
> _state = _fields[2].string
> _lat = _fields[4].string
> _long = _fields[5].string
> _sku = _fields[7].string
>
> _topo_meta.write(_name + "|" + _state + "|" +
> _lat + "|" + _long + "|" + _sku + "||")
> print x +': ' + _name
>
> print "Missing Files:"
> print _missing_files
> _topo_meta.close()
> _missing.close()
> _states.close()
>
>
> The message I am getting is:
>
> Code:
> >>>
> http://libremap.org/data/state/Colorado/drg/
> None
> 33087c2
> Traceback (most recent call last):
> File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code
> samples/usgs_missing_file_META.py", line 34, in <module>
> _row = _table.find('tr', text=re.compile(each_sku))
> AttributeError: 'NoneType' object has no attribute 'find'
>
>
> And the files look like:
>
> Code:
> state_list
> http://libremap.org/data/state/Colorado/drg/
> http://libremap.org/data/state/Connecticut/drg/
> http://libremap.org/data/state/Pennsylvania/drg/
> http://libremap.org/data/state/South_Dakota/drg/
>
> missing_topo_list
> 33087c2
> 34087b2
> 33086b7
> 34086c2
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist - Tutor at python.org <mailto:Tutor at python.org>
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
> Hope the comments above help in your endeavours.
>
> --
> Kind Regards,
> Christian Witts
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
Generally I just open files in read or read-binary mode, depending on
the data in them. The only times I use it in the for loop situation is
for things similar to yours when you need to iterate over the file alot
(although if the file is small enough I generally prefer loading it into
a dictionary as it will be faster, you build it once and never have to
read it off of the disk again as it is in memory).
For the Excel you want to work with later take a look at
http://www.python-excel.org/ xlrd is the one I use still (works in the
UNIX environment), haven't had a need to change it to anything else.
For MS SQL you can look at http://pymssql.sourceforge.net/ which is also
supported under UNIX.
--
Kind Regards,
Christian Witts
More information about the Tutor
mailing list