[Tutor] Passing a config file to Python
Irina I
irina_khanoom at yahoo.com
Thu Mar 14 19:22:31 CET 2013
Hi all,
I'm new to Python and am trying to pass a config file to my Python script. The config file is so simple and has only two URLs.
The code should takes that configuration file as input and generates a single file in HTML format as output.
The program must retrieve each web page in the list and extract all the <a> tag links from each page. It is only necessary to extract the <a> tag links from the landing page of the URLs that you have placed in your configuration file.
The program will output an HTML file containing a list of clickable links from the source webpages and will be grouped by webpage. This is what I came up with so far, can someone please tell me if it's good?
Thanks in advance.
[CODE]
- - - - - - - - config.txt - - - - - - - -
http://www.blahblah.bla
http://www.etcetc.etc
- - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - linkscraper.py - - - - - - - -
import urllib
def get_seed_links():
...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... }"""
....with open("config.txt", "r") as f:
........seed_links = f.read().split('\n')
....return dict([(s_link, None) for s_link in seed_links])
def get_all_links(seed_link):
...."""return list of links from seed_link page"""
....all_links = []
....source_page = urllib.urlopen(seed_link).read()
....start = 0
....while True:
........start = source_page.find("<a", start)
........if start == -1:
............return all_links
........start = source_page.find("href=", start)
........start = source_page.find("=", start) + 1
........end = source_page.find(" ", start)
........link = source_page[start:end]
........all_links.append(link)
def build_output_file(data):
...."""build and save output file from data. data -- {seed_link:[link, ...], ...}"""
....result = ""
....for seed_link in data:
........result += "<h2>%s</h2>\n<break />" % seed_link
........for link in data[seed_link]:
............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://", ""))
........result += "<html /><html />"
....with open("result.htm", "w") as f:
........f.write(result)
def main():
....seed_link_data = get_seed_links()
....for seed_link in seed_link_data:
........seed_link_data[seed_link] = get_all_links(seed_link)
....build_output_file(seed_link_data)
if __name__ == "__main__":
....main()
[/CODE]
More information about the Tutor
mailing list