[Tutor] Passing a config file to Python

Irina I irina_khanoom at yahoo.com
Thu Mar 14 19:22:31 CET 2013


Hi all,

I'm new to Python and am trying to pass a config file to my Python script. The config file is so simple and has only two URLs.

The code should takes that configuration file as input and generates a single file in HTML format as output.

The program must retrieve each web page in the list and extract all the <a> tag links from each page. It is only necessary to extract the <a> tag links from the landing page of the URLs that you have placed in your configuration file.

The program will output an HTML file containing a list of clickable links from the source webpages and will be grouped by webpage. This is what I came up with so far, can someone please tell me if it's good? 

Thanks in advance.

[CODE]

- - - - - - - - config.txt - - - - - - - -
http://www.blahblah.bla
http://www.etcetc.etc
- - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - linkscraper.py - - - - - - - -
import urllib

def get_seed_links():
...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... }"""
....with open("config.txt", "r") as f:
........seed_links = f.read().split('\n')
....return dict([(s_link, None) for s_link in seed_links])

def get_all_links(seed_link):
...."""return list of links from seed_link page"""
....all_links = []
....source_page = urllib.urlopen(seed_link).read()
....start = 0
....while True:
........start = source_page.find("<a", start)
........if start == -1:
............return all_links
........start = source_page.find("href=", start)
........start = source_page.find("=", start) + 1
........end = source_page.find(" ", start)
........link = source_page[start:end]
........all_links.append(link)

def build_output_file(data):
...."""build and save output file from data. data -- {seed_link:[link, ...], ...}"""
....result = ""
....for seed_link in data:
........result += "<h2>%s</h2>\n<break />" % seed_link
........for link in data[seed_link]:
............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://", ""))
........result += "<html /><html />"
....with open("result.htm", "w") as f:
........f.write(result)

def main():
....seed_link_data = get_seed_links()
....for seed_link in seed_link_data:
........seed_link_data[seed_link] = get_all_links(seed_link)
....build_output_file(seed_link_data)

if __name__ == "__main__":
....main()

[/CODE]



More information about the Tutor mailing list