[Tutor] remove blank list items
sacharook at hotmail.co.uk
Fri Sep 14 15:56:51 CEST 2007
i was expanding my program to write urls parsed from a html page and write them to a file so i chose www.icq.com to extract the urls from.
when i wrote these out to a file and then read the file back I noticed a list of urls then some blank lines then some more urls then some blank lines, does this mean that one of the functions called has for some reason added some whitespace into some of the list items so that i wrote them out to disk?
I also noticed that there are duplicate hosts/urls that have been written to the file.
So my two questions are;
1. how and where do I tackle removing the whitespace from being written out to disk?
2. how do i tackle checking for duplicate entries in a list before writing them out to disk?
My code is below
from BeautifulSoup import BeautifulSoupimport urllib2import urlparse
file = urllib2.urlopen("http://www.icq.com")
soup = BeautifulSoup(''.join(file))alist = soup.findAll('a')
output = open("fqdns.txt","w")
for a in alist: href = a['href'] output.write(urlparse.urlparse(href) + "\n")
input = open("fqdns.txt","r")
for j in input: print j,
the chopped output is here
Celeb spotting – Play CelebMashup and win cool prizes
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor