[Tutor] remove blank list items

sacha rook sacharook at hotmail.co.uk
Fri Sep 14 15:56:51 CEST 2007


Hi
 
i was expanding my program to write urls parsed from a html page and write them to a file so i chose www.icq.com to extract the urls from.
 
when i wrote these out to a file and then read the file back I noticed a list of urls then some blank lines then some more urls then some blank lines, does this mean that one of the functions called has for some reason added some whitespace into some of the list items so that i wrote them out to disk?
 
I also noticed that there are duplicate hosts/urls that have been written to the file.
 
So my two questions are;
1. how and where do I tackle removing the whitespace from being written out to disk?
 
2. how do i tackle checking for duplicate entries in a list before writing them out to disk?
 
My code is below 
from BeautifulSoup import BeautifulSoupimport urllib2import urlparse
file = urllib2.urlopen("http://www.icq.com")
soup = BeautifulSoup(''.join(file))alist = soup.findAll('a')
output = open("fqdns.txt","w")
for a in alist:    href = a['href']    output.write(urlparse.urlparse(href)[1] + "\n")
output.close()
input = open("fqdns.txt","r")
for j in input:    print j,
input.close()
the chopped output is here 
 
chat.icq.comchat.icq.comchat.icq.comchat.icq.comchat.icq.com
 
 
labs.icq.comdownload.icq.comgreetings.icq.comgreetings.icq.comgreetings.icq.comgames.icq.comgames.icq.com
_________________________________________________________________
Celeb spotting – Play CelebMashup and win cool prizes
https://www.celebmashup.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20070914/ebf494d6/attachment.htm 


More information about the Tutor mailing list