[Tutor] extract hosts from html write to file

Eric Brunson brunson at brunson.com
Tue Sep 11 19:46:00 CEST 2007

sacha rook wrote:
> Hi I wonder if anyone can help with the following
> I am trying to read a html page extract only fully qualified hostnames 
> from the page and output these hostnames to a file on disk to be used 
> later as input to another program.
> I have this so far
> import urllib2
> f=open("c:/tmp/newfile.txt", "w")
> for line in urllib2.urlopen("_http://www.somedomain.uk_ 
> <http://www.somedomain.uk/>"):
>     if "href" in line and "http://" in line:
>         print line
>         f.write(line)
> f.close()
> fu=open("c:/tmp/newfile.txt", "r")
> for line in fu.readlines():
>     print line      
> so i have opened a file to write to, got a page of html, printed and 
> written those to file that contain href & http:// references.
> closed file opened file read all the lines from file and printed out
> Can someone point me in right direction please on the flow of this 
> program, the best way to just extract the hostnames and print these to 
> file on disk?

I would start with a Regular Expression to match the text of the URL, it 
will match exactly the text of the URL and you can extract that.  You 
can probably even find one in a web search.  Read up on regular 
expressions to start with, they're extremely powerful, but a little bit 
of a learning curve to start with.  Google "regular expression tutorial" 
or search the list archive for a reference.

> As you can see I am newish to this
> Thanks in advance for any help given!
> s
> ------------------------------------------------------------------------
> Do you know a place like the back of your hand? Share local knowledge 
> with BackOfMyHand.com <http://www.backofmyhand.com>
> ------------------------------------------------------------------------
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor

More information about the Tutor mailing list