[Tutor] extract hosts from html write to file
Eric Brunson
brunson at brunson.com
Tue Sep 11 19:46:00 CEST 2007
sacha rook wrote:
> Hi I wonder if anyone can help with the following
>
> I am trying to read a html page extract only fully qualified hostnames
> from the page and output these hostnames to a file on disk to be used
> later as input to another program.
>
> I have this so far
>
> import urllib2
> f=open("c:/tmp/newfile.txt", "w")
> for line in urllib2.urlopen("_http://www.somedomain.uk_
> <http://www.somedomain.uk/>"):
> if "href" in line and "http://" in line:
> print line
> f.write(line)
> f.close()
> fu=open("c:/tmp/newfile.txt", "r")
>
> for line in fu.readlines():
> print line
>
> so i have opened a file to write to, got a page of html, printed and
> written those to file that contain href & http:// references.
> closed file opened file read all the lines from file and printed out
>
> Can someone point me in right direction please on the flow of this
> program, the best way to just extract the hostnames and print these to
> file on disk?
I would start with a Regular Expression to match the text of the URL, it
will match exactly the text of the URL and you can extract that. You
can probably even find one in a web search. Read up on regular
expressions to start with, they're extremely powerful, but a little bit
of a learning curve to start with. Google "regular expression tutorial"
or search the list archive for a reference.
>
> As you can see I am newish to this
>
> Thanks in advance for any help given!
>
> s
>
> ------------------------------------------------------------------------
> Do you know a place like the back of your hand? Share local knowledge
> with BackOfMyHand.com <http://www.backofmyhand.com>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list