fetching webpage
yookyung
ykjo at cs.cornell.edu
Thu Dec 29 21:20:04 EST 2005
I am trying to crawl webpages in citeseer domain (a collection of research
papers mostly in computer science).
I have used the following code snippet.
#####
import urllib
sock = urllib.urlopen("http://citeseer.ist.psu.edu")
webcontent = sock.read().split('\n')
sock.close()
print webcontent
########
Then I get the following error message.
['<!--#set var="TITLE" value="Server error!"', '--><!--#include
virtual="include/top.html" -->', '', ' <!--#if
expr="$REDIRECT_ERROR_NOTES" -->', '', ' The server encountered an
internal error and was ', ' unable to complete your request.', '', '
<!--#include virtual="include/spacer.html" -->', '', ' Error message:', '
<br /><!--#echo encoding="none" var="REDIRECT_ERROR_NOTES" -->', '', '
<!--#else -->', '', ' The server encountered an internal error and was ',
' unable to complete your request. Either the server is', ' overloaded
or there was an error in a CGI script.', '', ' <!--#endif -->', '',
'<!--#include virtual="include/bottom.html" -->', '']
However, the url is valid and it works fine if I open the url in my web
browser.
Or, if I use a different url (http://www.google.com instead of
http://citeseer.ist.psu.edu),
then it works.
What is wrong?
Could it be that the citeseer webserver checks the http request, and it sees
something
that it doesn't like and reject the request?
What should I do?
Thank you.
Best regards,
Yookyung
More information about the Python-list
mailing list