[wwwsearch-general] (no subject)
bruce
bedouglas at earthlink.net
Fri Aug 29 16:43:36 EDT 2008
Hi john.
Thanks for your reply. I tried your suggestion of using RobustFactory, and
still get a badly maligned html back!!! The html is listed below. I would
have thought that the mech process, would have interpreted the
"http-equiv="refresh" Unfortunately, mechanize apparently isn't able to
handle a "<meta http-equiv="refresh" url="/foo/..."> when it's inside the
<body> of the html...
test.html
------------------------------------------------------------------
<html>
<head>
<TITLE></TITLE>
</head>
<BODY BGCOLOR="#FFFFFF">
<TD NOWRAP WIDTH="45" VALIGN="top"><A
HREF="javascript:openAWindow('http://www.registrar.psu.edu/faculty_staff/enr
oll_services/clsrooms.html#C','Intent',625,425,1)"><FONT FACE="Arial,
Helvetica, sans-serif" SIZE="2"><strong>Tech Type</strong></FONT></A></TD>
<META HTTP-EQUIV="Refresh" CONTENT="0;url=/soc/fall/Alloz/a-c/acctg.html#">
---------------------------------------------------------------------------
as you can see, there is no closing </body></html> tag....
thanks
stripped down, test code...
----------------------------------------
from mechanize import Browser
import mechanize
br = Browser()
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(True)
br.addheaders = [('User-Agent', 'Firefox')]
url="http://schedule.psu.edu/act_main_search.cfm?Semester=FALL%202008%20%20%
20%20&CrseLoc=OZ%3A%3AAbington%20Campus&CECrseLoc=AllOZ%3A%3AAbington%20Camp
us&CourseAbbrev=ACCTG&CourseNum=&CrseAlpha="
br.open(url)
res = br.response() # this is a copy of response
s = res.read()
print "slen=",len(s)
print s
sys.exit()
----------------------------------
-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John J Lee
Sent: Friday, August 29, 2008 12:34 PM
To: wwwsearch-general at lists.sourceforge.net
Cc: python-list at python.org
Subject: Re: [wwwsearch-general] (no subject)
On Fri, 29 Aug 2008, bruce wrote:
[...]
> does the page (test.html) need to be completely valid html?
No, but there are certainly (poorly-defined) limitations.
I haven't tried to understand your script or the HTML, but did you try
this:
br = mechanize.Browser(mechanize.RobustFactory())
...
John
--
http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list