[wwwsearch-general] (no subject)

bruce bedouglas at earthlink.net
Fri Aug 29 22:43:36 CEST 2008


Hi john.

Thanks for your reply. I tried your suggestion of using RobustFactory, and
still get a badly maligned html back!!! The html is listed below. I would
have thought that the mech process, would have interpreted the
"http-equiv="refresh" Unfortunately, mechanize apparently isn't able to
handle a "<meta http-equiv="refresh" url="/foo/..."> when it's inside the
<body> of the html...

test.html
------------------------------------------------------------------
<html>
<head>
<TITLE></TITLE>
</head>

<BODY BGCOLOR="#FFFFFF">

                        <TD NOWRAP WIDTH="45" VALIGN="top"><A
HREF="javascript:openAWindow('http://www.registrar.psu.edu/faculty_staff/enr
oll_services/clsrooms.html#C','Intent',625,425,1)"><FONT FACE="Arial,
Helvetica, sans-serif" SIZE="2"><strong>Tech Type</strong></FONT></A></TD>

<META HTTP-EQUIV="Refresh" CONTENT="0;url=/soc/fall/Alloz/a-c/acctg.html#">

---------------------------------------------------------------------------

as you can see, there is no closing </body></html> tag....

thanks


stripped down, test code...
----------------------------------------
from  mechanize import Browser
import mechanize
br = Browser()

br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(True)
br.addheaders = [('User-Agent', 'Firefox')]

url="http://schedule.psu.edu/act_main_search.cfm?Semester=FALL%202008%20%20%
20%20&CrseLoc=OZ%3A%3AAbington%20Campus&CECrseLoc=AllOZ%3A%3AAbington%20Camp
us&CourseAbbrev=ACCTG&CourseNum=&CrseAlpha="

br.open(url)
res = br.response()  # this is a copy of response
s = res.read()
print "slen=",len(s)
print s

sys.exit()
----------------------------------


-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John J Lee
Sent: Friday, August 29, 2008 12:34 PM
To: wwwsearch-general at lists.sourceforge.net
Cc: python-list at python.org
Subject: Re: [wwwsearch-general] (no subject)


On Fri, 29 Aug 2008, bruce wrote:
[...]
> does the page (test.html) need to be completely valid html?

No, but there are certainly (poorly-defined) limitations.

I haven't tried to understand your script or the HTML, but did you try
this:

br = mechanize.Browser(mechanize.RobustFactory())
...


John

--
http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list