[Tutor] Fwd: Re: Parsing/Crawling test College Class Site.

Alan Gauld alan.gauld at btinternet.com
Tue Jun 2 09:27:28 CEST 2015


Forwarding to list.
Always use ReplyAll (or reply List if you have that option) to include 
the list.


-------- Forwarded Message --------
Subject: 	Re: [Tutor] Parsing/Crawling test College Class Site.
Date: 	Mon, 1 Jun 2015 20:42:48 -0400
From: 	bruce <badouglas at gmail.com>
To: 	Alan Gauld <alan.gauld at btinternet.com>



Seriously embarrassed!!

The issue that's happening is the process doesn't generate the page
with the classlist!!

forgot to mention why I was posting this...


On Mon, Jun 1, 2015 at 8:40 PM, bruce <badouglas at gmail.com> wrote:
> Hi Alan.
>
> Thanks. So, here goes!
>
> The target site is:
> https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
>
> The following is a sample of the test code, as well as the url/posts
> of the pages as produced by the Firefox/Firebug process.
>
> Basically, a user accesses the initial url, and then selects a couple
> of items on the page, followed by the "Search" btn at the bottom of
> the page.
>
> The items (in order to be input/selected) by the user are:
> -subject (insert ACC) for accounting
> -uncheck "Show Open Classes Only"
> -select the "Additional Search Criteria" expansion (bottom of the page)
>   --In the "Days of Week" dropdown, select the "include any of these days"
>   --select all days except Sat/Sun
>
> finally, select the "Search" btn, which generates the actual class
> list for the ACC dept.
>
> During each action, the app might generate ajax which
> updates/interfaces with the backend. All of this can be seen/tracked
> (I think) if you have the Firebug plugin for firefox running, where
> you can then track the cookies/post actions.  The same data can be
> generated running LiveHttpHeaders (or some other network app).
>
> The process is running on centos, using V2.6.6.
>
> The test app is a mix of standard py, and liberal use of the system
> curl cmd. In order to generate one of the post vars, XPath is used to
> extract the value from the initial generated file/content.
>
>
> #!/usr/bin/python
> #-------------------------------------------------------------
> #
> #    FileName:
> #        unlvClassTest.py
> #
> #    Creation date:
> #        jun/1/15
> #
> #    Modification/update:
> #
> #
> #    Purpose:
> #        test generating of the psoft dept data
> #
> #    Usage:
> #        cmdline unlvClassTest.py
> #
> #    App Logic:
> #
> #
> #
> #
> #
> #
> #-------------------------------------------------------------
>
> #test python script
> import subprocess
> import re
> import libxml2dom
> import urllib
> import urllib2
> import sys, string
> import time
> import os
> import os.path
> from hashlib import sha1
> from libxml2dom import Node
> from libxml2dom import NodeList
> import hashlib
> import pycurl
> import StringIO
> import uuid
> import simplejson
> from string import ascii_uppercase
>
> #=======================================
>
>
> execfile('/apps/parseapp2/ascii_strip.py')
> execfile('dir_defs_inc.py')
> appDir="/apps/parseapp2/"
>
> # data output filename
> datafile="unlvDept.dat"
>
>
> # global var for the parent/child list json
> plist={}
>
>
> cname="unlv.lwp"
>
> #----------------------------------------
>
> if __name__ == "__main__":
> # main app
>
>   #
>   # get the input struct, parse it, determine the level
>   #
>
>   cmd="echo '' > "+datafile
>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   res=proc.communicate()[0].strip()
>
>   cmd="echo '' > "+cname
>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   res=proc.communicate()[0].strip()
>
>   cmd='curl -vvv  '
>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>   cmd=cmd+'-L "http://www.lonestar.edu/class-search.htm"'
>   #proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   #res=proc.communicate()[0].strip()
>   #print res
>
>   cmd='curl -vvv  '
>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>   cmd=cmd+'-L "https://campus.lonestar.edu/classsearch.htm"'
>   #proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   #res1=proc.communicate()[0].strip()
>   #print res1
>
>
>    #initial page
>   cmd='curl -vvv  '
>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>   cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   res2=proc.communicate()[0].strip()
>   #print cmd+"\n\n"
>
>   print res2
>
>   sys.exit()
>
>
>
>   # s contains HTML not XML text
>   d = libxml2dom.parseString(res2, html=1)
>
>   #-----------Form------------
>
>   selpath="//input[@id='ICSID']//attribute::value"
>
>   sel_ = d.xpath(selpath)
>
>
>   if (len(sel_) == 0):
>     #--print svpath
>     #--print "llllll"
>     #--print " select error"
>     sys.exit()
>
>   val=""
>   ndx=0
>   for a in sel_:
>
>     val=a.textContent.strip()
>
>   print val
>   #sys.exit()
>
>   if(val==""):
>     sys.exit()
>
>
>   #build the 1st post
>
>   ddd=1
>
>   post=""
>   post="ICAJAX=1"
>   post=post+"&ICAPPCLSDATA="
>   post=post+"&ICAction=DERIVED_CLSRCH_SSR_EXPAND_COLLAPS%24149%24%241"
>   post=post+"&ICActionPrompt=false"
>   post=post+"&ICAddCount="
>   post=post+"&ICAutoSave=0"
>   post=post+"&ICBcDomData=undefined"
>   post=post+"&ICChanged=-1"
>   post=post+"&ICElementNum=0"
>   post=post+"&ICFind="
>   post=post+"&ICFocus="
>   post=post+"&ICNAVTYPEDROPDOWN=0"
>   post=post+"&ICResubmit=0"
>   post=post+"&ICSID="+urllib.quote(val)
>   post=post+"&ICSaveWarningFilter=0"
>   post=post+"&ICStateNum="+str(ddd)
>   post=post+"&ICType=Panel"
>   post=post+"&ICXPos=0"
>   post=post+"&ICYPos=114"
>   post=post+"&ResponsetoDiffFrame=-1"
>   post=post+"&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N"
>   post=post+"&SSR_CLSRCH_WRK_SUBJECT$0=ACC"
>   post=post+"&TargetFrameName=None"
>
>   cmd='curl -vvv  '
>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>   cmd=cmd+'-e "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
>   '
>   cmd=cmd+'-d "'+post+'"   '
>   cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   res3=proc.communicate()[0].strip()
>   print cmd+"\n"
>   print res3
>
>
>
>   ##2nd post
>   ddd=ddd+1
>
>   post=""
>   post="ICAJAX=1"
>   post=post+"&ICAPPCLSDATA="
>   post=post+"&ICNAVTYPEDROPDOWN=0"
>   post=post+"&ICType=Panel"
>   post=post+"&ICElementNum=0"
>   post=post+"&ICStateNum="+str(ddd)
>   post=post+"&ICAction=SSR_CLSRCH_WRK_SUBJECT%240"
>   post=post+"&ICXPos=0"
>   post=post+"&ICYPos=501"
>   post=post+"&ResponsetoDiffFrame=-1"
>   post=post+"&TargetFrameName=None"
>   post=post+"&FacetPath=None"
>   post=post+"&ICSaveWarningFilter=0"
>   post=post+"&ICChanged=-1"
>   post=post+"&ICAutoSave=0"
>   post=post+"&ICResubmit=0"
>   post=post+"&ICSID="+urllib.quote(val)
>   post=post+"&ICActionPrompt=false"
>   post=post+"&ICBcDomData=undefined"
>   post=post+"&ICFind="
>   post=post+"&ICAddCount="
>   post=post+"&ICFocus=SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS%246"
>   post=post+"&SSR_CLSRCH_WRK_SUBJECT$0=ACC"
>   post=post+"&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N"
>
>
>   cmd='curl -vvv  '
>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>   cmd=cmd+'-e "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
>   '
>   cmd=cmd+'-d "'+post+'"   '
>   cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   res3=proc.communicate()[0].strip()
>   print cmd+"\n"
>   print res3+"\n\n\n\n\n"
>   print post
>
>
>
>   ##sys.exit()
>
>   ##3rd post
>   ddd=ddd+1
>
>   post=""
>   post="ICAJAX=1"
>   post=post+"&ICNAVTYPEDROPDOWN=0"
>   post=post+"&ICType=Panel"
>   post=post+"&ICElementNum=0"
>   post=post+"&ICStateNum="+str(ddd)
>   post=post+"&ICAction=CLASS_SRCH_WRK2_SSR_PB_CLASS_SRCH"
>   post=post+"&ICXPos=0"
>   post=post+"&ICYPos=501"
>   post=post+"&ResponsetoDiffFrame=-1"
>   post=post+"&TargetFrameName=None"
>   post=post+"&FacetPath=None"
>   post=post+"&ICFocus="
>   post=post+"&ICSaveWarningFilter=0"
>   post=post+"&ICChanged=-1"
>   post=post+"&ICAutoSave=0"
>   post=post+"&ICResubmit=0"
>   post=post+"&ICSID="+urllib.quote(val)
>   post=post+"&ICActionPrompt=false"
>   post=post+"&ICBcDomData=undefined"
>   post=post+"&ICFind="
>   post=post+"&ICAddCount="
>   post=post+"&ICAPPCLSDATA="
>   post=post+"&SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS$6=J"
>   post=post+"&SSR_CLSRCH_WRK_MON$chk$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_MON$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_TUES$chk$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_TUES$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_WED$chk$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_WED$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_THURS$chk$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_THURS$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_FRI$chk$6=Y"
>   post=post+"&SSR_CLSRCH_WRK_FRI$6=Y"
>
>
>   cmd='curl -vvv  '
>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>   cmd=cmd+'-e "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
>   '
>   cmd=cmd+'-d "'+post+'"   '
>   cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>   res3=proc.communicate()[0].strip()
>   print cmd+"\n"
>   print res3+"\n\n\n\n\n"
>   print post
>
>
>
>   sys.exit()
>
>
> -------------------------------------------------------------------------------------
> The Firefox Actions:
> -The initianl url
> https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
>  (performs a get)
>
>
> --Select the "Additional Search Criteria"
> ---generates the backend ajax, -- seen by the post action
>     ------https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
> post [ICAJAX=1&ICNAVTYPEDROPDOWN=0&ICType=Panel&ICElementNum=0&ICStateNum=1&ICAction=DERIVED_CLSRCH_SSR_EXPAND_COLLAPS%24149%24%241&ICXPos=0&ICYPos=191&ResponsetoDiffFrame=-1&TargetFrameName=None&FacetPath=None&ICFocus=&ICSaveWarningFilter=0&ICChanged=-1&ICAutoSave=0&ICResubmit=0&ICSID=NwBLGklapJeRFylfen15jatQIwoGcJoQa%2BaO5AyhcwU%3D&ICActionPrompt=false&ICBcDomData=undefined&ICFind=&ICAddCount=&ICAPPCLSDATA=&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N]
>
> --selecting ACC as the dept
> --post url --- https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
>
> post[ICAJAX=1&ICNAVTYPEDROPDOWN=0&ICType=Panel&ICElementNum=0&ICStateNum=2&ICAction=SSR_CLSRCH_WRK_SUBJECT%240&ICXPos=0&ICYPos=362&ResponsetoDiffFrame=-1&TargetFrameName=None&FacetPath=None&ICFocus=SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS%246&ICSaveWarningFilter=0&ICChanged=-1&ICAutoSave=0&ICResubmit=0&ICSID=NwBLGklapJeRFylfen15jatQIwoGcJoQa%2BaO5AyhcwU%3D&ICActionPrompt=false&ICBcDomData=undefined&ICFind=&ICAddCount=&ICAPPCLSDATA=&SSR_CLSRCH_WRK_SUBJECT$0=ACC]
>
>
> -selecting the "SearchBTN"
> --post URL
> https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
>
> post [ICAJAX=1&ICNAVTYPEDROPDOWN=0&ICType=Panel&ICElementNum=0&ICStateNum=3&ICAction=CLASS_SRCH_WRK2_SSR_PB_CLASS_SRCH&ICXPos=0&ICYPos=633&ResponsetoDiffFrame=-1&TargetFrameName=None&FacetPath=None&ICFocus=&ICSaveWarningFilter=0&ICChanged=-1&ICAutoSave=0&ICResubmit=0&ICSID=NwBLGklapJeRFylfen15jatQIwoGcJoQa%2BaO5AyhcwU%3D&ICActionPrompt=false&ICBcDomData=undefined&ICFind=&ICAddCount=&ICAPPCLSDATA=&SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS$6=J&SSR_CLSRCH_WRK_MON$chk$6=Y&SSR_CLSRCH_WRK_MON$6=Y&SSR_CLSRCH_WRK_TUES$chk$6=Y&SSR_CLSRCH_WRK_TUES$6=Y&SSR_CLSRCH_WRK_WED$chk$6=Y&SSR_CLSRCH_WRK_WED$6=Y&SSR_CLSRCH_WRK_THURS$chk$6=Y&SSR_CLSRCH_WRK_THURS$6=Y&SSR_CLSRCH_WRK_FRI$chk$6=Y&SSR_CLSRCH_WRK_FRI$6=Y]
>
>
>
> On Mon, Jun 1, 2015 at 7:48 PM, Alan Gauld <alan.gauld at btinternet.com> wrote:
>> On 02/06/15 00:06, bruce wrote:
>>>
>>> Hi. I'm creating a test py app to do a quick crawl of a couple of
>>> pages of a psoft class schedule site. Before I start asking
>>> questions/pasting/posting code... I wanted to know if this is the kind
>>> of thing that can/should be here..
>>>
>>
>> Probably. we are targeted at beginners to Python and focus
>> on core language and standard library. If you are using
>> the standard library modules to build your app then certainly.,
>>
>> If you are using a third party module then we may/may not
>> be able to help depending on who, if anyone, within the
>> group is familiar with it. In that case you may be better
>> on the <whichever toolset you are using> forum.
>>
>>> The real issues I'm facing aren't so much pythonic as much as probably
>>> dealing with getting the cookies/post attributes correct. There's
>>> ongoing jscript on the site, but I'm hopeful/confident :) that if the
>>> cookies/post is correct, then the target page can be fetched..
>>
>>
>> Post sample code, any errors you get and as specific a
>> description of the issue as you can.
>> Include OS and Python versions.
>> Use plain text not HTML to preserve code formatting.
>>
>>
>> If it turns out to be way off topic we'll tell you (politely)
>> where you should go for help.
>>
>> --
>> Alan G
>> Author of the Learn to Program web site
>> http://www.alan-g.me.uk/
>> http://www.amazon.com/author/alan_gauld
>> Follow my photo-blog on Flickr at:
>> http://www.flickr.com/photos/alangauldphotos
>>
>>
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> https://mail.python.org/mailman/listinfo/tutor





More information about the Tutor mailing list