Newbie..Needs Help

Graham Feeley grahamjfeeley at optusnet.com.au
Sat Jul 29 23:55:13 EDT 2006


Well Well Well, Anthra you are a clever person, Are
nt you!!!!
I nearly fell over when i read your post.
Would it help if we used another web site to gather data????
As you stated the tables are not all that well structured.
well I will give thisone  a go first and if there is anything I can do for 
you just ask and I will try my best.
I really appreciate what you have done.
Of course I will try to follow your code to see if any will fall on 
me....LOL
Regards
Graham

"Anthra Norell" <anthra.norell at tiscalinet.ch> wrote in message 
news:mailman.8704.1154205950.27775.python-list at python.org...
>
> ----- Original Message -----
> From: "Graham Feeley" <grahamjfeeley at optusnet.com.au>
> Newsgroups: comp.lang.python
> To: <python-list at python.org>
> Sent: Friday, July 28, 2006 5:11 PM
> Subject: Re: Newbie..Needs Help
>
>
>> Thanks Nick for the reply
>> Of course my first post was a general posting to see if someone would be
>> able to help
>> here is the website which holds the data I require
>> http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=27/07/2006&meetings=bdgo
>>
>> The fields required are as follows
>>  NSW Tab
>> #      Win      Place
>>  2    $4.60   $2.40
>>  5                $2.70
>>  1                $1.30
>>  Quin    $23.00
>>  Tri  $120.70
>> Field names are
>> Date   ( not important )
>> Track................= Bendigo
>> RaceNo............on web page
>> Res1st...............2
>> Res2nd..............5
>> Res3rd..............1
>> Div1..................$4.60
>> DivPlc...............$2.40
>> Div2..................$2.70
>> Div3..................$1.30
>> DivQuin.............$23.00
>> DivTrif...............$120.70
>> As you can see there are a total of 6 meetings involved and I would need 
>> to
>> put in this parameter ( =bdgo) or (=gosf) these are the meeting tracks
>>
>> Hope this more enlightening
>> Regards
>> graham
>>
>
> Graham,
>
> Only a few days ago I gave someone a push who had a very similar problem. 
> I handed him code ready to run. I am doing it again for
> you.
>      The site you use is much harder to interpret than the other one was 
> and so I took the opportunity to experimentally stretch
> the envelope of a new brain child of mine: a stream editor called SE. It 
> is new and so I also take the opportunity to demo it.
>      One correspondent in the previous exchange was Paul McGuire, the 
> author of 'pyparse'. He made a good case for using 'pyparse'
> in situations like yours. Unlike a stream editor, a parser reads structure 
> in addition to data and can relate the data to its
> context.
>      Anlayzing the tables I noticed that they are poorly structured: The 
> first column contains both data and ids. Some records are
> shorter than others, so column ids have to be guessed and hard coded. 
> Missing data sometimes is a dash, sometimes nothing. The
> inconsistencies seem to be consistent, though, down the eight tables of 
> the page. So they can be formalized with some confidence
> that they are systematic. If Paul could spend some time on this, I'd be 
> much interested to see how he would handle the relative
> disorder.
>      Another thought: The time one invests in developing a program should 
> not exceed the time it can save overall (not talking
> about recreational programming). Web pages justify an extra measure of 
> caution, because they may change any time and when they do
> they impose an unscheduled priority every time the reader stops working 
> and requires a revision.
>
> So, here is your program. I write it so you can copy the whole thing to a 
> file. Next copy SE from the Cheese Shop. Unzip it and put
> both SE.PY and SEL.PY where your Python progams are. Then 'execfile' the 
> code in an IDLE window, call 'display_horse_race_data
> ('Bendigo', '27/07/2006') and see what happens. You'll have to wait ten 
> seconds or so.
>
> Regards
>
> Frederic
>
> ######################################################################################
>
> TRACKS = { 'New Zealand' : '',
>           'Bendigo'     : 'bdgo',
>           'Gosford'     : 'gosf',
>           'Northam'     : 'nthm',
>           'Port Augusta': 'pta',
>           'Townsville'  : 'town',
>         }
>
>
> # This function does it all once all functions are loaded. If nothing 
> shows, the
> # page has not data.
>
> def display_horse_race_data (track, date, clip_summary = 100):
>
>   """
>      tracks: e.g. 'Bendigo' or 'bdgo'
>      date: e.g. '27/07/2006'
>      clip_summary: each table has a long summary header.
>        the argument says hjow much of it to show.
>   """
>
>   if track [0].isupper ():
>      if TRACKS.has_key (track):
>         track = TRACKS [track]
>      else:
>         print 'No such track %s' % track
>         return
>   open ()
>   header, records = get_horse_race_data (track, date)
>   show_records (header, records, clip_summary)
>
>
>
> ######################################################################################
>
>
> import SE, urllib
>
> _is_open = 0
>
> def open ():
>
>   global _is_open
>
>   if not _is_open:   # Skip repeat calls
>
>      global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, 
> CSV_Maker
>
>      # Making the following Editors is a step-by-step process, adding one 
> element at a time and
>      # looking at what it does and what should be done next.
>      # Get pertinent data segments
>      header            = ' "~(?i)Today\'s Results - .+?<div 
> style="padding-top:5px;">~==*END*OF*HEADER*" '
>      race_summary      = ' "~(?i)Race [1-9].*?</font><br>~==" '
>      data_segment      = ' "~(?i)<table border=0 width=100% cellpadding=0 
> cellspacing=0>(.|\n)*?</table>~==*END*OF*SEGMENT*" '
>      Data_Filter = SE.SE (' <EAT> ' + header + race_summary + 
> data_segment)
>
>      # Some data items are empty. Fill them with a dash.
>      mark_null_data = ' "~(?i)>\s* \s*</td>~=>-" '
>      Null_Data_Marker = SE.SE (mark_null_data + ' " = " ')
>
>      # Dump the tags
>      eat_tags     = ' "~<(.|\n)*?>~=" '
>      eat_comments = ' "~<!--(.|\n)*?-->~=" '
>      Tag_Stripper = SE.SE (eat_tags + eat_comments + ' (13)= ')
>
>      # Visual inspection is easier without all those tabs and empty lines
>      Space_Deflator = SE.SE ('"~\n[\t ]+~=(10)" "~[\t ]+\n=(10)" | 
> "~\n+~=(10)"')
>
>      # Translating line breaks to tabs will make a tab-delimited CSV
>      CSV_Maker = SE.SE ( '(10)=(9)' )
>
>      _is_open = 1   # Block repeat calls
>
>
>
> def close ():
>
>   """Call close () if you want to free up memory"""
>
>   global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, 
> CSV_Maker
>   del Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, 
> CSV_Maker
>   urllib.urlcleanup ()
>   del urllib
>   del SE
>
>
>
> def get_horse_race_data (track, date):
>
>   """tracks: 'bndg' or (the other one)
>      date: e.g. '27/07/2006'
>      The website shows partial data or none at all, probably depending on
>      race schedules. The relevance of the date in the url is unclear.
>   """
>
>   def make_url (track, date):
>      return 
> 'http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=%s&meetings=%s' 
> % (date, track)
>
>   page = urllib.urlopen (make_url (track, date))
>   p = page.read ()
>   page.close ()
>   # When developing the program, don't get the file from the internet on
>   # each call. Download it and read it from the hard disk.
>
>   raw_data = Data_Filter (p)
>   raw_data_marked = Null_Data_Marker (raw_data)
>   raw_data_no_tags = Tag_Stripper (raw_data_marked)
>   raw_data_compact = Space_Deflator (raw_data_no_tags)
>   data = CSV_Maker (raw_data_compact)
>   header, tables = data.split ('*END*OF*HEADER*', 1)
>   records = tables.split ('*END*OF*SEGMENT*')
>   return header, records [:-1]
>
>
>
> def show_record (record, clip_summary = 100):
>
>   """clip_summary: None will display it all"""
>
>   # The records all have 55 fields.
>   # These are the relevant indexes:
>   SUMMARY                   =  0
>   FIRST                     =  8
>   FIRST_NSWTAB_WIN          =  9
>   FIRST_NSWTAB_PLACE        = 10
>   FIRST_TABCORP_WIN         = 11
>   FIRST_TABCORP_PLACE       = 12
>   FIRST_UNITAB_WIN          = 13
>   FIRST_UNITAB_PLACE        = 14
>   SECOND                    = 15
>   SECOND_NSWTAB_PLACE       = 17
>   SECOND_TABCORP_PLACE      = 19
>   SECOND_UNITAB_PLACE       = 21
>   THIRD                     = 22
>   THIRD_NSWTAB_PLACE        = 23
>   THIRD_TABCORP_PLACE       = 24
>   THIRD_UNITAB_PLACE        = 25
>   QUIN_NSWTAB_PLACE         = 28
>   QUIN_TABCORP_PLACE        = 30
>   QUIN_UNITAB_PLACE         = 32
>   EXACTA_NSWTAB_PLACE       = 35
>   EXACTA_TABCORP_PLACE      = 37
>   EXACTA_UNITAB_PLACE       = 39
>   TRI_NSWTAB_PLACE          = 41
>   TRI_TABCORP_PLACE         = 42
>   TRI_UNITAB_PLACE          = 43
>   DDOUBLE_NSWTAB_PLACE      = 46
>   DDOUBLE_TABCORP_PLACE     = 48
>   DDOUBLE_UNITAB_PLACE      = 50
>   SUB_SCR_NSW               = 52
>   SUB_SCR_TABCORP           = 53
>   SUB_SCR_UNITAB            = 54
>
>   if clip_summary == None:
>      print record [SUMMARY]
>   else:
>      print record [SUMMARY] [:clip_summary] + '...'
>      print
>
>   # Your specification:
>   # Date   ( not important )          -> In url and summary of first 
> record
>   # Track................= Bendigo    -> In url and summary of first 
> record
>   # RaceNo............on web page     -> In summary (index of record + 1?)
>   # Res1st...............2
>   # Res2nd..............5
>   # Res3rd..............1
>   # Div1..................$4.60
>   # DivPlc...............$2.40
>   # Div2..................$2.70
>   # Div3..................$1.30
>   # DivQuin.............$23.00
>   # DivTrif...............$120.70
>
>   print 'Res1st  > %s' % record [FIRST]
>   print 'Res2nd  > %s' % record [SECOND]
>   print 'Res3rd  > %s' % record [THIRD]
>   print 'Div1    > %s' % record [FIRST_NSWTAB_WIN]
>   print 'DivPlc  > %s' % record [FIRST_NSWTAB_PLACE]
>   print 'Div2    > %s' % record [SECOND_NSWTAB_PLACE]
>   print 'Div3    > %s' % record [THIRD_NSWTAB_PLACE]
>   print 'DivQuin > %s' % record [QUIN_NSWTAB_PLACE]
>   print 'DivTrif > %s' % record [TRI_NSWTAB_PLACE]
>
>   # Add others as you like from the list of index names above
>
>
>
> def show_records (header, records, clip_summary = 100):
>
>   print '\n%s\n' % header
>   for record in records:
>      show_record (record.split ('\t'), clip_summary)
>      print '\n'
>
>
> ##########################################################################
> #
> # show_records (records, 74) displays:
> #
> # Today's Results - 27/07/2006 BENDIGO
> #
> # Race 1 results:Carlsruhe Roadhouse Mdn Plate $11,000 2yo Maiden 1400m 
> Appr...
> #
> # Res1st  > 2
> # Res2nd  > 5
> # Res3rd  > 1
> # Div1    > $4.60
> # DivPlc  > $2.40
> # Div2    > $2.70
> # Div3    > $1.30
> # DivQuin > $23.00
> # DivTrif > $120.70
> #
> #
> # Race 2 results:Gerard K. House P/L Mdn Plate $11,000 3yo Maiden 1400m 
> Appr...
> #
> # Res1st  > 6
> # Res2nd  > 7
> # Res3rd  > 5
> # Div1    > $3.50
> # DivPlc  > $1.60
> # Div2    > $2.60
> # Div3    > $1.40
> # DivQuin > $18.60
> # DivTrif > $75.80
> #
> #
> # Race 3 results:Richard Cambridge Printers Mdn $11,000 3yo Maiden 1400m 
> Appr...
> #
> # Res1st  > 11
> # Res2nd  > 12
> # Res3rd  > 1
> # Div1 ...
> #
> # ... etc
> #
>
>
> 





More information about the Python-list mailing list