Newbie..Needs Help
Graham Feeley
grahamjfeeley at optusnet.com.au
Sat Jul 29 23:55:13 EDT 2006
Well Well Well, Anthra you are a clever person, Are
nt you!!!!
I nearly fell over when i read your post.
Would it help if we used another web site to gather data????
As you stated the tables are not all that well structured.
well I will give thisone a go first and if there is anything I can do for
you just ask and I will try my best.
I really appreciate what you have done.
Of course I will try to follow your code to see if any will fall on
me....LOL
Regards
Graham
"Anthra Norell" <anthra.norell at tiscalinet.ch> wrote in message
news:mailman.8704.1154205950.27775.python-list at python.org...
>
> ----- Original Message -----
> From: "Graham Feeley" <grahamjfeeley at optusnet.com.au>
> Newsgroups: comp.lang.python
> To: <python-list at python.org>
> Sent: Friday, July 28, 2006 5:11 PM
> Subject: Re: Newbie..Needs Help
>
>
>> Thanks Nick for the reply
>> Of course my first post was a general posting to see if someone would be
>> able to help
>> here is the website which holds the data I require
>> http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=27/07/2006&meetings=bdgo
>>
>> The fields required are as follows
>> NSW Tab
>> # Win Place
>> 2 $4.60 $2.40
>> 5 $2.70
>> 1 $1.30
>> Quin $23.00
>> Tri $120.70
>> Field names are
>> Date ( not important )
>> Track................= Bendigo
>> RaceNo............on web page
>> Res1st...............2
>> Res2nd..............5
>> Res3rd..............1
>> Div1..................$4.60
>> DivPlc...............$2.40
>> Div2..................$2.70
>> Div3..................$1.30
>> DivQuin.............$23.00
>> DivTrif...............$120.70
>> As you can see there are a total of 6 meetings involved and I would need
>> to
>> put in this parameter ( =bdgo) or (=gosf) these are the meeting tracks
>>
>> Hope this more enlightening
>> Regards
>> graham
>>
>
> Graham,
>
> Only a few days ago I gave someone a push who had a very similar problem.
> I handed him code ready to run. I am doing it again for
> you.
> The site you use is much harder to interpret than the other one was
> and so I took the opportunity to experimentally stretch
> the envelope of a new brain child of mine: a stream editor called SE. It
> is new and so I also take the opportunity to demo it.
> One correspondent in the previous exchange was Paul McGuire, the
> author of 'pyparse'. He made a good case for using 'pyparse'
> in situations like yours. Unlike a stream editor, a parser reads structure
> in addition to data and can relate the data to its
> context.
> Anlayzing the tables I noticed that they are poorly structured: The
> first column contains both data and ids. Some records are
> shorter than others, so column ids have to be guessed and hard coded.
> Missing data sometimes is a dash, sometimes nothing. The
> inconsistencies seem to be consistent, though, down the eight tables of
> the page. So they can be formalized with some confidence
> that they are systematic. If Paul could spend some time on this, I'd be
> much interested to see how he would handle the relative
> disorder.
> Another thought: The time one invests in developing a program should
> not exceed the time it can save overall (not talking
> about recreational programming). Web pages justify an extra measure of
> caution, because they may change any time and when they do
> they impose an unscheduled priority every time the reader stops working
> and requires a revision.
>
> So, here is your program. I write it so you can copy the whole thing to a
> file. Next copy SE from the Cheese Shop. Unzip it and put
> both SE.PY and SEL.PY where your Python progams are. Then 'execfile' the
> code in an IDLE window, call 'display_horse_race_data
> ('Bendigo', '27/07/2006') and see what happens. You'll have to wait ten
> seconds or so.
>
> Regards
>
> Frederic
>
> ######################################################################################
>
> TRACKS = { 'New Zealand' : '',
> 'Bendigo' : 'bdgo',
> 'Gosford' : 'gosf',
> 'Northam' : 'nthm',
> 'Port Augusta': 'pta',
> 'Townsville' : 'town',
> }
>
>
> # This function does it all once all functions are loaded. If nothing
> shows, the
> # page has not data.
>
> def display_horse_race_data (track, date, clip_summary = 100):
>
> """
> tracks: e.g. 'Bendigo' or 'bdgo'
> date: e.g. '27/07/2006'
> clip_summary: each table has a long summary header.
> the argument says hjow much of it to show.
> """
>
> if track [0].isupper ():
> if TRACKS.has_key (track):
> track = TRACKS [track]
> else:
> print 'No such track %s' % track
> return
> open ()
> header, records = get_horse_race_data (track, date)
> show_records (header, records, clip_summary)
>
>
>
> ######################################################################################
>
>
> import SE, urllib
>
> _is_open = 0
>
> def open ():
>
> global _is_open
>
> if not _is_open: # Skip repeat calls
>
> global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator,
> CSV_Maker
>
> # Making the following Editors is a step-by-step process, adding one
> element at a time and
> # looking at what it does and what should be done next.
> # Get pertinent data segments
> header = ' "~(?i)Today\'s Results - .+?<div
> style="padding-top:5px;">~==*END*OF*HEADER*" '
> race_summary = ' "~(?i)Race [1-9].*?</font><br>~==" '
> data_segment = ' "~(?i)<table border=0 width=100% cellpadding=0
> cellspacing=0>(.|\n)*?</table>~==*END*OF*SEGMENT*" '
> Data_Filter = SE.SE (' <EAT> ' + header + race_summary +
> data_segment)
>
> # Some data items are empty. Fill them with a dash.
> mark_null_data = ' "~(?i)>\s* \s*</td>~=>-" '
> Null_Data_Marker = SE.SE (mark_null_data + ' " = " ')
>
> # Dump the tags
> eat_tags = ' "~<(.|\n)*?>~=" '
> eat_comments = ' "~<!--(.|\n)*?-->~=" '
> Tag_Stripper = SE.SE (eat_tags + eat_comments + ' (13)= ')
>
> # Visual inspection is easier without all those tabs and empty lines
> Space_Deflator = SE.SE ('"~\n[\t ]+~=(10)" "~[\t ]+\n=(10)" |
> "~\n+~=(10)"')
>
> # Translating line breaks to tabs will make a tab-delimited CSV
> CSV_Maker = SE.SE ( '(10)=(9)' )
>
> _is_open = 1 # Block repeat calls
>
>
>
> def close ():
>
> """Call close () if you want to free up memory"""
>
> global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator,
> CSV_Maker
> del Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator,
> CSV_Maker
> urllib.urlcleanup ()
> del urllib
> del SE
>
>
>
> def get_horse_race_data (track, date):
>
> """tracks: 'bndg' or (the other one)
> date: e.g. '27/07/2006'
> The website shows partial data or none at all, probably depending on
> race schedules. The relevance of the date in the url is unclear.
> """
>
> def make_url (track, date):
> return
> 'http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=%s&meetings=%s'
> % (date, track)
>
> page = urllib.urlopen (make_url (track, date))
> p = page.read ()
> page.close ()
> # When developing the program, don't get the file from the internet on
> # each call. Download it and read it from the hard disk.
>
> raw_data = Data_Filter (p)
> raw_data_marked = Null_Data_Marker (raw_data)
> raw_data_no_tags = Tag_Stripper (raw_data_marked)
> raw_data_compact = Space_Deflator (raw_data_no_tags)
> data = CSV_Maker (raw_data_compact)
> header, tables = data.split ('*END*OF*HEADER*', 1)
> records = tables.split ('*END*OF*SEGMENT*')
> return header, records [:-1]
>
>
>
> def show_record (record, clip_summary = 100):
>
> """clip_summary: None will display it all"""
>
> # The records all have 55 fields.
> # These are the relevant indexes:
> SUMMARY = 0
> FIRST = 8
> FIRST_NSWTAB_WIN = 9
> FIRST_NSWTAB_PLACE = 10
> FIRST_TABCORP_WIN = 11
> FIRST_TABCORP_PLACE = 12
> FIRST_UNITAB_WIN = 13
> FIRST_UNITAB_PLACE = 14
> SECOND = 15
> SECOND_NSWTAB_PLACE = 17
> SECOND_TABCORP_PLACE = 19
> SECOND_UNITAB_PLACE = 21
> THIRD = 22
> THIRD_NSWTAB_PLACE = 23
> THIRD_TABCORP_PLACE = 24
> THIRD_UNITAB_PLACE = 25
> QUIN_NSWTAB_PLACE = 28
> QUIN_TABCORP_PLACE = 30
> QUIN_UNITAB_PLACE = 32
> EXACTA_NSWTAB_PLACE = 35
> EXACTA_TABCORP_PLACE = 37
> EXACTA_UNITAB_PLACE = 39
> TRI_NSWTAB_PLACE = 41
> TRI_TABCORP_PLACE = 42
> TRI_UNITAB_PLACE = 43
> DDOUBLE_NSWTAB_PLACE = 46
> DDOUBLE_TABCORP_PLACE = 48
> DDOUBLE_UNITAB_PLACE = 50
> SUB_SCR_NSW = 52
> SUB_SCR_TABCORP = 53
> SUB_SCR_UNITAB = 54
>
> if clip_summary == None:
> print record [SUMMARY]
> else:
> print record [SUMMARY] [:clip_summary] + '...'
> print
>
> # Your specification:
> # Date ( not important ) -> In url and summary of first
> record
> # Track................= Bendigo -> In url and summary of first
> record
> # RaceNo............on web page -> In summary (index of record + 1?)
> # Res1st...............2
> # Res2nd..............5
> # Res3rd..............1
> # Div1..................$4.60
> # DivPlc...............$2.40
> # Div2..................$2.70
> # Div3..................$1.30
> # DivQuin.............$23.00
> # DivTrif...............$120.70
>
> print 'Res1st > %s' % record [FIRST]
> print 'Res2nd > %s' % record [SECOND]
> print 'Res3rd > %s' % record [THIRD]
> print 'Div1 > %s' % record [FIRST_NSWTAB_WIN]
> print 'DivPlc > %s' % record [FIRST_NSWTAB_PLACE]
> print 'Div2 > %s' % record [SECOND_NSWTAB_PLACE]
> print 'Div3 > %s' % record [THIRD_NSWTAB_PLACE]
> print 'DivQuin > %s' % record [QUIN_NSWTAB_PLACE]
> print 'DivTrif > %s' % record [TRI_NSWTAB_PLACE]
>
> # Add others as you like from the list of index names above
>
>
>
> def show_records (header, records, clip_summary = 100):
>
> print '\n%s\n' % header
> for record in records:
> show_record (record.split ('\t'), clip_summary)
> print '\n'
>
>
> ##########################################################################
> #
> # show_records (records, 74) displays:
> #
> # Today's Results - 27/07/2006 BENDIGO
> #
> # Race 1 results:Carlsruhe Roadhouse Mdn Plate $11,000 2yo Maiden 1400m
> Appr...
> #
> # Res1st > 2
> # Res2nd > 5
> # Res3rd > 1
> # Div1 > $4.60
> # DivPlc > $2.40
> # Div2 > $2.70
> # Div3 > $1.30
> # DivQuin > $23.00
> # DivTrif > $120.70
> #
> #
> # Race 2 results:Gerard K. House P/L Mdn Plate $11,000 3yo Maiden 1400m
> Appr...
> #
> # Res1st > 6
> # Res2nd > 7
> # Res3rd > 5
> # Div1 > $3.50
> # DivPlc > $1.60
> # Div2 > $2.60
> # Div3 > $1.40
> # DivQuin > $18.60
> # DivTrif > $75.80
> #
> #
> # Race 3 results:Richard Cambridge Printers Mdn $11,000 3yo Maiden 1400m
> Appr...
> #
> # Res1st > 11
> # Res2nd > 12
> # Res3rd > 1
> # Div1 ...
> #
> # ... etc
> #
>
>
>
More information about the Python-list
mailing list