Parsing Baseball Stats
Paul McGuire
ptmcg at austin.rr._bogus_.com
Wed Jul 26 15:00:53 EDT 2006
"Ankit" <ankitdesai at gmail.com> wrote in message
news:1153917672.557715.90890 at h48g2000cwc.googlegroups.com...
> Frederic,
>
> Thanks for posting the solution. I used the original solution you
> posted and it worked beautifully.
>
> Paul,
>
> I understand your concern for the site's TOS. Although, this may not
> mean anything, the reason I wanted this "parser" was because I wanted
> to get the Advanced, and Translated Stats for personal use. I don't
> have any commercial motives but play with baseball stats is my hobby.
> The site does allow one to download stuff for personal use, which I
> abide by. Also, I am only looking to get the aforementioned stats for
> some players. The site has player pages for over 16,000 players. I
> think it would be unfair to the site owners if I went to download all
> 16,000 players using the script. In the end, they might just move the
> stats in to their premium package (not free) and then I would be really
> screwed.
>
> So, I understand your concerns and thank you for posting them.
>
> Ankit
>
Frederic and Ankit -
I guess you may have caught me in a more-than-curmudgeon-ly mood. Thanks
for giving me the benefit of the doubt.
I guess I should put more faith in our "consenting adults" environment - if
someone wants to use posted code to create a bot or virus or TOS-violating
web page scraper, that is their business, not mine. I've noticed that the
esteemed C. Titus Brown in his twill intro gives an example violating
Google's TOS, but at least he gives a suitable admonition in the code to the
effect of "this is just an example, but don't do it."
So in that spirit, for EDUCATION AND PERSONAL USE PURPOSES ONLY, here is a
pyparsing rendition that processes the HTML of the previously cited web
site. Ankit, you already know the suitable url's to use for this, so I
don't need to post them again (in a weak attempt to shield that web site
from casual slamming).
At first glance, this is *way* more complicated than Frederic's SE-based
solution. The catch is that the pattern we are keying off of has a lot of
HTML junk in it. Frederic just dumps it on the floor, and really this
program doesn't do much more with it. Note that we suppress almost all of
the parsed HTML tags, which is just pyparsing's way of saying "don't need
this...", but the tag expression still needs to be included in the pattern
we are scanning for.
There are a couple of beyond-beginner pyparsing techniques in this example:
- Using a parse action to reject text that matches syntax, but not
semantics. In this case, we reject <h3> tags that don't have the right
section name. From a parsing standpoint, all <h3>'s match the h3Start
expression, so we attach a parse action to perform the additional filtering.
- Using Dict is always kind of magic. At parse time, the Dict class
instructs the parser to build a dict-style result, use the first token in
each matched group as a key, and the remainder as the value. This gives us
a keyed lookup by age to the yearly stats values.
- We have to stop reading stats at the line break, so we first check if we
are not at the end-of-line before accepting the next number. That is why
the expression reads "OneOrMore(~lineEnd + number)" to parse in the actual
statistics values.
Once the parsing is done, I go through a little extra work showing different
ways to get at the parsed results. pyparsing does much more than just
return nested lists of strings. In this case, we are associating field
names with some content, and also dynamically generating dict-style access
to statistics by age. Finally, there is also the output to CSV format,
which was the original intent.
I think that as HTML-scraping apps go, this is fairly typical for a
pyparsing approach. The feedback I get is that people take an hour or two
getting their programs just the way they want them, but then the resulting
code is pretty robust over time, as minor page changes or enhancement
require simple if any updates to the scraper. For instance, if new stat
columns were added to this page, there would be *no* change to the parser.
Anyway, here is the pyparsing datapoint for your comparison.
-- Paul
(... and what was Babe Ruth doing between the ages of 26 and 35? Did he
retire for 9 years and then come back?)
from pyparsing import *
import urllib
playerURL = "http://rest_of_URL_goes_here"
# define start/end HTML tags for key items
# makeHTMLTags takes care of unexpected attributes, whitespace, case, etc.
h3Start,h3End = makeHTMLTags("h3")
aStart,aEnd = makeHTMLTags("a")
preStart,preEnd = makeHTMLTags("pre")
aStart = aStart.suppress()
aEnd = aEnd.suppress()
preStart = preStart.suppress()
preEnd = preEnd.suppress()
# spell out some of the specific HTML patterns we are looking for
sectionStart = (h3Start + aStart + SkipTo(aEnd).setResultsName("section") +
aEnd + h3End ) | \
(h3Start + SkipTo(h3End).setResultsName("section") + h3End )
sectionHeading = OneOrMore(aStart + SkipTo(aEnd) +
aEnd).setResultsName("statsNames")
sectionHeading2 = OneOrMore(~lineEnd +
Word(alphanums.upper()+"/")).setResultsName("statsNames")
integer = Combine(Optional("-") + Word(nums))
real = Combine(Optional("-") + Optional(Word(nums)) + "." + Word(nums))
number = real | integer
teamName = Word(alphas.upper() + "_-")
# create parse action that will filter for sections of a particular name
wrongSectionName = ParseException("",0,"")
def onlyAcceptSectionNamed(sec):
def parseAction(tokens):
if tokens.section != sec:
raise wrongSectionName
return parseAction
import pprint
def getStatistics(url):
htm_page = urllib.urlopen(url)
htm_lines = htm_page.read()
htm_page.close ()
actualPitchingStats = \
sectionStart.copy().setParseAction(onlyAcceptSectionNamed("Actual
Pitching Statistics ")) + \
preStart + \
sectionHeading + \
Dict( OneOrMore( Group(integer + aStart.suppress() + integer +
teamName + aEnd.suppress() + \
OneOrMore(~lineEnd +
number).setResultsName("stats") ) )).setResultsName("statsByAge") + \
Group( OneOrMore(number) ).setResultsName("careerStats") + preEnd
aps = actualPitchingStats.searchString(htm_lines)[0]
translatedPitchingStats = \
sectionStart.copy().setParseAction(onlyAcceptSectionNamed("Translated
Pitching Statistics")) + \
preStart + lineEnd + \
sectionHeading2 + \
Dict( OneOrMore( Group(integer + aStart.suppress() + integer +
teamName + aEnd.suppress() + \
OneOrMore(~lineEnd +
number).setResultsName("stats") ) )).setResultsName("statsByAge") + \
Suppress("Career") + Group(
OneOrMore(number) ).setResultsName("careerStats") + preEnd
tps = translatedPitchingStats.searchString(htm_lines)[0]
# examples of accessing data fields in returned parse results
for res in (aps,tps):
print res.section
print '-'*len(res.section.rstrip())
for k in res.keys():
print "- %s: %s" % (k,res[k])
# career stats don't have age, year, or team name, so skip over
those stats names
pprint.pprint( zip(res.statsNames[3:],res.careerStats) )
print
# print stats for year at age 24
# by-age stats don't include age, so skip over first stats name
pprint.pprint( zip(res.statsNames[1:],res.statsByAge["24"]) )
print
# output CSV-style data, for each year and then for career
for yearlyStats in res.statsByAge:
print ", ".join(yearlyStats)
print " , , ,",", ".join(res.careerStats)
print
getStatistics(playerURL)
Gives this output:
Actual Pitching Statistics
--------------------------
- endH3: </h3>
- statsByAge: [['19', '1914', 'BOS-A', '2', '1', '0', '3.91', '4', '3',
'96', '23.0', '21', '12', '10', '1', '7', '3', '0', '0', '0', '0', '1',
'0'], ['20', '1915', 'BOS-A', '18', '8', '0', '2.44', '32', '28', '874',
'217.7', '166', '80', '59', '3', '85', '112', '6', '0', '9', '1', '16',
'1'], ['21', '1916', 'BOS-A', '23', '12', '1', '1.75', '44', '41', '1272',
'323.7', '230', '83', '63', '0', '118', '170', '8', '0', '3', '1', '23',
'9'], ['22', '1917', 'BOS-A', '24', '13', '2', '2.01', '41', '38', '1277',
'326.3', '244', '93', '73', '2', '108', '128', '11', '0', '5', '0', '35',
'6'], ['23', '1918', 'BOS-A', '13', '7', '0', '2.22', '20', '19', '660',
'166.3', '125', '51', '41', '1', '49', '40', '2', '0', '3', '1', '18', '1'],
['24', '1919', 'BOS-A', '9', '5', '1', '2.97', '17', '15', '570', '133.3',
'148', '59', '44', '2', '58', '30', '2', '0', '5', '1', '12', '0'], ['25',
'1920', 'NY_-A', '1', '0', '0', '4.50', '1', '1', '17', '4.0', '3', '4',
'2', '0', '2', '0', '0', '0', '0', '0', '0', '0'], ['26', '1921', 'NY_-A',
'2', '0', '0', '9.00', '2', '1', '49', '9.0', '14', '10', '9', '1', '9',
'2', '0', '0', '0', '0', '0', '0'], ['35', '1930', 'NY_-A', '1', '0', '0',
'3.00', '1', '1', '39', '9.0', '11', '3', '3', '0', '2', '3', '0', '0', '0',
'0', '1', '0'], ['38', '1933', 'NY_-A', '1', '0', '0', '5.00', '1', '1',
'42', '9.0', '12', '5', '5', '0', '3', '0', '0', '0', '0', '0', '1', '0']]
- startH3: ['h3', ['class', 'cardsect'], False]
- section: Actual Pitching Statistics
- statsNames: ['AGE', 'YEAR', 'TEAM', 'W', 'L', 'SV', 'ERA', 'G', 'GS',
'TBF', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'SO', 'HBP', 'IBB', 'WP', 'BK',
'CG', 'SHO']
- careerStats: ['94', '46', '4', '2.28', '163', '148', '4896', '1221.3',
'974', '400', '309', '10', '441', '488', '29', '0', '25', '4', '107', '17']
- class: cardsect
- empty: False
[('W', '94'),
('L', '46'),
('SV', '4'),
('ERA', '2.28'),
('G', '163'),
('GS', '148'),
('TBF', '4896'),
('IP', '1221.3'),
('H', '974'),
('R', '400'),
('ER', '309'),
('HR', '10'),
('BB', '441'),
('SO', '488'),
('HBP', '29'),
('IBB', '0'),
('WP', '25'),
('BK', '4'),
('CG', '107'),
('SHO', '17')]
[('YEAR', '1919'),
('TEAM', 'BOS-A'),
('W', '9'),
('L', '5'),
('SV', '1'),
('ERA', '2.97'),
('G', '17'),
('GS', '15'),
('TBF', '570'),
('IP', '133.3'),
('H', '148'),
('R', '59'),
('ER', '44'),
('HR', '2'),
('BB', '58'),
('SO', '30'),
('HBP', '2'),
('IBB', '0'),
('WP', '5'),
('BK', '1'),
('CG', '12'),
('SHO', '0')]
19, 1914, BOS-A, 2, 1, 0, 3.91, 4, 3, 96, 23.0, 21, 12, 10, 1, 7, 3, 0, 0,
0, 0, 1, 0
20, 1915, BOS-A, 18, 8, 0, 2.44, 32, 28, 874, 217.7, 166, 80, 59, 3, 85,
112, 6, 0, 9, 1, 16, 1
21, 1916, BOS-A, 23, 12, 1, 1.75, 44, 41, 1272, 323.7, 230, 83, 63, 0, 118,
170, 8, 0, 3, 1, 23, 9
22, 1917, BOS-A, 24, 13, 2, 2.01, 41, 38, 1277, 326.3, 244, 93, 73, 2, 108,
128, 11, 0, 5, 0, 35, 6
23, 1918, BOS-A, 13, 7, 0, 2.22, 20, 19, 660, 166.3, 125, 51, 41, 1, 49, 40,
2, 0, 3, 1, 18, 1
24, 1919, BOS-A, 9, 5, 1, 2.97, 17, 15, 570, 133.3, 148, 59, 44, 2, 58, 30,
2, 0, 5, 1, 12, 0
25, 1920, NY_-A, 1, 0, 0, 4.50, 1, 1, 17, 4.0, 3, 4, 2, 0, 2, 0, 0, 0, 0, 0,
0, 0
26, 1921, NY_-A, 2, 0, 0, 9.00, 2, 1, 49, 9.0, 14, 10, 9, 1, 9, 2, 0, 0, 0,
0, 0, 0
35, 1930, NY_-A, 1, 0, 0, 3.00, 1, 1, 39, 9.0, 11, 3, 3, 0, 2, 3, 0, 0, 0,
0, 1, 0
38, 1933, NY_-A, 1, 0, 0, 5.00, 1, 1, 42, 9.0, 12, 5, 5, 0, 3, 0, 0, 0, 0,
0, 1, 0
, , , 94, 46, 4, 2.28, 163, 148, 4896, 1221.3, 974, 400, 309, 10,
441, 488, 29, 0, 25, 4, 107, 17
Translated Pitching Statistics
------------------------------
- endH3: </h3>
- statsByAge: [['19', '1914', 'BOS-A', '20.0', '19', '15', '5', '6', '0',
'4', '6.75', '1', '1', '0', '8.6', '2.2', '2.7', '1.8'], ['20', '1915',
'BOS-A', '191.3', '163', '87', '24', '74', '6', '134', '4.09', '13', '9',
'0', '7.7', '1.1', '3.5', '6.3'], ['21', '1916', 'BOS-A', '274.0', '212',
'82', '21', '101', '9', '212', '2.69', '22', '8', '1', '7.0', '.7', '3.3',
'7.0'], ['22', '1917', 'BOS-A', '277.3', '239', '107', '29', '98', '13',
'178', '3.47', '20', '11', '2', '7.8', '.9', '3.2', '5.8'], ['23', '1918',
'BOS-A', '149.0', '128', '69', '19', '51', '3', '65', '4.17', '9', '8', '0',
'7.7', '1.1', '3.1', '3.9'], ['24', '1919', 'BOS-A', '123.3', '147', '65',
'14', '59', '3', '47', '4.74', '7', '6', '1', '10.7', '1.0', '4.3', '3.4'],
['25', '1920', 'NY_-A', '3.3', '3', '4', '0', '2', '0', '0', '10.80', '0',
'1', '0', '8.1', '.0', '5.4', '.0'], ['26', '1921', 'NY_-A', '7.7', '10',
'9', '2', '9', '0', '3', '10.57', '0', '1', '0', '11.7', '2.3', '10.6',
'3.5'], ['35', '1930', 'NY_-A', '8.7', '11', '3', '0', '2', '0', '4',
'3.12', '1', '0', '0', '11.4', '.0', '2.1', '4.2'], ['38', '1933', 'NY_-A',
'8.7', '15', '6', '0', '3', '0', '1', '6.23', '0', '1', '0', '15.6', '.0',
'3.1', '1.0']]
- startH3: ['h3', ['class', 'cardsect'], False]
- section: Translated Pitching Statistics
- statsNames: ['AGE', 'YEAR', 'TEAM', 'IP', 'H', 'ER', 'HR', 'BB', 'HBP',
'SO', 'ERA', 'W', 'L', 'SV', 'H/9', 'HR/9', 'BB/9', 'SO/9']
- careerStats: ['1063.3', '947', '447', '114', '405', '34', '648', '3.78',
'73', '46', '6', '8.0', '1.0', '3.4', '5.5']
- class: cardsect
- empty: False
[('IP', '1063.3'),
('H', '947'),
('ER', '447'),
('HR', '114'),
('BB', '405'),
('HBP', '34'),
('SO', '648'),
('ERA', '3.78'),
('W', '73'),
('L', '46'),
('SV', '6'),
('H/9', '8.0'),
('HR/9', '1.0'),
('BB/9', '3.4'),
('SO/9', '5.5')]
[('YEAR', '1919'),
('TEAM', 'BOS-A'),
('IP', '123.3'),
('H', '147'),
('ER', '65'),
('HR', '14'),
('BB', '59'),
('HBP', '3'),
('SO', '47'),
('ERA', '4.74'),
('W', '7'),
('L', '6'),
('SV', '1'),
('H/9', '10.7'),
('HR/9', '1.0'),
('BB/9', '4.3'),
('SO/9', '3.4')]
19, 1914, BOS-A, 20.0, 19, 15, 5, 6, 0, 4, 6.75, 1, 1, 0, 8.6, 2.2, 2.7, 1.8
20, 1915, BOS-A, 191.3, 163, 87, 24, 74, 6, 134, 4.09, 13, 9, 0, 7.7, 1.1,
3.5, 6.3
21, 1916, BOS-A, 274.0, 212, 82, 21, 101, 9, 212, 2.69, 22, 8, 1, 7.0, .7,
3.3, 7.0
22, 1917, BOS-A, 277.3, 239, 107, 29, 98, 13, 178, 3.47, 20, 11, 2, 7.8, .9,
3.2, 5.8
23, 1918, BOS-A, 149.0, 128, 69, 19, 51, 3, 65, 4.17, 9, 8, 0, 7.7, 1.1,
3.1, 3.9
24, 1919, BOS-A, 123.3, 147, 65, 14, 59, 3, 47, 4.74, 7, 6, 1, 10.7, 1.0,
4.3, 3.4
25, 1920, NY_-A, 3.3, 3, 4, 0, 2, 0, 0, 10.80, 0, 1, 0, 8.1, .0, 5.4, .0
26, 1921, NY_-A, 7.7, 10, 9, 2, 9, 0, 3, 10.57, 0, 1, 0, 11.7, 2.3, 10.6,
3.5
35, 1930, NY_-A, 8.7, 11, 3, 0, 2, 0, 4, 3.12, 1, 0, 0, 11.4, .0, 2.1, 4.2
38, 1933, NY_-A, 8.7, 15, 6, 0, 3, 0, 1, 6.23, 0, 1, 0, 15.6, .0, 3.1, 1.0
, , , 1063.3, 947, 447, 114, 405, 34, 648, 3.78, 73, 46, 6, 8.0,
1.0, 3.4, 5.5
More information about the Python-list
mailing list