Parsing Baseball Stats

Wed Jul 26 06:20:58 EDT 2006

----- Original Message -----
From: "Paul McGuire" <ptmcg at austin.rr._bogus_.com>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Wednesday, July 26, 2006 1:01 AM
Subject: Re: Parsing Baseball Stats

> "Anthra Norell" <anthra.norell at tiscalinet.ch> wrote in message
> news:mailman.8551.1153861590.27775.python-list at python.org...
> >
          snip
> >
> Frederic -
>
> HTML parsing is one of those slippery slopes - or perhaps "tar babies" might
> be a better metaphor - that starts out as a simple problem, but then one
> exception after the next drags the solution out for daaaays.  Probably once
> or twice a week, there is a posting here from someone trying to extract data
> from a website, usually something like trying to pull the href's out of some

          snip

> So what started out as a little joke (microscopic, even) has eventually
> touched a nerve, so thanks and apologies to those who have read this whole
> mess.  Frederic, SE looks like a killer - may it become the next regexp!
>
> -- Paul
>

Paul,

A year ago or so someone posted a call for ideas on encoding passwords for his own private use. I suggested a solution using
python's random number generator and was immediately reminded by several knowledgeable people, quite sharply by some, that the
random number generator was not to be used for cryptographic applications, since the doc specifically said so. I was also given good
advice on what to read.
      I thought that my solution was good, if not by the catechism, then by the requirements of the OP's problem which I considered
to be the issue. I hoped the OP would come back with his opinion, but he didn't.
      Not then and there. He did some time later, off list, telling me privately that he had incorporated my solution with some
adaptations and that it was exactly what he had been looking for.

So let me pursue this on two lines: A) your response and B) the issue.

A) I thank you for the considerable time you must have taken to explain pyparse in such detail. I didn't know you're the author.
Congratulations! It certainly looks very professional. I have no doubt that it is an excellent and powerful tool.
      Thanks also for your explanation of the TOS concept. It isn't alien to me and I have no problem with it. But I don't believe
it means that one should voluntarily argue against one's own freedom, barking at oneself with the voice of the legal watchdogs out
there that would restrict our freedom preemptively, getting a tug on the leash for excessive zeal but a pat on the head nontheless.
We have little cause to assume that the OP is setting up a baseball information service and have much cause to assume that he is
not. So let us reserve the benefit of the doubt because this is what the others do. And work by plausible assumption--necessarily,
because the realm of certainty is too small an action base.
      SE is not a parser. It is a stream editor. I believe it fills a gap, handling a certain kind of problem very gracefully while
being particularly easy to use. Your spontaneous reaction of horror was the consequence of a misinterpretation. The Tag_Stripper's
argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful incarnation of a novel, yet more arcane regular expression
syntax. It is simply a string consisting of three very simple expressions: '<.*?>', '<[^>]*' and '[^<]*>'. They could also be
written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the regex to identify it as such. The equal sign says replace
what precedes with what follows. Nothing happens to follow, which means replace it with nothing, which means delete it (tags).
That's all. SE allows--encourages--to break down a complex search into any number of simple components.
      (Having just said 'easy to use' I notice a mistake. I correct it below in section C.)

B) I would welcome the OP's opinion.

Regards

Frederic

C) Correction: The second and third expression were meant to catch tags spanning lines. There weren't any such tags and so the
expressions were useless--and inoffensive too: the second one, as a matter of fact, could also delete text. The Tag Stripper should
be defined like this:

Tag_Stripper = ('"~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~="')

It now deletes tags even if they span lines and it incorporates a second definition that deletes comments which, as you made me
aware, may contain tags. I now have to run the whole file through this before I look at the lines.

def get_statistics (name_of_player):

   statistics = {
     'Actual Pitching Statistics'   : [],
     'Advanced Pitching Statistics' : [],
   }

   url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player
   htm_page = urllib.urlopen (url)
   lines = StringIO.StringIO (Tag_Stripper (htm_page.read ()))
   htm_page.close ()
   current_list = None
   for line in lines:
      line = line.strip ()
      if line == '':
         continue
      if 'Statistics' in line:  # That's the section headings.
         if statistics.has_key (line):
            current_list = statistics [line]
            current_list.append (line)
         else:
            current_list = None
      else:
         if current_list != None:
            current_list.append (CSV_Maker (line))

   return statistics

show_statistics (statistics) displays this tab-delimited CSV:

Advanced Pitching Statistics
AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA STF
19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25
20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12
21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24
22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13
23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3
24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6
25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35
26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41
35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13
38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22
1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10

Actual Pitching Statistics
AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO
19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0
20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1
21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9
22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6
23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1
24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0
25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0
26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0
35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0
38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0
94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17

(The last line remains to be shifted three columns to the right.)