help!! *extra* tricky web page to extract data from...

John Nagle nagle at animats.com
Tue Mar 13 19:20:15 EDT 2007


seberino at spawar.navy.mil wrote:
> How extract the visible numerical data from this Microsoft financial
> web site?
> 
> http://tinyurl.com/yw2w4h
> 
> If you simply download the HTML file you'll see the data is *not*
> embedded in it but loaded from some other file.
> 
> Surely if I can see the data in my browser I can grab it somehow right
> in a Python script?
> 
> Any help greatly appreciated.

    Been there, done that, years ago.  Try this:

http://www.downside.com/cgi/testfinancialsextract.cgi?url=http://www.sec.gov/Archives/edgar/data/886158/0001104659-06-034196.txt

That will get you the data you're looking for.
If you want to try other companies, start at the query box on 
"http://www.downside.com".

The data is actually coming from the United States Securities and Exchange
Commission's EDGAR web site, where companies are required to file their
financial statements.  The filings are intended to be read by humans, but
it's possible to parse many filings mechanically.  They're supposed to be
in HTML 3.2, but this isn't enforced.

There are many EDGAR parsers, some better than ours.  To do a really good one,
you have to license a patent from Price Waterhouse.  Try 
"http://www.10kwizard.com/", which has an API for retrieving this info.
It's not free.

				John Nagle



More information about the Python-list mailing list