[Tutor] PDF Scrapping

Tue Nov 24 13:36:54 EST 2015

Hi,

I am looking for the best way to scrape the following PDF's:

(1) http://minerals.usgs.gov/minerals/pubs/commodity/gold/mcs-2015-gold.pdf
(table on page 1)

(2) http://minerals.usgs.gov/minerals/pubs/commodity/gold/myb1-2013-gold.pdf
(table 1)

I have done a lot of research and have read that pdftables 0.0.4 is an
excellent way to scrape tabular data from PDF'S (see
https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/
).

I downloaded pdftables 0.0.4 (see https://pypi.python.org/pypi/pdftables).

I am new to Python and having trouble finding good documentation for how to
use this library.

Has anybody used pdftables before that could help me get started or point
me to the ideal library for scrapping the PDF links above? I have read that
different PDF libraries are used depending on the format of the PDF. What
library would be best for the PDF formats above? Knowing this will help me
get started, then I can write up some code and ask further questions if
needed.

Thanks in advance for your help!

~Chris