[Tutor] PDF Scrapping

Laura Creighton lac at openend.se
Wed Nov 25 13:29:26 EST 2015


In a message of Wed, 25 Nov 2015 12:43:51 -0500, Francois Dion writes:
>This is well beyond the scope of Tutor, but let me mention the following:
>
>The code to pdftables disappeared from github some time back. What is on
>sourceforge is old, same with pypi. I wouldn't create a project using
>pdftables based on that...
>
>As far as what you are trying to do, it looks like they might have the data
>in excel spreadsheets. That is totally trivial to load in pandas. if you
>have any choice at all, avoid PDF at all cost to get data. See some detail
>of the complexity here:
>http://ieg.ifs.tuwien.ac.at/pub/yildiz_iicai_2005.pdf
>
>For your two documents, if you cannot find the data in the excel sheets, I
>think the tabula (ruby based application) approach is the best bet.
>
>Francois

What he said.  Double.  However ...

you can also use see about using popplar.  It has a nice
pdftohtml utility.  Once you get your data in as html, if you
are lucky, and the table information didn't get destroyed in the
process, you can then send your data to pandas, which will 
happily read html tables.  Once you have pandas reading
it, you are pretty much home free and can do whatever you like 
with the data.

If you happen to be on ubuntu, then getting popplar and pdftohtml
is easy.  http://www.ubuntugeek.com/howto-convert-pdf-files-to-html-files.html

It seems to be harder on windows, but there are stackoverflow questions
outlining how to do it ...

Laura


More information about the Tutor mailing list