[CentralOH] Parsing PDF's

Nick Albright nick.albright at gmail.com
Tue Oct 5 20:16:41 CEST 2010


I've done some stuff with PDFs, and I think I've used pyPdf to parse out
info before, and love ReportLab Toolkit for generating PDFs.

I hope that helps!

On Tue, Oct 5, 2010 at 2:51 AM, Michael S. Yanovich <yanovich.1 at osu.edu>wrote:

> I have a very large PDF (roughly 500 pages!) that has a long table of
> information. I'd like to be able to parse the PDF and create some
> statistics about such things as how many times does something in a row
> occur through out the document and more.
> I've looked into PDFMiner, which is a great tool. However, it's not that
> I want to just output the PDF to plain text, html, or xml. The output
> for html and xml is very ugly for this pdf and the plain-text seems
> manageable but it would be very time-consuming to right the code I want.
> The way PDFMiner organizes the my PDF into plain text is it makes lists
> of the values for each column and then moves on to the next page. So I
> could in theory, hoping everything matches up go through and assume that
> the first value for column A will always match the first value for
> column B. But this could get tricky when getting towards then end since
> the last page is only half filled.
> I'm basically wondering if there exists something *like* BeautifulSoup
> for PDFs? I am basically looking for something that can take a PDF
> create a pythonic type object and I can go through and play with each
> page and break the elements down further and examine them. Preferably in
> a more user-friendly way than PDFMiner.
> Any ideas?
> Michael S. Yanovich
> _______________________________________________
> CentralOH mailing list
> CentralOH at python.org
> http://mail.python.org/mailman/listinfo/centraloh

Please note that as of 1/20 I no longer have a land phone line, only my
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/mailman/private/centraloh/attachments/20101005/207aa896/attachment.html>

More information about the CentralOH mailing list