[CentralOH] Parsing PDF's
nick.albright at gmail.com
Tue Oct 5 20:16:41 CEST 2010
I've done some stuff with PDFs, and I think I've used pyPdf to parse out
info before, and love ReportLab Toolkit for generating PDFs.
I hope that helps!
On Tue, Oct 5, 2010 at 2:51 AM, Michael S. Yanovich <yanovich.1 at osu.edu>wrote:
> I have a very large PDF (roughly 500 pages!) that has a long table of
> information. I'd like to be able to parse the PDF and create some
> statistics about such things as how many times does something in a row
> occur through out the document and more.
> I've looked into PDFMiner, which is a great tool. However, it's not that
> I want to just output the PDF to plain text, html, or xml. The output
> for html and xml is very ugly for this pdf and the plain-text seems
> manageable but it would be very time-consuming to right the code I want.
> The way PDFMiner organizes the my PDF into plain text is it makes lists
> of the values for each column and then moves on to the next page. So I
> could in theory, hoping everything matches up go through and assume that
> the first value for column A will always match the first value for
> column B. But this could get tricky when getting towards then end since
> the last page is only half filled.
> I'm basically wondering if there exists something *like* BeautifulSoup
> for PDFs? I am basically looking for something that can take a PDF
> create a pythonic type object and I can go through and play with each
> page and break the elements down further and examine them. Preferably in
> a more user-friendly way than PDFMiner.
> Any ideas?
> Michael S. Yanovich
> CentralOH mailing list
> CentralOH at python.org
Please note that as of 1/20 I no longer have a land phone line, only my
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CentralOH