Fw: PDF library for reading PDF files
Andreas Lobinger
andreas.lobinger at netsurf.de
Tue Jan 20 04:20:35 EST 2004
Aloha,
Peter Galfi schrieb:
> Thanks. I am studying the PDF spec, it just does not seem to be that easy
> having to implement all the decompressions, etc. The "information" I am
> trying to extract from the PDF file is the text, specifically in a way to
> keep the original paragraphs of the text. I have seen so far one shareware
> standalone tool that extracts the text (and a lot of other formatting
> garbage) into an RTF document keeping the paragraphs as well. I would need
> only the text.
As others wrote here, the simplest solution is to use a external
pdf-2-text programm and postprocess the data. Read comp.text.pdf
There is no simple and consistent way to extract text from a .pdf
because there are many ways to set text. The optical impression
of a paragraph may not be represented by a similar command structure
in the .pdf.
Adobe recognized the difficulties for document reuse and introduced
tagged .pdf in 1.4. With tagged-pdf it is possible to insert
structural information in the .pdf. If you are interested in
using this, contact me.
Wishing a happy day
LOBI
More information about the Python-list
mailing list