Analyse of PDF (or EPS?)
davidb at mcs.st-and.ac.uk
Fri Nov 21 03:04:16 CET 2003
bokr at oz.net (Bengt Richter) wrote in message news:<bpj320$qui$0 at 220.127.116.11>...
> On Thu, 20 Nov 2003 14:48:52 +0100, Johan Holst Nielsen <johan at weknowthewayout.com> wrote:
> >Is there any Python packages to analyse or get some information out of
> >an PDF document...
> >Like where the text are placed - what text are placed - fonts, embedded
> >PDFs/fonts/images etc.
It depends on the type of images (bitmap vs. vector).
> IIRC you can get the full specs of pdf and eps at the adobe site.
The full PDF specification is not exactly short, but it's fairly readable.
> Some stuff is easy to get at, some may be compressed and/or encrypted,
> and not so easy.
Although the FlateDecode compression format is straightforward with existing
libraries, some of the other compression techniques may be less accessible.
> Conforming docs are supposed to be structured so that it is relatively easy
> to grab chunks of document and do the kinds of things printing business s/w does,
> like rotating and scaling and reordering pages, etc.
I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.
Maybe it's time to stick a license on it and upload it somewhere.
More information about the Python-list