Analyse of PDF (or EPS?)
Johan Holst Nielsen
johan at weknowthewayout.com
Fri Nov 21 13:10:41 CET 2003
David Boddie wrote:
>>>Is there any Python packages to analyse or get some information out of
>>>an PDF document...
>>>Like where the text are placed - what text are placed - fonts, embedded
> It depends on the type of images (bitmap vs. vector).
Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=)
>>IIRC you can get the full specs of pdf and eps at the adobe site.
> The full PDF specification is not exactly short, but it's fairly readable.
Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too ;)
>>Some stuff is easy to get at, some may be compressed and/or encrypted,
>>and not so easy.
> Although the FlateDecode compression format is straightforward with existing
> libraries, some of the other compression techniques may be less accessible.
Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.
>>Conforming docs are supposed to be structured so that it is relatively easy
>>to grab chunks of document and do the kinds of things printing business s/w does,
>>like rotating and scaling and reordering pages, etc.
> I have a Python library which is able to identify a lot of the structure in simple
> documents, including basic text extraction, but I've become pretty disillusioned
> with it because so much work is required to extract more complex information.
> Maybe it's time to stick a license on it and upload it somewhere.
Well, let me know ;) Maybe I could get an demo or something? That would
be nice :)
More information about the Python-list