[Pythonmac-SIG] PDF reading

Mon Jan 26 04:11:36 CET 2009

It's a while since I did any text to PDF extraction.
Last time I did, I used some tools that are part of

http://www.hforge.org/itools

Which, I seem to remember, also does elementary decryption.

David.

On 26/01/2009, at 7:02 AM, Bill Janssen wrote:

> Paul Brown <appworld at mac.com> wrote:
>
>> anyone have any  pointers on reading a pdf file.
>>
>> i need to extract the text content , page number ,  text style ,  
>> block
>> , ... all in XML if poss
>>
>> Paul
>
> Hi, Paul.
>
> I use a patched version of xpdf to get this stuff, which works pretty
> well.  Extracts the text and wordbox info (page, word rectangle, font,
> bold/italic, etc.)  for each word in the PDF.  You can download the
> patch to xpdf from
> http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip 
> .
> You'll have to unpack the zip file and look for it in there, then  
> apply
> it to the xpdf sources and build xpdf.  I've sent the patch to the  
> xpdf
> maintainer, but haven't heard more about it from him.  See the  
> (patched)
> xpdf man page for details of the output format (ASCII text, one word
> record per line).
>
> This is also included in the UpLib release at http://uplib.parc.com/;
> you'll have to register an account on the blog in order to get the
> download link for that.  If you download and install one of the binary
> builds of UpLib, the patched xpdf is included.
>
> Bill
> _______________________________________________
> Pythonmac-SIG maillist  -  Pythonmac-SIG at python.org
> http://mail.python.org/mailman/listinfo/pythonmac-sig
>

________________________________________________
David Worrall.
- Sonic Communications Research Group:	creative.canberra.edu.au/scrg
- Experimental Polymedia:	www.avatar.com.au
- Education for Financial Independence: www.mindthemarkets.com.au