[Pythonmac-SIG] PDF reading
DavidW
vip at avatar.com.au
Mon Jan 26 04:11:36 CET 2009
It's a while since I did any text to PDF extraction.
Last time I did, I used some tools that are part of
http://www.hforge.org/itools
Which, I seem to remember, also does elementary decryption.
David.
On 26/01/2009, at 7:02 AM, Bill Janssen wrote:
> Paul Brown <appworld at mac.com> wrote:
>
>> anyone have any pointers on reading a pdf file.
>>
>> i need to extract the text content , page number , text style ,
>> block
>> , ... all in XML if poss
>>
>> Paul
>
> Hi, Paul.
>
> I use a patched version of xpdf to get this stuff, which works pretty
> well. Extracts the text and wordbox info (page, word rectangle, font,
> bold/italic, etc.) for each word in the PDF. You can download the
> patch to xpdf from
> http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip
> .
> You'll have to unpack the zip file and look for it in there, then
> apply
> it to the xpdf sources and build xpdf. I've sent the patch to the
> xpdf
> maintainer, but haven't heard more about it from him. See the
> (patched)
> xpdf man page for details of the output format (ASCII text, one word
> record per line).
>
> This is also included in the UpLib release at http://uplib.parc.com/;
> you'll have to register an account on the blog in order to get the
> download link for that. If you download and install one of the binary
> builds of UpLib, the patched xpdf is included.
>
> Bill
> _______________________________________________
> Pythonmac-SIG maillist - Pythonmac-SIG at python.org
> http://mail.python.org/mailman/listinfo/pythonmac-sig
>
________________________________________________
David Worrall.
- Sonic Communications Research Group: creative.canberra.edu.au/scrg
- Experimental Polymedia: www.avatar.com.au
- Education for Financial Independence: www.mindthemarkets.com.au
More information about the Pythonmac-SIG
mailing list