[Pythonmac-SIG] PDF reading

Sun Jan 25 21:02:10 CET 2009

Paul Brown <appworld at mac.com> wrote:

> anyone have any  pointers on reading a pdf file.
> 
> i need to extract the text content , page number ,  text style , block
> , ... all in XML if poss
> 
> Paul

Hi, Paul.

I use a patched version of xpdf to get this stuff, which works pretty
well.  Extracts the text and wordbox info (page, word rectangle, font,
bold/italic, etc.)  for each word in the PDF.  You can download the
patch to xpdf from
http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip.
You'll have to unpack the zip file and look for it in there, then apply
it to the xpdf sources and build xpdf.  I've sent the patch to the xpdf
maintainer, but haven't heard more about it from him.  See the (patched)
xpdf man page for details of the output format (ASCII text, one word
record per line).

This is also included in the UpLib release at http://uplib.parc.com/;
you'll have to register an account on the blog in order to get the
download link for that.  If you download and install one of the binary
builds of UpLib, the patched xpdf is included.

Bill