highlight words by regex in pdf files using python
pmaupin at gmail.com
Wed Mar 17 05:12:16 CET 2010
On Mar 4, 6:57 pm, Peng Yu <pengyu... at gmail.com> wrote:
> I don't find a general pdf library in python that can do any
> operations on pdfs.
> I want to automatically highlight certain words (using regex) in a
> pdf. Could somebody let me know if there is a tool to do so in python?
The problem with PDFs is that they can be quite complicated. There is
the outer container structure, which isn't too bad (unless the
document author applied encryption or fancy multi-object compression),
but then inside the graphics elements, things could be stored as
regular ASCII, or as fancy indexes into font-specific tables. Not
rocket science, but the only industrial-strength solution for this is
probably reportlab's pagecatcher.
I have a library which works (primarily with the outer container) for
reading and writing, called pdfrw. I also maintain a list of other
PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It
may be that pdfminer (link on that page) will do what you want -- it
is certainly trying to be complete as a PDF reader. But I've never
personally used pdfminer.
One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
will read in preexisting PDFs and write them out to a reportlab
canvas. This works quite well on a few very simple ASCII PDFs, but
the font handling needs a lot of work and probably won't work at all
right now on unicode. (But if you wanted to improve it, I certainly
would accept patches or give you commit rights!)
That pdfrw example does graphics reasonably well. I was actually
going down that path for getting better vector graphics into rst2pdf
(both uniconvertor and svglib were broken for my purposes), but then I
realized that the PDF spec allows you to include a page from another
PDF quite easily (the spec calls it a form xObject), so you don't
actually need to parse down into the graphics stream for that. So,
right now, the best way to do vector graphics with rst2pdf is either
to give it a preexisting PDF (which it passes off to pdfrw for
conversion into a form xObject), or to give it a .svg file and invoke
it with -e inkscape, and then it will use inkscape to convert the svg
to a pdf and then go through the same path.
More information about the Python-list