highlight words by regex in pdf files using python
wingusr at gmail.com
Thu Mar 18 20:36:20 CET 2010
On Wed, Mar 17, 2010 at 7:53 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
> On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin <pmaupin at gmail.com> wrote:
>> On Mar 4, 6:57 pm, Peng Yu <pengyu... at gmail.com> wrote:
>>> I don't find a general pdf library in python that can do any
>>> operations on pdfs.
>>> I want to automatically highlight certain words (using regex) in a
>>> pdf. Could somebody let me know if there is a tool to do so in python?
>> The problem with PDFs is that they can be quite complicated. There is
>> the outer container structure, which isn't too bad (unless the
>> document author applied encryption or fancy multi-object compression),
>> but then inside the graphics elements, things could be stored as
>> regular ASCII, or as fancy indexes into font-specific tables. Not
>> rocket science, but the only industrial-strength solution for this is
>> probably reportlab's pagecatcher.
>> I have a library which works (primarily with the outer container) for
>> reading and writing, called pdfrw. I also maintain a list of other
>> PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It
>> may be that pdfminer (link on that page) will do what you want -- it
>> is certainly trying to be complete as a PDF reader. But I've never
>> personally used pdfminer.
>> One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
>> will read in preexisting PDFs and write them out to a reportlab
>> canvas. This works quite well on a few very simple ASCII PDFs, but
>> the font handling needs a lot of work and probably won't work at all
>> right now on unicode. (But if you wanted to improve it, I certainly
>> would accept patches or give you commit rights!)
>> That pdfrw example does graphics reasonably well. I was actually
>> going down that path for getting better vector graphics into rst2pdf
>> (both uniconvertor and svglib were broken for my purposes), but then I
>> realized that the PDF spec allows you to include a page from another
>> PDF quite easily (the spec calls it a form xObject), so you don't
>> actually need to parse down into the graphics stream for that. So,
>> right now, the best way to do vector graphics with rst2pdf is either
>> to give it a preexisting PDF (which it passes off to pdfrw for
>> conversion into a form xObject), or to give it a .svg file and invoke
>> it with -e inkscape, and then it will use inkscape to convert the svg
>> to a pdf and then go through the same path.
> Thank you for your long reply! But I'm not sure if you get my question or not.
> Acrobat can highlight certain words in pdfs. I could add notes to the
> highlighted words as well. However, I find that I frequently end up
> with highlighting some words that can be expressed by a regular
> To improve my productivity, I don't want do this manually in Acrobat
> but rather do it in an automatic way, if there is such a tool
> available. People in reportlab mailing list said this is not possible
> with reportlab. And I don't see PyPDF can do this. If you know there
> is an API to for this purpose, please let me know. Thank you!
Take a look at the Acrobat SDK
(http://www.adobe.com/devnet/acrobat/?view=downloads). In particular
see the Acrobat Interapplication Communication information at
"Spell-checking a document" shows how to spell check a PDF using
visual basic at
"Working with annotations" shows how to add an annotation with visual
basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.16.html.
Presumably combining the two examples with Python's win32com should
allow you to do what you want.
More information about the Python-list