[Tutor] How to Scrape Text from PDFs

Malcolm Herbert mjch at mjch.net
Tue Jun 18 20:37:42 EDT 2019


This isn't  a response that's python-related, sorry, I'm still learning python myself, but more questions around the nature of the PDF and where I might start looking to solve the problem, were it mine.

The URLs that you are intending to match - are they themselves clickable when you open the PDF in another reader?  If so, then you might have better luck looking for the PDF element that provides that capability rather than trying to text-scrape to recover them.

Although unlikely inside a URL, text in a PDF can be laid out on the page in a completely arbitrary manner and to properly do PDF-to-text conversion you may need to track position on the page for each glyph as well as the font mapping vector - a glyph of an 'A' for instance might not actually be mapped to the ASCII/Unicode for 'A' ... all of which can make this a complete nightmare for the unwary.

So - when I last looked at generating a PDF with a live link element, this was implemented as blue underlined text (to make it look like a link) with an invisible box placed over the top which contained the PDF magic to make that do what I wanted when the user clicked on it.

I would suspect that what you might want would be a Python library that can pull apart a PDF into it's structural elements and then hunt through there for the appropriate "URL box" or whatever it's called ...

Hope that helps,
Malcolm

-- 
Malcolm Herbert
mjch at mjch.net


More information about the Tutor mailing list