[Tutor] How to Scrape Text from PDFs

William Ray Wing wrw at mac.com
Mon Jun 17 16:33:48 EDT 2019



> On Jun 17, 2019, at 1:30 AM, Cem Vardar <cemv96 at hotmail.com> wrote:
> 
> Hello,
> 
> I have been working on assignment that was described to me as “fairly trivial” for a couple of days now. I have some PDF files that have links for some websites and I need to extract these links from these files by using Python. I would be very glad if someone could point me in the direction of some resources that would give me the essential skills specific for this task.
> 

Unfortunately, a PDF can contain anything from almost PostScript to a bit map.  But lets assume your PDFs are of the almost PostScript flavor.  In that case you can simply read them as text, and then use standard Python’s standard string searching for http:// or https://.  Each time you find one, stop and parse (again with string handling) the URL looking for one of the typical terminators (e.g. .com, .net, .org etc.).

It might help to cheat a bit and open one of the PDFs with a standard text editor and using it, search for http:// and see what turns up.  I’ll bet it will be fairly clear.

Bill

> Sincerely,
> Cem
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list