[Tutor] extracting informations (images and text) from a PDF andcreating a database from it
Alan Gauld
alan.gauld at btinternet.com
Tue Dec 29 10:29:16 CET 2009
"Shashwat Anand" <anand.shashwat at gmail.com> wrote
>I need to make a database from some PDFs. I need to extract logos as well
>as
> the information (i.e. name,address) beneath the logo and fill it up in
> database. The logo can be text as well as picture as shown in two of the
> screenshots of one of the sample PDF file:
> http://imagebin.org/77378
> http://imagebin.org/77379
You could try PDFMiner to extract direct from the PDF using Python.
> Will converting to html a good option? Later on I need to apply some
> image
> processing too. What should be the ideal way towards it ?
Converting to html (assuming you have a tool to do that!) may be better
since there are a wider choice of tools and more experience to help you.
Or there are various commercial tools for converting PDF into Word etc.
I've never personally had to extract data from a PDF, I've always had
access
to the source documents so I can't comment on how effective each approach
is...
--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/
More information about the Tutor
mailing list