[Tutor] extracting informations (images and text) from a PDF andcreating a database from it

Wed Dec 30 07:05:55 CET 2009

On Tue, Dec 29, 2009 at 3:21 PM, Shashwat Anand
<anand.shashwat at gmail.com> wrote:
> I used PDFMiner and I was pretty satisfied with the text portions. I
> retrieved all the text and was able to manipulate it according to my wish.
> However I failed on Image part. So Technically my question reduces to 'If
> there  a PDF document and some verbose text below them and the pattern is
> followed i.e. per page of PDF there will be one image and some texts
> following it, how can I retrieve both the images and the text without loss'
> ?

You can use `pdftohtml' [http://pdftohtml.sf.net]. It is available on Ubuntu.

Regards,
Didar