[BangPypers] extracting unicode text from pdfs
Eknath Venkataramani
eknath.iyer at gmail.com
Mon May 24 17:15:45 CEST 2010
Tried .. didn't work out well enough. The output is same as what I get out
of xpdf
On Mon, May 24, 2010 at 7:51 PM, Dhananjay Nene <dhananjay.nene at gmail.com>wrote:
> You may want to try out pdfminer. Its very similar to xpdf in structure and
> should give you the parsed data into unicode directly.
>
> On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <
> eknath.iyer at gmail.com
> > wrote:
>
> > I have around 45 pdfs to convert into raw text containing text in _HINDI_
> .
> > When I use the xpdf package, the generated text is very weird, so I'd
> like
> > to write a program which would convert the pdf text into Unicode text as
> it
> > is.
> >
> > The fonts used in the pdfs:
> > name type emb sub uni
> object
> > ID
> > ------------------------------------ ----------------- --- --- ---
> > ---------
> > APKAPP+Usha-Bold Type 1C yes yes yes 72
> > 0
> > APKBBB+Agenda-Light Type 1C yes yes yes 77
> > 0
> > APKBGF+Usha Type 1C yes yes yes 41
> > 0
> > APKBKJ+Agenda-Medium Type 1C yes yes yes 46
> > 0
> > APKBON+Agenda-Bold Type 1C yes yes yes 49
> > 0
> >
> > For eg. in the pdf: आदमी मुसाफिर है
> > when I use pdftotext, I get some very weird symbols: '...
> > .......'
> > while i'd like 'आदमी मुसाफिर है' to be the output
> >
> >
> > --
> > Eknath Venkataramani
> > _______________________________________________
> > BangPypers mailing list
> > BangPypers at python.org
> > http://mail.python.org/mailman/listinfo/bangpypers
> >
>
>
>
> --
> --------------------------------------------------------
> blog: http://blog.dhananjaynene.com
> twitter: http://twitter.com/dnene
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>
--
Eknath Venkataramani
More information about the BangPypers
mailing list