[BangPypers] extracting unicode text from pdfs

Mon May 24 16:21:57 CEST 2010

You may want to try out pdfminer. Its very similar to xpdf in structure and
should give you the parsed data into unicode directly.

On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <eknath.iyer at gmail.com
> wrote:

> I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
> When I use the xpdf package, the generated text is very weird, so I'd like
> to write a program which would convert the pdf text into Unicode text as it
> is.
>
> The fonts used in the pdfs:
> name                                   type              emb sub uni object
> ID
> ------------------------------------ ----------------- --- --- ---
> ---------
> APKAPP+Usha-Bold                     Type 1C           yes yes yes     72
>  0
> APKBBB+Agenda-Light                  Type 1C           yes yes yes     77
>  0
> APKBGF+Usha                          Type 1C           yes yes yes     41
>  0
> APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46
>  0
> APKBON+Agenda-Bold                   Type 1C           yes yes yes     49
>  0
>
> For eg. in the pdf: आदमी मुसाफिर है
>              when I use pdftotext, I get some very weird symbols: '...
> .......'
>             while i'd like 'आदमी मुसाफिर है' to be the output
>
>
> --
> Eknath Venkataramani
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>

-- 
--------------------------------------------------------
blog: http://blog.dhananjaynene.com
twitter: http://twitter.com/dnene