[BangPypers] extracting unicode text from pdfs

Eknath Venkataramani eknath.iyer at gmail.com
Mon May 24 15:43:26 CEST 2010

I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
When I use the xpdf package, the generated text is very weird, so I'd like
to write a program which would convert the pdf text into Unicode text as it

The fonts used in the pdfs:
name                                   type              emb sub uni object
------------------------------------ ----------------- --- --- --- ---------
APKAPP+Usha-Bold                     Type 1C           yes yes yes     72  0
APKBBB+Agenda-Light                  Type 1C           yes yes yes     77  0
APKBGF+Usha                          Type 1C           yes yes yes     41  0
APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46  0
APKBON+Agenda-Bold                   Type 1C           yes yes yes     49  0

For eg. in the pdf: आदमी मुसाफिर है
              when I use pdftotext, I get some very weird symbols: '...
             while i'd like 'आदमी मुसाफिर है' to be the output

Eknath Venkataramani

More information about the BangPypers mailing list