extracting unicode text from pdfs

Eknath Venkataramani eknath.iyer at gmail.com
Mon May 24 09:43:26 EDT 2010


I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
When I use the xpdf package, the generated text is very weird, so I'd like
to write a program which would convert the pdf text into Unicode text as it
is.

The fonts used in the pdfs:
name                                   type              emb sub uni object
ID
------------------------------------ ----------------- --- --- --- ---------
APKAPP+Usha-Bold                     Type 1C           yes yes yes     72  0
APKBBB+Agenda-Light                  Type 1C           yes yes yes     77  0
APKBGF+Usha                          Type 1C           yes yes yes     41  0
APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46  0
APKBON+Agenda-Bold                   Type 1C           yes yes yes     49  0

For eg. in the pdf: आदमी मुसाफिर है
              when I use pdftotext, I get some very weird symbols: '...
.......'
             while i'd like 'आदमी मुसाफिर है' to be the output


-- 
Eknath Venkataramani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100524/d01b4744/attachment.html>


More information about the Python-list mailing list