extracting unicode text from pdfs
Eknath Venkataramani
eknath.iyer at gmail.com
Mon May 24 09:43:26 EDT 2010
I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
When I use the xpdf package, the generated text is very weird, so I'd like
to write a program which would convert the pdf text into Unicode text as it
is.
The fonts used in the pdfs:
name type emb sub uni object
ID
------------------------------------ ----------------- --- --- --- ---------
APKAPP+Usha-Bold Type 1C yes yes yes 72 0
APKBBB+Agenda-Light Type 1C yes yes yes 77 0
APKBGF+Usha Type 1C yes yes yes 41 0
APKBKJ+Agenda-Medium Type 1C yes yes yes 46 0
APKBON+Agenda-Bold Type 1C yes yes yes 49 0
For eg. in the pdf: आदमी मुसाफिर है
when I use pdftotext, I get some very weird symbols: '...
.......'
while i'd like 'आदमी मुसाफिर है' to be the output
--
Eknath Venkataramani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100524/d01b4744/attachment.html>
More information about the Python-list
mailing list