[BangPypers] extracting unicode text from pdfs
Dhananjay Nene
dhananjay.nene at gmail.com
Mon May 24 10:21:57 EDT 2010
You may want to try out pdfminer. Its very similar to xpdf in structure and
should give you the parsed data into unicode directly.
On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <eknath.iyer at gmail.com
> wrote:
> I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
> When I use the xpdf package, the generated text is very weird, so I'd like
> to write a program which would convert the pdf text into Unicode text as it
> is.
>
> The fonts used in the pdfs:
> name type emb sub uni object
> ID
> ------------------------------------ ----------------- --- --- ---
> ---------
> APKAPP+Usha-Bold Type 1C yes yes yes 72
> 0
> APKBBB+Agenda-Light Type 1C yes yes yes 77
> 0
> APKBGF+Usha Type 1C yes yes yes 41
> 0
> APKBKJ+Agenda-Medium Type 1C yes yes yes 46
> 0
> APKBON+Agenda-Bold Type 1C yes yes yes 49
> 0
>
> For eg. in the pdf: आदमी मुसाफिर है
> when I use pdftotext, I get some very weird symbols: '...
> .......'
> while i'd like 'आदमी मुसाफिर है' to be the output
>
>
> --
> Eknath Venkataramani
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>
--
--------------------------------------------------------
blog: http://blog.dhananjaynene.com
twitter: http://twitter.com/dnene
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100524/b63b5dc0/attachment-0001.html>
More information about the Python-list
mailing list