[BangPypers] PyPDF to read hindi

AADITYA SRIRAM aadisriram at gmail.com
Wed Jun 2 18:27:52 CEST 2010


HI Raj,

   Your reply makes me believe i can finally work ahead with the project,
here is the code i am using, and the pdf file i am using for the test
purpose is this
http://pib.nic.in/archieve/railbudget/rbudget2010/RBspeechHin.pdf, how do i
find out the encoding in the file ?

  My code is :

*import pyPdf*
*from BeautifulSoup import BeautifulSoup*
*f=open('conv.txt','w')*
*pdf = pyPdf.PdfFileReader(open("RBspeechHin.pdf", "rb"))*
*#for page in pdf.pages:*
*c=pdf.getPage(1).extractText()*
*soup=BeautifulSoup(c)*
*soup.originalEncoding*
*print BeautifulSoup(c).prettify()*
*f.write(soup)*
*
*
i was working with some html program before this and used BF for encoding,
so tried my luck here too and it din't work :( if u can help me for just the
pdf mentioned above also it will suffice, i will try learning from that :)
*
*
*Cheers,*
Aaditya* *

On Wed, Jun 2, 2010 at 3:28 PM, Amal <raj.amal at gmail.com> wrote:

> Hi Aaditya,
>  Actually reading hindi text is not as simple as reading english text. Most
> of the Hindi PDFs don't have standard encoding.
>
> And Encoding is value given to each Unicode code point.
> And each encoding corresponds to font representation.
> So a PDF takes the encoding, maps it to a font using a Font map and then
> renders the font. It does not know what character it is.
> So For reading most of hindi PDFs, we have to know the encoding to
> character
> mapping.
>
> I worked in my previous company with Dainik Bhaskar, and other hindi
> newspaper PDFs and faced the same problem.
> So a generic hindi PDF to text is not possible.
>
> But if u know a specific encoding, then u u might be able to write a
> specific Hindi PDF to text.
>
> Amal.
>
> On Wed, Jun 2, 2010 at 2:50 AM, Srinivas Reddy Thatiparthy <
> srinivas_thatiparthy at akebonosoft.com> wrote:
>
> > Hindhi is a unicode text , your input data should be treated as Unicode
> > instead of
> > ASCII and last but not the least the encoding format in editor should be
> > set to unicode ,otherwise you see garbled text.
> >
> >
> > This is my guess , i have never worked with unicode in python.If i am
> wrong
> > please correct me.
> >
> > Thanks&Regards,
> > Srinivas Reddy Thatiparthy,
> > Mobile:9393099772,
> >
> >
> >
> > -----Original Message-----
> > From: bangpypers-bounces+srinivas_thatiparthy=akebonosoft.com at python.orgon
> behalf of AADITYA SRIRAM
> > Sent: Wed 6/2/2010 2:22 PM
> > To: bangpypers at python.org
> > Subject: [BangPypers] PyPDF to read hindi
> >
> > Hi guys, i am writing a small program to convert pdf to text files(i know
> > its easy and lame but need to start somewhere !!), anyway i am not bale
> to
> > rip the hindi text in readable form :( can anyone please help ? Its
> working
> > fine with english text .
> > _______________________________________________
> > BangPypers mailing list
> > BangPypers at python.org
> > http://mail.python.org/mailman/listinfo/bangpypers
> >
> >
> > _______________________________________________
> > BangPypers mailing list
> > BangPypers at python.org
> > http://mail.python.org/mailman/listinfo/bangpypers
> >
> >
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>


More information about the BangPypers mailing list