Html character entity conversion

Sun Jul 30 14:43:09 EDT 2006

Claudio Grondi wrote:
> pak.andrei at gmail.com wrote:
> > Here is my script:
> >
> > from mechanize import *
> > from BeautifulSoup import *
> > import StringIO
> > b = Browser()
> > f = b.open("http://www.translate.ru/text.asp?lang=ru")
> > b.select_form(nr=0)
> > b["source"] = "hello python"
> > html = b.submit().get_data()
> > soup = BeautifulSoup(html)
> > print  soup.find("span", id = "r_text").string
> >
> > OUTPUT:
> > привет
> > питон
> > ----------
> > In russian it looks like:
> > "привет питон"
> >
> > How can I translate this using standard Python libraries??
> >
> > --
> > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
> >
> Translate to what and with what purpose?
>
> Assuming your intention is to get a Python Unicode string, what about:
>
> strHTML = 'привет
> питон'
> strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
> strUnicode = eval("u'%s'"%strUnicodeHexCode)
>
> ?
>
> I am sure, there is a more elegant and direct solution, but just wanted
> to provide here some quick response.
>
> Claudio Grondi

Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = 'привет
питон'

In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')

In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)

In [22]: print strUnicode
---------------------------------------------------------------------------
exceptions.UnicodeEncodeError                        Traceback (most
recent call last)

C:\Documents and Settings\dron\<ipython console>

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
     16     def encode(self,input,errors='strict'):
     17
---> 18         return codecs.charmap_encode(input,errors,encoding_map)
     19
     20     def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to <undefined>

In [23]: print strUnicode.encode("utf-8")
сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
<-- it's not my string "привет питон"

In [24]: strUnicode.encode("utf-8")
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
x85' <-- and too many chars