Html character entity conversion
pak.andrei at gmail.com
pak.andrei at gmail.com
Sun Jul 30 14:43:09 EDT 2006
Claudio Grondi wrote:
> pak.andrei at gmail.com wrote:
> > Here is my script:
> >
> > from mechanize import *
> > from BeautifulSoup import *
> > import StringIO
> > b = Browser()
> > f = b.open("http://www.translate.ru/text.asp?lang=ru")
> > b.select_form(nr=0)
> > b["source"] = "hello python"
> > html = b.submit().get_data()
> > soup = BeautifulSoup(html)
> > print soup.find("span", id = "r_text").string
> >
> > OUTPUT:
> > привет
> > питон
> > ----------
> > In russian it looks like:
> > "привет питон"
> >
> > How can I translate this using standard Python libraries??
> >
> > --
> > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
> >
> Translate to what and with what purpose?
>
> Assuming your intention is to get a Python Unicode string, what about:
>
> strHTML = 'привет
> питон'
> strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
> strUnicode = eval("u'%s'"%strUnicodeHexCode)
>
> ?
>
> I am sure, there is a more elegant and direct solution, but just wanted
> to provide here some quick response.
>
> Claudio Grondi
Thank you, Claudio.
Really interest solution, but it doesn't work...
In [19]: strHTML = 'привет
питон'
In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)
In [22]: print strUnicode
---------------------------------------------------------------------------
exceptions.UnicodeEncodeError Traceback (most
recent call last)
C:\Documents and Settings\dron\<ipython console>
C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
16 def encode(self,input,errors='strict'):
17
---> 18 return codecs.charmap_encode(input,errors,encoding_map)
19
20 def decode(self,input,errors='strict'):
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to <undefined>
In [23]: print strUnicode.encode("utf-8")
сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
<-- it's not my string "привет питон"
In [24]: strUnicode.encode("utf-8")
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
x85' <-- and too many chars
More information about the Python-list
mailing list