Once again a unicode question

Nicolas Evrard nicoe at nutellux.ath.cx
Sat Mar 26 17:17:57 EST 2005


Hello,

I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?

.nicoe at smarties:~$ python2.4
.Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19) 
.[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
.Type "help", "copyright", "credits" or "license" for more information.
.>>> import formatter
.>>> import htmllib
.>>> html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
.>>> html2txt.feed(u'D\xe9but')
.Traceback (most recent call last):
.  File "<stdin>", line 1, in ?
.  File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
.    self.goahead(0)
.  File "/usr/lib/python2.4/sgmllib.py", line 120, in goahead
.    self.handle_data(rawdata[i:j])
.  File "/usr/lib/python2.4/htmllib.py", line 65, in handle_data
.    self.formatter.add_flowing_data(data)
.  File "/usr/lib/python2.4/formatter.py", line 197, in add_flowing_data
.    self.writer.send_flowing_data(data)
.  File "/usr/lib/python2.4/formatter.py", line 421, in send_flowing_data
.    write(word)
.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
.>>> html2txt.feed(u'D\xe9but'.encode('latin1'))
.Traceback (most recent call last):
.  File "<stdin>", line 1, in ?
.  File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.    self.rawdata = self.rawdata + data
.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
.>>> html2txt.feed('Début')
.Traceback (most recent call last):
.  File "<stdin>", line 1, in ?
.  File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.    self.rawdata = self.rawdata + data
.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
.>>>

-- 
(°>  Nicolas Évrard
/ )  Liège - Belgique
^^



More information about the Python-list mailing list