Some <head> clauses cases BeautifulSoup to choke?
usenet at solar-empire.de
Tue Nov 20 20:40:05 CET 2007
Frank Stutzman <stutzman at skywagon.kjsl.com> wrote:
> Some kind person replied:
>> You have the same URL as both your good and bad example.
> Oops, dang emacs cut buffer (yeah, thats what did it). A working
> example url would be (again, mind the wrap):
> Marc Christiansen <usenet at solar-empire.de> wrote:
>> The problem is this line:
>> <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
>> Which is wrong. The content is not utf-16 encoded. The line after that
>> declares the charset as utf-8, which is correct, although ascii would be
>> ok too.
> Ah, er, hmmm. Take a look the 'good' URL I mentioned above. You will
> notice that it has the same utf-16, utf-8 encoding that the 'bad' one
> has. And BeautifulSoup works great on it.
> I'm still scratchin' ma head...
>>> s = bad.decode("utf-16")
>>> s = good.decode("utf-16")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 41176: truncated data
bad contains the content of the 'bad' url, good the content of the
'good' url. Because of the UnicodeDecodeError, BeautifulSoup tries
either the next encoding or the next step from the url below.
>> <http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
> Much appreciate all the comments so far.
More information about the Python-list