Some <head> clauses cases BeautifulSoup to choke?

Frank Stutzman stutzman at
Tue Nov 20 17:07:40 CET 2007

Some kind person replied:
> You have the same URL as both your good and bad example.

Oops, dang emacs cut buffer (yeah, thats what did it).  A working 
example url would be (again, mind the wrap): 

Marc Christiansen <usenet at> wrote:

> The problem is this line:
> <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
> Which is wrong. The content is not utf-16 encoded. The line after that
> declares the charset as utf-8, which is correct, although ascii would be
> ok too.

Ah, er, hmmm.  Take a look the 'good' URL I mentioned above.  You will 
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has.  And BeautifulSoup works great on it.  

I'm still scratchin' ma head...

> If I save the search result and remove this line, everything works. So,
> you could:
> - ignore problematic pages

Not an option for my application.
> - save and edit them, then reparse them (not always practical)

Thats what I'm doing at the moment during my development.  Sure
seems inelegant.

> - use the fromEncoding argument:
> soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
> (or 'ascii'). Of course this only works if you guess/predict the
> encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
> "an encoding discovered in the document itself" (quote from
> < Soup Gives You Unicode, Dammit>)

I'll try that.  For what I'm doing it ought to be safe enough.  

Much appreciate all the comments so far.

Frank Stutzman
Bonanza N494B     "Hula Girl"
Boise, ID

More information about the Python-list mailing list