Some <head> clauses cases BeautifulSoup to choke?
Frank Stutzman
stutzman at skywagon.kjsl.com
Tue Nov 20 11:07:40 EST 2007
Some kind person replied:
> You have the same URL as both your good and bad example.
Oops, dang emacs cut buffer (yeah, thats what did it). A working
example url would be (again, mind the wrap):
http://www.naco.faa.gov/digital_tpp_search.asp?fldIdent=ksfo&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search
Marc Christiansen <usenet at solar-empire.de> wrote:
> The problem is this line:
> <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
>
> Which is wrong. The content is not utf-16 encoded. The line after that
> declares the charset as utf-8, which is correct, although ascii would be
> ok too.
Ah, er, hmmm. Take a look the 'good' URL I mentioned above. You will
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has. And BeautifulSoup works great on it.
I'm still scratchin' ma head...
> If I save the search result and remove this line, everything works. So,
> you could:
> - ignore problematic pages
Not an option for my application.
> - save and edit them, then reparse them (not always practical)
Thats what I'm doing at the moment during my development. Sure
seems inelegant.
> - use the fromEncoding argument:
> soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
> (or 'ascii'). Of course this only works if you guess/predict the
> encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
> "an encoding discovered in the document itself" (quote from
> <http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
I'll try that. For what I'm doing it ought to be safe enough.
Much appreciate all the comments so far.
--
Frank Stutzman
Bonanza N494B "Hula Girl"
Boise, ID
More information about the Python-list
mailing list