Some <head> clauses cases BeautifulSoup to choke?

Marc Christiansen usenet at solar-empire.de
Mon Nov 19 22:29:51 CET 2007


Frank Stutzman <stutzman at skywagon.kjsl.com> wrote:
> I've got a simple script that looks like (watch the wrap):
> ---------------------------------------------------
> import BeautifulSoup,urllib
> 
> ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
> ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()
> 
> soup=BeautifulSoup.BeautifulSoup(ifile)
> print soup.prettify()
> ----------------------------------------------------
> 
> and all I get out of it is garbage.

Same for me.

> I did some poking and proding and it seems that there is something in the 
> <head> clause that is causing the problem.  Heck if I can see what it is.

The problem is this line:
 <META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages
- save and edit them, then reparse them (not always practical)
- use the fromEncoding argument:
 soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
when the encoding you supply does not work, using fromEncoding="ascii"
should not hurt too much. But this being usenet, I'm sure someone will
tell me that I'm wrong and there is some weird 7bit encoding in use
somewhere on the web...

> I'm new to BeautifulSoup (heck, I'm new to python).  If I'm doing something
> dumb, you don't need to be gentle.

No, you did nothing dumb. The server sent you broken content. 

Ciao
  Marc



More information about the Python-list mailing list