[Baypiggies] HTML code sets

Chris Clark Chris.Clark at ingres.com
Fri Oct 26 20:37:12 CEST 2007


Max Slimmer wrote:
> I am reading some raw HTML that contains things like: 
>
> "enforcing the nation\xe2\x80\x99s laws" 
>
> and I need to know what incantation to apply to translate the xe2,x80,x99
> into some kind of apostrophe char. I can initialize this string as str or
> unicode.
>
> The headers are:
> '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html
> xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta
> http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />\n
>   

The headers are lying to you, it is utf8.

    x="enforcing the nation\xe2\x80\x99s laws"
    print x.decode('utf8')

Try using http://www.crummy.com/software/BeautifulSoup/ instead of 
reading it by hand, it _should_ protect you from problems like this.


Chris



More information about the Baypiggies mailing list