[OT] does the charset lie?

Skip Montanaro skip at pobox.com
Sun May 2 13:25:50 EDT 2004


    >> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    ...
    >> ’
    >> is the charset correct or should it have been utf-8?

    David> The charset is correct.  "&" "#" "8" etc. are all in iso-8859-1.

I realized that about five minutes after posting.  The Content-Type header
is just for the purposes of HTTP.  OTOH, this means if I need the raw
content of the page (after expanding any entities), I need to so something
like (assuming the raw bytes are already in data):

    data = unicode(data, "iso-8859-1").encode("utf-8")
    data = map_entities_to_utf_8(data)
    data = unicode(data, "utf-8")

Skip




More information about the Python-list mailing list