[Tutor] unicode and character sets

tpc247 at gmail.com tpc247 at gmail.com
Thu Aug 16 11:56:54 CEST 2007


dear fellow Python enthusiasts,

I recently wrote a script that grabs a file containing a list of ISO defined
countries and creates an html select element.  That's all well and good, and
everything seems to work fine, except for one little nagging problem:

http://en.wikipedia.org/wiki/Aland_Islands

I use the Wikipedia url because I'm conscious of the fact that people
reading this email might not be able to see the character I am having
trouble displaying correctly, the LATIN CAPITAL LETTER A WITH RING ABOVE
character.  After reading the following article:

http://www.joelonsoftware.com/articles/Unicode.html

I realize the following: It does not make sense to have a string without
knowing what encoding it uses.  There is no such thing as plain text.

Ok.  Fine.  In Mozilla, by clicking on View, Character Encoding, I find out
that the text in the file I grab from:

http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/index.html

is encoded in ISO-8859-1.  So I go about changing Python's default encoding
according to:

http://www.diveintopython.org/xml_processing/unicode.html

and voila:

>>> import sys
>>> sys.getdefaultencoding()
'iso-8859-1'
>>>

BUT the LATIN CAPITAL LETTER A WITH RING ABOVE character still displays in
IDLE as \xc5 !  I can get the character to display correctly if I type:

print "\xc5"

which is fine if I am simply going to copy and paste the select element into
my html file.  However, I want to be able to dynamically generate the html
form page and have the character in question display correctly in the web
browser.  In case you're wondering, I've already done my due diligence to
ensure the character set is ISO-8859-1 in my web server as well as in the
html file:

- in my html file, I put in:
    <head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"></head>
- I restarted apache after changing httpd.conf to add the line:
    AddDefaultCharset ISO-8859-1

The problem, of course, is that if I run my script that creates the select
element in IDLE I continue to see the output:

<option value='AX'>\xc5land Islands</option>

Am I doing something wrong ?

def create_bidirectional_dicts_from_latest_ISO_countries():
    import urllib
    ISO3166_FILE_URL = "
http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1-semic.txt
"
    a2_to_name = {}
    name_to_a2 = {}
    file_obj = urllib.urlopen(ISO3166_FILE_URL)
    for line in file_obj:
        if line.startswith("This list") or line.isspace():
            pass
        else:
            a_list = line.split(';')
            ISO_name = a_list[0].title()
            ISO_a2 = a_list[1].strip()
            a2_to_name[ISO_a2] = ISO_name
            name_to_a2[ISO_name] = ISO_a2
    file_obj.close()
    return a2_to_name, name_to_a2

def create_select_element_from_dict(name, a_dict, default_value=None):
    parent_wrapper = "<select name='%s'>%s</select>"
    child_wrapper = "\t<option value=''>Please select one</option>\n%s"
    element_template = "\t<option value='%s'>%s</option>\n"
    default_element = "\t<option value='%s' selected='yes'>%s</option>\n"
    a_str = ""
    for key in sorted(a_dict.keys()):
        if default_value and a_dict[key] == default_value:
            a_str = a_str + default_element % (default_value, key)
        a_str = a_str + element_template % (a_dict[key], key)
    c_w_instance = child_wrapper % a_str
    return parent_wrapper % (name, c_w_instance)

a2_to_name, name_to_a2 =
create_bidirectional_dicts_from_latest_ISO_countries()
a_select = create_select_element_from_dict("country", name_to_a2)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20070816/5c6ccce1/attachment-0001.htm 


More information about the Tutor mailing list