Encoding and norwegian (non ASCII) characters.
__peter__ at web.de
Sat Oct 7 23:59:01 CEST 2006
joakim.hove at gmail.com wrote:
> I am having great problems writing norwegian characters æøå to file
> from a python application. My (simplified) scenario is as follows:
> 1. I have a web form where the user can enter his name.
> 2. I use the cgi module module to get to the input from the user:
> name = form["name"].value
> 3. The name is stored in a file
> fileH = open(namefile , "a")
> fileH.write("name:%s \n" % name)
> Now, this works very well indeed as long the users have 'ascii' names,
> however when someone enters a name with one of the norwegian characters
> æøå - it breaks at the write() statement.
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position
> Now - I understand that the ascii codec can't be used to decode the
> particular characters, however my attempts of specifying an alternative
> encoding have all failed.
> I have tried variants along the line:
> fileH = codecs.open(namefile , "a" , "latin-1") / fileH =
> open(namefile , "a")
> fileH.write(name) / fileH.write(name.encode("latin-1"))
> It seems *whatever* I do the Python interpreter fails to see my pledge
> for an alternative encoding, and fails with the dreaded
> Any tips on this would be *highly* appreciated.
The approach with codecs.open() should succeed
>>> out = codecs.open("tmp.txt", "a", "latin1")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/codecs.py", line 501, in write
File "/usr/local/lib/python2.4/codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
provided that you write only unicode strings with characters in the range
unichr(0)...unichr(255) and normal strs in the range chr(0)...chr(127).
You have to decode non-ascii strs before feeding them to write() with the
appropriate encoding (that only you know)
>>> out.write(unicode("\xe6\xf8\xe5", "latin1"))
If there are unicode code points beyond unichr(255) you have to change the
encoding in codecs.open(), typically to UTF-8.
# raises UnicodeEncodeError
codecs.open("tmp.txt", "a", "latin1").write(u"\u1234")
codecs.open("tmp.txt", "a", "utf8").write(u"\u1234")
More information about the Python-list