Encoding conundrum

Ian Kelly ian.g.kelly at gmail.com
Tue Nov 20 18:03:14 EST 2012


On Tue, Nov 20, 2012 at 2:49 PM, Daniel Klein <danielkleinad at gmail.com> wrote:
> With the assistance of this group I am understanding unicode encoding issues
> much better; especially when handling special characters that are outside of
> the ASCII range. I've got my application working perfectly now :-)
>
> However, I am still confused as to why I can only use one specific encoding.
>
> I've done some research and it appears that I should be able to use any of
> the following codecs with codepoints '\xfc' (chr(252)) '\xfd' (chr(253)) and
> '\xfe' (chr(254)) :

These refer to the characters with *Unicode* codepoints 252, 253, and 254:

>>> unicodedata.name('\xfc')
'LATIN SMALL LETTER U WITH DIAERESIS'
>>> unicodedata.name('\xfd')
'LATIN SMALL LETTER Y WITH ACUTE'
>>> unicodedata.name('\xfe')
'LATIN SMALL LETTER THORN'

> ISO-8859-1   [ note that I'm using this codec on my Linux box ]

For ISO 8859-1, these characters happen to exist and even correspond
to the same ordinals: 252, 253, and 254 (this is by design); so there
is no problem encoding them, and the resulting bytes even happen to
match the codepoints of the characters.

> cp1252

cp1252 is designed after ISO 8859-1 and also has those same three characters:

>>> for char in b'\xfc\xfd\xfe'.decode('cp1252'):
...     print(unicodedata.name(char))
...
LATIN SMALL LETTER U WITH DIAERESIS
LATIN SMALL LETTER Y WITH ACUTE
LATIN SMALL LETTER THORN

> latin1

Latin-1 is just another name for ISO 8859-1.

> utf-8

UTF-8 is a *multi-byte* encoding.  It can encode any Unicode
characters, so you can represent those three characters in UTF-8, but
with a different (and longer) byte sequence:

>>> print('\xfc\xfd\xfd'.encode('utf8'))
b'\xc3\xbc\xc3\xbd\xc3\xbd'

> cp437

cp437 is another 8-bit encoding, but it maps entirely different
characters to those three bytes:

>>> for char in b'\xfc\xfd\xfe'.decode('cp437'):
...     print(unicodedata.name(char))
...
SUPERSCRIPT LATIN SMALL LETTER N
SUPERSCRIPT TWO
BLACK SQUARE

As it happens, the character at codepoint 252 (that's LATIN SMALL
LETTER U WITH DIAERESIS) does exist in cp437.  It maps to the byte
0x81:

>>> '\xfc'.encode('cp437')
b'\x81'

The other two Unicode characters, at codepoints 253 and 254, do not
exist at all in cp437 and cannot be encoded.

> If I'm not mistaken, all of these codecs can handle the complete 8bit
> character set.

There is no "complete 8bit character set".  cp1252, Latin1, and cp437
are all 8-bit character sets, but they're *different* 8-bit character
sets with only partial overlap.

> However, on Windows 7, I am only able to use 'cp437' to display (print) data
> with those characters in Python. If I use any other encoding, Windows laughs
> at me with this error message:
>
>   File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
>     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\xfd' in
> position 3: character maps to <undefined>

It would be helpful to see the code you're running that causes this error.

> Furthermore I get this from IDLE:
>
>>>> import locale
>>>> locale.getdefaultlocale()
> ('en_US', 'cp1252')
>
> I also get 'cp1252' when running the same script from a Windows command
> prompt.
>
> So there is a contradiction between the error message and the default
> encoding.

If you're printing to stdout, it's going to use the encoding
associated with stdout, which does not necessarily have anything to do
with the default locale.  Use this to determine what character set you
need to be working in if you want your data to be printable:

>>> import sys
>>> sys.stdout.encoding
'cp437'

> Why am I restricted from using just that one codec? Is this a Windows or
> Python restriction? Please enlighten me.

In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
and either way it has no problems encoding that data for output.  In a
Windows cmd terminal, the default terminal encoding is cp437, which
can't support two of the three characters you mentioned above.



More information about the Python-list mailing list