Python 3.1.1 bytes decode with replace bug
Mark Tolonen
metolone+gmane at gmail.com
Mon Oct 26 00:54:13 EDT 2009
"Dave Angel" <davea at ieee.org> wrote in message
news:4AE43150.9010901 at ieee.org...
> Joe wrote:
>>> For the reason BK explained, the important difference is that I ran in
>>> the IDLE shell, which handles screen printing of unicode better ;-)
>>>
>>
>> Something still does not seem right here to me.
>>
>> In the example above the bytes were decoded to 'UTF-8' with the
>>
> *nope* you're decoding FROM utf-8 to unicode.
>> replace option so any characters that were not UTF-8 were replaced and
>> the resulting string is '\ufffdabc' as BK explained. I understand
>> that the replace worked.
>>
>> Now consider this:
>>
>> Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
>> (AMD64)] on
>> win32
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>>>>> s = '\ufffdabc'
>>>>> print(s)
>>>>>
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
>> encode
>> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
>> UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
>> position
>> 0: character maps to <undefined>
>>
>>>>> import sys
>>>>> sys.getdefaultencoding()
>>>>>
>> 'utf-8'
>>
>> This too fails for the exact same reason (and doesn't invole decode).
>>
>> In the original example I decoded to UTF-8 and in this example the
>> default encoding is UTF-8 so why is cp437 being used?
>>
>> Thanks in advance for your assistance!
>>
>>
>>
> Benjamin had it right, but you still don't understand what he said.
>
> The problem in your original example, and in the current one, is not in
> decode(), but in encode(), which is implicitly called by print(), when
> needed to convert from Unicode to some byte format of the console. Take
> your original example:
>
>>>>>> b'\x80abc'.decode('utf-8', 'replace')
>
>
> The decode() is explicit, and converts *FROM* utf8 string to a unicode
> one. But since you're running in a debugger, there's an implicit print,
> which is converting unicode into whatever your default console encoding
> is. That calls encode() (or one of its variants, charmap_encode(), on
> the unicode string. There is no relationship between the two steps.
>
> In your current example, you're explicitly doing the print(), but still
> have the same implicit encoding to cp437, which gets the equivalent error.
> That's the encoding that your Python 3.x is choosing for the stdout
> console, based on country-specific Windows settings. In the US, that
> implicit encoding is ASCII. I don't know how to override it generically,
> but I know it's possible to replace stdout with a wrapper that does your
> preferred encoding. You probably want to keep cp437, but change the error
> handling to ignore. Or if this is a one-time problem, I suspect you could
> do the encoding manually, to a byte array, then print that.
You can also replace the Unicode replacement character U+FFFD with a valid
cp437 character before displaying it:
>>> b'\x80abc'.decode('utf8','replace').replace('\ufffd','?')
'?abc'
-Mark
More information about the Python-list
mailing list