UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Thu Jan 29 16:58:38 EST 2009

On Thu, Jan 29, 2009 at 4:19 PM, John Machin <sjmachin at lexicon.net> wrote:

> Benjamin Kaplan <bsk16 <at> case.edu> writes:
>
> >
> >
> > On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan <mail <at>
> anjanesh.net> wrote:
> > > It does auto-detect it as cp1252- look at the files in the traceback
> and
> > > you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
> > > encoding, try opening it as utf-8 or latin1 and see if that fixes it.
>
> Benjamin, "auto-detect" has strong connotations of the open() call (with
> mode
> including text and encoding not specified) reading some/all of the file and
> trying to guess what the encoding might be -- a futile pursuit and not what
> the
> docs say:
>
> """encoding is the name of the encoding used to decode or encode the file.
> This
> should only be used in text mode. The default encoding is platform
> dependent,
> but any encoding supported by Python can be passed. See the codecs module
> for
> the list of supported encodings"""
>
> On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It
> would be interesting to know
> (1) what is produced on Anjanesh's machine
> (2) how the default encoding is derived (I would have thought I was a prime
> candidate for 'cp1252')
> (3) whether the 'default encoding' of open() is actually the same as the
> 'default encoding' of sys.getdefaultencoding() -- one would hope so but the
> docs
> don't say so.

First of all, you're right that might be confusing. I was thinking of
auto-detect as in "check the platform and locale and guess what they usually
use". I wasn't thinking of it like the web browsers use it.

I think it uses locale.getpreferredencoding(). On my machine, I get
sys.getpreferredencoding() == 'utf-8' and locale.getdefaultencoding()==
'cp1252'. When I open a file without specifying the encoding, it's cp1252.

>
> > Thanks a lot ! utf-8 and latin1 were accepted !
>
> Benjamin and Anjanesh, Please understand that
> any_random_rubbish.decode('latin1') will be "accepted". This is *not*
> useful
> information to be greeted with thanks and exclamation marks. It is merely a
> by-product of the fact that *any* single-byte character set like latin1
> that
> uses all 256 possible bytes can not fail, by definition; no character "maps
> to
> <undefined>".

If you check my response to Anjanesh's comment, I mentioned that he should
either find out which encoding it is in particular or he should open the
file in binary mode. I suggested utf-8 and latin1 because those are the most
likely candidates for his file since cp1252 was already excluded. Looking at
a character map, 0x9d is a control character in latin1, so the page is
probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
that isn't as common as UTF-8.

>
> > If you want to read the file as text, find out which encoding it actually
> is.
> In one of those encodings, you'll probably see some nonsense characters. If
> you
> are just looking at the file as a sequence of bytes, open the file in
> binary
> mode rather than text. That way, you'll avoid this issue all together (just
> make
> sure you use byte strings instead of unicode strings).
>
> In fact, inspection of Anjanesh's report:
> """UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
> 10442: character maps to <undefined>
> The string at position 10442 is something like this :
> "query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," """
>
> draws two observations:
> (1) there is nothing in the reported string that can be unambiguously
> identified
> as corresponding to "0x9d"
> (2) it looks like a small snippet from a Python source file!
>
> Anjanesh, Is it a .py file? If so, is there something like "# encoding:
> cp1252"
> or "# encoding: utf-8" near the start of the file? *Please* tell us what
> sys.getdefaultencoding() returns on your machine.
>
> Instead of "something like", please report exactly what is there:
>
> print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
>
> Cheers,
> John
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090129/a54e969c/attachment.html>