Unicode/ascii encoding nightmare

Mon Nov 6 17:54:38 EST 2006

Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-). My understanding of unicode
has room for improvement, that's for sure. I got some pointers and
initial code-cleanup seem to have removed some of the strange results I
got, which several of you also pointed out.

Anyway, thanks for all your replies. I think I can get this thing up
and running with a bit more code tinkering. And I'll read up on some
unicode-docs as well. :-) Thanks again.

Thomas

John Machin wrote:
> Thomas W wrote:
> > I'm getting really annoyed with python in regards to
> > unicode/ascii-encoding problems.
> >
> > The string below is the encoding of the norwegian word "fødselsdag".
> >
> > >>> s = 'f\xc3\x83\xc2\xb8dselsdag'
>
> There is no such thing as "*the* encoding" of any given string.
>
> >
> > I stored the string as "fødselsdag" but somewhere in my code it got
> > translated into the mess above and I cannot get the original string
> > back.
>
> Somewhere in your code??? Can't you track through your code to see
> where it is being changed? Failing that, can't you show us your code so
> that we can help you?
>
> I have guessed *what* you got, but *how* you got it boggles the mind:
>
> The effect is the same as (decode from latin1 to Unicode, encode as
> utf8) *TWICE*. That's how you change one byte in the original to *FOUR*
> bytes in the "mess":
>
> | >>> orig = 'f\xf8dselsdag'
> | >>> orig.decode('latin1').encode('utf8')
> | 'f\xc3\xb8dselsdag'
> | >>>
> orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8')
> | 'f\xc3\x83\xc2\xb8dselsdag'
> | >>>
>
> > It cannot be printed in the console or written a plain text-file.
>
> Incorrect. *Any* string can be printed on the console or written to a
> file. What you mean is that when you look at the output, it is not what
> you want.
>
> > I've tried to convert it using
> >
> > >>> s.encode('iso-8859-1')
> > Traceback (most recent call last):
> >   File "<interactive input>", line 1, in ?
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> > ordinal not in range(128)
>
> encode is an attribute of unicode objects. If applied to a str object,
> the str object is converted to unicode first using the default codec
> (typically ascii).
>
> s.encode('iso-8859-1') is effectively
> s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for
> the (obvious(?)) reason given.
>
> >
> > >>> s.encode('utf-8')
> > Traceback (most recent call last):
> >   File "<interactive input>", line 1, in ?
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> > ordinal not in range(128)
>
> Same story as for 'iso-8859-1'
>
> >
> > And nothing helps. I cannot remember hacing these problems in earlier
> > versions of python
>
> I would be very surprised if you couldn't reproduce your problem on any
> 2.n version of Python.
>
> > and it's really annoying, even if it's my own fault
> > somehow, handling of normal characters like this shouldn't cause this
> > much hassle. Searching google for "codec can't decode byte" and
> > UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
> > not alone.
> >
> > Any hints?
>
> 1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode
> 2. Read the Python documentation on .decode() and .encode() carefully.
> 3. Show us your code so that we can help you avoid the double
> conversion to utf8. Tell us what IDE you are using.
> 4. Tell us what you are trying to achieve. Note that if all you are
> trying to do is read and write text in Norwegian (or any other language
> that's representable in iso-8859-1 aka latin1), then you don't have to
> do anything special at all in your code-- this is the good old "legacy"
> way of doing things universally in vogue before Unicode was invented!
> 
> HTH,
> John