[Tutor] questions on encoding

Thu Jul 21 03:36:05 CEST 2011

Albert-Jan Roskam wrote:
> Hi,
> 
> I am looking for test data with accented and multibyte characters. I have found a good resource that I could use to cobble something together (http://www.inter-locale.com/whitepaper/learn/learn-to-test.html) but I was hoping somebody knows some ready resource.
> 
> I also have some questions about encoding. In the code below, is there a difference between unicode() and .decode?

Not functionality-wise. unicode may be slightly faster, on account of 
being a function rather than a method, for small strings. But in Python 
3, unicode is gone as no longer needed.

> s = "§ÇÇ¼ÍÍ"
> x = unicode(s, "utf-8")
> y = s.decode("utf-8")
> x == y # returns True

The fact that this works at all is a fluke, dependent on the settings of 
your terminal.  If I copy and paste the line

s = "§ÇÇ¼ÍÍ"

into my terminal, with an arbitrarily chosen encoding, I get this:

 >>> unicode(s, 'utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 0: 
unexpected code byte

(I get a hint that things are not as they should, because the characters 
of s look different too.)

Without knowing what encoding your terminal is set to, it is impossible 
to tell what bytes s *actually* includes. But whatever they are, whether 
they are valid UTF-8 is a matter of chance.

> Also, is it, at least theoretically, possible to mix different encodings in byte strings? I'd say no, unless there are multiple BOMs or so. Not that I'd like to try this, but it'd improve my understanding of this sort of obscure topic.

Of course it is! That gives you a broken file, like taking a file 
containing a jpeg and appending it to a file containing an mp3. The 
resultant file is neither a well-formed mp3 nor a well-formed jpeg. 
Unless you have some way of telling where one part ends and the other 
starts, you've just broken your file.

Going back to a terminal with the default encoding (whatever that is!), 
I can do this:

 >>> s = "§ÇÇ¼ÍÍ"  # copy and pasted from your email
 >>> # note the chars look different in my terminal and email client!
...
 >>> a = unicode(s, 'utf-8')  # treat it as UTF-8 bytes
 >>> b = unicode(s, 'utf-16')  # treat it as UTF-16 bytes
 >>> t = a.encode('utf-8') + b.encode('utf-16')  # mix them together
 >>> t
'\xc2\xa7\xc3\x87\xc3\x87\xc2\xbc\xc3\x8d\xc3\x8d\xff\xfe\xc2\xa7\xc3\x87\xc3\x87\xc2\xbc\xc3\x8d\xc3\x8d'
 >>> t.decode('utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 12: 
unexpected code byte
 >>> t.decode('utf-16')
u'\ua7c2\u87c3\u87c3\ubcc2\u8dc3\u8dc3\ufeff\ua7c2\u87c3\u87c3\ubcc2\u8dc3\u8dc3'

So the mixed bytes t does *not* make valid utf-8, but it happens to make 
valid utf-16. That's an accident of the particular bytes that happened 
to be in the string s. There's no guarantee that it will always work, 
but even when it does, you rarely get a sensible string of characters.

In the same way, a random chunk of bytes from an mp3 file might, by 
chance, happen to make up a valid jpeg file -- but it almost certainly 
won't make a nice picture, rather just a blob of random pixels.

-- 
Steven