[Python-Dev] Python3 "complexity"

Fri Jan 10 03:23:43 CET 2014

On Thu, Jan 09, 2014 at 02:08:57PM -0800, Ethan Furman wrote:

> If latin1 is used to convert binary to text, how convoluted is it to then 
> take chunks of that text and convert to int, or some other variety of 
> unicode?
> 
> For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
> 
> If that were decoded using latin1 how would I then get the first two bytes 
> to the integer 256 and the last six bytes to their Cyrillic meaning?  
> (Apologies for not testing myself, short on time.)

Not terribly convoluted, but there is some double-processing. When you 
know up-front that some data is non-text, you shouldn't convert it to 
text, otherwise you're just double-processing:

py> b = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'
py> s = b.decode('latin1')
py> num, = struct.unpack('>h', s[:2].encode('latin1'))
py> assert num == 0x100

Better to just go straight from bytes to the struct, if you can:

py> struct.unpack('>h', b[:2])
(256,)

As for the last six bytes and "their Cyrillic meaning", which Cyrillic 
meaning did you have in mind?

py> s = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'.decode('latin1')
py> for encoding in "cp1251 ibm866 iso-8859-5 koi8-r koi8-u mac_cyrillic".split():
...     print(s[-6:].encode('latin1').decode(encoding))
...
СЂСѓРЂ
╤А╤Г╨А
бба
я─я┐п─
я─я┐п─
—А—Г–А

I understand that Cyrillic is an especially poor choice, since there 
are many incompatible Cyrillic code-pages. On the other hand, it's also 
an especially good example of how you need to know the encoding before 
you can make sense of the data.

Again, note that if you know the encoding you are intending to use is 
not Latin-1, decoding to Latin-1 first just ends up double-handling. If 
you can, it is best to split your data into fields up front, and then 
decode each piece once only.

-- 
Steven