[Python-Dev] Python3 "complexity"
Steven D'Aprano
steve at pearwood.info
Fri Jan 10 03:23:43 CET 2014
On Thu, Jan 09, 2014 at 02:08:57PM -0800, Ethan Furman wrote:
> If latin1 is used to convert binary to text, how convoluted is it to then
> take chunks of that text and convert to int, or some other variety of
> unicode?
>
> For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
>
> If that were decoded using latin1 how would I then get the first two bytes
> to the integer 256 and the last six bytes to their Cyrillic meaning?
> (Apologies for not testing myself, short on time.)
Not terribly convoluted, but there is some double-processing. When you
know up-front that some data is non-text, you shouldn't convert it to
text, otherwise you're just double-processing:
py> b = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'
py> s = b.decode('latin1')
py> num, = struct.unpack('>h', s[:2].encode('latin1'))
py> assert num == 0x100
Better to just go straight from bytes to the struct, if you can:
py> struct.unpack('>h', b[:2])
(256,)
As for the last six bytes and "their Cyrillic meaning", which Cyrillic
meaning did you have in mind?
py> s = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'.decode('latin1')
py> for encoding in "cp1251 ibm866 iso-8859-5 koi8-r koi8-u mac_cyrillic".split():
... print(s[-6:].encode('latin1').decode(encoding))
...
СЂСѓРЂ
╤А╤Г╨А
бба
я─я┐п─
я─я┐п─
—А—Г–А
I understand that Cyrillic is an especially poor choice, since there
are many incompatible Cyrillic code-pages. On the other hand, it's also
an especially good example of how you need to know the encoding before
you can make sense of the data.
Again, note that if you know the encoding you are intending to use is
not Latin-1, decoding to Latin-1 first just ends up double-handling. If
you can, it is best to split your data into fields up front, and then
decode each piece once only.
--
Steven
More information about the Python-Dev
mailing list