[Tutor] how to struct.pack a unicode string?

Albert-Jan Roskam fomcl at yahoo.com
Fri Nov 30 17:43:08 CET 2012


Hi,

How can I pack a unicode string using the struct module? If I simply use packed = struct.pack(fmt, hello) in the code below (and 'hello' is a unicode string), I get this: "error: argument for 's' must be a string". I keep reading that I have to encode it to a utf-8 bytestring, but this does not work (it yields mojibake and tofu output for some of the languages). It's annoying if one needs to know the encoding in which each individual language should be represented. I was hoping "unicode-internal" was the way to do it, but this does not reproduce the original string when I unpack it.. :-(



# Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32

import sys
import struct

greetings = \
        [['Arabic', [1575, 1604, 1587, 1604, 1575, 1605, 32, 1593, 1604, 1610, 1603,
                     1605], 'cp1256'], # 'cp864' 'iso8859_6'
         ['Assamese', [2472, 2478, 2488, 2509, 2453, 2494, 2544], 'utf-8'],
         ['Bengali', [2438, 2488, 2488, 2494, 2482, 2494, 2478, 2497, 32, 2438,
                      2482, 2494, 2439, 2453, 2497, 2478], 'utf-8'],
         ['Georgian', [4306, 4304, 4315, 4304, 4320, 4335, 4317, 4305, 4304], 'utf-8'],
         ['Kazakh', [1057, 1241, 1083, 1077, 1084, 1077, 1090, 1089, 1110, 1079, 32,
                     1073, 1077], 'utf-8'],
         ['Russian', [1047, 1076, 1088,1072, 1074, 1089, 1090, 1074, 1091, 1081,
                      1090, 1077], 'utf-8'],
         ['Spanish', [161, 72, 111, 108, 97, 33], 'cp1252'],
         ['Swiss German', [71, 114, 252, 101, 122, 105], 'cp1252'],
         ['Thai', [3626, 3623, 3633, 3626, 3604, 3637], 'cp874'],
         ['Walloon', [66, 111, 110, 100, 106, 111, 251], 'cp1252']]     
for greet in greetings:
    language, chars, encoding = greet
    hello = "".join([unichr(i) for i in chars])
    #print language, hello, encoding  # prints everything as it should look
    endianness = "<" if sys.byteorder == "little" else ">"
    fmt = endianness + str(len(hello)) + "s"
    #https://code.activestate.com/lists/python-list/301601/
    #http://bytes.com/topic/python/answers/546519-unicode-strings-struct-files
    #packed = struct.pack(fmt, hello.encode('utf_32_le'))
    #packed = struct.pack(fmt, hello.encode(encoding))
    #packed = struct.pack(fmt, hello.encode('utf_8'))
    packed = struct.pack(fmt, hello.encode("unicode-internal"))
    print struct.unpack(fmt, packed)[0].decode("unicode-internal")  # UnicodeDecodeError: 'unicode_internal' codec can't decode byte 0x00 in position 12: truncated input


Thank you in advance!

 
Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 


More information about the Tutor mailing list