[Tutor] how to struct.pack a unicode string?

Steven D'Aprano steve at pearwood.info
Sat Dec 1 00:29:53 CET 2012


On 01/12/12 03:43, Albert-Jan Roskam wrote:
> Hi,
>
> How can I pack a unicode string using the struct module? If I
>simply use packed = struct.pack(fmt, hello) in the code below
>(and 'hello' is a unicode string), I get this:
>"error: argument for 's' must be a string".

To be precise, it must be a *byte* string, not a Unicode string.


> I keep reading that I have to encode it to a utf-8 bytestring,

To be precise, you can use any encoding you like, with the
following provisos:

* not all encodings are capable of representing every character
   (e.g. the ASCII encoding only represents 127 characters);

* some encodings may not quite round-trip exactly, that is, they
   may lose some information;

* some encodings are more compact than others (e.g. Latin-1 uses
   one byte per character, while UTF-32 uses four bytes per
   character).


> but this does not work (it yields mojibake and tofu output for
>some of the languages).

It would be useful to see an example of this.

But if you do your encoding/decoding correctly, using the right
codecs, you should never get mojibake. You only get that when
you have a mismatch between the encoding you think you have and
the encoding you actually have.


> It's annoying if one needs to know the encoding in which each
>individual language should be represented. I was hoping
>"unicode-internal" was the way to do it, but this does not
>reproduce the original string when I unpack it.. :-(

Yes, encodings are annoying. The sooner that all encodings other
than UTF-8 and UTF-32 disappear the better :)

The beauty of using UTF-8 instead of one of the many legacy
encodings is that UTF-8 can represent any character, so you don't
need to care about the individual language, and it is compact (at
least for Western European languages).

Why are you using struct for this? If you want to convert Unicode
strings into a sequence of bytes, that's exactly what the encode
method does. There's no need for struct.



greetings = [
         ('Arabic', u'\u0627\u0644\u0633\u0644\u0627\u0645\u0020\u0639\u0644\u064a\u0643\u0645', 'cp1256'),
         ('Assamese', u'\u09a8\u09ae\u09b8\u09cd\u0995\u09be\u09f0', 'utf-8'),
         ('Bengali', u'\u0986\u09b8\u09b8\u09be\u09b2\u09be\u09ae\u09c1 \u0986\u09b2\u09be\u0987\u0995\u09c1\u09ae', 'utf-8'),
         ('English', u'Greetings and salutations', 'ascii'),
         ('Georgian', u'\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0', 'utf-8'),
         ('Kazakh', u'\u0421\u04d9\u043b\u0435\u043c\u0435\u0442\u0441\u0456\u0437 \u0431\u0435', 'utf-8'),
         ('Russian', u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435', 'utf-8'),
         ('Spanish', u'\xa1Hola!', 'cp1252'),
         ('Swiss German', u'Gr\xfcezi', 'cp1252'),
         ('Thai', u'\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35', 'cp874'),
         ('Walloon', u'Bondjo\xfb', 'cp1252'),
         ]
for language, greet, encoding in greetings:
     print u"Hello in %s: %s" % (language, greet)
     for enc in ('utf-8', 'utf-16', 'utf-32', encoding):
         bytestring = greet.encode(enc)
         print "encoded as %s gives %r" % (enc, bytestring)
         if bytestring.decode(enc) != greet:
             print "*** round-trip encoding/decoding failed ***"


Any of the byte strings can then be written directly to a file:

f.write(bytestring)

or embedded into a struct. You need a variable-length struct, of course.

My advice: stick to Python unicode strings internally, and always write
them to files as UTF-8.



-- 
Steven


More information about the Tutor mailing list