[Tutor] how to struct.pack a unicode string?

Albert-Jan Roskam fomcl at yahoo.com
Sun Dec 2 14:00:05 CET 2012


>> How can I pack a unicode string using the struct module? If I simply use

>> packed = struct.pack(fmt, hello) in the code below (and 'hello' is a
>> unicode string), I get this: "error: argument for 's' must be a string". I
>> keep reading that I have to encode it to a utf-8 bytestring, but this does
>> not work (it yields mojibake and tofu output for some of the languages).
>
>You keep reading it because it is the right approach. You will not get 
>mojibake if you decode the "packed" data before using it. 
>
>Your code basically becomes
>
>for greet in greetings:
>    language, chars, encoding = greet
>    hello = "".join([unichr(i) for i in chars])
>    packed = hello.encode("utf-8")
>    unpacked = packed.decode("utf-8")
>    print unpacked
>
>I don't know why you mess with byte order, perhaps you can tell a bit about 
>your actual use-case.


Hi Peter,

Thanks for helping me. I am writing binary files and I wanted to create test data for this.
--this has been a good test case, such that (a) it demonstrated a defect in my program (b) idem, my knowledge. I realize how cp2152-ish I am; for instance, I wrongly tend to assume that len(someUnicodeString) == nbytes_of_that_unicode_string.

--re: messing with byte order: I read in M. Summerfield's "Programming in Python 3" that it's advisable to always specify the byte order, for portability of the data. But, now that you mention it, the way I did it, I might as well omit it. Or, given that the binary format I am writing contains information about the byte order, I might hard-code the byte order (e.g. always write LE). That would follow Mark Summerfield's advise, if I understand it correctly.
--(Aside from your advise to use utf-8) Given that sys.maxunicode == 65535 on my system (ie, that many unicode points can be represented in my compilation of Python) I'd expect that I not only could write u'blaah'.encode("unicode-internal"), but also u'blaah'.encode("ucs-2")
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    u'blaah'.encode("ucs-2")
LookupError: unknown encoding: ucs-2
Why is the label "unicode-internal" to indicate both ucs-2 and ucs-4? And why does the same Python version on my Linux computer use 1114111 code points? Can we conclude that Linux users are better equiped to write a letter in Birmese or Aleut? ;-)

Thanks again!

Regards,
Albert-Jan



More information about the Tutor mailing list