Unicode strings, struct, and files

John Machin sjmachin at lexicon.net
Mon Oct 9 02:06:40 EDT 2006


Tom Plunket wrote:
> I am building a file with the help of the struct module.
>
> I would like to be able to put Unicode strings into this file, but I'm
> not sure how to do it.
>
> The format I'm trying to write is basically this C structure:
>
> struct MyFile
> {
>    int magic;
>    int flags;
>    short otherFlags;
>    char pad[22];
>
>    wchar_t line1[32];
>    wchar_t line2[32];
>
>    // ... other data which is easy.  :)
> };
>
> (I'm writing data on a PC to be read on a big-endian machine.)
>
> So I can write the four leading members with the output of
> struct.pack('>IIH22x', magic, flags, otherFlags).  Unfortunately I
> can't figure out how to write the unicode strings, since:
>
> message = unicode('Hello, world')
> myFile.write(message)
>
> results in 'message' being converted back to a string before being
> written.  Is the way to do this to do something hideous like this:
>
> for c in message:
>    myFile.write(struct.pack('>H', ord(unicode(c))))
>
> ?

I'd suggest UTF-encoding it as a string, using the encoding that
matches whatever wchar means on the target machine, for example
assuming bigendian and sizeof(wchar) == 2:

utf_line1 = unicode_line1.encode('utf_16_be')
etc
struct.pack(">.........64s64s", ......, utf_line1, utf_line2)
Presumes (1) you have already checked that you don't have more than 32
characters in each "line" (2) padding with unichr(0) is acceptable.

HTH,
John




More information about the Python-list mailing list