[Tutor] Struct and UTF-16

Kent Johnson kent37 at tds.net
Mon Oct 3 12:38:47 CEST 2005


Liam Clarke wrote:
> I have a base object, which reads the unicode string as bytes like so,
> this ignores all but important bits.
> 
> class Mhod:
>     def __init__(self, f):
>         self.payload = struct.unpack("36s", f.read(36))
> 
> Which in turn, is utilised in a Song object, which works like this -
> 
> class Song:
>     def __init__(self, mhod):
>         self.location = unicode(mhod.payload, "UTF-16")
>         self.mhod = mhod
>     def gLoc(self):
>         return self.location
>     def sLoc(self, value):
>         #Need to coerce data into UTF-16 here
>         self.mhod.payload = value.encode("UTF-16")

I'm confused about what sLoc is supposed to do. Shouldn't it be setting self.location? ISTM sLoc should parallel what __init__ does. What is value here? If it is an Mhod then you should just do
  self.location = unicode(mhod.payload, "UTF-16")
again.

OTOH if you are trying to modify Mhod.payload then you should make a method in mhod and (assuming value is ascii texs) it should be something like
  self.payload = unicode(value, 'ascii').encode('UTF-16')
(though see my previous reply about utf-16 vs utf-16be and utf-16le)

> 
>     location = property(gLoc, sLoc)
> 
> If I were to do a
> 
> 
>>>>x = Mhod(open("test", "rb"))
>>>>y = Song(x)
> 
> 
> I get
> 
> 
>>>>x.payload
> 
> ':\x00i\x00P\x00o\x00d\x00_\x00C\x00o\x00n\x00t\x00r\x00o\x00l
> \x00:\x00M\x00u\x00s\x00i\x00c\x00:\x00F\x004\x004\x00:\x00L
> \x00W\x00B\x00R\x00.\x00m\x00p\x003\x00' #Line breaks added.

This is utf-16le
> 
> 
>>>>y.location
> 
> u':iPod_Control:Music:F44:LWBR.mp3'
> 
> Which is what I'm after. What I'm struggling with is coercing the
> string that's being passed to sLoc() into UTF-16, and actually
> creating any form of unicode string at all without using
> 
> 
>>>>foo = u'Monkies!'
> 
> 
> Which I'm sure is going to be in UTF-8, just to spite me.

No, it will be a unicode string, and what's wrong with that as a way to create a unicode string anyway?
> 
> So far, the best I've come up with is -
> 
> 
>>>>foo = unicode("Hi Bob!".encode("UTF-16"), "UTF-16")

You are still confused about when to use encode vs decode
encode goes *away* from unicode
decode goes *towards* unicode

So any of these will work:
  foo = u'Hi Bob!'
  foo = 'Hi Bob!'.decode('ascii')
  foo = unicode('Hi Bob!', 'ascii')
and, assuming sys.defaultencoding is set to 'ascii', the last can be written
  foo = unicode('Hi Bob!')

> Which, as you mention above, is likely to cause me errors. And
> apparently "Hi Bob!" is an 8 bit string encoded in UTF-16...

No, what gives you that idea? It is an 8-bit string encoded in ASCII.

>  *sigh* I suppose I could go the XP route and expect any further users
> to just deal with it and pass in a UTF-16 string, but there's got to
> be a simple way to handle it., and I'm not having too much luck with
> this.
> 
> I've been working from the below document, if anyone can recommend
> something further, I'd much appreciate it.
> 
> http://www.amk.ca/python/howto/unicode

The references are good too, particularly
Roman Czyborra wrote another explanation of Unicode's basic principles; it's at <http://czyborra.com/unicode/characters.html>. Czyborra has written a number of other Unicode-related documentation, available from <http://www.cyzborra.com>.

Two other good introductory articles were written by Joel Spolsky <http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff <http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make things clear to you, you should try reading one of these alternate articles before continuing.

And my own essay has more references at the end:
http://personalpages.tds.net/~kent37/blog/stories/14.html

Keep trying, eventually the mists will clear...this is confusing stuff.

Kent



More information about the Tutor mailing list