[Tutor] Struct and UTF-16
Kent Johnson
kent37 at tds.net
Mon Oct 3 12:38:47 CEST 2005
Liam Clarke wrote:
> I have a base object, which reads the unicode string as bytes like so,
> this ignores all but important bits.
>
> class Mhod:
> def __init__(self, f):
> self.payload = struct.unpack("36s", f.read(36))
>
> Which in turn, is utilised in a Song object, which works like this -
>
> class Song:
> def __init__(self, mhod):
> self.location = unicode(mhod.payload, "UTF-16")
> self.mhod = mhod
> def gLoc(self):
> return self.location
> def sLoc(self, value):
> #Need to coerce data into UTF-16 here
> self.mhod.payload = value.encode("UTF-16")
I'm confused about what sLoc is supposed to do. Shouldn't it be setting self.location? ISTM sLoc should parallel what __init__ does. What is value here? If it is an Mhod then you should just do
self.location = unicode(mhod.payload, "UTF-16")
again.
OTOH if you are trying to modify Mhod.payload then you should make a method in mhod and (assuming value is ascii texs) it should be something like
self.payload = unicode(value, 'ascii').encode('UTF-16')
(though see my previous reply about utf-16 vs utf-16be and utf-16le)
>
> location = property(gLoc, sLoc)
>
> If I were to do a
>
>
>>>>x = Mhod(open("test", "rb"))
>>>>y = Song(x)
>
>
> I get
>
>
>>>>x.payload
>
> ':\x00i\x00P\x00o\x00d\x00_\x00C\x00o\x00n\x00t\x00r\x00o\x00l
> \x00:\x00M\x00u\x00s\x00i\x00c\x00:\x00F\x004\x004\x00:\x00L
> \x00W\x00B\x00R\x00.\x00m\x00p\x003\x00' #Line breaks added.
This is utf-16le
>
>
>>>>y.location
>
> u':iPod_Control:Music:F44:LWBR.mp3'
>
> Which is what I'm after. What I'm struggling with is coercing the
> string that's being passed to sLoc() into UTF-16, and actually
> creating any form of unicode string at all without using
>
>
>>>>foo = u'Monkies!'
>
>
> Which I'm sure is going to be in UTF-8, just to spite me.
No, it will be a unicode string, and what's wrong with that as a way to create a unicode string anyway?
>
> So far, the best I've come up with is -
>
>
>>>>foo = unicode("Hi Bob!".encode("UTF-16"), "UTF-16")
You are still confused about when to use encode vs decode
encode goes *away* from unicode
decode goes *towards* unicode
So any of these will work:
foo = u'Hi Bob!'
foo = 'Hi Bob!'.decode('ascii')
foo = unicode('Hi Bob!', 'ascii')
and, assuming sys.defaultencoding is set to 'ascii', the last can be written
foo = unicode('Hi Bob!')
> Which, as you mention above, is likely to cause me errors. And
> apparently "Hi Bob!" is an 8 bit string encoded in UTF-16...
No, what gives you that idea? It is an 8-bit string encoded in ASCII.
> *sigh* I suppose I could go the XP route and expect any further users
> to just deal with it and pass in a UTF-16 string, but there's got to
> be a simple way to handle it., and I'm not having too much luck with
> this.
>
> I've been working from the below document, if anyone can recommend
> something further, I'd much appreciate it.
>
> http://www.amk.ca/python/howto/unicode
The references are good too, particularly
Roman Czyborra wrote another explanation of Unicode's basic principles; it's at <http://czyborra.com/unicode/characters.html>. Czyborra has written a number of other Unicode-related documentation, available from <http://www.cyzborra.com>.
Two other good introductory articles were written by Joel Spolsky <http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff <http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make things clear to you, you should try reading one of these alternate articles before continuing.
And my own essay has more references at the end:
http://personalpages.tds.net/~kent37/blog/stories/14.html
Keep trying, eventually the mists will clear...this is confusing stuff.
Kent
More information about the Tutor
mailing list