[Tutor] Struct and UTF-16
Liam Clarke
ml.cyresse at gmail.com
Mon Oct 3 12:11:04 CEST 2005
OK, one last kick.
So, using
val = unicode(value)
self._slaveMap[attr].setPayload(value.encode("UTF-16"))
I can stick normal strings in happily. Of course, as you mentioned,
Kent, this leaves me vulnerable if the string differs to
sys.getdefaultencoding().
Other than directly from the user, the most likely source of data will
be from pyid3lib, which for the time being assumes all strings are
ISO-8859-1.
http://pyid3lib.sourceforge.net
Erk. Talk about big design up front. Would you recommend a different
method of dealing with this? Basically, most of the strings in the
database are UTF-16, and I just need to make them readable, and make
sure any of the strings going in are UTF-16 as well.
Alternatively, I've thought about just cycling through the various 100
codecs until I don't get any UnicodeDecodeErrors, but that's no
guarantee that it'll be human readble...oh dear.
Thanks for any assistance offered.
Liam Clarke
On 10/3/05, Liam Clarke <ml.cyresse at gmail.com> wrote:
> Hi,
>
> If I can just beat this horse one more time, can I just get
> confirmation that I'm going about this the right way?
>
> I have a base object, which reads the unicode string as bytes like so,
> this ignores all but important bits.
>
> class Mhod:
> def __init__(self, f):
> self.payload = struct.unpack("36s", f.read(36))
>
> Which in turn, is utilised in a Song object, which works like this -
>
> class Song:
> def __init__(self, mhod):
> self.location = unicode(mhod.payload, "UTF-16")
> self.mhod = mhod
> def gLoc(self):
> return self.location
> def sLoc(self, value):
> #Need to coerce data into UTF-16 here
> self.mhod.payload = value.encode("UTF-16")
>
> location = property(gLoc, sLoc)
>
> If I were to do a
>
> >>>x = Mhod(open("test", "rb"))
> >>>y = Song(x)
>
> I get
>
> >>>x.payload
> ':\x00i\x00P\x00o\x00d\x00_\x00C\x00o\x00n\x00t\x00r\x00o\x00l
> \x00:\x00M\x00u\x00s\x00i\x00c\x00:\x00F\x004\x004\x00:\x00L
> \x00W\x00B\x00R\x00.\x00m\x00p\x003\x00' #Line breaks added.
>
> >>>y.location
> u':iPod_Control:Music:F44:LWBR.mp3'
>
> Which is what I'm after. What I'm struggling with is coercing the
> string that's being passed to sLoc() into UTF-16, and actually
> creating any form of unicode string at all without using
>
> >>>foo = u'Monkies!'
>
> Which I'm sure is going to be in UTF-8, just to spite me.
>
> So far, the best I've come up with is -
>
> >>> foo = unicode("Hi Bob!".encode("UTF-16"), "UTF-16")
>
> Which, as you mention above, is likely to cause me errors. And
> apparently "Hi Bob!" is an 8 bit string encoded in UTF-16...
> *sigh* I suppose I could go the XP route and expect any further users
> to just deal with it and pass in a UTF-16 string, but there's got to
> be a simple way to handle it., and I'm not having too much luck with
> this.
>
> I've been working from the below document, if anyone can recommend
> something further, I'd much appreciate it.
>
> http://www.amk.ca/python/howto/unicode
>
> Regards,
>
> Liam Clarke
> On 10/3/05, Liam Clarke <ml.cyresse at gmail.com> wrote:
> > Thanks Kent,
> >
> > My first time dealing with Python and unicode vs 'normal' strings, I
> > do look forward to Python 3.0... at the moment I'm just trying to
> > understand how to use UTF-16.
> >
> > Basically, I have data which is coming straight from struct.unpack()
> > and it's an UTF-16 string, and I'm just trying to get my head around
> > dealing with the data coming in from struct, and putting my data out
> > through struct.
> >
> > It doesn't help overly that struct considers all strings to consist of
> > one byte per char, whereas UTF-16 is two. And I was having trouble as
> > to how to write UTF-16 stuff out properly.
> >
> > But, if I understand it correctly, I could use
> >
> > j = #some unicode string
> > out = j.encode("UTF-16")
> > pattern = "%ds" % len(out)
> > struct.pack(pattern, out)
> >
> > without too much difficulty.
> >
> > Regards,
> >
> > Liam Clarke
> >
> > On 10/3/05, Kent Johnson <kent37 at tds.net> wrote:
> > > Liam Clarke wrote:
> > > > What's the difference between
> > > >
> > > > x = "Hi"
> > > > y = x.encode("UTF-16")
> > > >
> > > > and
> > > >
> > > > y = unicode(x, "UTF-16")
> > >
> > > They are more-or-less opposite.
> > >
> > > encode() converts away from unicode. (Think of unicode as the 'normal' format, anything else in 'encoded'.) Normally it is used on a unicode string, not a byte string. It means, "interpret this string as unicode, then convert it to an encoded byte string using the given encoding".
> > >
> > > When you encode a non-unicode string (like "Hi"), the string is first converted to unicode (decoded) using sys.getdefaultencoding(), then encoded using the supplied encoding. So
> > > 'Hi'.encode('utf-16')
> > > is the same as
> > > 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> > >
> > > In either case, the result is a string in UTF-16 encoding:
> > > >>> 'Hi'.encode('UTF-16')
> > > '\xff\xfeH\x00i\x00'
> > > >>> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> > > '\xff\xfeH\x00i\x00'
> > >
> > > Note that the utf-16 codec puts a byte-order mark ('\xff\xfe') in the output; then 'H' becomes 'H\x00' and 'i' becomes 'i\x00'.
> > >
> > > Because sys.getdefaultencoding() is used to convert to unicode, you will get an error if the original string cannot be decoded with this encoding:
> > >
> > > >>> '\xe3'.encode('utf-16')
> > > Traceback (most recent call last):
> > > File "<stdin>", line 1, in ?
> > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
> > >
> > >
> > > What about unicode('Hi', 'utf-16')? This doesn't do anything useful:
> > > >>> unicode('Hi', 'UTF-16')
> > > u'\u6948'
> > >
> > > unicode('Hi', 'utf-16') means the same as 'Hi'.decode('utf-16'). In this case we are saying, "Interpret this string as an encoded byte string in the given encoding, and convert it to a unicode string." Since 'Hi' is not, in fact, a byte string encoded in UTF-16, the results are not very useful.
> > >
> > >
> > > To summarize:
> > > If you have an encoded byte string and you want a unicode string, use str.decode() or unicode()
> > >
> > > If you have a unicode string and you want an encoded byte string, use unicode.encode().
> > >
> > > If you are using str.encode() you probably haven't though through your problem completely and you will likely get UnicodeDecodeErrors when you have non-ASCII data.
> > >
> > >
> > > If you are writing a unicode-aware application, a good strategy is to keep all strings internally as unicode and to convert to and from the required encodings at the boundaries.
> > >
> > > Kent
> > >
> > > _______________________________________________
> > > Tutor maillist - Tutor at python.org
> > > http://mail.python.org/mailman/listinfo/tutor
> > >
> >
>
More information about the Tutor
mailing list