[Python-Dev] getting the UCS-2 representation of a unicode object

Andreas Jung andreas@andreas-jung.com
Sun, 19 May 2002 20:31:04 -0400

I was just confused that a part of documentation talks about UTF-16
vs. UCS-2 since Python uses UCS-2(4) as internal representation. I also
did not know that UCS-2 is a subset of UTF-16...I think my problems
are now solved...at least from the Python side.


----- Original Message -----
From: "John Machin" <sjmachin@lexicon.net>
To: "Andreas Jung" <andreas@andreas-jung.com>
Cc: <python-dev@python.org>
Sent: Sunday, May 19, 2002 20:22
Subject: Re: [Python-Dev] getting the UCS-2 representation of a unicode

> 20/05/2002 12:35:19 AM, "Andreas Jung" <andreas@andreas-jung.com> wrote:
> >Sounds reasonable..but since Py_ParseTuple() only applies to function
> >arguments
> >it can not be used to convert a unicode object to UCS-2. So what is the
> >easiest
> >way to get the UCS-2 representation? PyUnicode_AS_DATA() returns for
> >u'computer'
> >a char * with strlen()==1, however PyUnicode_GET_DATA_SIZE() on the
> >same string returns 16 (looks fine for the two byes encoding of UCS-2).
Am I
> >missing
> >something?
> >
> Andreas,
> If you don't care about surrogates or weird things like the Hong Kong
extended character set that are outside the 2**16 range, pretend UCS-2 ==
UTF-16. Then on a narrow Python build, the
> unicode object is in effect in UCS-2; no conversion required.
> You are indeed missing something about PyUnicode_AS_DATA -- the doc says
it returns a char * pointer to the internal buffer. I can't imagine what
relevance strlen(such_a_pointer) has. The
> buffer will contain "c\0o\0m\0 etc etc" when viewed as a series of bytes
(on a little-endian box) so yes strlen -> 1 but so what?
> What is there about the PyUnicode_AS_UNICODE() function that you don't
> Perhaps you might like to (a) say what you are trying to achieve (b) move
the discussion to c.l.py
> Regards,
> John