[Python-Dev] getting the UCS-2 representation of a unicode object

John Machin sjmachin@lexicon.net
Mon, 20 May 2002 10:22:39 +1000


20/05/2002 12:35:19 AM, "Andreas Jung" <andreas@andreas-jung.com> wrote:

>Sounds reasonable..but since Py_ParseTuple() only applies to function
>arguments
>it can not be used to convert a unicode object to UCS-2. So what is the
>easiest
>way to get the UCS-2 representation? PyUnicode_AS_DATA() returns for
>u'computer'
>a char * with strlen()==1, however PyUnicode_GET_DATA_SIZE() on the
>same string returns 16 (looks fine for the two byes encoding of UCS-2). Am I
>missing
>something?
>

Andreas,

If you don't care about surrogates or weird things like the Hong Kong extended character set that are outside the 2**16 range, pretend UCS-2 == UTF-16. Then on a narrow Python build, the 
unicode object is in effect in UCS-2; no conversion required.

You are indeed missing something about PyUnicode_AS_DATA -- the doc says it returns a char * pointer to the internal buffer. I can't imagine what relevance strlen(such_a_pointer) has. The 
buffer will contain "c\0o\0m\0 etc etc" when viewed as a series of bytes (on a little-endian box) so yes strlen -> 1 but so what?

What is there about the PyUnicode_AS_UNICODE() function that you don't like?

Perhaps you might like to (a) say what you are trying to achieve (b) move the discussion to c.l.py

Regards,

John