Unicode problem in ucs4

Mon Mar 23 07:37:26 EDT 2009

On 2009-03-23 11:50, abhi wrote:
> On Mar 23, 3:04 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
> Thanks Marc, John,
>          With your help, I am at least somewhere. I re-wrote the code
> to compare Py_Unicode and wchar_t outputs and they both look exactly
> the same.
> 
> #include<Python.h>
> 
> static PyObject *unicode_helper(PyObject *self,PyObject *args){
> 	const char *name;
> 	PyObject *sampleObj = NULL;
>       	Py_UNICODE *sample = NULL;
> 	wchar_t * w=NULL;
> 	int size = 0;
> 	int i;
> 
>       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
>                 return NULL;
>       }
> 
> 
>         // Explicitly convert it to unicode and get Py_UNICODE value
>         sampleObj = PyUnicode_FromObject(sampleObj);
>         sample = PyUnicode_AS_UNICODE(sampleObj);
>         printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
> (sampleObj));
>         w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
> (wchar_t));
> 	size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
> +1)*sizeof(wchar_t));
> 	printf("%d chars are copied to w\n",size);
> 	printf("size of wchar_t is : %d\n", sizeof(wchar_t));
> 	printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
> 	for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
> 		printf("sample is : %c\n",sample[i]);
> 		printf("w is : %c\n",w[i]);
> 	}
> 	return sampleObj;
> }
> 
> static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
> unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
> 
> void initunicodeTest(void){
> 	Py_InitModule3("unicodeTest",funcs,"");
> }
> 
> This gives the following output when I pass "abc" as input:
> 
> size of sampleObj is : 3
> 3 chars are copied to w
> size of wchar_t is : 4
> size of Py_UNICODE is: 4
> sample is : a
> w is : a
> sample is : b
> w is : b
> sample is : c
> w is : c
> 
> So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
> \0s after a char, printf or wprintf is only printing one letter.
> I need to further process the data and those libraries will need the
> data in UCS2 format (2 bytes), otherwise they fail. Is there any way
> by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
> data to UCS2 explicitly?

Sure: just use the appropriate UTF-16 codec for this.

/* Generic codec based encoding API.

   object is passed through the encoder function found for the given
   encoding using the error handling method defined by errors. errors
   may be NULL to use the default method defined for the codec.

   Raises a LookupError in case no encoder can be found.

 */

PyAPI_FUNC(PyObject *) PyCodec_Encode(
       PyObject *object,
       const char *encoding,
       const char *errors
       );

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/