utf-8 and ctypes

Mark Tolonen metolone+gmane at gmail.com
Thu Sep 30 07:55:06 CEST 2010


"Brendan Miller" <catphive at catphive.net> wrote in message 
news:AANLkTi=2f3L++398St-16MPEs8wzfbLbu+Qa8zTPAbsd at mail.gmail.com...
> 2010/9/29 Lawrence D'Oliveiro <ldo at geek-central.gen.new_zealand>:
>> In message <mailman.1132.1285714474.29448.python-list at python.org>, 
>> Brendan
>> Miller wrote:
>>
>>> It seems that characters not in the ascii subset of UTF-8 are
>>> discarded by c_char_p during the conversion ...
>>
>> Not a chance.
>>
>>> ... or at least they don't print out when I go to print the string.
>>
>> So it seems there’s a problem on the printing side. What happens when 
>> you
>> construct a UTF-8-encoded string directly in Python and try printing it 
>> the
>> same way?
>
> Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
>
> if I enter:
> str = "日本語のテスト"
>
> Then:
> print str
> 日本語のテスト
>
> However, when I create a string buffer, pass it into my c++ code, and
> write the same UTF-8 string into it, python seems to discard pretty
> much all the text. The same code works for pure ascii strings.
>
> Python code:
> _std_string_size = _lib_mbxclient.std_string_size
> _std_string_size.restype = c_long
> _std_string_size.argtypes = [c_void_p]
>
> _std_string_copy = _lib_mbxclient.std_string_copy
> _std_string_copy.restype = None
> _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]
>
> # This function works for ascii, but breaks on strings with UTF-8!
> def std_string_to_string(str_ptr):
>    buf = create_string_buffer(_std_string_size(str_ptr))
>    _std_string_copy(str_ptr, buf)
>    return buf.raw
>
> C++ code:
>
> extern "C"
> long std_string_size(string* str)
> {
> return str->size();
> }
>
> extern "C"
> void std_string_copy(string* str, char* buf)
> {
> std::copy(str->begin(), str->end(), buf);
> }

I didn't see what OS you are using, but I fleshed out your example code and 
have a working example for Windows.  Below is the code for the DLL and 
script:

--------- x.cpp [cl /LD /EHsc /W4 
x.cpp] ----------------------------------------------------
#include <string>
#include <algorithm>
using namespace std;

extern "C" __declspec(dllexport) long std_string_size(string* str)
{
 return str->size();
}

extern "C" __declspec(dllexport) void std_string_copy(string* str, char* 
buf)
{
 std::copy(str->begin(), str->end(), buf);
}

extern "C" __declspec(dllexport) void* make(const char* s)
{
 return new string(s);
}

extern "C" __declspec(dllexport) void destroy(void* s)
{
 delete (string*)s;
}
---- x.py ---------------------------------------------------------
# coding: utf8
from ctypes import *
_lib_mbxclient = CDLL('x')

_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, c_char_p]

make = _lib_mbxclient.make
make.restype = c_void_p
make.argtypes = [c_char_p]

destroy = _lib_mbxclient.destroy
destroy.restype = None
destroy.argtypes = [c_void_p]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
    buf = create_string_buffer(_std_string_size(str_ptr))
    _std_string_copy(str_ptr, buf)
    return buf.raw

s = make(u'我是美国人。'.encode('utf8'))
print std_string_to_string(s).decode('utf8')
------------------------------------------------------

And output (in Pythonwin...US Windows console doesn't support Chinese):

我是美国人。

I used c_char_p instead of POINTER(c_char) and added functions to create and 
destroy a std::string for Python's use, but it is otherwise the same as your 
code.

Hope this helps you work it out,
-Mark







More information about the Python-list mailing list