utf-8 and ctypes
Mark Tolonen
metolone+gmane at gmail.com
Thu Sep 30 01:55:06 EDT 2010
"Brendan Miller" <catphive at catphive.net> wrote in message
news:AANLkTi=2f3L++398St-16MPEs8wzfbLbu+Qa8zTPAbsd at mail.gmail.com...
> 2010/9/29 Lawrence D'Oliveiro <ldo at geek-central.gen.new_zealand>:
>> In message <mailman.1132.1285714474.29448.python-list at python.org>,
>> Brendan
>> Miller wrote:
>>
>>> It seems that characters not in the ascii subset of UTF-8 are
>>> discarded by c_char_p during the conversion ...
>>
>> Not a chance.
>>
>>> ... or at least they don't print out when I go to print the string.
>>
>> So it seems there’s a problem on the printing side. What happens when
>> you
>> construct a UTF-8-encoded string directly in Python and try printing it
>> the
>> same way?
>
> Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
>
> if I enter:
> str = "日本語のテスト"
>
> Then:
> print str
> 日本語のテスト
>
> However, when I create a string buffer, pass it into my c++ code, and
> write the same UTF-8 string into it, python seems to discard pretty
> much all the text. The same code works for pure ascii strings.
>
> Python code:
> _std_string_size = _lib_mbxclient.std_string_size
> _std_string_size.restype = c_long
> _std_string_size.argtypes = [c_void_p]
>
> _std_string_copy = _lib_mbxclient.std_string_copy
> _std_string_copy.restype = None
> _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]
>
> # This function works for ascii, but breaks on strings with UTF-8!
> def std_string_to_string(str_ptr):
> buf = create_string_buffer(_std_string_size(str_ptr))
> _std_string_copy(str_ptr, buf)
> return buf.raw
>
> C++ code:
>
> extern "C"
> long std_string_size(string* str)
> {
> return str->size();
> }
>
> extern "C"
> void std_string_copy(string* str, char* buf)
> {
> std::copy(str->begin(), str->end(), buf);
> }
I didn't see what OS you are using, but I fleshed out your example code and
have a working example for Windows. Below is the code for the DLL and
script:
--------- x.cpp [cl /LD /EHsc /W4
x.cpp] ----------------------------------------------------
#include <string>
#include <algorithm>
using namespace std;
extern "C" __declspec(dllexport) long std_string_size(string* str)
{
return str->size();
}
extern "C" __declspec(dllexport) void std_string_copy(string* str, char*
buf)
{
std::copy(str->begin(), str->end(), buf);
}
extern "C" __declspec(dllexport) void* make(const char* s)
{
return new string(s);
}
extern "C" __declspec(dllexport) void destroy(void* s)
{
delete (string*)s;
}
---- x.py ---------------------------------------------------------
# coding: utf8
from ctypes import *
_lib_mbxclient = CDLL('x')
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]
_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, c_char_p]
make = _lib_mbxclient.make
make.restype = c_void_p
make.argtypes = [c_char_p]
destroy = _lib_mbxclient.destroy
destroy.restype = None
destroy.argtypes = [c_void_p]
# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw
s = make(u'我是美国人。'.encode('utf8'))
print std_string_to_string(s).decode('utf8')
------------------------------------------------------
And output (in Pythonwin...US Windows console doesn't support Chinese):
我是美国人。
I used c_char_p instead of POINTER(c_char) and added functions to create and
destroy a std::string for Python's use, but it is otherwise the same as your
code.
Hope this helps you work it out,
-Mark
More information about the Python-list
mailing list