[Cython] Odd behavior with std::string and .decode()
Barry Warsaw
barry at python.org
Fri Jul 6 16:21:54 CEST 2012
Thanks for the follow up Stefan,
On Jul 06, 2012, at 06:48 AM, Stefan Behnel wrote:
>This is very weird behaviour indeed. I wouldn't know why that should
>happen. What "return as_bytes.decode('utf-8')" does is that is calls
>strlen() to see how long the string is, then it calls the UTF-8 decode
>C-API function with that.
It seems like either the strlen() or the cast through char* is the problem.
>The string that get_description() returns is allocated internally in the
>C++ object, right? So it can't suddenly die or something?
I don't think so.
>One thing I would generally suggest is to do this:
>
> descr = self._this.get_description()
> return descr.data()[:descr.size()].decode('utf-8')
>
>Avoids the call to strlen() by explicitly slicing the pointer. Also avoids
>needing to make sure the C string is 0-terminated.
According to
http://www.cplusplus.com/reference/string/string/c_str/
The returned array points to an internal location with the required
storage space for this sequence of characters plus its terminating
null-character, but the values in this array should not be modified in the
program and are only guaranteed to remain unchanged until the next call to
a non-constant member function of the string object.
I believe the const char* returned by c_str() is guaranteed to be null
terminated. AFAICT, there are no embedded NULs. I also don't think there are
any non-constant member function calls of the parent string object getting in
the way.
Next, I tried two different implementations:
property description:
def __get__(self):
# works
descr = self._this.get_description()
return descr.c_str()[:descr.size()].decode('utf-8')
property destruction:
def __get__(self):
# broken
as_bytes = <char *>self._this.get_description().c_str()
return as_bytes.decode('utf-8')
The second case requires the cast or you get an error:
xapian.cpp:1409:67: error: invalid conversion from ‘const char*’ to ‘char*’ [-fpermissive]
but I don't think that's the problem. Looking at the generated C++ code, I
see these two different implementations:
works:
__pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_descr.c_str(), __pyx_v_descr.size(), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 84; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
broken:
__pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_as_bytes, strlen(__pyx_v_as_bytes), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 91; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
In the working case, __pyx_v_descr is a std::string, so the const char*
returned by .c_str() is passed directly to PyUnicode_Decode() without a cast.
The length is returned by std::string.size().
In the broken case, __pyx_v_as_bytes is a char* (I could not figure out how to
preserve the const char* type) and strlen() is used to find the length.
Those are the only substantive differences I could find.
>I wouldn't know any differences out of the top of my head, except that 0.17
>has generally better support for STL containers and std:string (but that's
>unrelated to this failure). I'm planning to enable direct support for
>cpp_string.decode(...) as well, but that's not implemented yet. It would
>basically generate the verbose code above automatically.
>
>> Is this a bug or am I doing something stupid?
>
>Definitely not doing something stupid, but I have no idea why this should
>go wrong.
Okay, at least I have a few workarounds :). I'd file a bug but I don't have
permission to file new issues.
If you have any other suggestions for ways to debug this, I'm happy to give
them a try.
Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/cython-devel/attachments/20120706/46390600/attachment.pgp>
More information about the cython-devel
mailing list