[Cython] Odd behavior with std::string and .decode()
Stefan Behnel
stefan_ml at behnel.de
Sat Jul 7 01:54:55 CEST 2012
Barry Warsaw, 06.07.2012 16:21:
> Thanks for the follow up Stefan,
>
> On Jul 06, 2012, at 06:48 AM, Stefan Behnel wrote:
>
>> This is very weird behaviour indeed. I wouldn't know why that should
>> happen. What "return as_bytes.decode('utf-8')" does is that is calls
>> strlen() to see how long the string is, then it calls the UTF-8 decode
>> C-API function with that.
>
> It seems like either the strlen() or the cast through char* is the problem.
Could you try it without the cast?
https://sage.math.washington.edu:8091/hudson/job/cython-docs/doclinks/1/src/tutorial/strings.html#dealing-with-const
In older Cython versions, you can use the declarations directly:
https://github.com/cython/cython/blob/master/Cython/Includes/libc/string.pxd#L3
I just noticed that .c_str() is incorrectly declared (without "const") in
Cython and that .data() is missing completely. I've pushed a fix for that
(note that the current master looks a bit broken, which is rather
unfortunate for testing).
>> One thing I would generally suggest is to do this:
>>
>> descr = self._this.get_description()
>> return descr.data()[:descr.size()].decode('utf-8')
>>
>> Avoids the call to strlen() by explicitly slicing the pointer. Also avoids
>> needing to make sure the C string is 0-terminated.
>
> According to
>
> http://www.cplusplus.com/reference/string/string/c_str/
>
> The returned array points to an internal location with the required
> storage space for this sequence of characters plus its terminating
> null-character, but the values in this array should not be modified in the
> program and are only guaranteed to remain unchanged until the next call to
> a non-constant member function of the string object.
>
> I believe the const char* returned by c_str() is guaranteed to be null
> terminated. AFAICT, there are no embedded NULs. I also don't think there are
> any non-constant member function calls of the parent string object getting in
> the way.
Yes to all of the above.
What I meant was that .c_str() may be less efficient than .data() because
the internal string buffer may not be 0-terminated originally.
> Next, I tried two different implementations:
>
> property description:
> def __get__(self):
> # works
> descr = self._this.get_description()
> return descr.c_str()[:descr.size()].decode('utf-8')
>
> property destruction:
> def __get__(self):
> # broken
> as_bytes = <char *>self._this.get_description().c_str()
> return as_bytes.decode('utf-8')
>
> The second case requires the cast or you get an error:
>
> xapian.cpp:1409:67: error: invalid conversion from ‘const char*’ to ‘char*’ [-fpermissive]
>
> but I don't think that's the problem. Looking at the generated C++ code, I
> see these two different implementations:
>
> works:
>
> __pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_descr.c_str(), __pyx_v_descr.size(), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 84; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
>
> broken:
>
> __pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_as_bytes, strlen(__pyx_v_as_bytes), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 91; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
>
> In the working case, __pyx_v_descr is a std::string, so the const char*
> returned by .c_str() is passed directly to PyUnicode_Decode() without a cast.
> The length is returned by std::string.size().
>
> In the broken case, __pyx_v_as_bytes is a char* (I could not figure out how to
> preserve the const char* type) and strlen() is used to find the length.
>
> Those are the only substantive differences I could find.
Maybe the C++ compiler is going mad because of the cast that kills "const"?
>> I wouldn't know any differences out of the top of my head, except that 0.17
>> has generally better support for STL containers and std:string (but that's
>> unrelated to this failure). I'm planning to enable direct support for
>> cpp_string.decode(...) as well, but that's not implemented yet. It would
>> basically generate the verbose code above automatically.
>>
>>> Is this a bug or am I doing something stupid?
>>
>> Definitely not doing something stupid, but I have no idea why this should
>> go wrong.
>
> Okay, at least I have a few workarounds :). I'd file a bug but I don't have
> permission to file new issues.
Please send a htpasswd entry to me or Robert.
> If you have any other suggestions for ways to debug this, I'm happy to give
> them a try.
Could you try to reproduce this without needing the Xapian library? It
would be good to have a (failing) test case.
Stefan
More information about the cython-devel
mailing list