[Cython] Odd behavior with std::string and .decode()

Stefan Behnel stefan_ml at behnel.de
Sat Jul 7 01:54:55 CEST 2012


Barry Warsaw, 06.07.2012 16:21:
> Thanks for the follow up Stefan,
> 
> On Jul 06, 2012, at 06:48 AM, Stefan Behnel wrote:
> 
>> This is very weird behaviour indeed. I wouldn't know why that should
>> happen. What "return as_bytes.decode('utf-8')" does is that is calls
>> strlen() to see how long the string is, then it calls the UTF-8 decode
>> C-API function with that.
> 
> It seems like either the strlen() or the cast through char* is the problem.

Could you try it without the cast?

https://sage.math.washington.edu:8091/hudson/job/cython-docs/doclinks/1/src/tutorial/strings.html#dealing-with-const

In older Cython versions, you can use the declarations directly:

https://github.com/cython/cython/blob/master/Cython/Includes/libc/string.pxd#L3

I just noticed that .c_str() is incorrectly declared (without "const") in
Cython and that .data() is missing completely. I've pushed a fix for that
(note that the current master looks a bit broken, which is rather
unfortunate for testing).


>> One thing I would generally suggest is to do this:
>>
>>    descr = self._this.get_description()
>>    return descr.data()[:descr.size()].decode('utf-8')
>>
>> Avoids the call to strlen() by explicitly slicing the pointer. Also avoids
>> needing to make sure the C string is 0-terminated.
> 
> According to
> 
> http://www.cplusplus.com/reference/string/string/c_str/
> 
>     The returned array points to an internal location with the required
>     storage space for this sequence of characters plus its terminating
>     null-character, but the values in this array should not be modified in the
>     program and are only guaranteed to remain unchanged until the next call to
>     a non-constant member function of the string object.
> 
> I believe the const char* returned by c_str() is guaranteed to be null
> terminated.  AFAICT, there are no embedded NULs.  I also don't think there are
> any non-constant member function calls of the parent string object getting in
> the way.

Yes to all of the above.

What I meant was that .c_str() may be less efficient than .data() because
the internal string buffer may not be 0-terminated originally.


> Next, I tried two different implementations:
> 
>     property description:
>         def __get__(self):
>             # works
>             descr = self._this.get_description()
>             return descr.c_str()[:descr.size()].decode('utf-8')
> 
>     property destruction:
>         def __get__(self):
>             # broken
>             as_bytes = <char *>self._this.get_description().c_str()
>             return as_bytes.decode('utf-8')
> 
> The second case requires the cast or you get an error:
> 
> xapian.cpp:1409:67: error: invalid conversion from ‘const char*’ to ‘char*’ [-fpermissive]
> 
> but I don't think that's the problem.  Looking at the generated C++ code, I
> see these two different implementations:
> 
> works:
> 
>   __pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_descr.c_str(), __pyx_v_descr.size(), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 84; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
> 
> broken:
> 
>   __pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_as_bytes, strlen(__pyx_v_as_bytes), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 91; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
> 
> In the working case, __pyx_v_descr is a std::string, so the const char*
> returned by .c_str() is passed directly to PyUnicode_Decode() without a cast.
> The length is returned by std::string.size().
> 
> In the broken case, __pyx_v_as_bytes is a char* (I could not figure out how to
> preserve the const char* type) and strlen() is used to find the length.
> 
> Those are the only substantive differences I could find.

Maybe the C++ compiler is going mad because of the cast that kills "const"?


>> I wouldn't know any differences out of the top of my head, except that 0.17
>> has generally better support for STL containers and std:string (but that's
>> unrelated to this failure). I'm planning to enable direct support for
>> cpp_string.decode(...) as well, but that's not implemented yet. It would
>> basically generate the verbose code above automatically.
>>
>>> Is this a bug or am I doing something stupid?
>>
>> Definitely not doing something stupid, but I have no idea why this should
>> go wrong.
> 
> Okay, at least I have a few workarounds :).  I'd file a bug but I don't have
> permission to file new issues.

Please send a htpasswd entry to me or Robert.


> If you have any other suggestions for ways to debug this, I'm happy to give
> them a try.

Could you try to reproduce this without needing the Xapian library? It
would be good to have a (failing) test case.

Stefan


More information about the cython-devel mailing list