PyPy 2.4.0 final prerelease now available
Hello, First the problem that i have. Right now, when i get a string back from a C function, i have to do 2 copies of it: ffi.cdef(""" const char *getString(...); """) tmp = ffi.string(clib.getString(...)) # 1st copy pystring = tmp.decode('utf-8') # 2nd copy So i thought why not use an ffi.buffer on it and do the decoding directly on the buffer: cstr = ffi.new('char []', 'abcd') b = unicode(ffi.buffer(cstr), 'utf-8') Above works. But the problem is that in C a function that returns an array cannot be declared. So i cannot do a: b = unicode( ffi.buffer( clib.getString(...) ) ,'utf-8') because it'll only return the first character of getString, due to being declared as a 'char*'. Is there any way in CFFI to declare a function as returning a 'char[]' so as a buffer can be directly used on its results? Thank you. l.
Hi Lefteris, On 22 September 2014 19:37, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
b = unicode( ffi.buffer( clib.getString(...) ) ,'utf-8')
because it'll only return the first character of getString, due to being declared as a 'char*'.
The issue is only that ffi.buffer() tries to guess how long a buffer you're giving it, and with "char *" the guess is one (only ffi.string() has logic to look for the final null character in the array). You need to get its length explicitly, for example like this: p = clib.getString(...) # a "char *" length = clib.strlen(p) # the standard strlen() function from C b = unicode(ffi.buffer(p, length), 'utf-8') A bientôt, Armin.
On 23/09/14 09:52, Armin Rigo wrote:
Hi Lefteris,
On 22 September 2014 19:37, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
b = unicode( ffi.buffer( clib.getString(...) ) ,'utf-8')
because it'll only return the first character of getString, due to being declared as a 'char*'.
The issue is only that ffi.buffer() tries to guess how long a buffer you're giving it, and with "char *" the guess is one (only ffi.string() has logic to look for the final null character in the array).
If only ffi.string has logic to look for the final null character, then how can below work?
teststr=ffi.new('char[]', 'asdfasdfasdfasdfasdfasdf') unicode(ffi.buffer(teststr), 'utf-8') u'asdfasdfasdfasdfasdfasdf\x00'
Above doesn't explicitly set the length in ffi.buffer. There is still one problem with ffi.buffer and the last "\x00" in input, but otherwise it works with only 1 copy to go from a char* to a Python unicode string. The problem is that i cannot declare a C function as returning a char[] so that ffi.buffer will have the same behaviour on its results as it has with above "teststr".
You need to get its length explicitly, for example like this:
p = clib.getString(...) # a "char *" length = clib.strlen(p) # the standard strlen() function from C b = unicode(ffi.buffer(p, length), 'utf-8')
I've tried that, and the overhead of the second call is more or less equal to the cost of the copy when using ffi.string. Kind regards, l.
On Tue, Sep 23, 2014 at 8:54 AM, Eleytherios Stamatogiannakis < estama@gmail.com> wrote:
On 23/09/14 09:52, Armin Rigo wrote:
Hi Lefteris,
On 22 September 2014 19:37, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
b = unicode( ffi.buffer( clib.getString(...) ) ,'utf-8')
because it'll only return the first character of getString, due to being declared as a 'char*'.
The issue is only that ffi.buffer() tries to guess how long a buffer you're giving it, and with "char *" the guess is one (only ffi.string() has logic to look for the final null character in the array).
If only ffi.string has logic to look for the final null character, then how can below work?
teststr=ffi.new('char[]', 'asdfasdfasdfasdfasdfasdf') unicode(ffi.buffer(teststr), 'utf-8') u'asdfasdfasdfasdfasdfasdf\x00'
Above doesn't explicitly set the length in ffi.buffer. There is still one problem with ffi.buffer and the last "\x00" in input, but otherwise it works with only 1 copy to go from a char* to a Python unicode string.
The first line you have returns an object that owns the memory and therefore knows how long it is, which is later used by ffi.buffer to figure out how long the buffer is. Also notice that the result of the second line has '\x00' at the end. This also work even if the string has null bytes in the middle In [9]: unicode(ffi.buffer(ffi.new('char[]', '\0a'))) Out[9]: u'\x00a\x00'
The problem is that i cannot declare a C function as returning a char[] so that ffi.buffer will have the same behaviour on its results as it has with above "teststr".
You need to get its length explicitly, for example like this:
p = clib.getString(...) # a "char *" length = clib.strlen(p) # the standard strlen() function from C b = unicode(ffi.buffer(p, length), 'utf-8')
I've tried that, and the overhead of the second call is more or less equal to the cost of the copy when using ffi.string.
Kind regards,
l. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Hi, On 23 September 2014 14:54, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
p = clib.getString(...) # a "char *" length = clib.strlen(p) # the standard strlen() function from C b = unicode(ffi.buffer(p, length), 'utf-8')
I've tried that, and the overhead of the second call is more or less equal to the cost of the copy when using ffi.string.
You cannot have a C function returning a 'char[]'. That's why you need to declare it returning a 'char *', and then you don't know the length. Sorry, it's the way C works; there is nothing I can do about that :-) Occasionally, we see C functions with this kind of signature: size_t getString(xxx, char **result); This would return the length, and use 'result' as an output parameter, to store into '*result' a pointer to the string. If you really care about performance, then you might want to change the C library you're binding to in order to do that. A bientôt, Armin.
On 24/09/14 20:13, Armin Rigo wrote:
Hi,
On 23 September 2014 14:54, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
p = clib.getString(...) # a "char *" length = clib.strlen(p) # the standard strlen() function from C b = unicode(ffi.buffer(p, length), 'utf-8')
I've tried that, and the overhead of the second call is more or less equal to the cost of the copy when using ffi.string.
You cannot have a C function returning a 'char[]'. That's why you need to declare it returning a 'char *', and then you don't know the length. Sorry, it's the way C works; there is nothing I can do about that :-)
Thank you for clarifying. I thought that ffi.buffer scanned for the \0 to find the end of the string for "char[]" types.
Occasionally, we see C functions with this kind of signature:
size_t getString(xxx, char **result);
This would return the length, and use 'result' as an output parameter, to store into '*result' a pointer to the string. If you really care about performance, then you might want to change the C library you're binding to in order to do that.
Unfortunately, the C library that i use (libsqlite3) does not provide a function like that :( . It has a function that returns the size of the string, but in my tests the overhead of doing another CFFI call (to find the size) is greater than doing the 2nd copy (depending on the average string size). We are doing 100s of millions of string passing calls back and forth from the libsqlite3 library, so any way to improve the efficiency of this case would be more than welcome :) . Best regards, l.
Hi, On 25 September 2014 09:06, Elefterios Stamatogiannakis <estama@gmail.com> wrote:
Unfortunately, the C library that i use (libsqlite3) does not provide a function like that :( . It has a function that returns the size of the string, but in my tests the overhead of doing another CFFI call (to find the size) is greater than doing the 2nd copy (depending on the average string size).
In general, if performance is an issue, particularly if you're running CPython (as opposed to PyPy), you can try to write small helpers in C that regroup a few operations. This can reduce the overhead of doing two calls instead of one. In this case, you can write this in the ffi.verify() part: size_t myGetString(xxx, char **presult) { *presult = getString(xxx); return strlen(*presult); } and then in Python you'd declare the function 'myGetString', and use it like that: p = ffi.new("char *[1]") # you can put this before some loop ... size = lib.myGetString(xxx, p) ..ffi.buffer(p[0], size).. A bientôt, Armin.
On 25/09/14 15:10, Armin Rigo wrote:
Hi,
On 25 September 2014 09:06, Elefterios Stamatogiannakis <estama@gmail.com> wrote:
Unfortunately, the C library that i use (libsqlite3) does not provide a function like that :( . It has a function that returns the size of the string, but in my tests the overhead of doing another CFFI call (to find the size) is greater than doing the 2nd copy (depending on the average string size).
In general, if performance is an issue, particularly if you're running CPython (as opposed to PyPy), you can try to write small helpers in C that regroup a few operations. This can reduce the overhead of doing two calls instead of one. In this case, you can write this in the ffi.verify() part:
These tests i'm writting about use PyPy only. In CPython i use a native C wrapper (APSW). I try to not use ffi.verify because i want the program to be easily deployable. Also i want to test the maximum performance of CFFI's API.
size_t myGetString(xxx, char **presult) { *presult = getString(xxx); return strlen(*presult); }
and then in Python you'd declare the function 'myGetString', and use it like that:
p = ffi.new("char *[1]") # you can put this before some loop ... size = lib.myGetString(xxx, p) ..ffi.buffer(p[0], size)..
Wouldn't an "strbuffer" that does this scan (opportunistically) be faster for cases like above? Thank you very much for your suggestions. l.
Hi, On 25 September 2014 16:57, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
Wouldn't an "strbuffer" that does this scan (opportunistically) be faster for cases like above?
No, it can't be faster than my last solution. There is no way we're going to add custom logic for a special case into the general ffi library. If you don't want to use ffi.verify(), then you're stuck with two calls instead of one. On PyPy, try the latest version (2.4.0); it reduces the overhead of each call, so the cost of doing two calls instead of one is much lower. A bientôt, Armin.
Hello, maybe the code above / inside getstring already knows that string length, and you could exploit that fact to avoid the strlen calculation... On Fri, Sep 26, 2014 at 6:51 PM, Armin Rigo <arigo@tunes.org> wrote:
Hi,
On 25 September 2014 16:57, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
Wouldn't an "strbuffer" that does this scan (opportunistically) be faster for cases like above?
No, it can't be faster than my last solution. There is no way we're going to add custom logic for a special case into the general ffi library. If you don't want to use ffi.verify(), then you're stuck with two calls instead of one. On PyPy, try the latest version (2.4.0); it reduces the overhead of each call, so the cost of doing two calls instead of one is much lower.
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
-- Vincent Legoll
participants (6)
-
Armin Rigo
-
Elefterios Stamatogiannakis
-
Eleytherios Stamatogiannakis
-
Matti Picus
-
Vincent Legoll
-
Yichao Yu