Converting from Python strings to C strings slow?

Hi, There seems to be a lot of overhead when passing a large string (23 Meg) to C compiled RPython code. For example, this code: def small(text): return 3 t = Translation(small) t.annotate() t.rtype() f3 = t.compile_c() st = time.time() z = f3(xml) print time.time() - st Whereas parsing the 16,000 XML elements using a regular expression only took 22 msec. Even reading the text file inside the compiled RPython is faster than passing it. Here's the code from rstr.py that (seems to be) doing the conversion. Any idea how I'd put timing code in there to see what's taking all the time? Any idea how to speed it up? class __extend__(pairtype(PyObjRepr, AbstractStringRepr)): def convert_from_to((r_from, r_to), v, llops): v_len = llops.gencapicall('PyString_Size', [v], resulttype=Signed) cstr = inputconst(Void, STR) v_result = llops.genop('malloc_varsize', [cstr, v_len], resulttype=Ptr(STR)) cchars = inputconst(Void, "chars") v_chars = llops.genop('getsubstruct', [v_result, cchars], resulttype=Ptr(STR.chars)) llops.gencapicall('PyString_ToLLCharArray', [v, v_chars]) string_repr = llops.rtyper.type_system.rstr.string_repr v_result = llops.convertvar(v_result, string_repr, r_to) return v_result Best, Martin

Martin C. Martin wrote:
This is wrong. You should even get a warning, the proper command is t.annotate([str]). Besides, this is not the official way of writing rpython standalone programs. The official way is to go to pypy/translator/goal and for example modify targetnopstandalone for a standalone target (the entry function is entry_point). You should translate this by ./translate.py targetnopstandalone.py (or whatever target you choose). You can even use some fancy options (like different gcs). In your example the xml was converted to python object, which will never happen the official way. Cheers, fijal :.

Maciek Fijalkowski wrote:
Oops, yes, I've been working with variations of this all day, and I hadn't actually compiled & run the example in the email, although I'd done something equivalent.
Besides, this is not the official way of writing rpython standalone programs.
Thanks, but I'm not trying to write a standalone program, I need to call some 3rd party libraries. For example, the string comes from one of a couple dozen of socket connections, managed by Twisted. So I just want my inner loop in RPython. The inner loop turns XML into a MySQL statement, which the main python program can then send to a database. So I need to get a big string into RPython, and a smaller (but still pretty big) string out of it. I see some other targets in there for shared libraries. The docs mention that translate.py takes a --gc=generation argument, but when I try that I get: $ python ./translator/goal/translate.py --gc=generation fun3.py Usage ===== translate [options] [target] [target-specific-options] translate: error: invalid value generation for option gc Am I specifying it wrong? Thanks, Martin

Martin C. Martin wrote:
Couldn't you just use a subprocess, read the string from stdin and write the result to stdout? It's quite likely that this is not slower than the way strings are passed in and out now and has many advantages. You would need to use os.read and os.write, since sys.stdin/stdout is not supported in RPython, but apart from that it should work fine. One of them is that if you use the Translation class, your RPython program will use reference counting, which is our slowest GC. If you use a subprocess you get the benefits of our much better generational GC. Cheers, Carl Friedrich

Carl Friedrich Bolz wrote:
What I'm really looking for is a way to write most of my applications in a dynamic language (because its more productive to write & maintain), then if and when performance is a problem, have some way to speed it up. PyPy promises to do this even before performance is a problem, which will be great! Until that comes, I was hoping for a language where I could give some hints to the compiler or runtime to speed it up. Things like "although this binding could change each time through the loop, it doesn't actually change, so there's no need to do a hash lookup for every access." Or "this variable is always an int." The only language I know of that can do that is Lisp, which is a strong possibility. But Lisp's syntax is more verbose and low level than modern dynamic languages, it doesn't have as many libraries, it doesn't have an IDE with auto completion, or a good source level debugger. I had hoped Groovy would be like that, with its optional typing and Java inspired syntax and semantics, but sadly, the developers valued dynamism, however rarely used, over performance. So the next best thing is to rewrite the performance critical parts in some other language. I had hoped RPython would be that language for Python, but it turns out not to be. I could rewrite in C++, but the semantics of C++ are very different than Python, so interfacing the two becomes verbose and awkward. The ctypes module looks good for calling C libraries that weren't originally designed to work with Python. But it doesn't have a good way (or any way?) to manipulate Python objects from C. Even Java's JNI makes for a lot of boilerplate code to translate back and forth. So it looks like my best bet may be Groovy, which interacts with Java seamlessly. A year ago, when I last checked, the IDEs weren't up to the job, but may that's changed. And once PyPy is done, that may be an even better solution. Best, Martin

Martin C. Martin wrote:
This is wrong. You should even get a warning, the proper command is t.annotate([str]). Besides, this is not the official way of writing rpython standalone programs. The official way is to go to pypy/translator/goal and for example modify targetnopstandalone for a standalone target (the entry function is entry_point). You should translate this by ./translate.py targetnopstandalone.py (or whatever target you choose). You can even use some fancy options (like different gcs). In your example the xml was converted to python object, which will never happen the official way. Cheers, fijal :.

Maciek Fijalkowski wrote:
Oops, yes, I've been working with variations of this all day, and I hadn't actually compiled & run the example in the email, although I'd done something equivalent.
Besides, this is not the official way of writing rpython standalone programs.
Thanks, but I'm not trying to write a standalone program, I need to call some 3rd party libraries. For example, the string comes from one of a couple dozen of socket connections, managed by Twisted. So I just want my inner loop in RPython. The inner loop turns XML into a MySQL statement, which the main python program can then send to a database. So I need to get a big string into RPython, and a smaller (but still pretty big) string out of it. I see some other targets in there for shared libraries. The docs mention that translate.py takes a --gc=generation argument, but when I try that I get: $ python ./translator/goal/translate.py --gc=generation fun3.py Usage ===== translate [options] [target] [target-specific-options] translate: error: invalid value generation for option gc Am I specifying it wrong? Thanks, Martin

Martin C. Martin wrote:
Couldn't you just use a subprocess, read the string from stdin and write the result to stdout? It's quite likely that this is not slower than the way strings are passed in and out now and has many advantages. You would need to use os.read and os.write, since sys.stdin/stdout is not supported in RPython, but apart from that it should work fine. One of them is that if you use the Translation class, your RPython program will use reference counting, which is our slowest GC. If you use a subprocess you get the benefits of our much better generational GC. Cheers, Carl Friedrich

Carl Friedrich Bolz wrote:
What I'm really looking for is a way to write most of my applications in a dynamic language (because its more productive to write & maintain), then if and when performance is a problem, have some way to speed it up. PyPy promises to do this even before performance is a problem, which will be great! Until that comes, I was hoping for a language where I could give some hints to the compiler or runtime to speed it up. Things like "although this binding could change each time through the loop, it doesn't actually change, so there's no need to do a hash lookup for every access." Or "this variable is always an int." The only language I know of that can do that is Lisp, which is a strong possibility. But Lisp's syntax is more verbose and low level than modern dynamic languages, it doesn't have as many libraries, it doesn't have an IDE with auto completion, or a good source level debugger. I had hoped Groovy would be like that, with its optional typing and Java inspired syntax and semantics, but sadly, the developers valued dynamism, however rarely used, over performance. So the next best thing is to rewrite the performance critical parts in some other language. I had hoped RPython would be that language for Python, but it turns out not to be. I could rewrite in C++, but the semantics of C++ are very different than Python, so interfacing the two becomes verbose and awkward. The ctypes module looks good for calling C libraries that weren't originally designed to work with Python. But it doesn't have a good way (or any way?) to manipulate Python objects from C. Even Java's JNI makes for a lot of boilerplate code to translate back and forth. So it looks like my best bet may be Groovy, which interacts with Java seamlessly. A year ago, when I last checked, the IDEs weren't up to the job, but may that's changed. And once PyPy is done, that may be an even better solution. Best, Martin
participants (3)
-
Carl Friedrich Bolz
-
Maciek Fijalkowski
-
Martin C. Martin