A simple file reading is 2x slow wrt CPython

Hi, I just downloaded PyPy 2.6.0 just to play with it. I have a simple line-by-line file reading example where the file is 324MB. Code: # Not doing this import crashes PyPy with MemoryError?? from io import open a = 0 f = open(fname) for line in f.readlines(): a += len(line) f.close() PyPy: Python 2.7.9 (295ee98b69288471b0fcf2e0ede82ce5209eb90b, Jun 01 2015, 17:30:13) [PyPy 2.6.0 with GCC 4.9.2] on linux2 real 0m6.068s user 0m4.582s sys 0m0.846s CPython (2.7.10) real 0m3.799s user 0m2.851s sys 0m0.860s Am I doing something wrong or is this expected? Thanks! -- Ozan Çağlayan Research Assistant Galatasaray University - Computer Engineering Dept. http://www.ozancaglayan.com

I found this: https://bitbucket.org/pypy/pypy/issue/729/ I did an strace test and both CPython and PyPy do read syscalls with chunks of 4096.

On Mon, 29 Jun 2015 at 14:02 Ozan Çağlayan <ozancag@gmail.com> wrote:
I tested this with cpython 2.7 and pypy 2.7 and I found that it was 2x slower as you say. It seems that readlines() is somehow slower in pypy. You don't actually need to call readlines() in this case though and it's faster not to. With your code (although I didn't import io.open) I found the timings: CPython 2.7: 1.4s PyPy 2.7: 2.3s I changed it to for line in f: # (not f.readlines()) a += len(line) With that change I get: CPython 2.7: 1.3s PyPy 2.7: 0.6s So calling readlines() makes it slower in both CPython and PyPy. If you don't call readlines() then PyPy is 2x faster than CPython (at least on this machine). Probably also the reason for the MemoryError was that you are using readlines(). The reason for this is that readlines() reads all of the lines of the file into memory as a list of Python string objects. If you just loop over f directly then it reads the file one line at a time and so it requires much less memory. I don't know how much spare RAM you have but if you got MemoryError it suggests that you don't have enough to load the whole file into memory using readlines(). Note that even though this machine has 8GB if RAM (with over 7GB unused) and can load the file into memory quite comfortably I still wouldn't write code that assumed it was okay to load a 324MB file into memory unless there was some actual need to do that. -- Oscar

On Mon, 29 Jun 2015 at 14:44 Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: With your code (although I didn't import io.open) I found the timings:
Note that all of the timings above are "warm cache". That means that the file was already cached in memory by the operating system because I had recently read/written it. The time taken to load the file cold cache (e.g. after rebooting) would probably be much longer in all cases so that the difference between PyPy and CPython would not be significant. The problem you've posted should really be IO bound which isn't really the kind of situation where PyPy gains you much speed over CPython. -- Oscar

Hi, Oh It's my bad, I though that readlines iterates over the file, I'm always confusing this. But still the new results are interesting: time bin/pypy testfile.py 338695509 real 0m0.591s user 0m0.494s sys 0m0.096s time python testfile.py 338695509 real 0m0.560s user 0m0.495s sys 0m0.064s So PyPy is still a little slower here wrt to CPython. This is on warm cache too. Anyways, thanks for pointing out the readlines() stuff :)

it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/ On Mon, Jun 29, 2015 at 3:54 PM, Ozan Çağlayan <ozancag@gmail.com> wrote:

Hello all, Well I am searching my dream scientific language :) The current codebase that I am working with is related to a language translation software written in C++. I wanted to re-implement parts of it in Python and/or Julia to both learn it (as I didn't write the C++ stuff) and maybe to make it available for other people who are interested. I saw Pyston last night then I came back to PyPy. As a first step, I tried to parse a 300MB structed text file containing 1.1M lines like these: 0 ||| I love you mother . ||| label=number1 number2 number3 number4 label2=number5 number6 ... number19 ||| number20 Line-by-line accessing was actually pretty fast *but* trying to store the lines in a Python list drains RAM on my 4G laptop. This is disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of memory. Today I went through pypy and did some benchmarks. Line parsing is as follows: - Split it from " ||| " - Convert 1st field to int and 4rd field to float. - Cleanup label= stuff from 2nd field using re.sub() - Append a dict(1) or a class(2) representing each line to a list. # Dict(1): # PyPy: ~1.4G RAM, ~12.7 seconds # CPython: ~1.2G RAM, 28.7 seconds # Class(2): # PyPy: ~1.2G, ~11.1 seconds # CPython: ~1.3G, ~32 seconds The memory measurements are not precise as I tracked them visually using top :) Attaching the code. I'm not an optimization guru, I'm pretty sure that there are suboptimal parts in the code. But the crucial part is memory complexity. Normally those text files are ~1GB on disk this means that I can't represent them in-memory with Python with this code. This is bad. Any suggestions? Thanks!

On Mon, 29 Jun 2015 at 16:13 Ozan Çağlayan <ozancag@gmail.com> wrote:
The first question is always: can you avoid storing everything in memory? There are improvements you could do but you'll still hit the memory limit again with a slightly bigger file. So rethink that part of the code if possible. This is a fundamental algorithmic point: as long as your approach requires everything to be stored in memory it has an upper size limit. You can incrementally raise that limit with diminishing returns but it's usually better to think of a way to remove the upper limit altogether. You're storing a Python object for each line in the file. Each of these objects has an associated dict. That probably represents a significant part of the memory storage. Try using __slots__ which is intended for the situation where you have lots of small instances. (Not sure how much difference it makes with PyPy though). You can also get significantly better memory efficiency if you store using arrays of some kind rather than many different Python objects. I would probably use numpy record arrays for this problem. -- Oscar

Hi, Yes I thought of the evident question and I think I can avoid keeping everything in memory by doing two passes of the file. Regarding __slots__, it seemed to help using CPython but pypy + slots crashed/trashed in a very hardcore way :)

Hi, On 29 June 2015 at 21:40, Ozan Çağlayan <ozancag@gmail.com> wrote:
Regarding __slots__, it seemed to help using CPython but pypy + slots crashed/trashed in a very hardcore way :)
__slots__ is mostly ignored in PyPy (it always compact instances as if they had slots). The crash/trash is probably due to some other issue. A bientôt, Armin.

Hi Ozan, in addition to what the others said of not using readlines in the first place: I actually discovered a relatively slow part in our file.readlines implementation and fixed it. The nightly build of tonight should improve the situation. Thanks for reporting this! Out of curiosity, what are you using PyPy for? Cheers, Carl Friedrich On 29/06/15 15:02, Ozan Çağlayan wrote:

I found this: https://bitbucket.org/pypy/pypy/issue/729/ I did an strace test and both CPython and PyPy do read syscalls with chunks of 4096.

On Mon, 29 Jun 2015 at 14:02 Ozan Çağlayan <ozancag@gmail.com> wrote:
I tested this with cpython 2.7 and pypy 2.7 and I found that it was 2x slower as you say. It seems that readlines() is somehow slower in pypy. You don't actually need to call readlines() in this case though and it's faster not to. With your code (although I didn't import io.open) I found the timings: CPython 2.7: 1.4s PyPy 2.7: 2.3s I changed it to for line in f: # (not f.readlines()) a += len(line) With that change I get: CPython 2.7: 1.3s PyPy 2.7: 0.6s So calling readlines() makes it slower in both CPython and PyPy. If you don't call readlines() then PyPy is 2x faster than CPython (at least on this machine). Probably also the reason for the MemoryError was that you are using readlines(). The reason for this is that readlines() reads all of the lines of the file into memory as a list of Python string objects. If you just loop over f directly then it reads the file one line at a time and so it requires much less memory. I don't know how much spare RAM you have but if you got MemoryError it suggests that you don't have enough to load the whole file into memory using readlines(). Note that even though this machine has 8GB if RAM (with over 7GB unused) and can load the file into memory quite comfortably I still wouldn't write code that assumed it was okay to load a 324MB file into memory unless there was some actual need to do that. -- Oscar

On Mon, 29 Jun 2015 at 14:44 Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: With your code (although I didn't import io.open) I found the timings:
Note that all of the timings above are "warm cache". That means that the file was already cached in memory by the operating system because I had recently read/written it. The time taken to load the file cold cache (e.g. after rebooting) would probably be much longer in all cases so that the difference between PyPy and CPython would not be significant. The problem you've posted should really be IO bound which isn't really the kind of situation where PyPy gains you much speed over CPython. -- Oscar

Hi, Oh It's my bad, I though that readlines iterates over the file, I'm always confusing this. But still the new results are interesting: time bin/pypy testfile.py 338695509 real 0m0.591s user 0m0.494s sys 0m0.096s time python testfile.py 338695509 real 0m0.560s user 0m0.495s sys 0m0.064s So PyPy is still a little slower here wrt to CPython. This is on warm cache too. Anyways, thanks for pointing out the readlines() stuff :)

it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/ On Mon, Jun 29, 2015 at 3:54 PM, Ozan Çağlayan <ozancag@gmail.com> wrote:

Hello all, Well I am searching my dream scientific language :) The current codebase that I am working with is related to a language translation software written in C++. I wanted to re-implement parts of it in Python and/or Julia to both learn it (as I didn't write the C++ stuff) and maybe to make it available for other people who are interested. I saw Pyston last night then I came back to PyPy. As a first step, I tried to parse a 300MB structed text file containing 1.1M lines like these: 0 ||| I love you mother . ||| label=number1 number2 number3 number4 label2=number5 number6 ... number19 ||| number20 Line-by-line accessing was actually pretty fast *but* trying to store the lines in a Python list drains RAM on my 4G laptop. This is disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of memory. Today I went through pypy and did some benchmarks. Line parsing is as follows: - Split it from " ||| " - Convert 1st field to int and 4rd field to float. - Cleanup label= stuff from 2nd field using re.sub() - Append a dict(1) or a class(2) representing each line to a list. # Dict(1): # PyPy: ~1.4G RAM, ~12.7 seconds # CPython: ~1.2G RAM, 28.7 seconds # Class(2): # PyPy: ~1.2G, ~11.1 seconds # CPython: ~1.3G, ~32 seconds The memory measurements are not precise as I tracked them visually using top :) Attaching the code. I'm not an optimization guru, I'm pretty sure that there are suboptimal parts in the code. But the crucial part is memory complexity. Normally those text files are ~1GB on disk this means that I can't represent them in-memory with Python with this code. This is bad. Any suggestions? Thanks!

On Mon, 29 Jun 2015 at 16:13 Ozan Çağlayan <ozancag@gmail.com> wrote:
The first question is always: can you avoid storing everything in memory? There are improvements you could do but you'll still hit the memory limit again with a slightly bigger file. So rethink that part of the code if possible. This is a fundamental algorithmic point: as long as your approach requires everything to be stored in memory it has an upper size limit. You can incrementally raise that limit with diminishing returns but it's usually better to think of a way to remove the upper limit altogether. You're storing a Python object for each line in the file. Each of these objects has an associated dict. That probably represents a significant part of the memory storage. Try using __slots__ which is intended for the situation where you have lots of small instances. (Not sure how much difference it makes with PyPy though). You can also get significantly better memory efficiency if you store using arrays of some kind rather than many different Python objects. I would probably use numpy record arrays for this problem. -- Oscar

Hi, Yes I thought of the evident question and I think I can avoid keeping everything in memory by doing two passes of the file. Regarding __slots__, it seemed to help using CPython but pypy + slots crashed/trashed in a very hardcore way :)

Hi, On 29 June 2015 at 21:40, Ozan Çağlayan <ozancag@gmail.com> wrote:
Regarding __slots__, it seemed to help using CPython but pypy + slots crashed/trashed in a very hardcore way :)
__slots__ is mostly ignored in PyPy (it always compact instances as if they had slots). The crash/trash is probably due to some other issue. A bientôt, Armin.

Hi Ozan, in addition to what the others said of not using readlines in the first place: I actually discovered a relatively slow part in our file.readlines implementation and fixed it. The nightly build of tonight should improve the situation. Thanks for reporting this! Out of curiosity, what are you using PyPy for? Cheers, Carl Friedrich On 29/06/15 15:02, Ozan Çağlayan wrote:
participants (7)
-
Armin Rigo
-
Carl Friedrich Bolz
-
Maciej Fijalkowski
-
Ondřej Bílka
-
Oscar Benjamin
-
Ozan Çağlayan
-
Ryan Gonzalez