Mailman 3 A simple file reading is 2x slow wrt CPython - pypy-dev

A simple file reading is 2x slow wrt CPython

Ozan Çağlayan

29 Jun 2015 29 Jun '15

1:02 p.m.

Hi, I just downloaded PyPy 2.6.0 just to play with it. I have a simple line-by-line file reading example where the file is 324MB. Code: # Not doing this import crashes PyPy with MemoryError?? from io import open a = 0 f = open(fname) for line in f.readlines(): a += len(line) f.close() PyPy: Python 2.7.9 (295ee98b69288471b0fcf2e0ede82ce5209eb90b, Jun 01 2015, 17:30:13) [PyPy 2.6.0 with GCC 4.9.2] on linux2 real 0m6.068s user 0m4.582s sys 0m0.846s CPython (2.7.10) real 0m3.799s user 0m2.851s sys 0m0.860s Am I doing something wrong or is this expected? Thanks! -- Ozan Çağlayan Research Assistant Galatasaray University - Computer Engineering Dept. http://www.ozancaglayan.com

Show replies by date

Ozan Çağlayan

29 Jun 29 Jun

1:11 p.m.

I found this: https://bitbucket.org/pypy/pypy/issue/729/ I did an strace test and both CPython and PyPy do read syscalls with chunks of 4096.

Oscar Benjamin

1:44 p.m.

On Mon, 29 Jun 2015 at 14:02 Ozan Çağlayan <ozancag@gmail.com> wrote:

...

Hi,

I just downloaded PyPy 2.6.0 just to play with it.

I have a simple line-by-line file reading example where the file is 324MB.

Code:

# Not doing this import crashes PyPy with MemoryError?? from io import open

...

a = 0 f = open(fname) for line in f.readlines(): a += len(line) f.close()

PyPy: Python 2.7.9 (295ee98b69288471b0fcf2e0ede82ce5209eb90b, Jun 01 2015, 17:30:13) [PyPy 2.6.0 with GCC 4.9.2] on linux2

real 0m6.068s user 0m4.582s sys 0m0.846s

CPython (2.7.10)

real 0m3.799s user 0m2.851s sys 0m0.860s

Am I doing something wrong or is this expected?

I tested this with cpython 2.7 and pypy 2.7 and I found that it was 2x slower as you say. It seems that readlines() is somehow slower in pypy. You don't actually need to call readlines() in this case though and it's faster not to. With your code (although I didn't import io.open) I found the timings: CPython 2.7: 1.4s PyPy 2.7: 2.3s I changed it to for line in f: # (not f.readlines()) a += len(line) With that change I get: CPython 2.7: 1.3s PyPy 2.7: 0.6s So calling readlines() makes it slower in both CPython and PyPy. If you don't call readlines() then PyPy is 2x faster than CPython (at least on this machine). Probably also the reason for the MemoryError was that you are using readlines(). The reason for this is that readlines() reads all of the lines of the file into memory as a list of Python string objects. If you just loop over f directly then it reads the file one line at a time and so it requires much less memory. I don't know how much spare RAM you have but if you got MemoryError it suggests that you don't have enough to load the whole file into memory using readlines(). Note that even though this machine has 8GB if RAM (with over 7GB unused) and can load the file into memory quite comfortably I still wouldn't write code that assumed it was okay to load a 324MB file into memory unless there was some actual need to do that. -- Oscar

Oscar Benjamin

1:49 p.m.

On Mon, 29 Jun 2015 at 14:44 Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: With your code (although I didn't import io.open) I found the timings:

...

CPython 2.7: 1.4s PyPy 2.7: 2.3s

I changed it to for line in f: # (not f.readlines()) a += len(line)

With that change I get: CPython 2.7: 1.3s PyPy 2.7: 0.6s

Note that all of the timings above are "warm cache". That means that the file was already cached in memory by the operating system because I had recently read/written it. The time taken to load the file cold cache (e.g. after rebooting) would probably be much longer in all cases so that the difference between PyPy and CPython would not be significant. The problem you've posted should really be IO bound which isn't really the kind of situation where PyPy gains you much speed over CPython. -- Oscar

Ozan Çağlayan

1:54 p.m.

Hi, Oh It's my bad, I though that readlines iterates over the file, I'm always confusing this. But still the new results are interesting: time bin/pypy testfile.py 338695509 real 0m0.591s user 0m0.494s sys 0m0.096s time python testfile.py 338695509 real 0m0.560s user 0m0.495s sys 0m0.064s So PyPy is still a little slower here wrt to CPython. This is on warm cache too. Anyways, thanks for pointing out the readlines() stuff :)

Maciej Fijalkowski

1:58 p.m.

it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/ On Mon, Jun 29, 2015 at 3:54 PM, Ozan Çağlayan <ozancag@gmail.com> wrote:

...

Hi,

Oh It's my bad, I though that readlines iterates over the file, I'm always confusing this.

But still the new results are interesting:

time bin/pypy testfile.py 338695509

real 0m0.591s user 0m0.494s sys 0m0.096s

time python testfile.py 338695509

real 0m0.560s user 0m0.495s sys 0m0.064s

So PyPy is still a little slower here wrt to CPython. This is on warm cache too.

Anyways, thanks for pointing out the readlines() stuff :) _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

Ondřej Bílka

2:43 p.m.

On Mon, Jun 29, 2015 at 03:58:12PM +0200, Maciej Fijalkowski wrote:

...

it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/

Do you have testcase to report? I could fix that.

Maciej Fijalkowski

2:44 p.m.

On Mon, Jun 29, 2015 at 4:43 PM, Ondřej Bílka <neleai@seznam.cz> wrote:

...

On Mon, Jun 29, 2015 at 03:58:12PM +0200, Maciej Fijalkowski wrote:

...
it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/

Do you have testcase to report? I could fix that.

testcase of what? PyPy buffering not being up to scratch compared to libc buffering?

Ondřej Bílka

2:56 p.m.

On Mon, Jun 29, 2015 at 04:44:44PM +0200, Maciej Fijalkowski wrote:

...

On Mon, Jun 29, 2015 at 4:43 PM, Ondřej Bílka <neleai@seznam.cz> wrote:

...
On Mon, Jun 29, 2015 at 03:58:12PM +0200, Maciej Fijalkowski wrote:

...
it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/

Do you have testcase to report? I could fix that.

testcase of what? PyPy buffering not being up to scratch compared to libc buffering?

Why libc buffering is bad so it could be fixed.

Maciej Fijalkowski

2:58 p.m.

On Mon, Jun 29, 2015 at 4:56 PM, Ondřej Bílka <neleai@seznam.cz> wrote:

...

On Mon, Jun 29, 2015 at 04:44:44PM +0200, Maciej Fijalkowski wrote:

...
On Mon, Jun 29, 2015 at 4:43 PM, Ondřej Bílka <neleai@seznam.cz> wrote:

...
On Mon, Jun 29, 2015 at 03:58:12PM +0200, Maciej Fijalkowski wrote:

...
it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/

Do you have testcase to report? I could fix that.

testcase of what? PyPy buffering not being up to scratch compared to libc buffering?

Why libc buffering is bad so it could be fixed.

no, libc buffering is good and we should use it as opposed to our home-grown solution which is bad and likely should be thrown away ;-)

Ondřej Bílka

3:17 p.m.

On Mon, Jun 29, 2015 at 04:58:40PM +0200, Maciej Fijalkowski wrote:

...

On Mon, Jun 29, 2015 at 4:56 PM, Ondřej Bílka <neleai@seznam.cz> wrote:

...
On Mon, Jun 29, 2015 at 04:44:44PM +0200, Maciej Fijalkowski wrote:

...
On Mon, Jun 29, 2015 at 4:43 PM, Ondřej Bílka <neleai@seznam.cz> wrote:

...
On Mon, Jun 29, 2015 at 03:58:12PM +0200, Maciej Fijalkowski wrote:

...
it's sort-of-known, we have a branch to try to address this, the main problem is that we try to do buffering ourselves as opposed to just use libc buffering, which turns out to be not as good. Sorry about that :/

Do you have testcase to report? I could fix that.

testcase of what? PyPy buffering not being up to scratch compared to libc buffering?

Why libc buffering is bad so it could be fixed.

no, libc buffering is good and we should use it as opposed to our home-grown solution which is bad and likely should be thrown away ;-)

Ok, I wrongly parsed your sentence to mean that we do buffering ourself because libc one is bad.

Ozan Çağlayan

3:12 p.m.

Hello all, Well I am searching my dream scientific language :) The current codebase that I am working with is related to a language translation software written in C++. I wanted to re-implement parts of it in Python and/or Julia to both learn it (as I didn't write the C++ stuff) and maybe to make it available for other people who are interested. I saw Pyston last night then I came back to PyPy. As a first step, I tried to parse a 300MB structed text file containing 1.1M lines like these: 0 ||| I love you mother . ||| label=number1 number2 number3 number4 label2=number5 number6 ... number19 ||| number20 Line-by-line accessing was actually pretty fast *but* trying to store the lines in a Python list drains RAM on my 4G laptop. This is disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of memory. Today I went through pypy and did some benchmarks. Line parsing is as follows: - Split it from " ||| " - Convert 1st field to int and 4rd field to float. - Cleanup label= stuff from 2nd field using re.sub() - Append a dict(1) or a class(2) representing each line to a list. # Dict(1): # PyPy: ~1.4G RAM, ~12.7 seconds # CPython: ~1.2G RAM, 28.7 seconds # Class(2): # PyPy: ~1.2G, ~11.1 seconds # CPython: ~1.3G, ~32 seconds The memory measurements are not precise as I tracked them visually using top :) Attaching the code. I'm not an optimization guru, I'm pretty sure that there are suboptimal parts in the code. But the crucial part is memory complexity. Normally those text files are ~1GB on disk this means that I can't represent them in-memory with Python with this code. This is bad. Any suggestions? Thanks!

Oscar Benjamin

3:52 p.m.

On Mon, 29 Jun 2015 at 16:13 Ozan Çağlayan <ozancag@gmail.com> wrote:

...

Hello all,

Well I am searching my dream scientific language :)

The current codebase that I am working with is related to a language translation software written in C++. I wanted to re-implement parts of it in Python and/or Julia to both learn it (as I didn't write the C++ stuff) and maybe to make it available for other people who are interested.

I saw Pyston last night then I came back to PyPy.

As a first step, I tried to parse a 300MB structed text file containing 1.1M lines like these:

0 ||| I love you mother . ||| label=number1 number2 number3 number4 label2=number5 number6 ... number19 ||| number20

Line-by-line accessing was actually pretty fast *but* trying to store the lines in a Python list drains RAM on my 4G laptop. This is disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of memory.

The first question is always: can you avoid storing everything in memory? There are improvements you could do but you'll still hit the memory limit again with a slightly bigger file. So rethink that part of the code if possible. This is a fundamental algorithmic point: as long as your approach requires everything to be stored in memory it has an upper size limit. You can incrementally raise that limit with diminishing returns but it's usually better to think of a way to remove the upper limit altogether. You're storing a Python object for each line in the file. Each of these objects has an associated dict. That probably represents a significant part of the memory storage. Try using __slots__ which is intended for the situation where you have lots of small instances. (Not sure how much difference it makes with PyPy though). You can also get significantly better memory efficiency if you store using arrays of some kind rather than many different Python objects. I would probably use numpy record arrays for this problem. -- Oscar

Ozan Çağlayan

7:40 p.m.

Hi, Yes I thought of the evident question and I think I can avoid keeping everything in memory by doing two passes of the file. Regarding __slots__, it seemed to help using CPython but pypy + slots crashed/trashed in a very hardcore way :)

Armin Rigo

7:46 p.m.

Hi, On 29 June 2015 at 21:40, Ozan Çağlayan <ozancag@gmail.com> wrote:

...

Regarding __slots__, it seemed to help using CPython but pypy + slots crashed/trashed in a very hardcore way :)

__slots__ is mostly ignored in PyPy (it always compact instances as if they had slots). The crash/trash is probably due to some other issue. A bientôt, Armin.

Ryan Gonzalez

1:57 p.m.

Could you try just using: for line in f: ... That refrains from the loading the entire file into memory at once. On June 29, 2015 8:02:23 AM CDT, "Ozan Çağlayan" <ozancag@gmail.com> wrote:

...

Hi,

I just downloaded PyPy 2.6.0 just to play with it.

I have a simple line-by-line file reading example where the file is 324MB.

Code:

# Not doing this import crashes PyPy with MemoryError?? from io import open

a = 0 f = open(fname) for line in f.readlines(): a += len(line) f.close()

PyPy: Python 2.7.9 (295ee98b69288471b0fcf2e0ede82ce5209eb90b, Jun 01 2015, 17:30:13) [PyPy 2.6.0 with GCC 4.9.2] on linux2

real 0m6.068s user 0m4.582s sys 0m0.846s

CPython (2.7.10)

real 0m3.799s user 0m2.851s sys 0m0.860s

Am I doing something wrong or is this expected?

Thanks!

-- Ozan Çağlayan Research Assistant Galatasaray University - Computer Engineering Dept. http://www.ozancaglayan.com _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Carl Friedrich Bolz

2:33 p.m.

Hi Ozan, in addition to what the others said of not using readlines in the first place: I actually discovered a relatively slow part in our file.readlines implementation and fixed it. The nightly build of tonight should improve the situation. Thanks for reporting this! Out of curiosity, what are you using PyPy for? Cheers, Carl Friedrich On 29/06/15 15:02, Ozan Çağlayan wrote:

...

Hi,

I just downloaded PyPy 2.6.0 just to play with it.

I have a simple line-by-line file reading example where the file is 324MB.

Code:

# Not doing this import crashes PyPy with MemoryError?? from io import open

a = 0 f = open(fname) for line in f.readlines(): a += len(line) f.close()

PyPy: Python 2.7.9 (295ee98b69288471b0fcf2e0ede82ce5209eb90b, Jun 01 2015, 17:30:13) [PyPy 2.6.0 with GCC 4.9.2] on linux2

real 0m6.068s user 0m4.582s sys 0m0.846s

CPython (2.7.10)

real 0m3.799s user 0m2.851s sys 0m0.860s

Am I doing something wrong or is this expected?

Thanks!

3368

Age (days ago)

3368

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

Armin Rigo
Carl Friedrich Bolz
Maciej Fijalkowski
Ondřej Bílka
Oscar Benjamin
Ozan Çağlayan
Ryan Gonzalez

A simple file reading is 2x slow wrt CPython

Ozan Çağlayan

Ozan Çağlayan

Ozan Çağlayan

Ondřej Bílka

Ondřej Bílka

Ondřej Bílka

Ozan Çağlayan

Ozan Çağlayan

tags

participants (7)