[pypy-issue] [issue641] reading CSV files with csv module is much slower than CPython 2.6.6

Justin Peel tracker at bugs.pypy.org
Mon Feb 20 17:56:16 CET 2012


Justin Peel <peelpy at gmail.com> added the comment:

I've had some success in improving the csv module's reading speed. Just replace 
lib_pypy/_csv.py with the attached _csv.py file. I also attached two simple 
benchmarks, csvmodtest_read.py and csvmodtest_read_float.py, that use the data 
generated by csvparsingtest_create_test_file.py in csvparsingtest.tar.gz. Here 
are some times on my machine.

python csvmodtest_read.py       : 1.42479109764
python csvmodtest_read_float.py : 2.06221604347

Before changes to csv module's reader:
pypy csvmodtest_read.py         : 4.33989810944
pypy csvmodtest_read_float.py   : 6.02067089081

After changes to csv module's reader:
pypy csvmodtest_read.py         : 1.84223508835
pypy csvmodtest_read_float.py   : 3.14416193962

It isn't all the way there, but it is significant progress. Here are the major 
changes that I made:

1. I brought the while loop over _parse_process_char() in next() into 
_parse_process_char(). This was the biggest single change improvement because it 
was calling parse_process_char() for each character. Maybe I should change the 
name of _parse_process_char() to _parse_process_chars()?
2. The code was pulling out a single char, c = line[pos], and comparing 
everything against c. However, the jit can't seem to figure out in that case 
that c is always a single character and instead always does a full string 
comparison call. By substituting `c` with `line[pos]` everywhere I got a 
significant speed-up. The same occurred for changing self.dialect.delimiter with 
self.dialect.delimiter[0]. Maybe this indicates a change that should be made in 
the jit.
3. Make all of dialect's variables that are used in parsing into class variables 
of reader (with underscores at the front of their names). dialect's variables 
like delimiter are all property()'s in order for them to be read-only. However, 
this causes some slowdown.
4. Don't append the first char during a START_FIELD when the character is a 
normal (non-escaped, etc) character. This is taken care of in IN_FIELD instead.
5. Instead of using `if line[pos] in '\n\r'` to check for new line characters, 
use `if line[pos] == '\n' or line[pos] == '\r'`. This changes a call to search 
in a string to just checking against two constants.

I think that those are the main changes that I did, but I might have forgotten 
others. I did have to make a some small changes to the logic that went along 
with my changes. All tests pass in lib-python/modified-2.7/test/test_csv.py. I 
tried using StringBuilder and was getting a slow down, but maybe someone else 
can see a better way to use it.

Are these changes acceptable? If not, then please give clear reasons why and 
suggest alternatives.

----------
nosy: +justinpeel

________________________________________
PyPy bug tracker <tracker at bugs.pypy.org>
<https://bugs.pypy.org/issue641>
________________________________________


More information about the pypy-issue mailing list