[pypy-issue] [issue641] reading CSV files with csv module is much slower than CPython 2.6.6
tracker at bugs.pypy.org
Mon Feb 20 17:56:16 CET 2012
Justin Peel <peelpy at gmail.com> added the comment:
I've had some success in improving the csv module's reading speed. Just replace
lib_pypy/_csv.py with the attached _csv.py file. I also attached two simple
benchmarks, csvmodtest_read.py and csvmodtest_read_float.py, that use the data
generated by csvparsingtest_create_test_file.py in csvparsingtest.tar.gz. Here
are some times on my machine.
python csvmodtest_read.py : 1.42479109764
python csvmodtest_read_float.py : 2.06221604347
Before changes to csv module's reader:
pypy csvmodtest_read.py : 4.33989810944
pypy csvmodtest_read_float.py : 6.02067089081
After changes to csv module's reader:
pypy csvmodtest_read.py : 1.84223508835
pypy csvmodtest_read_float.py : 3.14416193962
It isn't all the way there, but it is significant progress. Here are the major
changes that I made:
1. I brought the while loop over _parse_process_char() in next() into
_parse_process_char(). This was the biggest single change improvement because it
was calling parse_process_char() for each character. Maybe I should change the
name of _parse_process_char() to _parse_process_chars()?
2. The code was pulling out a single char, c = line[pos], and comparing
everything against c. However, the jit can't seem to figure out in that case
that c is always a single character and instead always does a full string
comparison call. By substituting `c` with `line[pos]` everywhere I got a
significant speed-up. The same occurred for changing self.dialect.delimiter with
self.dialect.delimiter. Maybe this indicates a change that should be made in
3. Make all of dialect's variables that are used in parsing into class variables
of reader (with underscores at the front of their names). dialect's variables
like delimiter are all property()'s in order for them to be read-only. However,
this causes some slowdown.
4. Don't append the first char during a START_FIELD when the character is a
normal (non-escaped, etc) character. This is taken care of in IN_FIELD instead.
5. Instead of using `if line[pos] in '\n\r'` to check for new line characters,
use `if line[pos] == '\n' or line[pos] == '\r'`. This changes a call to search
in a string to just checking against two constants.
I think that those are the main changes that I did, but I might have forgotten
others. I did have to make a some small changes to the logic that went along
with my changes. All tests pass in lib-python/modified-2.7/test/test_csv.py. I
tried using StringBuilder and was getting a slow down, but maybe someone else
can see a better way to use it.
Are these changes acceptable? If not, then please give clear reasons why and
PyPy bug tracker <tracker at bugs.pypy.org>
More information about the pypy-issue