how to fast processing one million strings to remove quotes
Tim Daneliuk
info at tundraware.com
Fri Aug 4 10:20:02 EDT 2017
On 08/04/2017 01:52 AM, Peter Otten wrote:
<SNIP>
> It looks like Python is fairly competetive:
>
> $ wc -l hugequote.txt
> 1000000 hugequote.txt612250
> $ cat unquote.py
> import csv
>
> with open("hugequote.txt") as instream:
> for field, in csv.reader(instream):
> print(field)
>
> $ time python3 unquote.py > /dev/null
>
> real 0m3.773s
> user 0m3.665s
> sys 0m0.082s
>
> $ time cat hugequote.txt | sed 's/"""/"/g;s/""/"/g' > /dev/null
>
> real 0m4.862s
> user 0m4.721s
> sys 0m0.330s
>
> Run on ancient AMD hardware ;)
>
It's actually better than sed. What you're seeing is - I believe -
load time dominating the overall time. I reran this with a 20M line
file:
time cat superhuge.txt | sed 's/"""/"/g;s/""/"/g' >/dev/null
real 0m53.091s
user 0m52.861s
sys 0m0.820s
time python unquote.py >/dev/null
real 0m22.377s
user 0m22.021s
sys 0m0.352s
Note that this is with python2, not python3. Also, I confirmed that the
cat and pipe into sed was not a factor in the performance.
My guess is that delimiter recognition logic in the csv module is far
more efficient than the general purpose regular expression/dfa
implementaton in sed.
Extra Credit Assignment:
Reimplement in python using:
- string substitution
- regular expressions
Tschüss...
More information about the Python-list
mailing list