should writing Unicode files be so slow
djc
slais-www at ucl.ac.uk
Fri Mar 19 13:18:17 EDT 2010
Ben Finney wrote:
> What happens, then, when you make a smaller program that deals with only
> one file?
>
> What happens when you make a smaller program that only reads the file,
> and doesn't write any? Or a different program that only writes a file,
> and doesn't read any?
>
> It's these sort of reductions that will help narrow down exactly what
> the problem is. Do make sure that each example is also complete (i.e.
> can be run as is by someone who uses only that code with no additions).
>
The program reads one csv file of 9,293,271 lines.
869M wb.csv
It creates set of files containing the same lines but where each
output file in the set contains only those lines where the value of a
particular column is the same, the number of output files will depend on
the number of distinct values in that column In the example that results
in 19 files
74M tt_11696870405.txt
94M tt_18762175493.txt
15M tt_28668070915.txt
12M tt_28673313795.txt
15M tt_28678556675.txt
11M tt_28683799555.txt
12M tt_28689042435.txt
15M tt_28694285315.txt
7.3M tt_28835845125.txt
6.8M tt_28842136581.txt
12M tt_28848428037.txt
11M tt_28853670917.txt
12M tt_28858913797.txt
15M tt_28864156677.txt
11M tt_28869399557.txt
11M tt_28874642437.txt
283M tt_31002203141.txt
259M tt_33335282691.txt
45 2010-03-19 17:00 tt_taskid.txt
changing
with open(filename, 'rU') as tabfile:
to
with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:
and
with open(outfile, 'wt') as out_part:
to
with codecs.open(outfile, 'w', 'utf-8') as out_part:
causes a program that runs in
43 seconds to take 4 minutes to process the same data. In this particular
case that is not very important, any unicode strings in the data are not
worth troubling over and I have already spent more time satisfying
curiousity that will ever be required to process the dataset in
future. But I have another project in hand where not only is the
unicode significant but the files are very much larger. Scale up the
problem and the difference between 4 hours and 24 become a matter worth
some attention.
--
David Clark, MSc, PhD. UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbehaviour>
More information about the Python-list
mailing list