should writing Unicode files be so slow

djc slais-www at ucl.ac.uk
Fri Mar 19 13:18:17 EDT 2010


Ben Finney wrote:

> What happens, then, when you make a smaller program that deals with only
> one file?
> 
> What happens when you make a smaller program that only reads the file,
> and doesn't write any? Or a different program that only writes a file,
> and doesn't read any?
> 
> It's these sort of reductions that will help narrow down exactly what
> the problem is. Do make sure that each example is also complete (i.e.
> can be run as is by someone who uses only that code with no additions).
> 


The program reads one csv file of 9,293,271 lines.
869M wb.csv
  It  creates  set  of  files containing the  same  lines  but  where  each
 output file in the set  contains only those lines where the value of a
particular column is the same, the number of  output files will depend on
the number of distinct values  in that column In the example that results
in 19 files

74M tt_11696870405.txt
94M tt_18762175493.txt
15M  tt_28668070915.txt
12M tt_28673313795.txt
15M  tt_28678556675.txt
11M  tt_28683799555.txt
12M  tt_28689042435.txt
15M  tt_28694285315.txt
7.3M  tt_28835845125.txt
6.8M tt_28842136581.txt
12M  tt_28848428037.txt
11M  tt_28853670917.txt
12M  tt_28858913797.txt
15M  tt_28864156677.txt
11M  tt_28869399557.txt
11M  tt_28874642437.txt
283M  tt_31002203141.txt
259M  tt_33335282691.txt
45 2010-03-19 17:00 tt_taskid.txt

changing
with open(filename, 'rU') as tabfile:
to
with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:

and
with open(outfile, 'wt') as out_part:
to
with codecs.open(outfile, 'w', 'utf-8') as out_part:

causes a program that runs  in
43 seconds to take 4 minutes to process the same data. In this particular
case  that is  not very  important, any unicode strings in the data are not
worth troubling over and I have already  spent more  time satisfying
curiousity  that  will  ever  be  required  to  process  the dataset  in
future.  But  I have another  project  in  hand where not only is the
unicode significant but the files are very much larger. Scale up the
problem and the difference between 4 hours and 24 become a matter worth
some attention.



-- 
David Clark, MSc, PhD.              UCL Centre for Publishing
                                    Gower Str London WCIE 6BT
What sort of web animal are you?
            <https://www.bbc.co.uk/labuk/experiments/webbehaviour>



More information about the Python-list mailing list