should writing Unicode files be so slow

Fri Mar 19 18:23:59 EDT 2010

En Fri, 19 Mar 2010 14:18:17 -0300, djc <slais-www at ucl.ac.uk> escribió:
> Ben Finney wrote:
>
>> What happens, then, when you make a smaller program that deals with only
>> one file?
>>
>> What happens when you make a smaller program that only reads the file,
>> and doesn't write any? Or a different program that only writes a file,
>> and doesn't read any?
>>
>> It's these sort of reductions that will help narrow down exactly what
>> the problem is. Do make sure that each example is also complete (i.e.
>> can be run as is by someone who uses only that code with no additions).
>>
>
>
> The program reads one csv file of 9,293,271 lines.
> 869M wb.csv
>   It  creates  set  of  files containing the  same  lines  but  where   
> each
>  output file in the set  contains only those lines where the value of a
> particular column is the same, the number of  output files will depend on
> the number of distinct values  in that column In the example that results
> in 19 files
>
> changing
> with open(filename, 'rU') as tabfile:
> to
> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:
>
> and
> with open(outfile, 'wt') as out_part:
> to
> with codecs.open(outfile, 'w', 'utf-8') as out_part:
>
> causes a program that runs  in
> 43 seconds to take 4 minutes to process the same data. In this particular
> case  that is  not very  important, any unicode strings in the data are  
> not
> worth troubling over and I have already  spent more  time satisfying
> curiousity  that  will  ever  be  required  to  process  the dataset  in
> future.  But  I have another  project  in  hand where not only is the
> unicode significant but the files are very much larger. Scale up the
> problem and the difference between 4 hours and 24 become a matter worth
> some attention.

Ok. Your test program is too large to determine what's going on. Try to  
determine first *which* part is slow:

- reading: measure the time it takes only to read a file, with open() and  
codecs.open()
It might be important the density of non-ascii characters and their  
relative code points (as utf-8 is much more efficient for ASCII data than,  
say, Hanzi)
- processing: measure the time it takes the processing part (fed with str  
vs unicode data)
- writing: measure the time it takes only to write a file, with open() and  
codecs.open()

Only then one can focus on optimizing the bottleneck.

-- 
Gabriel Genellina