should writing Unicode files be so slow
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Fri Mar 19 18:23:59 EDT 2010
En Fri, 19 Mar 2010 14:18:17 -0300, djc <slais-www at ucl.ac.uk> escribió:
> Ben Finney wrote:
>
>> What happens, then, when you make a smaller program that deals with only
>> one file?
>>
>> What happens when you make a smaller program that only reads the file,
>> and doesn't write any? Or a different program that only writes a file,
>> and doesn't read any?
>>
>> It's these sort of reductions that will help narrow down exactly what
>> the problem is. Do make sure that each example is also complete (i.e.
>> can be run as is by someone who uses only that code with no additions).
>>
>
>
> The program reads one csv file of 9,293,271 lines.
> 869M wb.csv
> It creates set of files containing the same lines but where
> each
> output file in the set contains only those lines where the value of a
> particular column is the same, the number of output files will depend on
> the number of distinct values in that column In the example that results
> in 19 files
>
> changing
> with open(filename, 'rU') as tabfile:
> to
> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:
>
> and
> with open(outfile, 'wt') as out_part:
> to
> with codecs.open(outfile, 'w', 'utf-8') as out_part:
>
> causes a program that runs in
> 43 seconds to take 4 minutes to process the same data. In this particular
> case that is not very important, any unicode strings in the data are
> not
> worth troubling over and I have already spent more time satisfying
> curiousity that will ever be required to process the dataset in
> future. But I have another project in hand where not only is the
> unicode significant but the files are very much larger. Scale up the
> problem and the difference between 4 hours and 24 become a matter worth
> some attention.
Ok. Your test program is too large to determine what's going on. Try to
determine first *which* part is slow:
- reading: measure the time it takes only to read a file, with open() and
codecs.open()
It might be important the density of non-ascii characters and their
relative code points (as utf-8 is much more efficient for ASCII data than,
say, Hanzi)
- processing: measure the time it takes the processing part (fed with str
vs unicode data)
- writing: measure the time it takes only to write a file, with open() and
codecs.open()
Only then one can focus on optimizing the bottleneck.
--
Gabriel Genellina
More information about the Python-list
mailing list