[Tutor] genfromtxt vs. reading csv to a list or dictionary

Danny Yoo dyoo at hashcollision.org
Tue Feb 9 01:16:42 EST 2016

> I normally need to convert csv and text files to a Numpy array. I tried to
> do the same thing using (1) reader=DictReader(MyFile), (2)
> reader=csv.readre(MyFile), or (3) genfromtxt (MyFile ,……).  The first two
> is after I open the file. They produce a list of lists, list of tuples or
> list of dictionaries which later are converted to an array.

If we're touching the hard drive as a part of input/output operations,
likely you won't need to worry about efficiency, especially for a
program dedicated to read files.

What I mean is, disk operations are *several orders of magnitude* more
expensive than most other non-I/O operations your program will
perform.  As long as we're reading and processing the input in a
non-crazy way, we should be ok.  ("Non-crazy": A small constant number
of passes over the input file, and if the file is very large, doesn't
try to read the whole file into memory at once).  I think all three of
your described processes will be non-crazy.

What will dominate the time a file-parsing program takes will almost
certainly be I/O-bound.  And you probably can't do anything to change
the physics of how disk platters spin.  This rough rule is sensitive
to context, and several of my assumptions may be wrong.  I'm assuming
a standard desktop environment on a single machine, with a physical
hard drive.  But maybe you have SSDs or some unusual storage that's
very fast or parallel.  If those assumptions are wrong, then yes, you
may need to be concerned about shaving off every last millisecond to
get performance.

How can we know for sure?  We can measure: a "profile" of a program
lets us see if time is truly being spent in non-I/O computation.  See:
https://docs.python.org/3.5/library/profile.html for more details.  If
you have versions of your reader for those three strategies, try
profiling them.

More information about the Tutor mailing list