sorting 1172026 entries

Sun May 6 19:54:16 EDT 2012

On 06May2012 18:36, J. Mwebaze <jmwebaze at gmail.com> wrote:
| > for filename in txtfiles:
| >    temp=[]
| >    f=open(filename)
| >    for line in f.readlines():
| >      line = line.strip()
| >      line=line.split()
| >      temp.append((parser.parse(line[0]), float(line[1])))

Have you timed the different parts of your code instead of the whole
thing?

Specificly, do you know the sort time is the large cost?

I would point out that the loop above builds the list by append(), one
item at a time. That should have runtime cost of the square of the list
length, 1172026 * 1172026. Though I've just done this:

  [Documents/python]oscar1*> python
  Python 2.7.3 (default, May  4 2012, 16:19:02) 
  [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> L1 = []
  >>> for i in range(1000000): L1.append(0)
  ... 

and it only took a few seconds.

As pointed out by others, the readlines() is also a little expensive,
conceivably similarly so (it also needs to build a huge list).

Anyway, put some:

  print time.time()

at various points. Not in the inner bits of the loops, but around larger
chunks, example:

   from time import time
   temp=[]
   f=open(filename)
   print "after open", time()
   lines = f.readlines()
   print "after readlines", time()
   for line in lines:
     line = line.strip()
     line=line.split()
     temp.append((parser.parse(line[0]), float(line[1])))
   print "after read loop", time()

and so on. AT least then you will have more feel for what part of your
code is taking so long.

Ceers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

The shortest path between any two truths in the real domain passes through
the complex domain.     - J. Hadamand