Splitting a file from specific column content

Eelco hoogendoorn.eelco at gmail.com
Sun Jan 22 15:43:58 EST 2012


The grep solution is not cross-platform, and not really an answer to a
question about python.

The by-line iteration examples are inefficient and bad practice from a
numpy/vectorization perspective.

I would advice to do it the numpythonic way (untested code):

breakpoints = [3, 5, 7]
data = np.loadtxt('data.txt')
time = data[:,0]
indices = np.searchsorted(time, breakpoints)
chunks = np.split(data, indices, axis=0)
for i, d in enumerate(chunks):
    np.savetxt('data'+str(i)+'.txt', d)

Not sure how it compared to the grep solution in terms of performance,
but that should be quite a non-issue for 20mb of data, and its sure to
blow the by-line iteration out of the water. If you want to be more
efficient, you are going to have to cut the text-to-numeric parsing
out of the loop, which is the vast majority of the computational load
here; but if thats possible at all depends on how structured your
timestamps are; there must be a really compelling performance gain to
justify throwing the elegance of the np.split based solution out of
the window, in my opinion.



More information about the Python-list mailing list