READ FILE UNTIL THE Xth SPACE IS ENCOUNTERED...

Wed Oct 17 13:43:08 EDT 2001

On Tue, 16 Oct 2001 22:42:47 +0200 (MEST), chr_w at gmx.de wrote:

>Hi list :-)
>
>Yeah, I'm a python starter seeking for help once again... ;-) Seems as if I
>start right the hard way with the coding tasks I have to solve here... :-(
>
>I don't know if this has been covered before... Here's what I am trying
>(amongst other things) to do:
>"read in an ascii file until a given number of delminiting spaces has been
>reached"
>the file contains float numbers delimited by spaces (ASCII)
>
>Motivation: I have massiv ASCII files containing floats delimited by spaces
>and run out of memory when trying to get them into the machine in one piece
>(followed by spliting into a list object).
>
>What I am doing right now:
>1) read in the entire file until EOF is reached
>2) split into a list of strings
>3) convert to float array
>
>This works fine for my smaller files. But there are also some big bastards
>to be done.
>So:
>
>What I want to do:
>1) read the file until space number 63477 was encountered
>2) take this string and split it into a list / followed by a conversion into
>an float array
>3) continue to read the file (until the next space no 63477 was read)
>4) etc.
>
>Don't worry about the number '63477': this is the sum of fields in an 3d
>array I want to produce.
>No idea how to do this, since all examples I found read until EOF or by
>lines. Forgive me if I missed something really basic here - I just couldn't find
>any related topic in the newsgroups or docs...
>
>cheers,
>christian
>
>
>-- 
>GMX - Die Kommunikationsplattform im Internet.
>http://www.gmx.net
>
>

You could use mmap to open the file and use a regular expression to
extract the floats or simple iterate through the buffer and extract
the data yourself.

bob

import mmap
import re
import time

f = open('floats.txt', 'r+')
filemap = mmap.mmap(f.fileno(), 0)

# this will give you a list of all floats
# which you can slice and dice as you wish
# uses quite a bit of memory though
t = time.clock()
l = re.findall(r'\d+.\d+', filemap)
t2 = time.clock()
print "took", t2-t
print len(l)
l = []

# this is a bit slower but seem to use less memory
# building a 2d matrix here
class newMatrix:
    def __init__(self, maxlen=2191):
        self.currentlist = []
        self.lists = [self.currentlist]
        self.maxlen = maxlen

    def __call__(self, value):
        if len(self.currentlist) == self.maxlen:
            self.currentlist = []
            self.lists.append(self.currentlist)
        self.currentlist.append(float(value.group(0)))
        return ''

nm = newMatrix()
t = time.clock()
# will return all of the spaces in the file
# could match them and filter them later
v = re.sub(r'\d+.\d+\s*', nm, filemap)
t2 = time.clock()
print len(v), len(v.strip())
print len(nm.lists)
print 'sub took', t2-t