split large file by string/regex

Bengt Richter bokr at oz.net
Mon Nov 22 12:21:15 EST 2004

On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <dieringe at zedat.fu-berlin.de> wrote:

>Jason Rennie <jrennie at csail.mit.edu> writes:
>> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>>> I am trying to split a file by a fixed string.
>>> The file is too large to just read it into a string and split this.
>>> I could probably use a lexer but there maybe anything more simple?
>> If the pattern is contained within a single line, do something like this:
>Hmm it's binary data, I can't tell how long lines would be. OTOH a
>line would certainly contain the pattern as it has no \n in it... and
>the lines probably wouldn't be too large for memory...
Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of

 >>> '1231xxx45646xxx45646xxx78'.split('xxx')
 ['1231', '45646', '45646', '78']

or (I chose this for below)
 ['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

 ['1231xxx', '45646xxx', '45646xxx', '78']


Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
    splen = len(splitstr)
    chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
    buf = ''
    for chunk in chunks:
        buf += chunk
        start = end = 0
        while end>=0 and len(buf)>=splen:
            start, end = end, buf.find(splitstr, end)
            if end>=0:
                yield buf[start:end]  #not including splitstr
                yield splitstr  # == buf[end:end+splen] # splitstr
                end += splen
                buf = buf[start:]

    yield buf

def test(*args):
    for chunk in splitfile(*args):
        print repr(chunk)

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
        if len(args)==3: args[2]=int(args[2])
    except Exception:
        raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'

Extent of testing follows :-)

 >>> print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40)
 >>> import ut.splitfile
 >>> ut.splitfile.test('splitfile.txt', 'abc')
 >>> ut.splitfile.test('splitfile.txt', '012')
 >>> it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
 >>> it.next
 <method-wrapper object at 0x02EF1C6C>
 >>> it.next()
 >>> it.next()
 >>> it.next()
 >>> it.next()
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Bengt Richter

More information about the Python-list mailing list