split large file by string/regex

Mon Nov 22 12:21:15 EST 2004

On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <dieringe at zedat.fu-berlin.de> wrote:

>Jason Rennie <jrennie at csail.mit.edu> writes:
>
>> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>>> I am trying to split a file by a fixed string.
>>> The file is too large to just read it into a string and split this.
>>> I could probably use a lexer but there maybe anything more simple?
>>
>> If the pattern is contained within a single line, do something like this:
>
>Hmm it's binary data, I can't tell how long lines would be. OTOH a
>line would certainly contain the pattern as it has no \n in it... and
>the lines probably wouldn't be too large for memory...
>
>m.
Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of

 >>> '1231xxx45646xxx45646xxx78'.split('xxx')
 ['1231', '45646', '45646', '78']

or (I chose this for below)
 ['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

 ['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
    splen = len(splitstr)
    chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
    buf = ''
    for chunk in chunks:
        buf += chunk
        start = end = 0
        while end>=0 and len(buf)>=splen:
            start, end = end, buf.find(splitstr, end)
            if end>=0:
                yield buf[start:end]  #not including splitstr
                yield splitstr  # == buf[end:end+splen] # splitstr
                end += splen
            else:
                buf = buf[start:]
                break

    yield buf

def test(*args):
    for chunk in splitfile(*args):
        print repr(chunk)

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
    try:
        if len(args)==3: args[2]=int(args[2])
    except Exception:
        raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
    test(*args)
----------------------------------------------------------------

Extent of testing follows :-)

 >>> print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40)
 ----------------------------------------
 01234abc5678abc901234
 567ab890abc
 ----------------------------------------
 >>> import ut.splitfile
 >>> ut.splitfile.test('splitfile.txt', 'abc')
 '01234'
 'abc'
 '5678'
 'abc'
 '901234\r\n567ab890'
 'abc'
 '\r\n'
 >>> ut.splitfile.test('splitfile.txt', '012')
 ''
 '012'
 '34abc5678abc9'
 '012'
 '34\r\n567ab890abc\r\n'
 >>> it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
 >>> it.next
 <method-wrapper object at 0x02EF1C6C>
 >>> it.next()
 '01234abc5678abc901234\r\n567'
 >>> it.next()
 'ab89'
 >>> it.next()
 '0abc\r\n'
 >>> it.next()
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter