split large file by string/regex

Bengt Richter bokr at oz.net
Mon Nov 22 12:21:15 EST 2004


On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <dieringe at zedat.fu-berlin.de> wrote:

>Jason Rennie <jrennie at csail.mit.edu> writes:
>
>> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>>> I am trying to split a file by a fixed string.
>>> The file is too large to just read it into a string and split this.
>>> I could probably use a lexer but there maybe anything more simple?
>>
>> If the pattern is contained within a single line, do something like this:
>
>Hmm it's binary data, I can't tell how long lines would be. OTOH a
>line would certainly contain the pattern as it has no \n in it... and
>the lines probably wouldn't be too large for memory...
>
>m.
Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of

 >>> '1231xxx45646xxx45646xxx78'.split('xxx')
 ['1231', '45646', '45646', '78']

or (I chose this for below)
 ['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

 ['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
    splen = len(splitstr)
    chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
    buf = ''
    for chunk in chunks:
        buf += chunk
        start = end = 0
        while end>=0 and len(buf)>=splen:
            start, end = end, buf.find(splitstr, end)
            if end>=0:
                yield buf[start:end]  #not including splitstr
                yield splitstr  # == buf[end:end+splen] # splitstr
                end += splen
            else:
                buf = buf[start:]
                break

    yield buf

def test(*args):
    for chunk in splitfile(*args):
        print repr(chunk)

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
    try:
        if len(args)==3: args[2]=int(args[2])
    except Exception:
        raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
    test(*args)
----------------------------------------------------------------

Extent of testing follows :-)

 >>> print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40)
 ----------------------------------------
 01234abc5678abc901234
 567ab890abc
 ----------------------------------------
 >>> import ut.splitfile
 >>> ut.splitfile.test('splitfile.txt', 'abc')
 '01234'
 'abc'
 '5678'
 'abc'
 '901234\r\n567ab890'
 'abc'
 '\r\n'
 >>> ut.splitfile.test('splitfile.txt', '012')
 ''
 '012'
 '34abc5678abc9'
 '012'
 '34\r\n567ab890abc\r\n'
 >>> it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
 >>> it.next
 <method-wrapper object at 0x02EF1C6C>
 >>> it.next()
 '01234abc5678abc901234\r\n567'
 >>> it.next()
 'ab89'
 >>> it.next()
 '0abc\r\n'
 >>> it.next()
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter



More information about the Python-list mailing list