An Odd Little Script

Wed Mar 9 17:59:48 EST 2005

On Wednesday 09 March 2005 04:06 pm, Greg Lindstrom wrote:
> Hello-
> 
> I have a task which -- dare I say -- would be easy in <asbestos_undies> 
> Perl </asbestos_undies> but would rather do in Python (our primary 
> language at Novasys).  I have a file with varying length records.  All 
> but the first record, that is; it's always 107 bytes long.  What I would 
> like to do is strip out all linefeeds from the file, read the character 
> in position 107 (the end of segment delimiter) and then replace all of 
> the end of segment characters with linefeeds, making a file where each 
> segment is on its own line.  Currently, some vendors supply files with 
> linefeeds, others don't, and some split the file every 80 bytes.  In 
> Perl I would operate on the file in place and be on my way.  The files 
> can be quite large, so I'd rather not be making extra copies unless it's 
> absolutely essential/required.

The only problem I see is the "in place" requirement, which seems silly
unless by "quite large" you mean multiple gigabytes.  Surely Perl
actually makes a copy in the process even though you never see
it? That much shouldn't be hard, but *actually* doing it in-place
runs into problems with the file data being accidently overwritten,
doesn't it?

I'd make a copy and delete the original afterwards.

delimiter = open('original', 'r').read(108)[107]
# this is actually the 108th character, right?
# if you mean the actual 107th character, that'll be element [106] of course

BLOCKSIZE = 10000
inf = open('original', 'r')
ouf = open('result', 'w')

for block in inf.read(BLOCKSIZE):
     block = block.replace('\n', '')
     block = block.replace(delimiter, '\n')
     ouf.write(block)

Now, if we can GUARANTEE that the block never gets longer, it seems
likely that you could actually open the file 'rw' and use the same file handle
for output and input, with explicit seek() and tell() methods being used:

# REALLY, COMPLETELY UNTESTED AND DANGEROUS

delimiter = open('original', 'r').read(108)[107]  # this is actually the 108th character, right?
BLOCKSIZE = 10000
f = open('didnt_really_want_that_file_anyway.dat', 'rw')

def indata(fp, sz):
    data = fp.read(sz)
    pos   = fp.tell()
    return data, pos

def outdata(fp, pos, data):
    fp.seek(pos)
    fp.write(data)
    pos  = fp.tell()
    return pos

outpos = 0
for block, inpos in indata(f,BLOCKSIZE):
     block = block.replace('\n', '')
     block = block.replace(delimiter, '\n')
     outpos = outdata(f, outpos, block)

# TOTALLY UNTESTED!

I can imagine evil operating system behavior that could make this trash the file,
BTW, such as reading or writing data in blocks.  I'd want to test to make sure
such things don't happen (or that Python magically compensates somehow).

But is it really worth that?

Cheers,
Terry

-- 
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks  http://www.anansispaceworks.com