Converting a text data file from positional to tab delimited.
Alex Martelli
aleaxit at yahoo.com
Tue Mar 13 10:40:48 EST 2001
"Lee Joramo" <lee at joramo.com> wrote in message
news:B6D37840.2124%lee at joramo.com...
> I am looking for suggestions to speed up the process of converting a large
> text data file from 'positional' layout to tab delimited. The data file is
> over 200MB in size containing over 40,000 lines which have over 600
fields.
>
> I suspect that the 'for' loop that splits each line into tab delimited,
> could be optimized. Perhaps it could be replaced with a regex or other
> technique.
[snip]
> layout = [
> ['STUDY', 0, 7]
> ['MDLNO', 8, 12]
[snip]
> for field in layout:
> #
> #can this loop be improved??
> #
> fieldValue = line[field[1]:field[2]]
First of all, there seems to be a serious bug here: it looks like the
layout has a pair of numbers, the upper one of which is meant to be
*included*, but the a:b slice notation *excludes* b -- so, you most
likely want line[field[1]:field[2]+1].
> delimitedLine = delimitedLine + delimit + fieldValue
> delimit = "\t"
> outFile.write(delimitedLine+"\n")
A small optimization is to use a _list_ of pieces (strings) and
output them with a single .writelines call (which does not just
write _lines_, but arbitrary strings).
As you know beforehand the number of fields in your layout, you
can prepare the list-of-pieces in advance:
list_of_pieces = ['\t'] * (2*len(layout)-1)
indexed_fields = zip(layout, range(0,len(layout),2))
and fill alternate pieces with other-than-tabs in the nested
loop:
for field, index in indexed_fields:
list_of_pieces[index] = line[field[1]:field[2]+1]
outFile.writelines(list_of_pieces)
We can do a little more work in advance, outside of the loop:
indexed_fields = [ (i*2, layout[i][1], layout[i][2]+1)
for i in range(len(layout)) ]
and the loop becomes the very-slightly-faster:
for index, lower, upper in indexed_fields:
list_of_pieces[index] = line[lower:upper]
I don't know if a substantially faster approach exists to
avoid the hundreds of calls to line[lower:upper] for each
line. You could break the line into a list of characters
once just before the inner-loop then slice the list, but
I don't know if that would gain you anything (you'd have
to join the sublists at some point, or have a much bigger
list-of-pieces, and either approach seems of doubtful
gain).
Another possibility is avoiding some data-copy by having
the inner-loop's statement as:
list_of_pieces[index] = buffer(line, offset, size)
you'll need to prepare the indexed_fields a bit differently:
indexed_fields = [ (i*2, layout[i][1], layout[i][2]-layout[i][1]+1)
for i in range(len(layout)) ]
and the loop itself becomes:
for index, offset, size in indexed_fields:
Not sure if .writelines supports a list of alternate
strings and buffers, or what the speed becomes then --
just suggesting alternatives for you to try...
Alex
More information about the Python-list
mailing list