[Tutor] Problem When Iterating Over Large Test Files

William R. Wing (Bill Wing) wrw at mac.com
Thu Jul 19 02:53:28 CEST 2012


On Jul 18, 2012, at 7:33 PM, Ryan Waples wrote:

> I'm seeing some unexpected output when I use a script (included at
> end) to iterate over large text files.  I am unsure of the source of
> the unexpected output and any help would be much appreciated.
> 
> Background
> Python v 2.7.1
> Windows 7 32bit
> Reading and writing to an external USB hard drive
> 
> Data files are ~4GB text (.fastq) file, it has been uncompressed
> (gzip).  This file has no errors or formatting problems, it seems to
> have uncompressed just fine.  64M lines, each 'entry' is split across
> 4 consecutive lines, 16M entries.
> 
> My python script iterates over data files 4 lines at a time, selects
> and writes groups of four lines to the output file.  I will end up
> selecting roughly 85% of the entries.
> 
> In my output I am seeing lines that don't occur in the original file,
> and that don't match any lines in the original file.  The incidences
> of badly formatted lines don't seem to match up with any patterns in
> the data file, and occur across multiple different data files.
> 
> I've included 20 consecutive lines of input and output.  Each of these
> 5 'records' should have been selected and printed to the output file.
> But there is a problem with the 4th and 5th entries in the output, and
> it no longer matches the input as expected.  For example the line:
> TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT
> never occurs in the original data.
> 
> Sorry for the large block of text below.
> Other pertinent info, I've tried a related perl script, and ran into
> similar issues, but not in the same places.
> 
> Any help or insight would be appreciated.
> 
> Thanks

[Data and program snipped]

With apologies - I'm a Mac/UNIX user, not Windows, but those numbers (4GB and 64M lines) look suspiciously close to the file and record pointer limits to a 32-bit file system.  Are you sure you aren't bumping into wrap around issues of some sort?

Just a thought…

-Bill


More information about the Tutor mailing list