Unzip: Memory Error
David Bolen
db3l.net at gmail.com
Thu Aug 30 17:07:40 EDT 2007
David Bolen <db3l.net at gmail.com> writes:
> If you are going to read the file data incrementally from the zip file
> (which is what my other post provided) you'll prevent the huge memory
> allocations and risk of running out of resource, but would have to
> implement your own line ending support if you then needed to process
> that data in a line-by-line mode. Not terribly hard, but more
> complicated than my prior sample which just returned raw data chunks.
Here's a small example of a ZipFile subclass (tested a bit this time)
that implements two generator methods:
read_generator Yields raw data from the file
readline_generator Yields "lines" from the file (per splitlines)
It also corrects my prior code posting which didn't really skip over
the file header properly (due to the variable sized name/extra
fields). Needs Python 2.3+ for generator support (or 2.2 with
__future__ import)
Peak memory use is set "roughly" by the optional chunk parameter.
It's roughly since that's an uncompressed chunk so will grow in memory
during decompression. And the readline generator adds further copies
for the data split into lines.
For your file processing by line, it could be used as in:
zipf = ZipFileGen('somefile.zip')
g = zipf.readline_generator('somefilename.txt')
for line in g:
dealwithline(line)
zipf.close()
Even if not a perfect match, it should point you further in the right
direction.
-- David
- - - - - - - - - - - - - - - - - - - - - - - - -
import zipfile
import zlib
import struct
class ZipFileGen(zipfile.ZipFile):
def read_generator(self, name, chunk=65536):
"""Return a generator that yields file bytes for name incrementally.
The optional chunk parameter controls the chunk size read from the
underlying zip file. For compressed files, the data length returned
by the generator will be larger as the decompressed version of a chunk.
Note that unlike read(), this method does not preserve the internal
file pointer and should not be mixed with write operations. Nor does
it verify that the ZipFile is still opened and for reading.
Multiple generators returned by this function are not designed to be
used simultaneously (they do not re-seek the underlying file for
each request."""
zinfo = self.getinfo(name)
compressed = (zinfo.compress_type == zipfile.ZIP_DEFLATED)
if compressed:
dc = zlib.decompressobj(-15)
self.fp.seek(zinfo.header_offset)
# Skip the file header (from zipfile.ZipFile.read())
fheader = self.fp.read(30)
if fheader[0:4] != zipfile.stringFileHeader:
raise zipfile.BadZipfile, "Bad magic number for file header"
fheader = struct.unpack(zipfile.structFileHeader, fheader)
fname = self.fp.read(fheader[zipfile._FH_FILENAME_LENGTH])
if fheader[zipfile._FH_EXTRA_FIELD_LENGTH]:
self.fp.read(fheader[zipfile._FH_EXTRA_FIELD_LENGTH])
# Process the file incrementally
remain = zinfo.compress_size
while remain:
bytes = self.fp.read(min(remain, chunk))
remain -= len(bytes)
if compressed:
bytes = dc.decompress(bytes)
yield bytes
if compressed:
bytes = dc.decompress('Z') + dc.flush()
if bytes:
yield bytes
def readline_generator(self, name, chunk=65536):
"""Return a generator that yields lines from a file within the zip
incrementally. Line ending detection based on splitlines(), and
like file.readline(), the returned line does not include the line
ending. Efficiency not guaranteed if used with non-textual files.
Uses a read_generator() generator to retrieve file data incrementally,
so it inherits the limitations of that method as well, and the
optional chunk parameter is passed to read_generator unchanged."""
partial = ''
g = self.read_generator(name, chunk=chunk)
for bytes in g:
# Break current chunk into lines
lines = bytes.splitlines()
# Add any prior partial line to first line
if partial:
lines[0] = partial + lines[0]
# If the current chunk didn't happen to break on a line ending,
# save the partial line for next time
if bytes[-1] not in ('\n', '\r'):
partial = lines.pop()
# Then yield the lines we've identified so far
for curline in lines:
yield curline
# Return any trailing data (if file didn't end in a line ending)
if partial:
yield partial
More information about the Python-list
mailing list