[ python-Bugs-1215928 ] Large tarfiles cause overflow
SourceForge.net
noreply at sourceforge.net
Thu Aug 25 13:24:21 CEST 2005
Bugs item #1215928, was opened at 2005-06-06 21:19
Message generated for change (Comment added) made by loewis
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1215928&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
>Resolution: Accepted
Priority: 5
Submitted By: Tom Emerson (tree)
>Assigned to: Reinhold Birkenfeld (birkenfeld)
Summary: Large tarfiles cause overflow
Initial Comment:
I have a 4 gigabyte bz2 compressed tarfile containing some 3.3
million documents. I have a script which opens this file with "r:bz2"
and is simply iterating over the contents using next(). With 2.4.1 I
still get an Overflow error (originally tried with 2.3.5 as packaged in
Mac OS 10.4.1):
Traceback (most recent call last):
File "extract_part.py", line 47, in ?
main(sys.argv)
File "extract_part.py", line 39, in main
pathnames = find_valid_paths(argv[1], 1024, count)
File "extract_part.py", line 13, in find_valid_paths
f = tf.next()
File "/usr/local/lib/python2.4/tarfile.py", line 1584, in next
self.fileobj.seek(self.offset)
OverflowError: long int too large to convert to int
----------------------------------------------------------------------
>Comment By: Martin v. Löwis (loewis)
Date: 2005-08-25 13:24
Message:
Logged In: YES
user_id=21627
The patch is fine, please apply.
As for generalising Py_off_t: there are some issues which I
keep forgetting. fpos_t is not guaranteed to be an integral
type, and indeed, on Linux, it is not. I'm not quite
completely sure why this patch works; I think that on all
platforms where fpos_t is not integral, off_t happens to be
large enough. The only case where off_t is not large enough
is (IIRC) Windows, where fpos_t can be used.
So this is all somewhat muddy, and if this gets generalized,
a more elaborate comment seems to be in order.
----------------------------------------------------------------------
Comment By: Viktor Ferenczi (complex)
Date: 2005-06-21 01:44
Message:
Logged In: YES
user_id=142612
The bug has been reproduced with a 90Mbytes bz2 file containing more than 4Gbytes of fairly similar documents. I've diagnosed the same problem with large offsets. Thanks for the patch.
Platform: WinXP Intel P4, Python 2.4.1
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2005-06-19 00:05
Message:
Logged In: YES
user_id=80475
Martin, please look at this when you get a chance.
----------------------------------------------------------------------
Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-18 23:26
Message:
Logged In: YES
user_id=1188172
I looked into this a bit further, and noticed the following:
The modules bz2, cStringIO and mmap all use plain integers
to represent file offsets given to or returned by seek(),
tell() and truncate().
They should be corrected to use a 64-bit type when having
large file support. fileobject.c defines an own type for
that, Py_off_t, which should be shared among the other modules.
Conditional compile is needed since different
macros/functions must be used.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2005-06-13 03:32
Message:
Logged In: YES
user_id=80475
Is there a way to write a test for this?
Can it be done without a conditional compile?
Is the problem one that occurs in other code outside of bz?
----------------------------------------------------------------------
Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-10 13:45
Message:
Logged In: YES
user_id=1188172
Attaching corrected patch.
----------------------------------------------------------------------
Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-09 22:31
Message:
Logged In: YES
user_id=1188172
Attaching a patch which mimics the behaviour of normal file
objects. This should resolve the issue on platforms with
large file support.
----------------------------------------------------------------------
Comment By: Lars Gustäbel (gustaebel)
Date: 2005-06-07 15:23
Message:
Logged In: YES
user_id=642936
A quick look at the problem reveals that this is a bug in
bz2.BZ2File. The seek() method does not allow position
values >= 2GiB.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1215928&group_id=5470
More information about the Python-bugs-list
mailing list