[Tutor] Problem reading large files in binary mode

Fri Jun 13 13:20:23 CEST 2014

Thank you, i will keep all that in mind.

My python version is 3.3.5

George

On 13-06-2014 16:07, Peter Otten wrote:
> Mirage Web Studio wrote:
>
>> Try reading the file in chunks instead:
>>
>> CHUNKSIZE = 2**20
>> hash = hashlib.md5()
>> while True:
>>       chunk = f.read(CHUNKSIZE)
>>       if not chunk:
>>           break
>>       hash.update(chunk)
>> hashvalue = hash.hexdigest()
>>
>>
>> Thank you peter for the above valubale reply.  but shouldn't read() by
>> itself work because i have enough memory to load it or should it be a bug.
> I think you are right. At least you should get a MemoryError (the well-
> behaved way of the Python interpreter to say that it cannot allocate enough
> memory) while your description hints at a segmentation fault.
>
> A quick test with the Python versions I have lying around:
>
> $ python -c 'open("bigfile", "rb").read()'
> Traceback (most recent call last):
>    File "<string>", line 1, in <module>
> MemoryError
> $ python3.3 -c 'open("bigfile", "rb").read()'
> Segmentation fault
> $ python3.3 -V
> Python 3.3.2+
> $ python3.4 -c 'open("bigfile", "rb").read()'
> Traceback (most recent call last):
>    File "<string>", line 1, in <module>
> MemoryError
>
> So the bug occurs in 3.3 at least up to 3.3.2.
>
> If you don't have the latest bugfix release Python 3.3.4 you can try and
> install that or if you are not tied to 3.3 update to 3.4.1.
>
> Note that you may still run out of memory, particularly if you are using the
> 32 bit version.
>
> Also it is never a good idea to load a lot of data into memory when you
> intend to use it just once. Therefore I recommend that you calculate the
> checksum the way that I have shown in the example.
>
> PS: There was an email in my inbox where eryksun suggests potential
> improvements to my code:
>
>> You might see better performance if you preallocate a bytearray and
>> `readinto` it. On Windows, you might see even better performance if
>> you map sections of the file using mmap; the map `length` needs to be
>> a multiple of ALLOCATIONGRANULARITY (except the residual) to set the
>> `offset` for a sliding window.
> While I don't expect significant improvements since the problem is "I/O-
> bound", i. e. the speed limit is imposed by communication with the harddisk
> rather than the Python interpreter, you may still find it instructive to
> compare the various approaches.
>
> Another candidate when you are working in an environment where the md5sum
> utility is available is to delegate the work to the "specialist":
>
> hashvalue = subprocess.Popen(
>      ["md5sum", filename],
>      stdout=subprocess.PIPE).communicate()[0].split()[0].decode()
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com