checksum on numpy float array

My app reads in one or more float arrays from a binary file.
Sometimes due to network timeouts etc the array is not read correctly.
What would be the best way of checking the validity of the data?
Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data?

On Thu, Dec 4, 2008 at 17:17, Brennan Williams brennan.williams@visualreservoir.com wrote:
My app reads in one or more float arrays from a binary file.
Sometimes due to network timeouts etc the array is not read correctly.
What would be the best way of checking the validity of the data?
Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data?
Just use a generic hash on the file's bytes (ignoring their format). MD5 is sufficient for these purposes.

On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams brennan.williams@visualreservoir.com wrote:
My app reads in one or more float arrays from a binary file.
Sometimes due to network timeouts etc the array is not read correctly.
What would be the best way of checking the validity of the data?
Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data?
If you want to verify the file itself, then python provides several more or less secure checksums, my experience was that zlib.crc32 was pretty fast on moderate file sizes. crc32 is common inside archive files and for binary newsgroups. If you have large files transported over the network, e.g. GB size, I would work with par2 repair files, which verifies and repairs at the same time.
Josef

josef.pktd@gmail.com wrote:
On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams brennan.williams@visualreservoir.com wrote:
My app reads in one or more float arrays from a binary file.
Sometimes due to network timeouts etc the array is not read correctly.
What would be the best way of checking the validity of the data?
Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data?
If you want to verify the file itself, then python provides several more or less secure checksums, my experience was that zlib.crc32 was pretty fast on moderate file sizes. crc32 is common inside archive files and for binary newsgroups. If you have large files transported over the network, e.g. GB size, I would work with par2 repair files, which verifies and repairs at the same time.
The file has multiple arrays stored in it.
So I want to have some sort of validity check on just the array that I'm reading.
I will need to add a check on the file as well as of course network problems could affect writing to the file as well as reading from the file.
Josef _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Dec 4, 2008 at 17:43, Brennan Williams brennan.williams@visualreservoir.com wrote:
josef.pktd@gmail.com wrote:
On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams brennan.williams@visualreservoir.com wrote:
My app reads in one or more float arrays from a binary file.
Sometimes due to network timeouts etc the array is not read correctly.
What would be the best way of checking the validity of the data?
Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data?
If you want to verify the file itself, then python provides several more or less secure checksums, my experience was that zlib.crc32 was pretty fast on moderate file sizes. crc32 is common inside archive files and for binary newsgroups. If you have large files transported over the network, e.g. GB size, I would work with par2 repair files, which verifies and repairs at the same time.
The file has multiple arrays stored in it.
So I want to have some sort of validity check on just the array that I'm reading.
So do it on the bytes of the individual arrays. Just don't bother implementing new type-specific checksums.

I didn't check what this does behind the scenes, but try this
m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200)))
m2 = hashlib.md5() m2.update(np.array(range(100))) m2.update(np.array(range(200)))
print m.hexdigest() print m2.hexdigest()
assert m.hexdigest() == m2.hexdigest()
m3 = hashlib.md5() m3.update(np.array(range(100))) m3.update(np.array(range(199)))
print m3.hexdigest()
assert m.hexdigest() == m3.hexdigest()
Josef

On Thu, Dec 4, 2008 at 6:57 PM, josef.pktd@gmail.com wrote:
I didn't check what this does behind the scenes, but try this
I forgot to paste:
import hashlib #standard python library
Josef

Thanks
josef.pktd@gmail.com wrote:
I didn't check what this does behind the scenes, but try this
import hashlib #standard python library import numpy as np
m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200)))
m2 = hashlib.md5() m2.update(np.array(range(100))) m2.update(np.array(range(200)))
print m.hexdigest() print m2.hexdigest()
assert m.hexdigest() == m2.hexdigest()
m3 = hashlib.md5() m3.update(np.array(range(100))) m3.update(np.array(range(199)))
print m3.hexdigest()
assert m.hexdigest() == m3.hexdigest()
Josef _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Dec 4, 2008 at 18:54, Brennan Williams brennan.williams@visualreservoir.com wrote:
Thanks
josef.pktd@gmail.com wrote:
I didn't check what this does behind the scenes, but try this
import hashlib #standard python library import numpy as np
m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200)))
I would recommend doing this on the strings before you make arrays from them. You don't know if the network cut out in the middle of an 8-byte double.
Of course, sending the lengths and other metadata first, then the data would let you check without needing to do expensivish hashes or checksums. If truncation is your problem rather than corruption, then that would be sufficient. You may also consider using the NPY format in numpy 1.2 to implement that.

Robert Kern wrote:
On Thu, Dec 4, 2008 at 18:54, Brennan Williams brennan.williams@visualreservoir.com wrote:
Thanks
josef.pktd@gmail.com wrote:
I didn't check what this does behind the scenes, but try this
import hashlib #standard python library import numpy as np
m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200)))
I would recommend doing this on the strings before you make arrays from them. You don't know if the network cut out in the middle of an 8-byte double.
Of course, sending the lengths and other metadata first, then the data would let you check without needing to do expensivish hashes or checksums. If truncation is your problem rather than corruption, then that would be sufficient. You may also consider using the NPY format in numpy 1.2 to implement that.
Thanks for the ideas. I'm definitely going to add some more basic checks on lengths etc as well. Unfortunately the problem is happening at a client site so (a) I can't reproduce it and (b) most of the time they can't reproduce it either. This is a Windows Python app running on Citrix reading/writing data to a Linux networked drive.
Brennan

A Friday 05 December 2008, Brennan Williams escrigué:
Robert Kern wrote:
On Thu, Dec 4, 2008 at 18:54, Brennan Williams
brennan.williams@visualreservoir.com wrote:
Thanks
josef.pktd@gmail.com wrote:
I didn't check what this does behind the scenes, but try this
import hashlib #standard python library import numpy as np
m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200)))
I would recommend doing this on the strings before you make arrays from them. You don't know if the network cut out in the middle of an 8-byte double.
Of course, sending the lengths and other metadata first, then the data would let you check without needing to do expensivish hashes or checksums. If truncation is your problem rather than corruption, then that would be sufficient. You may also consider using the NPY format in numpy 1.2 to implement that.
Thanks for the ideas. I'm definitely going to add some more basic checks on lengths etc as well. Unfortunately the problem is happening at a client site so (a) I can't reproduce it and (b) most of the time they can't reproduce it either. This is a Windows Python app running on Citrix reading/writing data to a Linux networked drive.
Another possibility would be to use HDF5 as a data container. It supports the fletcher32 filter [1] which basically computes a chuksum for evey data chunk written to disk and then always check that the data read satifies the checksum kept on-disk. So, if the HDF5 layer doesn't complain, you are basically safe.
There are at least two usable HDF5 interfaces for Python and NumPy: PyTables[2] and h5py [3]. PyTables does have support for that right out-of-the-box. Not sure about h5py though (a quick search in docs doesn't reveal nothing).
[1] http://rfc.sunsite.dk/rfc/rfc1071.html [2] http://www.pytables.org [3] http://h5py.alfven.org
Hope it helps,

Another possibility would be to use HDF5 as a data container. It supports the fletcher32 filter [1] which basically computes a chuksum for evey data chunk written to disk and then always check that the data read satifies the checksum kept on-disk. So, if the HDF5 layer doesn't complain, you are basically safe.
There are at least two usable HDF5 interfaces for Python and NumPy: PyTables[2] and h5py [3]. PyTables does have support for that right out-of-the-box. Not sure about h5py though (a quick search in docs doesn't reveal nothing).
[1] http://rfc.sunsite.dk/rfc/rfc1071.html [2] http://www.pytables.org [3] http://h5py.alfven.org
Hope it helps,
Just to confirm that h5py does in fact have fletcher32; it's one of the options you can specify when creating a dataset, although it could use better documentation:
http://h5py.alfven.org/docs/guide/hl.html#h5py.highlevel.Group.create_datase...
Like other checksums, fletcher32 provides error-detection but not error-correction. You'll still need to throw away data which can't be read. However, I believe that you can still read sections of the dataset which aren't corrupted.
Andrew Collette

A Friday 05 December 2008, Andrew Collette escrigué:
Another possibility would be to use HDF5 as a data container. It supports the fletcher32 filter [1] which basically computes a chuksum for evey data chunk written to disk and then always check that the data read satifies the checksum kept on-disk. So, if the HDF5 layer doesn't complain, you are basically safe.
There are at least two usable HDF5 interfaces for Python and NumPy: PyTables[2] and h5py [3]. PyTables does have support for that right out-of-the-box. Not sure about h5py though (a quick search in docs doesn't reveal nothing).
[1] http://rfc.sunsite.dk/rfc/rfc1071.html [2] http://www.pytables.org [3] http://h5py.alfven.org
Hope it helps,
Just to confirm that h5py does in fact have fletcher32; it's one of the options you can specify when creating a dataset, although it could use better documentation:
http://h5py.alfven.org/docs/guide/hl.html#h5py.highlevel.Group.create _dataset
My bad. I've searched for 'fletcher' instead of 'fletcher32'. I naively thought that the search tool in Sphinx allowed for partial name finding. In fact, it is a pity it does not.
Cheers,

OK so maybe I should....
(1) not add some sort of checksum type functionality to my read/write methods
these read/write methods simply read/write numpy arrays to a binary file which contains one or more numpy arrays (and nothing else).
(2) replace my binary files iwith either HDF5 or PyTables
But....
my app is being used by clients on existing projects - in one case there are over 900 of these numpy binary files in just one project, albeit each file is pretty small (200KB or so)
so.. questions.....
How can I tranparently (or at least with minimum user-pain) replace my existing read/write methods with PyTables or HDF5?
My initial thoughts are...
(a) have an app version number and a data format version number which i can check against.
(b) if data format version < 1.0 then read from old binary files
(c) if app version number > 1.0 then write to new PyTables or HDF5 files
(d) get clients to open existing project and then save existing project to semi-transparently convert from old to new formats.
Francesc Alted wrote:
A Friday 05 December 2008, Andrew Collette escrigué:
Another possibility would be to use HDF5 as a data container. It supports the fletcher32 filter [1] which basically computes a chuksum for evey data chunk written to disk and then always check that the data read satifies the checksum kept on-disk. So, if the HDF5 layer doesn't complain, you are basically safe.
There are at least two usable HDF5 interfaces for Python and NumPy: PyTables[2] and h5py [3]. PyTables does have support for that right out-of-the-box. Not sure about h5py though (a quick search in docs doesn't reveal nothing).
[1] http://rfc.sunsite.dk/rfc/rfc1071.html [2] http://www.pytables.org [3] http://h5py.alfven.org
Hope it helps,
Just to confirm that h5py does in fact have fletcher32; it's one of the options you can specify when creating a dataset, although it could use better documentation:
http://h5py.alfven.org/docs/guide/hl.html#h5py.highlevel.Group.create _dataset
My bad. I've searched for 'fletcher' instead of 'fletcher32'. I naively thought that the search tool in Sphinx allowed for partial name finding. In fact, it is a pity it does not.
Cheers,

A Sunday 07 December 2008, Brennan Williams escrigué:
OK so maybe I should....
(1) not add some sort of checksum type functionality to my read/write methods
these read/write methods simply read/write numpy arrays to a
binary file which contains one or more numpy arrays (and nothing else).
(2) replace my binary files iwith either HDF5 or PyTables
But....
my app is being used by clients on existing projects - in one case there are over 900 of these numpy binary files in just one project, albeit each file is pretty small (200KB or so)
so.. questions.....
How can I tranparently (or at least with minimum user-pain) replace my existing read/write methods with PyTables or HDF5?
My initial thoughts are...
(a) have an app version number and a data format version number which i can check against.
(b) if data format version < 1.0 then read from old binary files
(c) if app version number > 1.0 then write to new PyTables or HDF5 files
(d) get clients to open existing project and then save existing project to semi-transparently convert from old to new formats.
Yeah. That would work perfectly. Also, there is a function in PyTables named 'isHDF5File(filename)' that allow you to know whether a file is in HDF5 format or not. You might want to use it and avoid to bother with data format/app version issues.
Cheers,
Francesc
Francesc Alted wrote:
A Friday 05 December 2008, Andrew Collette escrigué:
Another possibility would be to use HDF5 as a data container. It supports the fletcher32 filter [1] which basically computes a chuksum for evey data chunk written to disk and then always check that the data read satifies the checksum kept on-disk. So, if the HDF5 layer doesn't complain, you are basically safe.
There are at least two usable HDF5 interfaces for Python and NumPy: PyTables[2] and h5py [3]. PyTables does have support for that right out-of-the-box. Not sure about h5py though (a quick search in docs doesn't reveal nothing).
[1] http://rfc.sunsite.dk/rfc/rfc1071.html [2] http://www.pytables.org [3] http://h5py.alfven.org
Hope it helps,
Just to confirm that h5py does in fact have fletcher32; it's one of the options you can specify when creating a dataset, although it could use better documentation:
http://h5py.alfven.org/docs/guide/hl.html#h5py.highlevel.Group.cre ate _dataset
My bad. I've searched for 'fletcher' instead of 'fletcher32'. I naively thought that the search tool in Sphinx allowed for partial name finding. In fact, it is a pity it does not.
Cheers,
Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Andrew Collette
-
Brennan Williams
-
Francesc Alted
-
josef.pktd@gmail.com
-
Robert Kern