Announcing PyTables 0.9.1 ------------------------- This release is mainly a maintenance version. In it, some bugs has been fixed and a few improvements has been made. One important thing is that chunk sizes in EArrays has been re-tuned to get much better performance and compression rations. Besides, it has been tested against the latest Python 2.4 and all test units seems to pass fine. What it is ---------- PyTables is a solid hierarchical database package designed to efficiently manage extremely large amounts of data (with support for full 64-bit file addressing). It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code, makes it a very easy-to-use tool for high performance data storage and retrieval. It is built on top of the HDF5 library and the numarray package, and provides containers for both heterogeneous data (Tables) and homogeneous data (Array, EArray) as well as containers for keeping lists of objects of variable length (like Unicode strings or general Python objects) in a very efficient way (VLArray). It also sports a series of filters allowing you to compress your data on-the-fly by using different compressors and compression enablers. But perhaps the more interesting features are its powerful browsing and searching capabilities that allow you to perform data selections over heterogeneous datasets exceeding gigabytes of data in just tenths of second. Besides, all the PyTables I/O is buffered, implemented in C and carefully tuned so that you can reach much better performance with PyTables than with your own home-grown wrappings to the HDF5 library. Changes more in depth --------------------- Improvements: - The chunksize computation for EArrays has been re-tuned to allow better performance and *much* better compression rations. - New --unpackshort and --quantize flags has been added to nctoh5 script. --unpackshort unpack short integer variables to float variables using scale_factor and add_offset netCDF variable attributes. --quantize quantize data to improve compression using least_significant_digit netCDF variable attribute (not active by default). See http://www.cdc.noaa.gov/cdc/conventions/cdc_netcdf_standard.shtml for further explanation of what this attribute means. Thanks to Jeff Whitaker for providing this. - Table.itersequence has received a new parameter called "sort". This allows to disable the sorting of the sequence in case the user wants so. Backward-incompatible changes: - Now, the AttributeSet class throw an AttributeError on __getattr__ for nonexistent attributes in it. Formerly, the routine returned None, which is pretty much against convention in Python and breaks the built-in hasattr() function. Thanks to Norbert Nemec for noting this and offering a patch. - VLArray.read() has changed its behaviour. Now, it always returns a list, as stated in documentation, even when the number of elements to return is 0 or 1. This is much more consistent when representing the actual number of elements on a certain VLArray row. API additions: - A Row.getTable() has been added. It is an accessor for the associated Table object. - A File.copyAttrs() has been added. It allows copying attributes from one leaf to other. Properly speaking, this was already there, but not documented :-/ Bug fixes: - Now, the copy of hierarchies works even when there are scalar Arrays (i.e. Arrays which shape is ()) on it. Thanks to Norbert Nemec for providing a patch. - Solved a memory leak regarding the Filters instance associated with the File object, that was not released after closing the file. Now, there are no known leaks on PyTables itself. - Fixed a bug in Table.append() when the table was indexed. The problem was that if table was in auto-indexing mode, some rows were lost in the indexation process and hence, not indexed correctly. - Improved security of nodes name checking. Closes #1074335 Important note for Python 2.4 and Windows users ----------------------------------------------- If you are willing to use PyTables with Python 2.4 in Windows platforms, you will need to get the HDF5 library compiled for MSVC 7.1, aka .NET (and possible LZO and UCL as well, if you want support for LZO and UCL at all). It can be found at: ftp://ftp.ncsa.uiuc.edu/HDF/HDF5/current/bin/windows/5-163-winxp-net2003.zip Where can PyTables be applied? ------------------------------ PyTables is not designed to work as a relational database competitor, but rather as a teammate. If you want to work with large datasets of multidimensional data (for example, for multidimensional analysis), or just provide a categorized structure for some portions of your cluttered RDBS, then give PyTables a try. It works well for storing data from data acquisition systems (DAS), simulation software, network data monitoring systems (for example, traffic measurements of IP packets on routers), very large XML files, or for creating a centralized repository for system logs, to name only a few possible uses. What is a table? ---------------- A table is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure and all values in each field have the same data type. The terms "fixed-length" and "strict data types" seem to be quite a strange requirement for a language like Python that supports dynamic data types, but they serve a useful function if the goal is to save very large quantities of data (such as is generated by many scientific applications, for example) in an efficient manner that reduces demand on CPU time and I/O resources. What is HDF5? ------------- For those people who know nothing about HDF5, it is a general purpose library and file format for storing scientific data made at NCSA. HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic constructs, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs. Platforms --------- I'm using Linux (Intel 32-bit) as the main development platform, but PyTables should be easy to compile/install on many other UNIX machines. This package has also passed all the tests on a UltraSparc platform with Solaris 7 and Solaris 8. It also compiles and passes all the tests on a SGI Origin2000 with MIPS R12000 processors, with the MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux 64-bit platforms, like AMD Opteron running GNU/Linux 2.4.21 Server, Intel Itanium (IA64) running GNU/Linux 2.4.21 or PowerPC G5 with Linux 2.6.x in 64bit mode. It has also been tested in MacOSX platforms (10.2 but should also work on newer versions). Regarding Windows platforms, PyTables has been tested with Windows 2000 and Windows XP (using the Microsoft Visual C compiler), but it should also work with other flavors as well. Web site -------- Go to the PyTables web site for more details: http://pytables.sourceforge.net/ To know more about the company behind the PyTables development, see: http://www.carabos.com/ Share your experience --------------------- Let me know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Altet Who's your data daddy? PyTables
participants (1)
-
Francesc Altet