===================== Announcing carray 0.3 ===================== What's new ========== A lot of stuff. The most outstanding feature in this version is the introduction of a `ctable` object. A `ctable` is similar to a structured array in NumPy, but instead of storing the data row-wise, it uses a column-wise arrangement. This allows for much better performance for very wide tables, which is one of the scenarios where a `ctable` makes more sense. Of course, as `ctable` is based on `carray` objects, it inherits all its niceties (like on-the-flight compression and fast iterators). Also, the `carray` object itself has received many improvements, like new constructors (arange(), fromiter(), zeros(), ones(), fill()), iterators (where(), wheretrue()) or resize mehtods (resize(), trim()). Most of these also work with the new `ctable`. Besides, Numexpr is supported now (but it is optional) in order to carry out stunningly fast queries on `ctable` objects. For example, doing a query on a table with one million rows and one thousand columns can be up to 2x faster than using a plain structured array, and up to 20x faster than using SQLite (using the ":memory:" backend and indexing). See 'bench/ctable-query.py' for details. Finally, binaries for Windows (both 32-bit and 64-bit) are provided. For more detailed info, see the release notes in: https://github.com/FrancescAlted/carray/wiki/Release-0.3 What it is ========== carray is a container for numerical data that can be compressed in-memory. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. Having data compressed in-memory can reduce the stress of the memory subsystem. The net result is that carray operations may be faster than using a traditional ndarray object from NumPy. carray also supports fully 64-bit addressing (both in UNIX and Windows). Below, a carray with 1 trillion of rows has been created (7.3 TB total), filled with zeros, modified some positions, and finally, summed-up::
%time b = ca.zeros(1e12) CPU times: user 54.76 s, sys: 0.03 s, total: 54.79 s Wall time: 55.23 s %time b[[1, 1e9, 1e10, 1e11, 1e12-1]] = (1,2,3,4,5) CPU times: user 2.08 s, sys: 0.00 s, total: 2.08 s Wall time: 2.09 s b carray((1000000000000,), float64) nbytes: 7450.58 GB; cbytes: 2.27 GB; ratio: 3275.35 cparams := cparams(clevel=5, shuffle=True) [0.0, 1.0, 0.0, ..., 0.0, 0.0, 5.0] %time b.sum() CPU times: user 10.08 s, sys: 0.00 s, total: 10.08 s Wall time: 10.15 s 15.0
['%time' is a magic function provided by the IPyhton shell] Please note that the example above is provided for demonstration purposes only. Do not try to run this at home unless you have more than 3 GB of RAM available, or you will get into trouble. Resources ========= Visit the main carray site repository at: http://github.com/FrancescAlted/carray You can download a source package from: http://carray.pytables.org/download Manual: http://carray.pytables.org/manual Home of Blosc compressor: http://blosc.pytables.org User's mail list: carray@googlegroups.com http://groups.google.com/group/carray Share your experience ===================== Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. ---- Enjoy! -- Francesc Alted
On Wed, Dec 22, 2010 at 1:58 PM, Francesc Alted <faltet@pytables.org> wrote:
%time b = ca.zeros(1e12) CPU times: user 54.76 s, sys: 0.03 s, total: 54.79 s Wall time: 55.23 s
I know this is somewhat missing the point of your demonstration, but 55 seconds to create an empty 3 GB data structure to represent a multi-TB dense array doesn't seem all that fast to me. Compression can do a lot of things, but isn't this a case where a true sparse data structure would be the right tool for the job? I'm more interested in seeing what a carray can do with census data, web logs, or somethat vaguely real world where direct binary representations are used by default and assumed to be reasonable optimal (i.e., anything sensibly stored in sqlite tables). -Kevin
2010/12/24, Kevin Jacobs <jacobs@bioinformed.com> <bioinformed@gmail.com>:
On Wed, Dec 22, 2010 at 1:58 PM, Francesc Alted <faltet@pytables.org> wrote:
%time b = ca.zeros(1e12) CPU times: user 54.76 s, sys: 0.03 s, total: 54.79 s Wall time: 55.23 s
I know this is somewhat missing the point of your demonstration, but 55 seconds to create an empty 3 GB data structure to represent a multi-TB dense array doesn't seem all that fast to me.
Yes, this was not the point of the demo, but just showing 64-bit addressing (a feature that I implemented recently and was eager to show). But, agreed, I'm guilty to show times, so your observation is pertinent. But mind that I'm not creating an *empty* structure, but a *zeroed* structure; that's a bit different (that does not mean that the process cannot be speed-up, but we all surely agree that there is little sense in optimizing this scenario ;-).
Compression can do a lot of things, but isn't this a case where a true sparse data structure would be the right tool for the job? I'm more interested in seeing what a carray can do with census data, web logs, or somethat vaguely real world where direct binary representations are used by default and assumed to be reasonable optimal (i.e., anything sensibly stored in sqlite tables).
Well, I'm just creating the tool; it is up to the users to find real-world applications. I'm pretty sure that some of you will find some good ones. Cheers! -- Francesc Alted
participants (2)
-
Francesc Alted -
Kevin Jacobs <jacobs@bioinformed.com>