On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com > wrote:
Hi Timothy,

On 7/30/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
>
>
> On 7/30/07, Geoffrey Zhu < zyzhu2000@gmail.com > wrote:
> > Hi Everyone,
> >
> > I am wondering what is the best (and fast) way to build a pivot table
> > aside from the 'brute force way?'
>
> What's the brute force way? It's easier to offer an improved suggestion if
> we know what we're trying to beat.
>
> > I want to transform an numpy array into a pivot table. For example, if
> > I have a numpy array like below:
> >
> > Region     Date          # of Units
> > ----------    ----------        --------------
> > East        1/1             10
> > East        1/1             20
> > East        1/2             30
> > West       1/1             40
> > West       1/2             50
> > West       1/2             60
> >
> > I want  to transform this into the following table, where f() is a
> > given aggregate function:
> >
> >            Date
> > Region           1/1          1/2
> > ----------
> > East         f(10,20)         f(30)
> > West        f(40)             f(50,60)
> >
> >
> > I can regroup them into 'sets' and do it the brute force way, but that
> > is kind of slow to execute. Does anyone know a better way?
>
> I would use a python to dictionary to assemble lists of values. I would key
> off (region/date) tuples. In outline:
>
> map = {}
> dates = set()
> regions = set()
> for (region, date, units) in data:
>     dates.add(date)
>     regions.add(regions)
>     key = (region, date)
>     if key not in map:
>          map[key] = []
>     map[key].append(data)
>
> Once you have map, regions and dates, you can trivially make a table as
> above.  The details will depend on what format you want the table to have,
> but it should be easy to do.
>

[SNIP]

The 'brute-force' way is basically what you suggested -- looping
through all the records and building a two-way hash-table of the data.

The problem of the brute-force' approach is that it is not taking
advantage of facilities of numpy and can be slow in speed. If only
there is some built-in mechanism in numpy to handle this.

I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors.
 
There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though.

-tim



The other thing I am not sure is in your map object above, do I append
the row number to the numpy array or do I append the row object (such
as data[r])?

Thanks,
Geoffrey
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion



--
.  __
.   |-\
.
.  tim.hochberg@ieee.org