[Numpy-discussion] NumPy-Discussion Digest, Vol 63, Issue 43

Tue Dec 13 15:50:06 EST 2011

U

On 12/13/11, numpy-discussion-request at scipy.org
<numpy-discussion-request at scipy.org> wrote:
> Send NumPy-Discussion mailing list submissions to
> 	numpy-discussion at scipy.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://mail.scipy.org/mailman/listinfo/numpy-discussion
> or, via email, send a message with subject or body 'help' to
> 	numpy-discussion-request at scipy.org
>
> You can reach the person managing the list at
> 	numpy-discussion-owner at scipy.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of NumPy-Discussion digest..."
>
>
> Today's Topics:
>
>    1. Re: Fast Reading of ASCII files (Chris Barker)
>    2. Re: Apparently non-deterministic behaviour of complex array
>       multiplication (kneil)
>    3. Re: numpy.mean problems (Eraldo Pomponi)
>    4. Re: Fast Reading of ASCII files (Bruce Southey)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 13 Dec 2011 10:08:44 -0800
> From: Chris Barker <chris.barker at noaa.gov>
> Subject: Re: [Numpy-discussion] Fast Reading of ASCII files
> To: denis <denis-bz-gg at t-online.de>, 	Discussion of Numerical Python
> 	<numpy-discussion at scipy.org>
> Message-ID:
> 	<CALGmxEJt9Y0oaM1gkFSuFLwaBJNxfLk54x-N8+f8hT5VzjcVtQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> NOTE:
>
> Let's keep this on the list.
>
> On Tue, Dec 13, 2011 at 9:19 AM, denis <denis-bz-gg at t-online.de> wrote:
>
>> Chris,
>>  unified, consistent save / load is a nice goal
>>
>> 1) header lines with date, pwd etc.: "where'd this come from ?"
>>
>>    # (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
>>    # 80.6 % correct -- user info
>>      245    39     4     5    26
>>    ...
>>
> I'm not sure I understand what you are expecting here: What would be
> automatic? if itparses a datetime on the header, what would it do with it?
> But anyway, this seems to me:
>   - very application specific -- this is for the users code to write
>   - not what we are talking about at this point anyway -- I think this
> discussion is about a lower-level, does-the-simple-things-fast reader --
> that may or may not be able to form the basis of a higher-level fuller
> featured reader.
>
>
>> 2) read any CSVs: comma or blank-delimited, with/without column names,
>>    a la loadcsv() below
>>
>
> yup -- though the column name reading would be part of a higher-level
> reader as far as I'm concerned.
>
>
>> 3) sparse or masked arrays ?
>>
>> sparse probably not, that seem pretty domain dependent to me -- though
> hopefully one could build such a thing on top of the lower level reader.
>  Masked support would be good -- once we're convinced what the future of
> masked arrays are in numpy. I was thinking that the masked array issue
> would really be a higher-level feature -- it certainly could be if you need
> to mask "special value" stype files (i.e. 9999), but we may have to build
> it into the lower level reader for cases where the mask is specified by
> non-numerical values -- i.e. there are some met files that use "MM" or some
> other text, so you can't put it into a numerical array first.
>
>>
>> Longterm wishes: beyond the scope of one file <-> one array
>> but essential for larger projects:
>> 1) dicts / dotdicts:
>>    Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little
>> files
>>    is easy, better than np.savez
>>    (Haven't used hdf5, I believe Matlabv7  does.)
>>
>> 2) workflows: has anyone there used visTrails ?
>>
>
> outside of the spec of this thread...
>
>>
>> Anyway it seems to me (old grey cynic) that Numpy/scipy developers
>> prefer to code first, spec and doc later. Too pessimistic ?
>>
>>
> Well, I think many of us believe in a more agile style approach --
> incremental development. But really, as an open source project, it's really
> about scratching an itch -- so there is usually a spec in mind for the itch
> at hand. In this case, however, that has been a weakness -- clearly a
> number of us hav written small solutions to our particular problem at hand,
> but no we haven't arrived at a more general purpose solution yet. So a bit
> of spec-ing ahead of time may be called for.
>
> On that:
>
> I"ve been thinking from teh botom-up -- imaging what I need for the simple
> case, and how it might apply to more complex cases -- but maybe we should
> think about this another way:
>
> What we're talking about here is really about core software engineering --
> optimization. It's easy to write a pure-python simple file parser, and
> reasonable to write a complex one (genfromtxt) -- the issue is performance
> -- we need some more C (or Cython) code to really speed it up, but none of
> us wants to write the complex case code in C. So:
>
> genfromtxt is really nice for many of the complex cases. So perhaps
> another approach is to look at genfromtxt, and see what
> high performance lower-level functionality we could develop that could make
> it fast -- then we are done.
>
> This actually mirrors exactly what we all usually recommend for python
> development in general -- write it in Python, then, if it's really not fast
> enough, write the bottle-neck in C.
>
> So where are the bottle necks in genfromtxt? Are there self-contained
> portions that could be re-written in C/Cython?
>
> -Chris
>
>
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mail.scipy.org/pipermail/numpy-discussion/attachments/20111213/2b6d09f4/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Tue, 13 Dec 2011 10:13:31 -0800 (PST)
> From: kneil <magnetotellurics at gmail.com>
> Subject: Re: [Numpy-discussion] Apparently non-deterministic behaviour
> 	of complex array multiplication
> To: numpy-discussion at scipy.org
> Message-ID: <32969114.post at talk.nabble.com>
> Content-Type: text/plain; charset=us-ascii
>
>
> Hi Olivier,
> Sorry for the late reply - I have been on travel.
> I have encountered the error in two separate cases; when I was using numpy
> arrays, and when I was using numpy matrices.
> In the case of a numpy array (Y), the operation is:
> dot(Y,Y.conj().transpose())
> and in the case of a matrix,  with X=asmatrix(Y) and then the operation is:
> X*X.H
> -Karl
>
>
> Olivier Delalleau-2 wrote:
>>
>> I was trying to see if I could reproduce this problem, but your code fails
>> with numpy 1.6.1 with:
>> AttributeError: 'numpy.ndarray' object has no attribute 'H'
>> Is X supposed to be a regular ndarray with dtype = 'complex128', or
>> something else?
>>
>> -=- Olivier
>>
>>
>
> --
> View this message in context:
> http://old.nabble.com/Apparently-non-deterministic-behaviour-of-complex-array-multiplication-tp32893004p32969114.html
> Sent from the Numpy-discussion mailing list archive at Nabble.com.
>
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 13 Dec 2011 20:04:22 +0100
> From: Eraldo Pomponi <eraldo.pomponi at gmail.com>
> Subject: Re: [Numpy-discussion] numpy.mean problems
> To: Discussion of Numerical Python <numpy-discussion at scipy.org>
> Message-ID:
> 	<CAEaCG7eaoVWwqBm3xkjZ8JZp3xgQKs8rCKkVCGi6EdDRSGPvvw at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Fred,
>
> I would suggest you to have a look at pandas
> (http://pandas.sourceforge.net/)
> . It was
> really helpful for me. It seems well suited for the type of data that you
> are working
> with. It has nice "brodcasting" capabilities to apply numpy functions to a
> set column.
> http://pandas.sourceforge.net/basics.html#descriptive-statistics
> http://pandas.sourceforge.net/basics.html#function-application
>
> Cheers,
> Eraldo
>
>
> On Sun, Dec 11, 2011 at 1:49 PM, ferreirafm
> <ferreirafm at lim12.fm.usp.br>wrote:
>
>>
>>
>> Aronne Merrelli wrote:
>> >
>> > I can recreate this error if tab is a structured ndarray - what is the
>> > dtype of tab?
>> >
>> > If that is correct, I think you could fix this by simplifying things.
>> > Since
>> > tab is already an ndarray, you should not need to convert it back into a
>> > python list. By converting the ndarray back to a list you are making an
>> > extra level of "wrapping" as a python object, which is ultimately why
>> > you
>> > get that error about adding numpy.void.
>> >
>> > Unfortunately you cannot take directly take a mean of a struct dtype;
>> > structs are generic so they could have fields with strings, or objects,
>> > etc, that would be invalid for a mean calculation. However the following
>> > code fragment should work pretty efficiently. It will make a 1-element
>> > array of the same dtype as tab, and then populate it with the mean value
>> > of
>> > all elements where the length is >= 15. Note that dtype.fields.keys()
>> > gives
>> > you a nice way to iterate over the fields in the struct dtype:
>> >
>> > length_mask = tab['length'] >= 15
>> > tab_means = np.zeros(1, dtype=tab.dtype)
>> > for k in tab.dtype.fields.keys():
>> >     tab_means[k] = np.mean( tab[k][mask] )
>> >
>> > In general this would not work if tab has a field that is not a simple
>> > numeric type, such as a str, object, ... But it looks like your arrays
>> are
>> > all numeric from your example above.
>> >
>> > Hope that helps,
>> > Aronne
>> >
>> HI Aronne,
>> Thanks for your replay. Indeed, tab is a mix of different column types:
>> tab.dtype:
>> [('sgi', '<i8'), ('length', '<i8'), ('nident', '<i8'), ('pident', '<f8'),
>> ('positive', '<i8'), ('ppos', '<f8'), ('mismatch', '<i8'), ('qstart',
>> '<i8'), ('qend', '<i8'), ('sstart', '<i8'), ('send', '<i8'), ('gapopen',
>> '<i8'), ('gaps', '<i8'), ('evalue', '<f8'), ('bitscore', '<f8'), ('score',
>> '<f8')]
>>  Interestingly, I couldn't be able to import some columns of digits as
>> strings like as with R dataframe objects.
>> I'll try to adapt your example to my needs and let you know the results.
>> Regards.
>>
>> --
>> View this message in context:
>> http://old.nabble.com/numpy.mean-problems-tp32945124p32955052.html
>> Sent from the Numpy-discussion mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mail.scipy.org/pipermail/numpy-discussion/attachments/20111213/487dbe82/attachment-0001.html
>
> ------------------------------
>
> Message: 4
> Date: Tue, 13 Dec 2011 13:29:47 -0600
> From: Bruce Southey <bsouthey at gmail.com>
> Subject: Re: [Numpy-discussion] Fast Reading of ASCII files
> To: numpy-discussion at scipy.org
> Message-ID: <4EE7A7AB.8060201 at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On 12/13/2011 12:08 PM, Chris Barker wrote:
>> NOTE:
>>
>> Let's keep this on the list.
>>
>> On Tue, Dec 13, 2011 at 9:19 AM, denis <denis-bz-gg at t-online.de
>> <mailto:denis-bz-gg at t-online.de>> wrote:
>>
>>     Chris,
>>      unified, consistent save / load is a nice goal
>>
>>     1) header lines with date, pwd etc.: "where'd this come from ?"
>>
>>        # (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
>>        # 80.6 % correct -- user info
>>          245    39     4     5    26
>>        ...
>>
>> I'm not sure I understand what you are expecting here: What would be
>> automatic? if itparses a datetime on the header, what would it do with
>> it? But anyway, this seems to me:
>>   - very application specific -- this is for the users code to write
>>   - not what we are talking about at this point anyway -- I think this
>> discussion is about a lower-level, does-the-simple-things-fast reader
>> -- that may or may not be able to form the basis of a higher-level
>> fuller featured reader.
>>
>>     2) read any CSVs: comma or blank-delimited, with/without column names,
>>        a la loadcsv() below
>>
>>
>> yup -- though the column name reading would be part of a higher-level
>> reader as far as I'm concerned.
>>
>>     3) sparse or masked arrays ?
>>
>> sparse probably not, that seem pretty domain dependent to me -- though
>> hopefully one could build such a thing on top of the lower level
>> reader.  Masked support would be good -- once we're convinced what the
>> future of masked arrays are in numpy. I was thinking that the masked
>> array issue would really be a higher-level feature -- it certainly
>> could be if you need to mask "special value" stype files (i.e. 9999),
>> but we may have to build it into the lower level reader for cases
>> where the mask is specified by non-numerical values -- i.e. there are
>> some met files that use "MM" or some other text, so you can't put it
>> into a numerical array first.
>>
>>
>>     Longterm wishes: beyond the scope of one file <-> one array
>>     but essential for larger projects:
>>     1) dicts / dotdicts:
>>        Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little
>>     files
>>        is easy, better than np.savez
>>        (Haven't used hdf5, I believe Matlabv7  does.)
>>
>>     2) workflows: has anyone there used visTrails ?
>>
>>
>> outside of the spec of this thread...
>>
>>
>>     Anyway it seems to me (old grey cynic) that Numpy/scipy developers
>>     prefer to code first, spec and doc later. Too pessimistic ?
>>
>>
>> Well, I think many of us believe in a more agile style approach --
>> incremental development. But really, as an open source project, it's
>> really about scratching an itch -- so there is usually a spec in mind
>> for the itch at hand. In this case, however, that has been a weakness
>> -- clearly a number of us hav written small solutions to
>> our particular problem at hand, but no we haven't arrived at a more
>> general purpose solution yet. So a bit of spec-ing ahead of time may
>> be called for.
>>
>> On that:
>>
>> I"ve been thinking from teh botom-up -- imaging what I need for the
>> simple case, and how it might apply to more complex cases -- but maybe
>> we should think about this another way:
>>
>> What we're talking about here is really about core software
>> engineering -- optimization. It's easy to write a pure-python simple
>> file parser, and reasonable to write a complex one (genfromtxt) -- the
>> issue is performance -- we need some more C (or Cython) code to really
>> speed it up, but none of us wants to write the complex case code in C. So:
>>
>> genfromtxt is really nice for many of the complex cases. So perhaps
>> another approach is to look at genfromtxt, and see what
>> high performance lower-level functionality we could develop that could
>> make it fast -- then we are done.
>>
>> This actually mirrors exactly what we all usually recommend for python
>> development in general -- write it in Python, then, if it's really not
>> fast enough, write the bottle-neck in C.
>>
>> So where are the bottle necks in genfromtxt? Are there self-contained
>> portions that could be re-written in C/Cython?
>>
>> -Chris
>>
>>
>>
> Reading data is hard and writing code that suits the diversity in the
> Numerical Python community is even harder!
>
> Both loadtxt and genfromtxt functions (other functions are perhaps less
> important) perhaps need an upgrade to incorporate the new NA object. I
> think that adding the NA object will simply some of the process because
> invalid data (missing or a string in a numerical format) can be set to
> NA without requiring the creation of a  new masked array or returning an
> error.
>
> Here I think loadtxt is a better target than genfromtxt because, as I
> understand it, it assumes the user really knows the data. Whereas
> genfromtxt can ask the data for the appropriatye format.
>
> So I agree that new 'superfast custom CSV reader for well-behaved data'
> function would be rather useful especially as an replacement for
> loadtxt. By that I mean reading data using a user specified format that
> essentially follows the CSV format
> (http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to
> allow for NA object, skipping lines and user-defined delimiters.
>
> Bruce
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mail.scipy.org/pipermail/numpy-discussion/attachments/20111213/b01db77d/attachment.html
>
> ------------------------------
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> End of NumPy-Discussion Digest, Vol 63, Issue 43
> ************************************************
>

-- 
Sent from my mobile device

""Reasonable people adapt themselves to the world. Unreasonable people
attempt to adapt the world to themselves. All progress, therefore, depends
on unreasonable people." - G.B. Shaw