
Hi folks, This is a continuation of a conversation already started, but i gave it a new, more appropriate, thread and subject. On 12/6/11 2:13 PM, Wes McKinney wrote:
we should start talking about building a *high performance* flat file loading solution with good column type inference and sensible defaults, etc. ...
I personally don't believe in sacrificing an order of magnitude of performance in the 90% case for the 10% case-- so maybe it makes sense to have two functions around: a superfast custom CSV reader for well-behaved data, and a slower, but highly flexible, function like loadtable to fall back on.
I've wanted this for ages, and have done some work towards it, but like others, only had the time for a my-use-case-specific solution. A few thoughts: * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. * key to performance is to have the text to number to numpy type happening in C -- if you read the text with python, then convert to numbers, then to numpy arrays, it's simple going to be slow. * I think we want a solution that can be adapted to arbitrary text files -- not just tabular, CSV-style data. I have a lot of those to read - and some thoughts about how. Efforts I have made so far, and what I've learned from them: 1) fromfile(): fromfile (for text) is nice and fast, but buggy, and a bit too limited. I've posted various notes about this in the past (and, I'm pretty sure a couple tickets). They key missing features are: a) no support form commented lines (this is a lessor need, I think) b) there can be only one delimiter, and newlines are treated as generic whitespace. What this means is that if you have whitespace-delimited file, you can read multiple lines, but if it is, for instance, comma-delimited, then you can only read one line at a time, killing performance. c) there are various bugs if the text is malformed, or doesn't quite match what you're asking for (ie.e reading integers, but the tet is float) -- mostly really limited error checking. I spent some time digging into the code, and found it to be really hard to track C code. And very hard to update. The core idea is pretty nice -- each dtype should know how to read itself form a text file, but the implementation is painful. The key issue is that for floats and ints, anyway, it relies on the C atoi and atof functions. However, there have been patches to these that handle NaN better, etc, for numpy, and I think a python patch as well. So the code calls the numpy atoi, which does some checks, then calls the python atoi, which then calls the C lib atoi (I think all that...) In any case, the core bugs are due to the fact that atoi and friends doesn't return an error code, so you have to check if the pointer has been incremented to see if the read was successful -- this error checking is not propagated through all those levels of calls. It got really ugly to try to fix! Also, the use of the C atoi() means that locales may only be handled in the default way -- i.e. no way to read european-style floats on a system with a US locale. My conclusion -- the current code is too much a mess to try to deal with and fix! I also think it's a mistake to have text file reading a special case of fromfile(), it really should be a separate issue, though that's a minor API question. 2) FileScanner: FileScanner is some code a wrote years ago as a C extension - it's limited, but does the job and is pretty fast. It essentially calls fscanf() as many times as it gets a successful scan, skipping all invalid text, then returning a numpy array. You can also specify how many numbers you want read from the file. It only supports floats. Travis O. asked it it could be included in Scipy way back when, but I suspect none of my code actually made it in. If I had to do it again, I might write something similar in Cython, though I am still using it. My Conclusions: I think what we need is something similar to MATLAB's fscanf(): what it does is take a C-style format string, and apply it to your file over an over again as many times as it can, and returns an array. What's nice about this is that it can be purposed to efficiently read a wide variety of text files fast. For numpy, I imagine something like: fromtextfile(f, dtype=np.float64, comment=None, shape=None): """ read data from a text file, returning a numpy array f: is a filename or file-like object comment: is a string of the comment signifier. Anything on a line after this string will be ignored. dytpe: is a numpy dtype that you want read from the file shape: is the shape of the resulting array. If shape==None, the file will be read until EOF or until there is read error. By default, if there are newlines in the file, a 2-d array will be returned, with the newline signifying a new row in the array. """ This is actually pretty straightforward. If it support compound dtypes, then you can read a pretty complex CSV file, once you've determined the dtype for your "record" (row). It is also really simple to use for the simple cases. But of course, the implementation could be a pain -- I've been thinking that you could get a lot of it by creating a mapping from numpy dtypes to fscanf() format strings, then simply use fscanf for the actual file reading. This would certainly be easy for the easy cases. (maybe you'd want to use sscanf, so you could have the same code scan strings as well as files) Ideally, each dtype would know how to read itself from a string, but as I said above, the code for that is currently pretty ugly, so it may be easier to keep it separate. Anyway, I'd be glad to help with this effort. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

I would like to launch python modules or functions (I don't know which is easier to do, modules or functions) in separate Terminal windows so I can see the output from each as they execute. I need to be able to pass each module or function a set of parameters. I would like to do this from a python script already running in a Terminal window. In other words, I'd start up a "master" script and it would launch, say, three processes using another module or a function with different parameter values for each launch and each would run independently in its own Terminal window so stdout from each process would go to it's own respective window. When the process terminated the window would remain open. I've begun to look at subprocess modules, etc., but that's pretty confusing. I can do what I say above manually, but it's gotten clumsy as I want to run eventually in 12 cores. I have a Mac Pro running Mac OS X 10.6. If there is a better forum to ask this question, please let me know. Thanks for any advice. -- Lou Pecora, my views are my own.

Maybe try stackoverflow, since this isn't really a numpy question. To run a command like "python myscript.py arg1 arg2" in a separate process, you can do: p = subprocess.Popen("python myscript.py arg1 arg2".split()) You can launch many of these, and if you want to know if a process p is over, you can call p.poll(). I'm sure there are other (and better) options though. -=- Olivier 2011/12/7 Lou Pecora <lou_boog2000@yahoo.com>
I would like to launch python modules or functions (I don't know which is easier to do, modules or functions) in separate Terminal windows so I can see the output from each as they execute. I need to be able to pass each module or function a set of parameters. I would like to do this from a python script already running in a Terminal window. In other words, I'd start up a "master" script and it would launch, say, three processes using another module or a function with different parameter values for each launch and each would run independently in its own Terminal window so stdout from each process would go to it's own respective window. When the process terminated the window would remain open.
I've begun to look at subprocess modules, etc., but that's pretty confusing. I can do what I say above manually, but it's gotten clumsy as I want to run eventually in 12 cores.
I have a Mac Pro running Mac OS X 10.6.
If there is a better forum to ask this question, please let me know.
Thanks for any advice.
-- Lou Pecora, my views are my own.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

You should consider the powerful multiprocessing package. Have a look on this piece of code: import glob import os import multiprocessing as multi import subprocess as sub import time NPROC = 4 Python = '/Library/Frameworks/EPD64.framework/Versions/Current/bin/python' Xterm = '/usr/X11/bin/xterm ' coord = [] Size = '100x10' XPos = 810 YPos = 170 XOffset = 0 YOffset = 0 for i in range(NPROC): if i % 2 == 0: coord.append(Size + '+' + str(YPos) + '+' + str(YOffset)) else: coord.append(Size + '+' + str(XPos) + '+' + str(YOffset)) YOffset = YOffset + YPos def CompareColourRef(Champ): BaseChamp = os.path.basename(Champ) NameProc = int(multi.current_process().name[-1]) - 1 print 'Processing', BaseChamp, 'on processor', NameProc+1 os.putenv('ADAM_USER', DirWrk + 'adam_' + str(NameProc+1)) Command = Xterm + '-geometry ' + '"' + coord[NameProc] + '" -T " Proc' + str(NameProc+1) + ' ' + BaseChamp + ' ' + '" -e " ' + Python + ' ' + DirSrc + \ 'CompareColourRef.py ' + BaseChamp + ' 2>&1 | tee ' + DirLog + BaseChamp + '.log"' Process = sub.Popen([Command], shell=True) Process.wait() print BaseChamp, 'processed on processor', NameProc+1 return pool = multi.Pool(processes=NPROC) Champs = glob.glob(DirImg + '*/*') results = pool.map_async(CompareColourRef, Champs) pool.close() while results._number_left > 0: print "Waiting for", results._number_left, 'tasks to complete' time.sleep(15) pool.join() print 'Process completed' exit(0) Cheers Jean-Baptiste Le 7 déc. 2011 à 15:43, Olivier Delalleau a écrit :
Maybe try stackoverflow, since this isn't really a numpy question. To run a command like "python myscript.py arg1 arg2" in a separate process, you can do: p = subprocess.Popen("python myscript.py arg1 arg2".split()) You can launch many of these, and if you want to know if a process p is over, you can call p.poll(). I'm sure there are other (and better) options though.
-=- Olivier
2011/12/7 Lou Pecora <lou_boog2000@yahoo.com> I would like to launch python modules or functions (I don't know which is easier to do, modules or functions) in separate Terminal windows so I can see the output from each as they execute. I need to be able to pass each module or function a set of parameters. I would like to do this from a python script already running in a Terminal window. In other words, I'd start up a "master" script and it would launch, say, three processes using another module or a function with different parameter values for each launch and each would run independently in its own Terminal window so stdout from each process would go to it's own respective window. When the process terminated the window would remain open.
I've begun to look at subprocess modules, etc., but that's pretty confusing. I can do what I say above manually, but it's gotten clumsy as I want to run eventually in 12 cores.
I have a Mac Pro running Mac OS X 10.6.
If there is a better forum to ask this question, please let me know.
Thanks for any advice.
-- Lou Pecora, my views are my own.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

From: Jean-Baptiste Marquette <marquett@iap.fr> To: Discussion of Numerical Python <numpy-discussion@scipy.org> Sent: Wednesday, December 7, 2011 4:23 PM Subject: Re: [Numpy-discussion] Simple way to launch python processes? You should consider the powerful multiprocessing package. Have a look on this piece of code: importglob importos import multiprocessing as multi import subprocess as sub importtime NPROC = 4 Python = '/Library/Frameworks/EPD64.framework/Versions/Current/bin/python' Xterm = '/usr/X11/bin/xterm ' coord = [] Size = '100x10' XPos = 810 YPos = 170 XOffset = 0 YOffset = 0 for i in range(NPROC): if i % 2 == 0: coord.append(Size + '+' + str(YPos) + '+' + str(YOffset)) else: coord.append(Size + '+' + str(XPos) + '+' + str(YOffset)) YOffset = YOffset + YPos def CompareColourRef(Champ): BaseChamp = os.path.basename(Champ) NameProc = int(multi.current_process().name[-1]) - 1 print 'Processing', BaseChamp, 'on processor', NameProc+1 os.putenv('ADAM_USER', DirWrk + 'adam_' + str(NameProc+1)) Command = Xterm + '-geometry ' + '"' + coord[NameProc] + '" -T " Proc' + str(NameProc+1) + ' ' + BaseChamp + ' ' + '" -e " ' + Python + ' ' + DirSrc + \ 'CompareColourRef.py ' + BaseChamp + ' 2>&1 | tee' + DirLog + BaseChamp + '.log"' Process = sub.Popen([Command], shell=True) Process.wait() print BaseChamp, 'processed on processor', NameProc+1 return pool = multi.Pool(processes=NPROC) Champs = glob.glob(DirImg + '*/*') results = pool.map_async(CompareColourRef, Champs) pool.close() while results._number_left > 0: print"Waiting for", results._number_left, 'tasks to complete' time.sleep(15) pool.join() print'Process completed' exit(0) Cheers Jean-Baptiste ---------------------------------------------------------------------------------------------------------------------------------- Wow. I will have to digest that, but thank you. -- Lou Pecora, my views are my own. ________________________________

From: Olivier Delalleau <shish@keba.be> To: Discussion of Numerical Python <numpy-discussion@scipy.org> Sent: Wednesday, December 7, 2011 3:43 PM Subject: Re: [Numpy-discussion] Simple way to launch python processes? Maybe try stackoverflow, since this isn't really a numpy question. To run a command like "python myscript.py arg1 arg2" in a separate process, you can do: p = subprocess.Popen("python myscript.py arg1 arg2".split()) You can launch many of these, and if you want to know if a process p is over, you can call p.poll(). I'm sure there are other (and better) options though. -=- Olivier Thank you. -- Lou Pecora, my views are my own. ________________________________

On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <Chris.Barker@noaa.gov> wrote:
Hi folks,
This is a continuation of a conversation already started, but i gave it a new, more appropriate, thread and subject.
On 12/6/11 2:13 PM, Wes McKinney wrote:
we should start talking about building a *high performance* flat file loading solution with good column type inference and sensible defaults, etc. ...
I personally don't believe in sacrificing an order of magnitude of performance in the 90% case for the 10% case-- so maybe it makes sense to have two functions around: a superfast custom CSV reader for well-behaved data, and a slower, but highly flexible, function like loadtable to fall back on.
I've wanted this for ages, and have done some work towards it, but like others, only had the time for a my-use-case-specific solution. A few thoughts:
* If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader.
You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. A very fast reader for well-behave files would be very welcome, but I see it as a separate topic from genfromtxt/loadtable. The question for the loadtable pull request is whether it is different enough from genfromtxt that we need/want both, or whether loadtable should replace genfromtxt. Cheers, Ralf
* key to performance is to have the text to number to numpy type happening in C -- if you read the text with python, then convert to numbers, then to numpy arrays, it's simple going to be slow.
* I think we want a solution that can be adapted to arbitrary text files -- not just tabular, CSV-style data. I have a lot of those to read - and some thoughts about how.
Efforts I have made so far, and what I've learned from them:
1) fromfile(): fromfile (for text) is nice and fast, but buggy, and a bit too limited. I've posted various notes about this in the past (and, I'm pretty sure a couple tickets). They key missing features are: a) no support form commented lines (this is a lessor need, I think) b) there can be only one delimiter, and newlines are treated as generic whitespace. What this means is that if you have whitespace-delimited file, you can read multiple lines, but if it is, for instance, comma-delimited, then you can only read one line at a time, killing performance. c) there are various bugs if the text is malformed, or doesn't quite match what you're asking for (ie.e reading integers, but the tet is float) -- mostly really limited error checking.
I spent some time digging into the code, and found it to be really hard to track C code. And very hard to update. The core idea is pretty nice -- each dtype should know how to read itself form a text file, but the implementation is painful. The key issue is that for floats and ints, anyway, it relies on the C atoi and atof functions. However, there have been patches to these that handle NaN better, etc, for numpy, and I think a python patch as well. So the code calls the numpy atoi, which does some checks, then calls the python atoi, which then calls the C lib atoi (I think all that...) In any case, the core bugs are due to the fact that atoi and friends doesn't return an error code, so you have to check if the pointer has been incremented to see if the read was successful -- this error checking is not propagated through all those levels of calls. It got really ugly to try to fix! Also, the use of the C atoi() means that locales may only be handled in the default way -- i.e. no way to read european-style floats on a system with a US locale.
My conclusion -- the current code is too much a mess to try to deal with and fix!
I also think it's a mistake to have text file reading a special case of fromfile(), it really should be a separate issue, though that's a minor API question.
2) FileScanner:
FileScanner is some code a wrote years ago as a C extension - it's limited, but does the job and is pretty fast. It essentially calls fscanf() as many times as it gets a successful scan, skipping all invalid text, then returning a numpy array. You can also specify how many numbers you want read from the file. It only supports floats. Travis O. asked it it could be included in Scipy way back when, but I suspect none of my code actually made it in.
If I had to do it again, I might write something similar in Cython, though I am still using it.
My Conclusions:
I think what we need is something similar to MATLAB's fscanf():
what it does is take a C-style format string, and apply it to your file over an over again as many times as it can, and returns an array. What's nice about this is that it can be purposed to efficiently read a wide variety of text files fast.
For numpy, I imagine something like:
fromtextfile(f, dtype=np.float64, comment=None, shape=None): """ read data from a text file, returning a numpy array
f: is a filename or file-like object
comment: is a string of the comment signifier. Anything on a line after this string will be ignored.
dytpe: is a numpy dtype that you want read from the file
shape: is the shape of the resulting array. If shape==None, the file will be read until EOF or until there is read error. By default, if there are newlines in the file, a 2-d array will be returned, with the newline signifying a new row in the array. """
This is actually pretty straightforward. If it support compound dtypes, then you can read a pretty complex CSV file, once you've determined the dtype for your "record" (row). It is also really simple to use for the simple cases.
But of course, the implementation could be a pain -- I've been thinking that you could get a lot of it by creating a mapping from numpy dtypes to fscanf() format strings, then simply use fscanf for the actual file reading. This would certainly be easy for the easy cases. (maybe you'd want to use sscanf, so you could have the same code scan strings as well as files)
Ideally, each dtype would know how to read itself from a string, but as I said above, the code for that is currently pretty ugly, so it may be easier to keep it separate.
Anyway, I'd be glad to help with this effort.
-Chris
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 12/11/11 8:40 AM, Ralf Gommers wrote:
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <Chris.Barker@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader.
You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both.
I don't think the version in my mind is contradictory (Not quite). What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases. I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that. Hmm, maybe there is -- for instance, if a "record" consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step. Maybe that wouldn't be any faster or easier, but it could be done... Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser. Would anyone like to join/form a small working group to work on this? Wes, I'd like to see your Cython version -- maybe a starting point? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker <chris.barker@noaa.gov>wrote:
On 12/11/11 8:40 AM, Ralf Gommers wrote:
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <Chris.Barker@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader.
You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both.
I don't think the version in my mind is contradictory (Not quite).
What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases.
I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that.
Hmm, maybe there is -- for instance, if a "record" consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step.
Maybe that wouldn't be any faster or easier, but it could be done...
Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser.
Would anyone like to join/form a small working group to work on this?
Wes, I'd like to see your Cython version -- maybe a starting point?
-Chris
I'm also working on a faster text file reader, so count me in. I've been experimenting in both C and Cython. I'll put it on github as soon as I can. Warren
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mon, Dec 12, 2011 at 12:34 PM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker <chris.barker@noaa.gov> wrote:
On 12/11/11 8:40 AM, Ralf Gommers wrote:
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <Chris.Barker@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader.
You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both.
I don't think the version in my mind is contradictory (Not quite).
What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases.
I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that.
Hmm, maybe there is -- for instance, if a "record" consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step.
Maybe that wouldn't be any faster or easier, but it could be done...
Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser.
Would anyone like to join/form a small working group to work on this?
Wes, I'd like to see your Cython version -- maybe a starting point?
-Chris
I'm also working on a faster text file reader, so count me in. I've been experimenting in both C and Cython. I'll put it on github as soon as I can.
Warren
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Cool, Warren, I look forward to seeing it. I'm hopeful we can craft a performant tool that will meet the needs of of many projects (NumPy, pandas, etc.)...
participants (8)
-
Chris.Barker
-
Chris.Barker
-
Jean-Baptiste Marquette
-
Lou Pecora
-
Olivier Delalleau
-
Ralf Gommers
-
Warren Weckesser
-
Wes McKinney