[Numpy-discussion] fromfile() for reading text (one more time!)

Mon Jan 4 22:45:04 EST 2010

On Mon, Jan 4, 2010 at 10:39 PM,  <alan at ajackson.org> wrote:
>>Hi folks,
>>
>>I'm taking a look once again at fromfile() for reading text files. I
>>often have the need to read a LOT of numbers form a text file, and it
>>can actually be pretty darn slow do i the normal python way:
>>
>>for line in file:
>>    data = map(float, line.strip().split())
>>
>>
>>or various other versions that are similar. It really does take longer
>>to read the text, split it up, convert to a number, then put that number
>>into a numpy array, than it does to simply read it straight into the array.
>>
>>However, as it stands, fromfile() turn out to be next to useless for
>>anything but whitespace separated text. Full set of ideas here:
>>
>>http://projects.scipy.org/numpy/ticket/909
>>
>>However, for the moment, I'm digging into the code to address a
>>particular problem -- reading files like this:
>>
>>123, 65.6, 789
>>23,  3.2,  34
>>...
>>
>>That is comma (or whatever) separated text -- pretty common stuff.
>>
>>The problem with the current code is that you can't read more than one
>>line at time with fromfile:
>>
>>a = np.fromfile(infile, sep=",")
>>
>>will read until it doesn't find a comma, and thus only one line, as
>>there is no comma after each line. As this is a really typical case, I
>>think it should be supported.
>>
>>Here is the question:
>>
>>The work of finding the separator is done in:
>>
>>multiarray/ctors.c:  fromfile_skip_separator()
>>
>>It looks like it wouldn't be too hard to add some code in there to look
>>for a newline, and consider that a valid separator. However, that would
>>break backward compatibility. So maybe a flag could be passed in, saying
>>you wanted to support newlines. The problem is that flag would have to
>>get passed all the way through to this function (and also for fromstring).
>>
>>I also notice that it supports separators of arbitrary length, which I
>>wonder how useful that is. But it also does odd things with spaces
>>embedded in the separator:
>>
>>", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>>
>>Is it worth trying to fix that?
>>
>>
>>In the longer term, it would be really nice to support comments as well,
>>tough that would require more of a re-factoring of the code, I think
>>(though maybe not -- I suppose a call to fromfile_skip_separator() could
>>look for a comment character, then if it found one, skip to where the
>>comment ends -- hmmm.
>>
>>thanks for any feedback,
>>
>>-Chris
>>
>
> I agree. I've tried using it, and usually find that it doesn't quite get there.
>
> I rather like the R command(s) for reading text files - except then I have to
> use R which is painful after using python and numpy. Although ggplot2 is
> awfully nice too ... but that is a later post.
>
>     read.table(file, header = FALSE, sep = "", quote = "\"'",
>                dec = ".", row.names, col.names,
>                as.is = !stringsAsFactors,
>                na.strings = "NA", colClasses = NA, nrows = -1,
>                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
>                strip.white = FALSE, blank.lines.skip = TRUE,
>                comment.char = "#",
>                allowEscapes = FALSE, flush = FALSE,
>                stringsAsFactors = default.stringsAsFactors(),
>                fileEncoding = "", encoding = "unknown")
>
>     read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".",
>              fill = TRUE, comment.char="", ...)
>
>     read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",",
>               fill = TRUE, comment.char="", ...)
>
>     read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".",
>                fill = TRUE, comment.char="", ...)
>
>     read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",",
>                 fill = TRUE, comment.char="", ...)
>
>
> There is really only read.table, the others are just aliases with different
> defaults.  But the flexibility is great, as you can see.


Aren't the newly improved

numpy.genfromtxt(fname, dtype=<type 'float'>, comments='#',
delimiter=None, skiprows=0, converters=None, missing='',
missing_values=None, usecols=None, names=None, excludelist=None,
deletechars=None, case_sensitive=True, unpack=None, usemask=False,
loose=True)

and friends indented to handle all this

Josef

>
> --
> -----------------------------------------------------------------------
> | Alan K. Jackson            | To see a World in a Grain of Sand      |
> | alan at ajackson.org          | And a Heaven in a Wild Flower,         |
> | www.ajackson.org           | Hold Infinity in the palm of your hand |
> | Houston, Texas             | And Eternity in an hour. - Blake       |
> -----------------------------------------------------------------------
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>