<br><br><div><span class="gmail_quote">On 7/9/07, <b class="gmail_sendername">Torgil Svensson</b> <<a href="mailto:torgil.svensson@gmail.com">torgil.svensson@gmail.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Elegant solution. Very readable and takes care of row0 nicely.<br><br>I want to point out that this is much more efficient than my version<br>for random/late string representation changes throughout the<br>conversion but it suffers from 2*n memory footprint and large block
<br>copying if the string rep changes arrives very early on huge datasets.</blockquote><div><br>Yep. <br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I think we can't have best of both and Tims solution is better in the<br>general case.</blockquote><div><br>It probably would not be hard to do a hybrid version. One issue is that one doesn't, in general, know the size of the dataset in advance, so you'd have to use an absolute criteria (less than 100 lines) instead of a relative criteria (less than 20% done). I suppose you could stat the file or something, but that seems like overkill.
<br><br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Maybe "use one_alt if rownumber < xxx else use other_alt" can
<br>fine-tune performance for some cases. but even ten, with many cols,<br>it's nearly impossible to know.</blockquote><div><br>That sounds sensible. I have an interesting thought on how to this that's a bit hard to describe. I'll try to throw it together and post another version today or tomorrow.
<br><br> </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">//Torgil<br><br><br>On 7/9/07, Timothy Hochberg <<a href="mailto:tim.hochberg@ieee.org">
tim.hochberg@ieee.org</a>> wrote:<br>><br>><br>> On 7/8/07, Vincent Nijs <<a href="mailto:v-nijs@kellogg.northwestern.edu">v-nijs@kellogg.northwestern.edu</a>> wrote:<br>> > Thanks for looking into this Torgil! I agree that this is a much more
<br>> > complicated setup. I'll check if there is anything I can do on the data<br>> end.<br>> > Otherwise I'll go with Timothy's suggestion and read in numbers as floats<br>> > and convert to int later as needed.
<br>><br>> Here is a strategy that should allow auto detection without too much in the<br>> way of inefficiency. The basic idea is to convert till you run into a<br>> problem, store that data away, and continue the conversion with a new dtype.
<br>> At the end you assemble all the chunks of data you've accumulated into one<br>> large array. It should be reasonably efficient in terms of both memory and<br>> speed.<br>><br>> The implementation is a little rough, but it should get the idea across.
<br>><br>> --<br>> . __<br>> . |-\<br>> .<br>> . <a href="mailto:tim.hochberg@ieee.org">tim.hochberg@ieee.org</a><br>><br>> ========================================================================
<br>><br>> def find_formats(items, last):<br>> formats = []<br>> for i, x in enumerate(items):<br>> dt, cvt = string_to_dt_cvt(x)<br>> if last is not None:<br>> last_cvt, last_dt = last[i]
<br>> if last_cvt is float and cvt is int:<br>> cvt = float<br>> formats.append((dt, cvt))<br>> return formats<br>><br>> class LoadInfo(object):<br>> def __init__(self, row0):
<br>> self.done = False<br>> self.lastcols = None<br>> self.row0 = row0<br>><br>> def data_iterator(lines, converters, delim, info):<br>> yield tuple(f(x) for f, x in zip(converters,
info.row0.split(delim)))<br>> try:<br>> for row in lines:<br>> yield tuple(f(x) for f, x in zip(converters, row.split(delim)))<br>> except:<br>> info.row0 = row<br>> else:
<br>> info.done = True<br>><br>> def load2(fname,delim = ',', has_varnm = True, prn_report = True):<br>> """<br>> Loading data from a file using the csv module. Returns a recarray.
<br>> """<br>> f=open(fname,'rb')<br>><br>> if has_varnm:<br>> varnames = [i.strip() for i in f.next().split(delim)]<br>> else:<br>> varnames = None
<br>><br>><br>> info = LoadInfo(f.next())<br>> chunks = []<br>><br>> while not info.done:<br>> row0 = info.row0.split(delim)<br>> formats = find_formats(row0, info.lastcols
)<br>> if varnames is None:<br>> varnames = varnm = ['col%s' % str(i+1) for i, _ in<br>> enumerate(formate)]<br>> descr=[]<br>> conversion_functions=[]<br>> for name, (dtype, cvt_fn) in zip(varnames, formats):
<br>> descr.append((name,dtype))<br>> conversion_functions.append(cvt_fn)<br>><br>> chunks.append(N.fromiter(data_iterator(f, conversion_functions,<br>> delim, info), descr))
<br>><br>> if len(chunks) > 1:<br>> n = sum(len(x) for x in chunks)<br>> data = N.zeros([n], chunks[-1].dtype)<br>> offset = 0<br>> for x in chunks:<br>> delta = len(x)
<br>> data[offset:offset+delta] = x<br>> offset += delta<br>> else:<br>> [data] = chunks<br>><br>> # load report<br>> if prn_report:<br>> print<br>
> "##########################################\n"<br>> print "Loaded file: %s\n" % fname<br>> print "Nr obs: %s\n" % data.shape[0]<br>> print "Variables and datatypes:\n"
<br>> for i in data.dtype.descr:<br>> print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1],<br>> str(data[i[0]][0:3]))<br>> print<br>> "\n##########################################\n"
<br>><br>> return data<br>><br>> _______________________________________________<br>> Numpy-discussion mailing list<br>> <a href="mailto:Numpy-discussion@scipy.org">Numpy-discussion@scipy.org</a><br>
> <a href="http://projects.scipy.org/mailman/listinfo/numpy-discussion">http://projects.scipy.org/mailman/listinfo/numpy-discussion</a><br>><br>><br>_______________________________________________<br>Numpy-discussion mailing list
<br><a href="mailto:Numpy-discussion@scipy.org">Numpy-discussion@scipy.org</a><br><a href="http://projects.scipy.org/mailman/listinfo/numpy-discussion">http://projects.scipy.org/mailman/listinfo/numpy-discussion</a><br></blockquote>
</div><br><br clear="all"><br>-- <br>. __<br>. |-\<br>.<br>. <a href="mailto:tim.hochberg@ieee.org">tim.hochberg@ieee.org</a>