efficient data loading with Python, is that possible possible?
DouhetSukd
DouhetSukd at gmail.com
Wed Dec 12 22:04:41 EST 2007
Back about 8 yrs ago, on pc hardware, I was reading twin 5 Mb files
and doing a 'fancy' diff between the 2, in about 60 seconds. Granted,
your file is likely bigger, but so is modern hardware and 20 mins does
seem a bit high.
Can't talk about the rest of your code, but some parts of it may be
optimized
def parseValue(line, col):
s = line[col.start:col.end+1]
# no switch in python
if col.format == ColumnFormat.DATE:
return Format.parseDate(s)
if col.format == ColumnFormat.UNSIGNED:
return Format.parseUnsigned(s)
How about taking the big if clause out? That would require making all
the formatters into functions, rather than in-lining some of them, but
it may clean things up.
#prebuilding a lookup of functions vs. expected formats...
#This is done once.
#Remember, you have to position this dict's computation _after_ all
the Format.parseXXX declarations. Don't worry, Python _will_ complain
if you don't.
dict_format_func = {ColumnFormat.DATE:Format.parseDate,
ColumnFormat.UNSIGNED:Format.parseUnsigned,
....
def parseValue(line, col):
s = line[col.start:col.end+1]
#get applicable function, apply it to s
return dict_format_func[col.format](s)
Also...
if col.format == ColumnFormat.STRING:
# and-or trick (no x ? y:z in python 2.4)
return not col.strip and s or rstrip(s)
Watch out! 'col.strip' here is not the result of stripping the
column, it is the strip _function_ itself, bound to the col object, so
it always be true. I get caught by those things all the time :-(
I agree that taking out the dot.dot.dots would help, but I wouldn't
expect it to matter that much, unless it was in an incredibly tight
loop.
I might be that.
if s.startswith('999999') or s.startswith('000000'): return -1
would be better as...
#outside of loop, define a set of values for which you want to return
-1
set_return = set(['999999','000000'])
#lookup first 6 chars in your set
def parseDate(s):
if s[0:6] in set_return:
return -1
return int(mktime(strptime(s, "%y%m%d")))
Bottom line: Python built-in data objects, such as dictionaries and
sets, are very much optimized. Relying on them, rather than writing a
lot of ifs and doing weird data structure manipulations in Python
itself, is a good approach to try. Try to build those objects outside
of your main processing loops.
Cheers
Douhet-did-suck
More information about the Python-list
mailing list