efficient data loading with Python, is that possible possible?
John Machin
sjmachin at lexicon.net
Wed Dec 12 20:27:43 EST 2007
On Dec 13, 11:44 am, igor.tatari... at gmail.com wrote:
> On Dec 12, 4:03 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > Inside your function
> > [you are doing all this inside a function, not at global level in a
> > script, aren't you?], do this:
> > from time import mktime, strptime # do this ONCE
> > ...
> > blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
>
> > It would help if you told us what platform, what version of Python,
> > how much memory, how much swap space, ...
>
> > Cheers,
> > John
>
> I am using a global 'from time import ...'. I will try to do that
> within the
> function and see if it makes a difference.
>
> The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
> something like that. Python 2.4
>
> Here is some of my code. Tell me what's wrong with it :)
>
> def loadFile(inputFile, loader):
> # .zip files don't work with zlib
> f = popen('zcat ' + inputFile)
> for line in f:
> loader.handleLine(line)
> ...
>
> In Loader class:
> def handleLine(self, line):
> # filter out 'wrong' lines
> if not self._dataFormat(line): return
>
> # add a new output record
> rec = self.result.addRecord()
>
> for col in self._dataFormat.colFormats:
> value = parseValue(line, col)
> rec[col.attr] = value
>
> And here is parseValue (will using a hash-based dispatch make it much
> faster?):
>
> def parseValue(line, col):
> s = line[col.start:col.end+1]
> # no switch in python
> if col.format == ColumnFormat.DATE:
> return Format.parseDate(s)
> if col.format == ColumnFormat.UNSIGNED:
> return Format.parseUnsigned(s)
> if col.format == ColumnFormat.STRING:
> # and-or trick (no x ? y:z in python 2.4)
> return not col.strip and s or rstrip(s)
> if col.format == ColumnFormat.BOOLEAN:
> return s == col.arg and 'Y' or 'N'
> if col.format == ColumnFormat.PRICE:
> return Format.parseUnsigned(s)/100.
>
> And here is Format.parseDate() as an example:
> def parseDate(s):
> # missing (infinite) value ?
> if s.startswith('999999') or s.startswith('000000'): return -1
> return int(mktime(strptime(s, "%y%m%d")))
>
> Hopefully, this should be enough to tell what's wrong with my code.
>
I have to go out now, so here's a quick overview: too many goddam dots
and too many goddam method calls.
1. do
colfmt = col.format # ONCE
if colfmt == ...
2. No switch so put most frequent at the top
3. What is ColumnFormat? What is Format? I think you have gone class-
crazy, and there's more overhead than working code ...
Cheers,
John
More information about the Python-list
mailing list