Populating a dictionary, fast
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Nov 10 17:46:47 EST 2007
On Sat, 10 Nov 2007 13:56:35 -0800, Michael Bacarella wrote:
> The id2name.txt file is an index of primary keys to strings. They look
> like this:
>
> 11293102971459182412:Descriptive unique name for this record\n
> 950918240981208142:Another name for another record\n
>
> The file's properties are:
>
> # wc -l id2name.txt
>
> 8191180 id2name.txt
> # du -h id2name.txt
> 517M id2name.txt
>
> I'm loading the file into memory with code like this:
>
> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
That's an awfully complicated way to iterate over a file. Try this
instead:
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.
> This takes about 45 *minutes*
>
> If I comment out the last line in the loop body it takes only about 30
> _seconds_ to run. This would seem to implicate the line id2name[id] =
> name as being excruciatingly slow.
No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.
> Is there a fast, functionally equivalent way of doing this?
>
> (Yes, I really do need this cached. No, an RDBMS or disk-based hash is
> not fast enough.)
You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.
--
Steven.
More information about the Python-list
mailing list