optimizing memory utilization

Anon anon at ymous.com
Tue Sep 14 06:39:49 CEST 2004

Hello all,

I'm hoping for some guidance here...  I am a c/c++ "expert", but a
complete python virgin.  I'm trying to create a program that loads in the
entire FreeDB database (excluding the CDDBID itself) and uses this
"database" for other subsequent processing.  The problem is, I'm running
out of memory on a linux RH8 box with 512MB.  The FreeDB database that I'm
trying to load is presently consisting of two "CSV" files.  The first file
contains a "list" of albums with artist name and an arbitrary sequential
album ID an the CDDBID (an ascii-hex representation of a 32-bit value).
The second file contains a list of all of the tracks on each of the
albums, crossreferenced via the album ID.  When I load into memory, I
create a python list where each entry in the list is itself a list
representing the data for a given album.  The album data list consists of
a small handful of text items like the album title, author, genre, and
year, as well as a list which itself contains a list for each of the track
on the album.

[[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
  [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
 [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
  [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
 [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
  [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

So the problem I'm having is that I want to load it all in memory (the two
files total about 250MB of raw data) but just loading the first 50,000
lines of tracks (about 25MB of raw data) consumes 75MB of RAM.  If the
approximation is fairly accurate, I'd need >750MB of available RAM just to
load my in-memory database.

The bottom line is, is there a more memory efficient way to load all this
arbitrary field length and count type data into RAM?  I can already see
that creating a seperate list for the group of tracks on an album is
probably wasteful versus just appending them to the album list, but I
doubt that will yeild the desired level of optimization of memory usage??

Any data structures suggestions for this application?  BTW, the later
accesses to this database would not benefit in any way from being
presorted, so no need to warn me in advance about concepts like 
presorting the albums list to facillitate faster look-up later...


More information about the Python-list mailing list