Data type ideas

Mon Apr 1 04:50:24 EST 2002

In article <c76ff6fc.0203301338.1cf07eb0 at posting.google.com>, John Machin wrote:
>"Sean 'Shaleh' Perry" <shalehperry at attbi.com> wrote in message news:<mailman.1017475907.10586.python-list at python.org>...

>>> So what next?  Any ideas that I can use?  
>> look into pickles.  Act like dictionaries but live on the hard drive.

>Do you mean "shelves" instead of "pickles"? Even so, how do they help
>the OP with his problem?

 If your data doesn't fit in memory, use disk (or even tape).
 So the real question is, how can you structure files on disk to
 partition this data into usefully smaller subsets that can each
 be processed.  

 For example you might create a subdirectory (for the whole 
 processing job) and a subdirectory for each group or person
 (or a pair of directories, persons and groups).  Now you can search
 through the original file, parsing parts of it that do fit into 
 memory and splitting out the subsets (appending persons to the 
 group files, and groups to the person files).  Now you should be
 able to do a merge sort to get the information you want, organized
 according to your needs.

 Alternatively you could import this data into a set of database tables
 and then query the database to sort and gather the data.

 Which of these approaches is better depends quite a bit on your 
 background and your needs.  If you'll have frequent need of this data,
 or you'll have frequent large updates of additional data, the database
 approach might be worth the extra investment of your time and energy.
 If you were particularly experienced with RDBMSes (in general) and
 had one (like PostgreSQL or MySQL) already installed then it might
 be easier to do that than to cook up your own filenaming and 
 directory searching scheme.

 From what little I saw of your data it seems to require a 
 straightforward multi-multi schema (use threee tables, one for
 person_id, and person name, one for group_id, and group name, and
 a third to relate person_ids to group_ids).  Any elementary text on
 SQL database design will cover a "third normal form" multi-multi scheme 
 usually with a description involving authors and titles (since that's
 the most obvious real world example of entities that can have a
 multi-multi relationship, at least to the authors of these texts).
 Obviously a title/book can have multiple authors, and any author might
 write multiple works.  The first "normalization" rule (E.F. Codd) prohibit 
 us from putting multiple attributes in any single column (and require
 that all rows be of fixed numbers and types of columns) and the 
 second normal form requires that we put non-dependent attributes in
 independent tables.

 However, it seems likely that you would have already thought of the 
 RDBMS approach if you were familiar with using SQL and database tables.
 Since you aren't, the learning curve may not be worth it -- especially if
 this is just a oneshot.