Data type ideas
Jim Dennis
jimd at vega.starshine.org
Mon Apr 1 04:50:24 EST 2002
In article <c76ff6fc.0203301338.1cf07eb0 at posting.google.com>, John Machin wrote:
>"Sean 'Shaleh' Perry" <shalehperry at attbi.com> wrote in message news:<mailman.1017475907.10586.python-list at python.org>...
>>> So what next? Any ideas that I can use?
>> look into pickles. Act like dictionaries but live on the hard drive.
>Do you mean "shelves" instead of "pickles"? Even so, how do they help
>the OP with his problem?
If your data doesn't fit in memory, use disk (or even tape).
So the real question is, how can you structure files on disk to
partition this data into usefully smaller subsets that can each
be processed.
For example you might create a subdirectory (for the whole
processing job) and a subdirectory for each group or person
(or a pair of directories, persons and groups). Now you can search
through the original file, parsing parts of it that do fit into
memory and splitting out the subsets (appending persons to the
group files, and groups to the person files). Now you should be
able to do a merge sort to get the information you want, organized
according to your needs.
Alternatively you could import this data into a set of database tables
and then query the database to sort and gather the data.
Which of these approaches is better depends quite a bit on your
background and your needs. If you'll have frequent need of this data,
or you'll have frequent large updates of additional data, the database
approach might be worth the extra investment of your time and energy.
If you were particularly experienced with RDBMSes (in general) and
had one (like PostgreSQL or MySQL) already installed then it might
be easier to do that than to cook up your own filenaming and
directory searching scheme.
From what little I saw of your data it seems to require a
straightforward multi-multi schema (use threee tables, one for
person_id, and person name, one for group_id, and group name, and
a third to relate person_ids to group_ids). Any elementary text on
SQL database design will cover a "third normal form" multi-multi scheme
usually with a description involving authors and titles (since that's
the most obvious real world example of entities that can have a
multi-multi relationship, at least to the authors of these texts).
Obviously a title/book can have multiple authors, and any author might
write multiple works. The first "normalization" rule (E.F. Codd) prohibit
us from putting multiple attributes in any single column (and require
that all rows be of fixed numbers and types of columns) and the
second normal form requires that we put non-dependent attributes in
independent tables.
However, it seems likely that you would have already thought of the
RDBMS approach if you were familiar with using SQL and database tables.
Since you aren't, the learning curve may not be worth it -- especially if
this is just a oneshot.
More information about the Python-list
mailing list