Transforming ascii file (pseduo database) into proper database

p. ppetrick at gmail.com
Mon Jan 21 23:41:18 CET 2008


So in answer to some of the questions:
- There are about 15 files, each roughly representing a table.
- Within the files, each line represents a record.
- The formatting for the lines is like so:

File1:
somval1|ID|someval2|someval3|etc.

File2:
ID|someval1|someval2|somewal3|etc.

Where ID is the one and only value linking "records" from one file to
"records" in another file - moreover, as far as I can tell, the
relationships are all 1:1 (or 1:0) (I don't have the full dataset yet,
just a sampling, so I'm flying a bit in the dark).
- I believe that individual "records" within each of the files is
unique with respect to the identifier (again, not certain because I'm
only working with sample data).
- As the example shows, the position of the ID is not the same for all
files.
- I don't know how big N is since I only have a sample to work with,
and probably won't get the full dataset anytime soon. (Lets just take
it as a given that I won't get that information until AFTER a first
implementation...politics.)
- I don't know how many identifiers either, although it has to be at
least as large as the number of lines in the largest file (again, I
don't have the actual data yet).

So as an exercise, lets assume 800MB file, each line of data taking up
roughly 150B (guesstimate - based on examination of sample data)...so
roughly 5.3 million unique IDs.

With that size, I'll have to load them into temp db. I just can't see
holding that much data in memory...



More information about the Python-list mailing list