[Tutor] How best to structure a plain text data file for use in program(s) and later updating with new data?

Joel Goldstick joel.goldstick at gmail.com
Wed Oct 8 17:02:42 CEST 2014


On Wed, Oct 8, 2014 at 10:56 AM, boB Stepp <robertvstepp at gmail.com> wrote:
> About two years ago I wrote my most ambitious program to date, a
> hodge-podge collection of proprietary scripting, perl and shell files
> that collectively total about 20k lines of code. Amazingly it actually
> works and has saved my colleagues and I much time and effort. At the
> time I created this mess, I was playing "guess the correct proprietary
> syntax to do something" and "hunt and peck perl" games and squeezing
> this programming work into brief snippets of time away from what I am
> actually paid to do. I did not give much thought to design at the time
> and knew I would regret it later, which is now today! So now in my
> current few snippets of time I wish to redesign this program from
> scratch and make it much, ... , much easier to maintain the code and
> update the data tables, which change from time to time. And now that I
> have some version of python available on all of our current Solaris 10
> systems (python versions 2.4.4 and 2.6.4), it seems like a fine time
> to (finally!) do some serious python learning.
>
> Right now I have separated my data into their own files. Previously I
> had integrated the data with my source code files (Horrors!).
> Currently, a snippet from one of these data files is:
>
> NUMBER_FX:ONE; DATA_SOURCE:Timmerman; RELEASE_DATE:(11-2012);
>
> SERIAL_ROI:Chiasm; TEST_VOLUME:< 0.2 cc; VOLUME_MAX_GY:8.0;
> MAX_PT_DOSE_GY:10.0; MAX_MEAN_DOSE: ;
> SERIAL_ROI:Optic_Nerve_R; TEST_VOLUME:< 0.2 cc; VOLUME_MAX_GY:8.0;
> MAX_PT_DOSE_GY:10.0; MAX_MEAN_DOSE: ;
> SERIAL_ROI:Optic_Nerve_L; TEST_VOLUME:< 0.2 cc; VOLUME_MAX_GY:8.0;
> MAX_PT_DOSE_GY:10.0; MAX_MEAN_DOSE: ;
>
> [...]
>
> PARALLEL_ROI:Lungs_Bilateral; CRITICAL_VOLUME_CC:1500.0;
> CRITICAL_VOLUME_DOSE_MAX_GY:7.0; V8GY: ; V20GY: ; MAX_MEAN_DOSE: ;
> PARALLEL_ROI:Lungs_Bilateral; CRITICAL_VOLUME_CC:1000.0;
> CRITICAL_VOLUME_DOSE_MAX_GY:7.6; V8GY:< 37.0%; V20GY: ; MAX_MEAN_DOSE:
> ;
> PARALLEL_ROI:Liver; CRITICAL_VOLUME_CC:700.0;
> CRITICAL_VOLUME_DOSE_MAX_GY:11.0; V8GY: ; V20GY: ; MAX_MEAN_DOSE: ;
> PARALLEL_ROI:Renal_Cortex_Bilateral; CRITICAL_VOLUME_CC:200.0;
> CRITICAL_VOLUME_DOSE_MAX_GY:9.5; V8GY: ; V20GY: ; MAX_MEAN_DOSE: ;
> [EOF]
>
> I just noticed that copying from my data file into my Google email
> resulted in all extra spaces being condensed into a single space. I do
> not know why this has just happened. Note that there are no tab
> characters. The [...] indicates omitted lines of serial tissue data
> and [EOF] just notes the end-of-file.
>
> I am far from ready to write any code at this point. I am trying to
> organize my data files, so that they will be easy to use by the
> programs that will process the data and also to be easily updated
> every time these data values get improved upon. For the latter, I
> envision writing a second program to enable anyone to update the data
> tables when we are given new values. But until that second program
> gets written, the data files would have to be opened and edited
> manually, which is why I have labels included in all-caps ending in a
> colon. This is so the editor will know what he is editing. So,
> basically the actual data fields fall between ":" and ";" . String
> representations of numbers will need to get converted to floats by the
> program. Some fields containing numbers are of a form like "< 0.2 cc"
> . These will get copied as is into a GUI display, while the "0.2" will
> be used in a computation and/or comparison. Also notice that in each
> data file there are two distinct groupings of records--one for serial
> tissue (SERIAL_ROI:) and one for parallel tissue (PARALLEL_ROI). The
> fields used are different for each grouping. Also, notice that some
> fields will have no values, but in other data files they will have
> values. And finally the header line at the top of the file identifies
> for what number of fractions (FX) the data is to be used for as well
> as the source of the data and date that the data was released by that
> source.
>
> Finally the questions! Will I easily be able to use python to parse
> this data as currently structured, or do I need to restructure this? I
> am not at the point where I am aware of what possibilities python
> offers to handle these data files. Also, my efforts to search the 'net
> did not turn up anything that really clicked for me as the way to go.
> I could not seem to come up with a search string that would bring up
> what I was really interested in: What are the best practices for
> organizing plain text data?
>
> Thanks!
>
> --
> boB
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

It looks like you have csv like data.  Except you have a semicolon as
a separator.  Look at the csv module.  That should work for you

-- 
Joel Goldstick
http://joelgoldstick.com


More information about the Tutor mailing list