[Tutor] looking but not finding

Thu Jul 13 09:02:23 EDT 2023

On Wed, Jul 12, 2023 at 6:51 PM Alan Gauld via Tutor <tutor at python.org> wrote:
>
> This is late at night so I won;t reply to all the points
> but I'll pick up a few easier ones! :-)
>
>
> On 12/07/2023 16:26, o1bigtenor wrote:
> > What I'm finding is that the computing world uses terms in a different way than
> > does the sensor world than does the storage world than does the rest of us.
>
> Sadly true. I come from an electronic/telecomms background but
> most of my career was in software engineering. SE is a very new
> field and doesn't have the well established vocabulary of more
> traditional engineering fields. Often one term can have
> multiple meanings depending on context. And other times multiple
> terms are used for the exact same thing.
>
The phrase "divided by a common language" comes to mind.
A pity that practitioners seem to believe that using such arcana makes
them seem knowledgeable - - - imo it does the opposite. Clarity promotes
understanding and eases usage - - - and as I think I've seen - - - drives
sales (but not maybe profits).
>
>
> >> from structured files(shelves, JSON, YAML, XML, etc to
> >> databases both SQL and NoSQL based.
> >
> > This is where I start getting 'lost'. I have no idea which of the three
> > listed formats will work well
>
> I actually listed 4! :-)
> shelve is a Python proprietary format that will take arbitrary Python
> data structures and save them to a file that can then be accessed like a
> dictionary. Its very easy to use and works well if your data can be
> found via a simple key. For more general use it has issues.
>
> JSON, YAML, XML are all text based files using structured text
> to define the nature of the data. JSON looks a lot like a Python
> dictionary format. XML is good for complex data but is correspondingly
> complex to work with. YAML is somewhere in the middle.
> For your purposes JSON is probably a good choice.
>
> Wikipedia describes all of them in much more detail.

Seems that YAML is a superset to JSON - - - so at this point I'm
thinking of using YAML.
>
>
> > The first point with the data is to store it. At the time of its collection
> > the data will also be used to trigger other events so that will be (I think
> > that is the proper place) be done in the capturing/organizing/supervising
> > program (in python).
>
> That raises a whole bunch of other questions such as is the data
> stateful? Does the processing of sample Sn depend on the values
> of Sn-1, Sn-2... If so you need to store a "window" of data in
> memory and then write out the oldest item to storage as it expires.
>
> If not you can probably just read, process, store.

Found something called MQTT and it would seem that this 'subsystem'
should be useful for what I'm trying to do. Comments?
>
> > honestly cannot tell what makes data
> > 'irregular'
>
> It just means that different data items have different attributes.
> Some fields may be optional and different readings will contain
> different numbers of fields.
> Or the same fields may contain different types of data, for example
> one sensor reports on/off while another gives a magnitude and
> another gives a status message. That would be irregular data
> and SQL databases don't like it much (although there are
> techniques to get round it)
>
> Regular data just means every reading looks the same so you can
> define a table with a set number of columns each of a known type.

Your explanation is what I would have thought but I've learnt the hard way
that logical and programming really don't necessarily work together.
I wonder what ever happened to the concept of mathematically provable
constructs in programming (ie Byte article where programs could be
proven bug free).
>
> > Now if I only knew what regular and/or irregular data was.
> > Have been considering using Postgresql as a storage engine.
> > AFAIK it has the horsepower to deal with serious large amounts of data.
>
> It will do the job but so would something much lighter like SQLite.
> The really critical factor is how much parallel access you need.
> SQLite is great if only a single program is reading and writing
> the data. Especially if there is only a single thread of execution.
> The server based databases like Postgres come into their own if
> you have multiple clients accessing the database at once. They can
> perform tricks like record level locking (rather than table level
> or even database level) during writes.

As I'm starting with probably over 10 different systems logging and
future possibilities of perhaps 200 I'm starting with the server based
database system (Postgresql is the plan).
>
> > Its not serious huge data amounts but its definitely non-
> > trivial.
>
> Modern storage solutions have made even Gigabytes of data almost
> trivial. Big volumes would be more of an issue if using flat files
> because they need to be read from disk and that takes time for big
> volumes - even using SSDs.

Storage isn't the issue - - - - its making sure that the data is handled well
and accurately.
(I can remember buy a hand lettered 40 MB HDD and it was some considered
huge - - at the time 10 MB was considered decent sized - - - grin!)
I've already got one circa 2 TB and another at 6TB both raid 10.
Storage isn't that expensive today.
(Remember working at a facility early 82 and the 14" HDD was 5 MB - - -
that was HUGE at the time.)
>
> > Concurrent - - - yes, there are multiple sensors per station and
> > multiple stations and hopefully they're all working (problems
> > with the system if they're not!)
>
> You can deal with a relatively small number of sensors in a single
> threaded Python application, just by polling each one in sequence.
> But if the read/process/store time gets bigger then that will limit
> how many cycles you can perform in your 0.5 second window. One option
> is to have multiple threads, each reading a sensor(or sensor group).
> Another is to split the processing out to a separate program that
> reads the recorded data from storage and processes it before issuing
> updates to the sensors as needed.

I am using a number of microcontrollers coupled to one SoC for every
'point' so the issue is more in the storing and getting it there. That's
why the questions regarding that.
>
> This takes us into the thorny world of systems architecture and
> concurrency management both of which are complex and depend on
> detailed analysis of requirements. Probably at a level beyond
> the scope of this list! (Although that was my day job before
> I retired!)
>
> > There is a small amount of analysis being done as the data is stored
> > but that's to drive the system (ie at point x you stop this, at point
> > y you do this, point y does this for z time then a happens etc - - - I
> > do not consider that heavy duty analysis. That happens on a different
> > system (storage happens first to the local governing system.
>
> OK that sounds like the second of my options above. That's a
> perfectly viable approach.
>
> > a round of data (one cycle) has completed that data is shipped in a burst
> > to the long term storage system.
>
> That again is viable but does run the risk of losing all data for the
> cycle if anything breaks. But if the data are interdependant that may
> not matter and may even be a good thing.
>
> > of information - - - or that's the goal anyway. Maybe One system will store the
> > info and another will do analysis and then return that analysis to the storing
> > system for accumulation - - - also not decided.)
>
> That's fine you need very careful analysis to determine the
> architecture and it will need to be based on throughput
> calculations etc. (This is where the old software engineering
> meme about only considering performance after you find there
> is a problem does not apply. The cost of refactoring a
> whole architecture is very high!)
>
> > what I should be using for what you termed 'structured files'.
>
> The more I read your posts the more I think you should go with a
> database and probably a server based one because i think you'll
> wind up with several independent programs reading/writing data concurrently.
>
> I'm still not 100% sure if it's a SQL or NoSQL solution,
> but my gut says SQL will be adequate.
>
Well - - - I think Postgresql should handle what I want it to do adequately.

(Now - - - I'm looking at on-line in-line spectrographic analysis for a
further implementation but that's not today - - - that will drive the data
up considerably!)

Thanking you for your assistance.

Regards