[Tutor] Building dictionary from large txt file

Tue Jul 26 21:11:01 EDT 2022

Sorry, Alan, I found that part quite clear. Then again, one of my degrees is a B.S. in Chemistry. No idea why I ever did that as I have never had any use for it, well, except in Med School, which I also wonder why ...

Bob may not have been clear, but what he is reading in is basically a table of atomic elements starting with the Atomic Number (number of Protons so Hydrogen is 1 and Helium is 2 and so on). Many elements come in a variety of isotopes meaning the number of neutrons varies so Hydrogen can be a single proton or have one neutron too for Deuterium or 2 for tritium. The mass number is loosely how many nucleons it has as they are almost the same mass. 

He wants the key to be the concatenation of Atomic number (which is always one or two letters like H or He with I believe the mass number, thus he should make H1 in his one and only example (albeit you can look up things like a periodical table to see others in whatever format.) That field clearly should be text and used as a unique key.

He then want the values to be another dictionary where the many second-level dictionaries contain keys of 'Z', 'A', 'm' and whatever his etc. is. The Atomic mass is obvious but not really as it depends on the mix of isotopes in a sample. Hydrogen normally is predominantly the single proton version so the atomic weight is very close to one. But if you extracted out almost pure Deuterium, it would be about double. Whatever it is, he needs to extract out a value like this "1.00782503223(9)" and either toss the leaky digit, parens and all, or include it and then convert it into a FLOAT of some kind.

Isotopic Composition is sort of clear(as mud) as I mentioned there are more than two isotopes, albeit tritium does not last long before breaking down, so here it means the H-1 version is 0.999885(70) or way over 99% of the sample, with a tiny bit of H-2 or deuterium. (Sorry, hard to write anything in text mode when there are subscripts and superscripts used on both the left and right of symbols). I am not sure how this scales up when many other elements have many isotopes including stable ones, but assume it means the primary is some percent of the total. Chlorine, for example, has over two dozen known isotopes and an Atomic Weight in the neighborhood of 35 1/2 as nothing quite dominates. 

And since samples vary in percentage composition, the Atomic Weight is shown as some kind of range in:

Standard Atomic Weight = [1.00784,1.00811]

He needs to extract the two numbers using whatever techniques he wants and either record both as a tuple or list (after converting perhaps to float) or take an average or whatever the assignment requires.

I have no idea if Notes matters as he stopped explain what he wants his output to be BUT he should know it may be a pain to deal with the split text as it may show up as multiple items in his list of tokens.

But as I wrote earlier, his main request was to ask why his badly formatted single dictionary gets overwritten and the answer is because he does that instead of adding it to an outer dictionary first and then starting over.

So the rest of your comments do apply. Just satisfying your curiosity and if I am wrong, someone feel free to correct me.

-----Original Message-----
From: Tutor <tutor-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Alan Gauld
Sent: Tuesday, July 26, 2022 8:34 PM
To: tutor at python.org
Subject: Re: [Tutor] Building dictionary from large txt file

On 26/07/2022 21:58, bobx ander wrote:
> Hi all,
> I'm trying to build a dictionary from a rather large file of following 
> format after it has being read into a list(excerpt from start of list 
> below)
> --------
>
> Atomic Number = 1
>     Atomic Symbol = H
>     Mass Number = 1
>     Relative Atomic Mass = 1.00782503223(9)
>     Isotopic Composition = 0.999885(70)
>     Standard Atomic Weight = [1.00784,1.00811]
>     Notes = m
> --------
>
> My goal is to extract the content into a dictionary that displays each 
> unique triplet as indicated below
> {'H1': {'Z': 1,'A': 1,'m': 1.00782503223},
>               'D2': {'Z': 1,'A': 2,'m': 2.01410177812}
>                ...} etc

Unfortunately to those of us unfamiliar with your data that is as clear as mud.

You refer to a triplet but your sample file entry has 7 fields, some of which have multiple values. Where is the triplet among all that data?

Then you show us a dictionary with keys that do not correspond to any of the fields in your data sample. How do the fields correspond - the only "obvious" one is the mass which evidently corresponds with the key 'm'.

But what are H1 and D2? Another file record or some derived value from the record shown above? Similarly for Z, A and m. How do they relate to the data?

You need to specify your requirement more explicitly for us to be sure we are giving valid advice.

> My code that I have attempted is as follows:
>
> filename='ex.txt'
>
> afile=open(filename,'r') #opens the file
> content=afile.readlines()
> afile.close()

You probably don't need to read the file into a list if you are going to process it line by line. Just read the lines from the file and process them as you go.

> isotope_data={'Z':0,'A':0,'m':0}#start to create subdictionary for 
> each case of atoms with its unique keys and values for line in 
> content:
>     data=line.strip().split()
>
>     if len(data)<1:
>         pass
>     elif data[0]=="Atomic" and data[1]=="Number":
>         atomic_number=data[3]
>
>
>      elif data[0]=="Mass" and data[1]=="Number":
>         mass_number=data[3]
>
>
>
>     elif data[0]=="Relative" and data[1]=="Atomic" and data[2]=="Mass":
>         relative_atomic_mass=data[4]
>
Rather than split the line then compare each field it might be easier (and more readable) to compare the full strings using the startswith() method then split the string:

for line in file:

     if line.startwith("Atomic Number"):

         atomic_number = line.strip().split()[3]

    etc...

> isotope_data['Z']=atomic_number
> isotope_data['A']=mass_number
> isotope_data['A']=relative_atomic_mass
> isotope_data
>
> the output from the programme is only
>
> {'Z': '118', 'A': '295', 'm': '295.21624(69#)'}
>
> I seem to be owerwriting each dictionary

Yes, you never detect the end of a record - you never explain how records are separated in the file either!

You need something like

master = []   # empty dict.

for line in file:

       if line.startswith("Atomic Number")

           create variable....

      if line.startswith(....):....etc

       if <record separator detected>   # we don't know what this is...

             # save variables in a dictionary

             record = { key1:variable1, key2:variable2....}

             # insert dictionary to master dictionary

             master[key] = record

How you generate the keys is a mystery to me but presumably you know.

You could write the values directly into the master dictionary if you prefer.

Also note that you are currently storing strings. If you want the numeric data you will need to convert it with int() or float() as appropriate.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor