[scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems

Tommaso Costanzo tommaso.costanzo01 at gmail.com
Tue Mar 28 14:57:52 EDT 2017


Dear Henrique,

Yes, my previous representation looks like a Z-matrix format (BTW in
scikit-learn you will need to have the same number of columns at every
line, so you will need to fill somehow the first line). However, I will
take this email as the opportunity to stress the fact that you do not have
to stick to a specific file format, but the the features/columns/(2nd index
of a 2D matrix) they have to represent the properties/parameters that they
will directly affect changes in what you are trying to predict. In fact in
your columns you can even add more features compared to the priviosly cited
and probably you will describe the system in a better way. What I can think
right now, just for the seak of a better explanation, you can use: bond
lenght, atoms type,  number of unpaired electrons, total number of
electrons, diedral angle of the two atoms, number of atoms between the pair
(e.g. if you have Mn--O--Mn there is an oxygen between the two Mn that you
might want to look at the coupling) and so on and so forth. The number of
parameters you will have to use will solely depend on your system and what
you need to describe, but it will not affect in any case scikit-learn
routines. Basically every 2D matrix of number will work on scikil-learn,
but in order to make these number to have physical meaning that will depend
on what the number represent.

Let me know if it make more sense
Sincerely
Tommaso
On Mar 28, 2017 12:51 PM, "Henrique C. S. Junior" <henriquecsj at gmail.com>
wrote:

@Tommaso, this is something like Internal Coordinates[1], right?
@Bill, thanks for the hint, I'll definitely take a look at this.

[1] - https://en.wikipedia.org/wiki/Z-matrix_(chemistry)

On Tue, Mar 28, 2017 at 2:12 AM, Bill Ross <ross at cgl.ucsf.edu> wrote:

> Image processing deals with xy coordinates by (as I understand) training
> with multiple permutations of the raw data, in the form of translations and
> rotations in the 2d space. If training with 3d data, there would be that
> much more translating and rotating to do, in order to divorce the learning
> from the incidentals.
>
> Bill
>
> On 3/27/17 4:35 PM, Tommaso Costanzo wrote:
>
> Dear Henrique,
> I am sorry for the poor email I wrote before. What I was saying is simply
> the fact that if you are trying to use the coordinates as "features" from
> an .xyz file then by machine learning you will learn at wich coordinate
> certain atoms will occur so you can only make prediction on the coordinate.
> However, if I correctly understood, the "features" representing the
> coupling J are distance, angle, and electron number. Definitely this
> properties can be derived from the XYZ file format from simple geometric
> calculations and the number of electrons will depend from the type of atom.
> So, what I was trying to say is that instead of using the XYZ file as input
> for scikit-learn, I was suggesting to do the calculation of angle,
> distances, electrons' number in advance (with other software(s) or directly
> in python)  and use the new calculated matrix as input for scikit-learn. In
> this case the machine will learn how J(AB) varies as a function of angle,
> distance, number of electrons.
> For example
>
> distance     angle   n el.
> 1                  90      1
> 1                  90      1
> 2                  90      1
> ....                ...        ...
>
> If you are using a supervised learning you will have to add a 4th column (
> in reality a separate column vector) with your J(AB) on which you can train
> your model and then predict the unknown samples
>
> For example
> distance     angle   n el.    J(AB)
> 1                  90      1        1
> 1                  90      1        1
> 2                  90      1         0.5
> ....                ...        ...       ...
>
> Now if you train the model on the second matrix, and then you try to
> predict the first one you should expect a results like:
>
> 1
> 1
> 0.5
>
> Of course in this case the "features" are perfectly equal, hence the
> example is completely unrealistic. However, I hope that it will help to
> understand what I was explaining in the previous email.
> If you want you can directly contact me at this email, and I hope that you
> got additional hints from Robert, that he seems to be even more
> knowledgeable than me.
>
> Sincerely
> Tommaso
>
>
>
> 2017-03-27 18:44 GMT-04:00 Henrique C. S. Junior <henriquecsj at gmail.com>:
>
>> Dear Tommaso, thank you for your kind reply.
>> I know I have a lot to study before actually starting any code and that's
>> why any suggestion is so valuable.
>> So, you're suggesting that a simplification of the system using only the
>> paramagnetic centers can be a good approach? (I'm not sure if I understood
>> it correctly).
>> My main idea was, at first, try to represent the systems as realistically
>> as possible (using coordinates). I know that the software will not know
>> what a bond is or what an intermolecular interaction is but, let's say,
>> after including 1000s of examples in the training, I was expecting that (as
>> an example) finding a C 0.000 and an H at 1.000 should start to "make
>> sense" because it leads to an experimental trend. And I totally agree that
>> my way to represent the system is not the better.
>>
>> Thank you so much for all the help.
>>
>> On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo <
>> tommaso.costanzo01 at gmail.com> wrote:
>>
>>> Dear Henrique,
>>>
>>>
>>> I agree with Robert on the use of a supervised algorithm and I would
>>> also suggest you to try a semisupervised one if you have trouble in
>>> labeling your data.
>>>
>>>
>>> Moreover, as a chemist I think that the input you are thinking to use is
>>> not the in the best form for machine learning because you are trying to
>>> predict coupling J(AB) but in the future space you have only coordinates
>>> (XYZ). What I suggest is to generate the pair of atoms externally and then
>>> use a matrix of the form (Mx3), where M are the pairs of atoms you want to
>>> predict your J and 3 are the features of the two atoms (distance, angle,
>>> unpaired electrons). For a supervised approach you will need a training set
>>> where the J is know so your training data will be of the form Mx4 and the
>>> fourth feature will be the J you know.
>>>
>>> Hope that this is clear, if not I will be happy to help more
>>>
>>>
>>> Sincerely
>>>
>>> Tommaso
>>>
>>> 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior <henriquecsj at gmail.com>
>>> :
>>>
>>>> Dear Robert, thank you. Yes, I'd like to talk about some specifics on
>>>> the project.
>>>> Thank you again.
>>>>
>>>> On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater <rdslater at gmail.com>
>>>> wrote:
>>>>
>>>>> You definitely can use some of the tools in sci-kit learn for
>>>>> supervised machine learning.  The real trick will be how well your training
>>>>> system is representative of your future predictions.  All of the various
>>>>> regression algorithms would be of some value and you make even consider an
>>>>> ensemble to help generalize.  There will be some important questions to
>>>>> answer--what kind of loss function do you want to look at?  I assumed
>>>>> regression (continuous response) but it could also classify--paramagnetic,
>>>>> diamagnetic, ferromagnetic, etc...
>>>>>
>>>>> Another task to think about might be dimension reduction.
>>>>> There is no guarantee you will get fantastic results--every problem is
>>>>> unique and much will depend on exactly what you want out of the
>>>>> solution--it may be that we get '10%' accuracy at best--for some systems
>>>>> that is quite good, others it is horrible.
>>>>>
>>>>> If you'd like to talk specifics, feel free to contact me at this
>>>>> email.  I have a background in magnetism (PhD in magnetic multilayers--i
>>>>> was physics, but as you are probably aware chemisty and physics blend in
>>>>> this area) and have a fairly good knowledge of sci-kit learn and machine
>>>>> learning.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior <
>>>>> henriquecsj at gmail.com> wrote:
>>>>>
>>>>>> I'm a chemist with some rudimentary programming skills (getting
>>>>>> started with python) and in the middle of the year I'll be starting a Ph.D.
>>>>>> project that uses computers to describe magnetism in molecular systems.
>>>>>>
>>>>>> Most of the time I get my results after several simulations and
>>>>>> experiments, so, I know that one of the hardest tasks in molecular
>>>>>> magnetism is to predict the nature of magnetic interactions. That's why
>>>>>> I'll try to tackle this problem with Machine Learning (because such
>>>>>> interactions are dependent, basically, of distances, angles and number of
>>>>>> unpaired electrons). The idea is to feed the computer with a large training
>>>>>> set (with number of unpaired electrons, XYZ coordinates of each molecule
>>>>>> and experimental magnetic couplings) and see if it can predict the magnetic
>>>>>> couplings (J(AB)) of new systems:
>>>>>> (see example in the attached image)
>>>>>>
>>>>>> Can Scikit-Learn handle the task, knowing that the matrix used to
>>>>>> represent atomic coordinates will probably have a different number of atoms
>>>>>> (because some molecules have more atoms than others)? Or is this a job
>>>>>> better suited for another software/approach? ​
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Henrique C. S. Junior*
>>>>>> Industrial Chemist - UFRRJ
>>>>>> M. Sc. Inorganic Chemistry - UFRRJ
>>>>>> Data Processing Center - PMP
>>>>>> Visite o Mundo Químico <http://mundoquimico.com.br>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Henrique C. S. Junior*
>>>> Industrial Chemist - UFRRJ
>>>> M. Sc. Inorganic Chemistry - UFRRJ
>>>> Data Processing Center - PMP
>>>> Visite o Mundo Químico <http://mundoquimico.com.br>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>> Please do NOT send Microsoft Office Attachments:
>>> http://www.gnu.org/philosophy/no-word-attachments.html
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> *Henrique C. S. Junior*
>> Industrial Chemist - UFRRJ
>> M. Sc. Inorganic Chemistry - UFRRJ
>> Data Processing Center - PMP
>> Visite o Mundo Químico <http://mundoquimico.com.br>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Please do NOT send Microsoft Office Attachments:
> http://www.gnu.org/philosophy/no-word-attachments.html
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
*Henrique C. S. Junior*
Industrial Chemist - UFRRJ
M. Sc. Inorganic Chemistry - UFRRJ
Data Processing Center - PMP
Visite o Mundo Químico <http://mundoquimico.com.br>

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170328/ad341215/attachment-0001.html>


More information about the scikit-learn mailing list