Blog #1: Dataset Visualizations and Training GPR Models

Hi again!

It’s been two weeks since I started my internship, and I’ve been very busy. On my first day, I called my supervisors and went over the internship plan, outlining key deadlines as well as meeting schedules. Then, I set up my workspace (I SSH into a remote server to do my work), read through the GraphDot API documentation that I will be using extensively, and ran all the example notebooks. I also downloaded the solar.db database—the dataset I will be using to train my models—and converted it to different CSV files, one for each table in the database.

The next day, I worked on preprocessing the CSV files from solar.db and was able to generate another CSV file (InChI_SMILES.csv) that matched each chemical to its SMILES representation. I discovered several methods of generating molecular representation graphs, and chose the most efficient and accurate one: a function that takes in InChI representations and converts it into a GraphDot graph.

Using Seaborn, I also plotted the distribution and correlation between mass, volume, number of atoms, homo-lumo difference, and ground state energy in the dataset. Here are a couple trends I observed:

  • The distribution for E_homo/lumo gap resembles a normal distribution with a mean of -2.5; for the E_homo/lumo_TB gap, a normal distribution of a mean of -2; and for the D_homo/lumo gap, a distribution centered at around -0.2 skewed towards the positive.
  • There are 5366 datapoints for the E_homo/lumo gap, but only 1346 datapoints for the E_homo/lumo_TB and  D_homo/lumo each. Therefore, using the  E_homo/lumo values will provide more training data.
  • There seems to be no significant correlation between any of the homo-lumo gap labels and their corresponding masses, volumes, and number of atoms in the molecule.

Dataset trends plotted with Seaborn. vol = volume, natoms = number of atoms, diff = homo/lumo difference, and Energy = ground state energy.

Figure 1: The dataset trends plotted with Seaborn. vol = volume, natoms = number of atoms, diff = homo/lumo difference, and Energy = ground state energy.

Then, over the next few days, I ran Gaussian Progress Regression on the molecules, debugging different problems along the way. First, I simply ran the Marginalized Graph Kernel (MGK) on the first 5 molecules of the dataset to see what it did: it outputted an adjacency matrix comparing the 5 moelcules. In playing around, I realized that if I got rid of the labels in the molecular graphs when fitting the MGK, I would get an adjacency matrix of solely 1.0’s. It seems like without the labels, the first 5 molecules were similar enough that they were all classified as being the same molecule.

I wrote a python script that ran k-fold training with k=10 on different hyperparameters. Through these tests, I was able to find a set of hyperparameters that seemed to produce the highest accuracies when training. On the way, I ran into several problems, including a badly defined accuracy function that reported >99% accuracy on all models, GPU issues (where previous tasks refused to be killed and took up most of the GPU functionality, making training incredibly slow), as well as problems saving the model (in particular, serializing graphs into bytes to be saved).

Ground State Energy plot

Figure 2: Ground state energy predictions and labels as generated by the GPR model. 

However, on the way, I also learned a lot. I learned the importance of a well-defined accuracy function, different data plotting methods (and how to properly present data to emphasize key trends), different Python libraries (this was the first time I used Seaborn, for example), as well as how to work remotely with a Linux machine (using nohup to run scripts in the background, SSH with port forwarding…etc). It’s definitely been a very rewarding first two weeks for me!

Currently, I am running my final GPR model and am moving on to the second part of my project: building a generative model that uses a Monte Carlo Tree Search (MCTS) algorithm to find molecules with a homo-lumo gap closest to the given one. I’m currently building a MCTS algorithm to play tic-tac-toe to gain familiarity with the algorithm, and will be implementing MCTS in conjunction with my predicative model soon.