We aim to demonstrate how A.I. can be utilized to predict molecular properties through an engaging, individualized learning experience.
Specifically, our module uses a subtopic of artificial intelligence called machine learning to extract knowledge and find patterns hidden in the data to predict the property of biodegradability how capable a molecule is of being naturally broken down by microorganisms in organic compounds.
Our learning module walks students with zero coding or machine learning experience through the process of creating their own machine learning model, which once created can be used to predict the biodegradability of unseen instances of compounds.
We hope that after exploring our module, other STEM students or professionals can be inspired to utilize machine learning to predict other molecular properties, speeding up research times and leading to new innovations.
Train and Validate
1) PREPARE DATA:
Choose which feature group For more info on the specific feature groups used in our model, visit Dataset. to train the module with, and split the data into training, validation, and testing. For more info on data splitting, visit Dataset and Overfitting. Press Save Configuration.
2) TRAIN:
Choose which supervised learning algorithm For more info on each machine learning algorithm, visit Algorithms. to train your model with, and press Train.
3) VALIDATE:
Tune the hyperparameters For more info on each machine learning algorithm's hyperparameters, visit Algorithms. of your chosen algorithm, and press Validate.
Test
4) TEST:
Choose any of your previous saves to run a final test of your model, and press Test.
Predict
5) PREDICT:
Predict the biodegradability of any of your previous saves by inputting its SMILES string Simplified Molecular-Input Line-Entry System String: a computer-friendly string representation of a molecule's three-dimensional structure. On the Predict tab, you will find some randomly-selected SMILES strings of molecules from the dataset, or the option to create your own SMILES string with an online SMILES-generator. and press Predict. You have now successfully created a machine learning model that can predict whether molecules are readily biodegradable!
NOTE: If the module below is not appearing, try opening this page in a private/incognito window.
Due to the size of the dataset, it may take a few seconds to load.
The class the categorical group that a data point belongs to. is the value the model is trained to determine. It is a binary representation of biodegradability, how capable a molecule is of being naturally broken down by microorganisms where...
You will have a choice between four feature what the model is trained on; the model finds patterns in the feature data and makes predictions from them groups to use throughout the process
These values are produced by passing a SMILES string Simplified Molecular-Input Line-Entry System String: a computer-friendly string representation of a molecule's three-dimensional structure to the program, which identifies the molecule and generates values for it.
Choosing which features will be used to train the model is an important step in data preparation. the stage where data is chosen, properly cleansed/adjusted, and split to be used throughout the development process
While one might assume that using more features would lead to better accuracy, this is not necessarily the case. This is the curse of dimensionality. refers to the idea that increasing the number of features/dimensions greatly increases the volume of the data space, posing an issue when not every combination of values can be accounted for by the given data set; due to this, more features does not equate to better performance
These ideas should be kept in mind when choosing whether to use molecular properties or one of the fingerprints to train your model.
The data is then split into three segments to be used throughout the making of the model: training, the model learns patterns from the given data to be evaluated later validation, the model's prediction results are checked on another set of data while optimizing its hyperparameters to improve its performance and testing. the model is evaluated on completely unseen data to determine its final/true performance
NOTE: The data table shows only a preview of 15 of the chemicals in the dataset. In reality, the dataset contains over 6000 chemicals. For more information about where the the dataset came from, see the original study.
For this module, you will have the choice between three supervised learning algorithms, an algorithm that learns from datasets to predict outcomes. each of which has its own strengths and weaknesses. Furthermore, you will be able to tune some of their hyperparameters: settings of the model that the user sets in order to affect the model's performance
Flowchart like structure, narrows down choices given a large sample.
Compares input to a set amount of data, determining which data the input is most similar to.
Creates a classification Classification vs Regression: classification determines predefined classes (qualitative), regression determines numerical data (quantitative) boundary using logarithmic probabilities.
To prevent under a phenomenon where the model fails to learn enough from the data, resulting in poor performance on both training and unseen data or overfitting, a phenomenon where the model learns the training data too well leading to excellent training performance but poor generalization to new data data scientists typically split datasets into training the model learns patterns from the given data to be evaluated later (60-80%), validation the model's prediction results are checked on another set of data while optimizing its hyperparameters to improve its performance (10-20%), and testing the model is evaluated on completely unseen data to determine its final/true performance (10-20%) sets, and choose appropriate hyperparameters settings of the model that the user sets in order to affect the model's performance for the respective algorithm. Learning curves, which show performance changes as more data is introduced, can help identify under or overfitting. Below are examples of models that are either underfitting, optimal, or overfitting, and their respective learning curves.
NOTE: It is recommended to run steps 1) PREPARE DATA and 2) TRAIN with all default parameters before answering any questions. More detailed instructions on each step can be found on the Introduction tab in the above menu.
For questions 1-6, let “optimal” features, data splits, etc. be defined as producing the highest validation accuracy, without the learning curve showing signs of underfitting or overfitting.
For questions 7-9, let your “most optimal save” be defined as whichever save yielded the highest testing accuracy. “Saves” refer to previously-run models (displayed in the table on the Train and Validate tab).