Biodegradability Classification

Welcome to Biodegradability Classification

We aim to demonstrate how A.I. can be utilized to predict molecular properties through an engaging, individualized learning experience.

Specifically, our module uses a subtopic of artificial intelligence called machine learning to extract knowledge and find patterns hidden in the data to predict the property of biodegradability how capable a molecule is of being naturally broken down by microorganisms in organic compounds.

Our learning module walks students with zero coding or machine learning experience through the process of creating their own machine learning model, which once created can be used to predict the biodegradability of unseen instances of compounds.

We hope that after exploring our module, other STEM students or professionals can be inspired to utilize machine learning to predict other molecular properties, speeding up research times and leading to new innovations.

Collage

Procedure:

Train and Validate

1) PREPARE DATA:

Choose which feature group For more info on the specific feature groups used in our model, visit Dataset. to train the module with, and split the data into training, validation, and testing. For more info on data splitting, visit Dataset and Overfitting. Press Save Configuration.

2) TRAIN:

Choose which supervised learning algorithm For more info on each machine learning algorithm, visit Algorithms. to train your model with, and press Train.

3) VALIDATE:

Tune the hyperparameters For more info on each machine learning algorithm's hyperparameters, visit Algorithms. of your chosen algorithm, and press Validate.

Test

4) TEST:

Choose any of your previous saves to run a final test of your model, and press Test.

Predict

5) PREDICT:

Predict the biodegradability of any of your previous saves by inputting its SMILES string Simplified Molecular-Input Line-Entry System String: a computer-friendly string representation of a molecule's three-dimensional structure. On the Predict tab, you will find some randomly-selected SMILES strings of molecules from the dataset, or the option to create your own SMILES string with an online SMILES-generator. and press Predict. You have now successfully created a machine learning model that can predict whether molecules are readily biodegradable!

NOTE: If the module below is not appearing, try opening this page in a private/incognito window.

Due to the size of the dataset, it may take a few seconds to load.

Jump to Module


Working With the Dataset

Class:

The class the categorical group that a data point belongs to. is the value the model is trained to determine. It is a binary representation of biodegradability, how capable a molecule is of being naturally broken down by microorganisms where...

  • Non-ready biodegradability < 60% biodegradation within 28 days = 0
  • Ready biodegradability > 60% biodegradation within 28 days = 1

Features:

You will have a choice between four feature what the model is trained on; the model finds patterns in the feature data and makes predictions from them groups to use throughout the process

  • A list of over 200 molecular properties
  • One of three different fingerprints: binary representations of the molecule's substructures
    • Morgan Fingerprint
    • Extended Connectivity Fingerprint (ECFP2)
    • Path Fingerprint

These values are produced by passing a SMILES string Simplified Molecular-Input Line-Entry System String: a computer-friendly string representation of a molecule's three-dimensional structure to the program, which identifies the molecule and generates values for it.

Selecting/Dropping Data:

Choosing which features will be used to train the model is an important step in data preparation. the stage where data is chosen, properly cleansed/adjusted, and split to be used throughout the development process

While one might assume that using more features would lead to better accuracy, this is not necessarily the case. This is the curse of dimensionality. refers to the idea that increasing the number of features/dimensions greatly increases the volume of the data space, posing an issue when not every combination of values can be accounted for by the given data set; due to this, more features does not equate to better performance

These ideas should be kept in mind when choosing whether to use molecular properties or one of the fingerprints to train your model.

Splitting Data:

The data is then split into three segments to be used throughout the making of the model: training, the model learns patterns from the given data to be evaluated later validation, the model's prediction results are checked on another set of data while optimizing its hyperparameters to improve its performance and testing. the model is evaluated on completely unseen data to determine its final/true performance

NOTE: The data table shows only a preview of 15 of the chemicals in the dataset. In reality, the dataset contains over 6000 chemicals. For more information about where the the dataset came from, see the original study.

Using the Algorithms

For this module, you will have the choice between three supervised learning algorithms, an algorithm that learns from datasets to predict outcomes. each of which has its own strengths and weaknesses. Furthermore, you will be able to tune some of their hyperparameters: settings of the model that the user sets in order to affect the model's performance

Decision Tree (DT):

Flowchart like structure, narrows down choices given a large sample.

  • Pros: Simple, handles numerical & categorical data
  • Cons: Small changes can largely impact data
Data Picture
Decision Tree

K-Nearest Neighbor (KNN):

Compares input to a set amount of data, determining which data the input is most similar to.

  • Pros: Non-parametric, versatile
  • Cons: Memory intensive, exhibits curse of dimensionality refers to the idea that increasing the number of features/dimensions greatly increases the volume of the data space, posing an issue when not every combination of values can be accounted for by the given data set; due to this, more features does not equate to better performance
Data Picture
K-Nearest Neighbor

Logistic Regression (LR):

Creates a classification Classification vs Regression: classification determines predefined classes (qualitative), regression determines numerical data (quantitative) boundary using logarithmic probabilities.

  • Pros: Efficient, less inclined to overfitting a phenomenon where the model learns the training data too well leading to excellent training performance but poor generalization to new data
  • Cons: Limited to linear separation, can be oblivious to complex variable relationships
Data Picture
Logistic Regression

Hyperparameters for DT:

  • Max depth of tree
    • How far the tree is allowed to go
    • Higher depths can fit data better, but too far can also overfit data
  • Splitter strategy:
    • Best
      • Chooses best split at each node
    • Random
      • Chooses best random split at each node

Hyperparameters for KNN:

  • Number of neighbors (k)
    • How many neighbors near a point will be used to classify that point
    • Fewer neighbors can fit data better, but too few can overfit data
  • Weights:
    • Uniform
      • All nearby points are weighted equally
    • Distance
      • Closer neighbors are weighted more than further ones

Hyperparameters for LR:

  • Regularization Parameter (c)
    • How smooth the decision boundary will be
    • Higher values can fit data better, but can also overfit data
  • Solver
    • The particular algorithm being used to classify
      • LBFGS, Limited-memory Broyden-Fletcher-Goldfarb-Shanno
      • liblinear, large-scale linear classification
      • SAGA, Stochastic Average Gradient Descent

Generalize Don't Memorize

To prevent under a phenomenon where the model fails to learn enough from the data, resulting in poor performance on both training and unseen data or overfitting, a phenomenon where the model learns the training data too well leading to excellent training performance but poor generalization to new data data scientists typically split datasets into training the model learns patterns from the given data to be evaluated later (60-80%), validation the model's prediction results are checked on another set of data while optimizing its hyperparameters to improve its performance (10-20%), and testing the model is evaluated on completely unseen data to determine its final/true performance (10-20%) sets, and choose appropriate hyperparameters settings of the model that the user sets in order to affect the model's performance for the respective algorithm. Learning curves, which show performance changes as more data is introduced, can help identify under or overfitting. Below are examples of models that are either underfitting, optimal, or overfitting, and their respective learning curves.

Underfitting model

Data Picture
High loss, low accuracy

Underfit learning curve

Data Picture
59.9% train acc., 58.8% val. acc., 59.1% test acc.
This model underfits the data with a too simplistic linear decision boundary, failing to capture the significant trends. Both the learning curve and results show poor performance being barely better than random chance like flipping a coin.

Optimal model

Data Picture
Low loss, high accuracy

Optimal learning curve

Data Picture
78.2% train acc., 75.3% val. acc., 74.5% test acc.
This model is optimal, capturing macro trends in the data effectively. The learning curve illustrates strong generalization, with training performance decreasing as validation performance improves.

Overfitting model

Data Picture
Low loss, low accuracy

Overfit learning curve

Data Picture
100% train acc., 76.1% val. acc, 72.5% test acc.
This model severely overfits the data. Failing to capture general trends and it instead memorizes the training set to achieve 100% accuracy on it. As a result, it performs significantly worse on unseen validation instances. Preventing overfitting is the critical responsibility of a data scientist.

Guiding Questions

NOTE: It is recommended to run steps 1) PREPARE DATA and 2) TRAIN with all default parameters before answering any questions. More detailed instructions on each step can be found on the Introduction tab in the above menu.

Training and Validation:

For questions 1-6, let “optimal” features, data splits, etc. be defined as producing the highest validation accuracy, without the learning curve showing signs of underfitting or overfitting.

  1. Train the decision tree model For more information about Decision Trees, as well as the other machine learning algorithms used in this module, see Algorithms in the above menu. using each of the four feature choices. For this module, the feature choices are Molecular Properties, Morgan Fingerprint, ECFP2, and Path Fingerprint. For more information on each of these features, and the role of features in Machine Learning, see Dataset in the above menu. Find the optimal features for training your decision tree model.
  2. Train the decision tree model using different data split percentages, The percentages of the given data that will be used for training, validation, and testing. For more information on these processes, see Dataset and Overfitting in the above menu. ranging from 40% to 80% training. Find the optimal data split for your decision tree model, to the nearest 5%.
  3. Tune the decision tree model by modifying its hyperparameters For this module, the feature choices are Molecular Properties, Morgan Fingerprint, ECFP2, and Path Fingerprint. For more information on the hyperparameters of each machine learning algorithm, see Algorithms in the above menu.
    1. How does increasing the max. depth of tree affect your model's performance?
    2. What about changing the splitter?
  4. Train the model using a different machine learning algorithm (either KNN or LR). Are the features and data split still the most optimal for this model? Why or why not?
  5. Tune your chosen model by modifying its hyperparameters.
    1. How does increasing or decreasing your model's numerical hyperparameter In this module, the numerical hyperparameters are max depth of tree for DT, number of neighbors for KNN, and regularization parameter for LR. affect its performance?
    2. What about changing its categorical hyperparameter? In this module, the categorical hyperparameters are splitter strategy for DT, weights for KNN, and solver for LR.
  6. Using your above findings, try to find an optimum combination of feature, data split, machine learning algorithm, and hyperparameters for this model.

Testing and Predicting:

For questions 7-9, let your “most optimal save” be defined as whichever save yielded the highest testing accuracy. “Saves” refer to previously-run models (displayed in the table on the Train and Validate tab).

  1. Test each of your saves and view their testing accuracies. Does highest validation accuracy always relate to highest testing accuracy? Why or why not? If not, which save ended up being most optimal?
  2. Choose your most optimal save to predict the biodegradability of each of the 3 randomly-selected SMILES strings Simplified Molecular-Input Line-Entry System String: a computer-friendly string representation of a molecule's three-dimensional structure from the dataset. Did your model correctly predict all three of the randomly-selected molecules?
  3. For further exploration:
    1. Choose a different save to predict the biodegradability of these SMILES strings. Are any of the prediction results different?
    2. Create a custom SMILES string using the tools provided, and predict its biodegradability. Find a molecule that does not appear in the dataset (whose Actual Class = “unknown”).