Catalysis Data Science Model

Welcome to Catalysis Data Science

The dataset utilized in our module is called the “Catalysis dataset”. It describes a reaction known as the oxidative coupling of methane. What happens is methane and oxygen react together to form the desired products(ethane and ethylene) and the undesired products(carbon dioxide and carbon monoxide). This dataset contains over 10,000 points that represent instances of the catalytic reaction detailed above.

Due to its massive size, we apply data science and machine learning concepts such as multivariable regression, classification, clustering, and principal component analysis to uncover different trends within the data. To get a full understanding of all these concepts, it is important to utilize the interactivity of this module. Make sure to play out as many scenarios as possible to see how the data changes or how the performance of specific models change.

Data Picture

NOTE: If the module below is not appearing, try opening this page in a private/incognito window.

Due to the size of the dataset, it may take a few seconds to load.

Jump to Module


Model Description

The Data Exploration section allows for one to understand the distribution of the data set being used. One will be able to play with different things such as minimum temperature, minimum methane conversion, and minimum error to see how the data changes.

The Correlation Matrix shows how strong the correlation is between all the different features in the dataset. This is important because depending on the strength of correlation, one can make useful predictions about a potential regression.

The Multivariable Regression section allows for one to build their own regression model. The objective of the model is to show how good certain features are in predicting an output. This is achieved by a parity plot which shows the actual on the x axis and predicted on the y axis. Furthermore, while choosing the different features to go into the model, the user will be able to see many evaluation metrics such as R^2, regression coefficients, and an error histogram.

The Unsupervised Learning section will introduce two techniques. These are clustering analysis and principal component analysis. The objective of the clustering plot is to try to group similar data points within the data set. To help with the clustering plot, an elbow plot is also included to help indicate the ideal number of clusters in the plot. The objective of principal component analysis is to reduce the dimensionality of a large dataset into a few key components which still explain most of the information in the dataset. In this section, we show this through the PCA plot which plots the first two principal components, and through a histogram which explains how much information each principal component accounts for.

The Classification section will show ways in which the data is partitioned into different “classes”. With the dataset being used, the classes are a good catalyst and a bad catalyst. This is achieved through a support vector machine model. Within the model, one can choose between 4 kernels and see how the data changes. Furthermore, there are evaluation metrics included in the form of a classification report and confusion matrix.

Guiding Questions

Introductory Questions:

  1. In the Data Exploration section, what combinations of methane conversion, temperature, and value of C2y will yield the most data points?
  2. In the Correlation Matrix section, which two different features show the highest correlation between them? What could this indicate about how these features would perform in a regression model?

Regression Section:

  1. As you select the different features and what outputs you want to go into the model, which combination of features and what output yields the best regression model?
  2. Looking at the model with all features selected, is it possible to tell which variables were the most important in predicting the output variable? If so, how would you do so and which are the most important variables?
  3. How does changing the type of regression from linear to quadratic to cubic affect the regression?
  4. Is there any overfitting occurring in the regression model?

Unsupervised Learning Section:

  1. Is there a trend that can be noticed about the amount of variance explained by each principal component? How many principal components does it take to have at least 80 percent variance explained?
  2. According to the elbow plot, what is the ideal number of clusters(optimal k value) for the k-means clustering?
  3. As can be seen in the clustering plot, there is a lot of overlap between points and even clusters. What might this imply about the dataset used to build this plot?

Classification Section:

  1. When selecting all features and selecting the linear kernel, how many C2s were predicted right? How many were predicted wrong?
  2. Looking at the classification report, what are metrics such as precision, recall, and F1-score and how are they useful in evaluating classification models?
  3. Looking at the svm model, matrix, and report, which kernel performs the best?