The dataset utilized in our module is called the “Catalysis dataset”. It describes a reaction known as the oxidative coupling of methane. What happens is methane and oxygen react together to form the desired products(ethane and ethylene) and the undesired products(carbon dioxide and carbon monoxide). This dataset contains over 10,000 points that represent instances of the catalytic reaction detailed above.
Due to its massive size, we apply data science and machine learning concepts such as multivariable regression, classification, clustering, and principal component analysis to uncover different trends within the data. To get a full understanding of all these concepts, it is important to utilize the interactivity of this module. Make sure to play out as many scenarios as possible to see how the data changes or how the performance of specific models change.
NOTE: If the module below is not appearing, try opening this page in a private/incognito window.
Due to the size of the dataset, it may take a few seconds to load.
The Data Exploration section allows for one to understand the distribution of the data set being used. One will be able to play with different things such as minimum temperature, minimum methane conversion, and minimum error to see how the data changes.
The Correlation Matrix shows how strong the correlation is between all the different features in the dataset. This is important because depending on the strength of correlation, one can make useful predictions about a potential regression.
The Multivariable Regression section allows for one to build their own regression model. The objective of the model is to show how good certain features are in predicting an output. This is achieved by a parity plot which shows the actual on the x axis and predicted on the y axis. Furthermore, while choosing the different features to go into the model, the user will be able to see many evaluation metrics such as R^2, regression coefficients, and an error histogram.
The Unsupervised Learning section will introduce two techniques. These are clustering analysis and principal component analysis. The objective of the clustering plot is to try to group similar data points within the data set. To help with the clustering plot, an elbow plot is also included to help indicate the ideal number of clusters in the plot. The objective of principal component analysis is to reduce the dimensionality of a large dataset into a few key components which still explain most of the information in the dataset. In this section, we show this through the PCA plot which plots the first two principal components, and through a histogram which explains how much information each principal component accounts for.
The Classification section will show ways in which the data is partitioned into different “classes”. With the dataset being used, the classes are a good catalyst and a bad catalyst. This is achieved through a support vector machine model. Within the model, one can choose between 4 kernels and see how the data changes. Furthermore, there are evaluation metrics included in the form of a classification report and confusion matrix.
Introductory Questions:
Regression Section:
Unsupervised Learning Section:
Classification Section: