Machine learning (ML). Clustering BIRCH, Gaussian Mixture

[Machine Learning Function - BIRCH Clustering] button

Clustering is a machine learning technique that is used to group similar or homogeneous instances into distinct data clusters. This method is used in unsupervised machine learning tasks.

You can download an example structured table file for clustering algorithms: XLSX .

Structured data from table files can be used for import: Excel workbook (*.xlsx); Excel binary workbook (*.xlsb); OpenDocument Spreadsheet (*.ods).

Where can it be applied

Example 1. Data collected by the marketing department about customer purchases allows us to understand whether there are similarities between customers. These similarities divide customers into groups (clusters), and having customer groups helps in targeting campaigns, promotions, conversions, and building better customer relationships.

Example 2. Identification of the most homogeneous groups according to the qualitative indicators of a mixture of components based on the quantitative or qualitative indicators of each of the components in the mixture.

Example 3. Identification of the most homogeneous groups according to qualitative or quantitative indicators of finished products based on various technological production modes.

Example 4. Identification of atypical objects that cannot be attached to any of the clusters.

BIRCH Clustering

Clustering BIRCH (balanced iterative reducing and clustering using hierarchies) - balanced iterative reduction and clustering using hierarchies.

Cluster analysis by the BIRCH algorithm requires data with metric attributes. A metric attribute is an attribute whose values can be represented by explicit coordinates in Euclidean space (without categorical variables).

Machine learning (ML) functionality window with clustering function button highlighted

Figure 1. Machine learning (ML) functions window. A tooltip is displayed when you hover the mouse over the button for going to the clustering functions using the BIRCH and Gaussian Mixture algorithms.

Machine learning (ML) functions window. A tooltip is displayed when you hover the mouse over the button for going to the clustering function using the BIRCH method.

Figure 2. Machine learning (ML) functions window. A tooltip is displayed when you hover the mouse over the button for going to the clustering function using the BIRCH algorithm.

Machine learning (ML) functions window - Clustering using the BIRCH method. Measures of metric attributes of points are selected, the [Threshold value] and [Number of clusters] values are set, and the [Save results] checkbox is unchecked.

Figure 3. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. The measures of metric attributes of points are selected, the [Threshold value] and [Number of clusters] values are set, the [Lines between centroids and points] and [Save results] checkboxes are unchecked. Black crosses indicate centroids (centers of gravity of clusters) with cluster numbers.

Machine learning (ML) functions window - Clustering using the BIRCH method. A drop-down list of measures is displayed for reflection along the [Y] axis. The [Save results] checkbox is checked.

Figure 4. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. A drop-down list of measures is displayed for reflection along the [Y] axis.

Machine learning (ML) functions window - Clustering using the BIRCH method. A drop-down list of measures is displayed for reflection along the [X] axis.

Figure 5. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. A drop-down list of measures is displayed for reflection along the [X] axis.

Machine learning (ML) functions window - Clustering with the BIRCH algorithm. The [Lines between centroids and points] checkbox is checked.

Figure 6. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. The [Lines between centroids and points] and [Save results] checkboxes are checked.

Machine learning (ML) functions window - Clustering using the BIRCH method. A message appears to save the assigned cluster codes to the data pairs (X and Y) in the source file on the BIRCH sheet.

Figure 7. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. A message appears about saving the assigned cluster codes to data pairs (X and Y) in the source file on the "BIRCH" sheet. The names of the columns of the assigned clusters retain the name of the clustering method, automatic detection of clusters or user-defined, the names of the pair of measures and indicators [Threshold value] and [Number of clusters] selected by the user.

Machine learning (ML) functions window - Clustering with the BIRCH algorithm. A hint is displayed when you hover the mouse over the button to go to the function of drawing vertical and horizontal lines on graphs

Figure 8. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. A hint is displayed when you hover the mouse over the button to go to the function of drawing vertical and horizontal lines on graphs.

Window of the auxiliary function for drawing vertical and horizontal lines on graphs.

Figure 9. Machine learning (ML) functions window - Clustering with the BIRCH algorithm. Window of the auxiliary function for drawing vertical and horizontal lines on graphs. Two vertical lines with names and one horizontal line have been introduced. You can display any number of lines with labels (name-value). You can change the value of any line selected in the list. You can delete any line selected in the drop-down list or all lines at once.

Reasons why the quality of the mathematical model using the BIRCH clustering method may be insufficient

Suboptimal hyperparameter tuning: BIRCH clustering has hyperparameters such as thresholds and cluster radii that need to be tuned. Wrong choice of hyperparameters can lead to poor model quality.
Data inaccuracy and inconsistency: The quality of BIRCH clustering can be poor if the data contains noise or outliers that can disrupt the boundaries and structure of clusters.
Unspecified or incorrectly selected similarity criterion: The quality of BIRCH clustering may depend on the choice or setting of the similarity criterion. Incorrect choice of similarity criterion can lead to insufficiently accurate clustering.
Incorrect data scaling: If the data has different value ranges or different units of measurement, improper scaling can result in poor quality BIRCH clustering.
Insufficient data: The quality of BIRCH clustering may be insufficient if insufficient data is available to train the model. More data can improve the quality of clustering.

Gaussian Mixture Clustering

The Gaussian Mixture model is a probabilistic model that assumes that all data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. This machine learning algorithm can assign each sample the Gaussian diagram it most likely belongs to. In our analysis, Gaussian Mixture introduces a variant of constraining the covariance of the estimated difference classes: full covariance.

An expectation maximization model (Gaussian Mixture) will necessarily use the number of components specified by the user, while a variational inference model (Bayesian Gaussian Mixture) will effectively use only as many components as necessary for a good fit. If the user-specified number of components is less than the effective number, the Bayesian Gaussian Mixture plot will display the user-specified number of components.

Clustering by the Gaussian Mixture algorithm is demonstrated in two graphs corresponding to the Bayesian Gaussian Mixture and Gaussian Mixture algorithms.

For greater clarity, ellipsoids of the Gaussian mixture model are displayed on the graphs.

Machine learning (ML) functions window. A tooltip is displayed when you hover your mouse over the button for going to the clustering function using the Gaussian Mixture method.

Figure 10. Machine learning (ML) functions window. A tooltip is displayed when you hover the mouse over the button for going to the clustering function using the Gaussian Mixture algorithm.

Clustering function window using Bayesian Gaussian Mixture and Gaussian Mixture methods. The number of components parameter is set to (3).

Figure 11. Clustering function window for Bayesian Gaussian Mixture and Gaussian Mixture algorithms. The number of components parameter is set to (3).

Clustering function window using Bayesian Gaussian Mixture and Gaussian Mixture methods. The number of components parameter is set to (5).

Figure 12. Clustering function window for Bayesian Gaussian Mixture and Gaussian Mixture algorithms. The number of components parameter is set to (5).

Clustering function window using Bayesian Gaussian Mixture and Gaussian Mixture methods. The number of components parameter is set to (10).

Figure 13. Clustering function window for Bayesian Gaussian Mixture and Gaussian Mixture algorithms. The number of components parameter is set to (10).

The example in the figure below demonstrates the performance of the BIRCH and Gaussian Mixture clustering algorithms on “interesting” data sets.

BIRCH and Gaussian Mixture clustering algorithms for “interesting” data sets.

Figure 14. Comparative demonstration of the performance of the BIRCH and Gaussian Mixture clustering algorithms on “interesting” data sets. The last data set (right column) is an example of a “null” situation for clustering: the data is homogeneous and does not cluster well.

Pre-automatic data preparation

Before clustering is applied, the imported data is automatically scaled using standardization.

Standardization is the process of scaling data so that it has a mean of 0 and a standard deviation of 1.

If the imported data contains a categorical column such as [male, female], the user will be prompted to automatically "Hot Encode" the column to convert the data to new numeric code columns [0, 1]. The hot encoded data will be saved in the original [xlsx] file in a new sheet.

One-hot encoding is used to convert categorical variables into a format that can be easily used by machine learning algorithms. The basic idea of one-hot encoding is to create new variables that take the values [0] and [1] to represent the original categorical values. In other words, each unique value from a non-numeric column is converted into a new binary column containing the [0] and [1] flags. In this column, [1] indicates the presence of this value, and [0] indicates its absence.

Reasons why the quality of a mathematical model using the Bayesian Gaussian Mixture and Gaussian Mixture clustering method may be insufficient

Wrong choice of the number of components: Both clustering methods rely on the correct choice of the number of components in the model. If an insufficient number of components is selected or, conversely, too many components are selected, this may lead to insufficiently accurate clustering.
Suboptimal hyperparameter tuning: Both methods have hyperparameters, such as covariance matrix parameters and prior distributions, that need to be tuned. Wrong choice or tuning of hyperparameters can lead to poor quality of the clustering model.
Inconsistency of distributional assumptions: The Bayesian Gaussian Mixture and Gaussian Mixture methods assume that the data is Gaussian distributed. If the data does not meet this assumption, then the quality of the clustering may be insufficient.
Incorrect handling of outliers and noise: The presence of outliers and noise in the data can negatively affect the quality of clustering. If methods are not adapted to handle outliers or data is not preprocessed, this can lead to poor clustering quality.
Insufficient or incorrect data scaling: If your data has different value ranges or different units of measurement, you need to properly scale the data before clustering. Incorrect scaling can affect the quality of clustering.

Shewhart control charts PRO-Analyst +AI for Windows, Mac, Linux Register of Russian software (entry No. 18857 dated 09/05/2023)