Shewhart control charts
PRO-Analyst +AI
for Windows, Mac, Linux

Register of Russian software (entry No. 18857 dated 09/05/2023)

Purchase software

Pair correlation plots (scatter plots) with distribution histograms and thermal correlation matrix for an unlimited number of factors

Multivariate statistical analysis MSA (Multivariate Statistical Analysis).

[Multivariate statistical analysis] button

The scatterplot feature, with distribution histograms and correlation heat map, provides an effective way to visually represent the statistical functional relationships between the many factors (measurements and counts) represented in your data. Each graph displays the equation of the trend line, the Pearson correlation coefficient [R] and the coefficient of determination [R²].

You can download an example of a structured spreadsheet file for creating scatterplots with histograms of value distributions and a heat chart of correlations: XLSX .

Structured data from table files can be used for import: Excel workbook (*.xlsx); Excel binary workbook (*.xlsb); OpenDocument Spreadsheet (*.ods).

It is important to note that a high correlation coefficient does not prove a cause-and-effect relationship between the analyzed factors, but indicates their statistical functional connection. For example, both factors may depend on some other or group of other factors.

The menu of the main program window is opened to go to the multidimensional data analysis control panel.

Figure 1. The menu of the main program window is opened to go to the multidimensional data analysis control panel.

Multifactor correlation analysis of qualitative characteristics-1.

Figure 2. A drop-down tooltip is displayed when you hover the mouse over the button to go to the control panel for the pair correlation graph (scatter diagrams) with histograms of the distribution of individual values.

Multivariate correlation analysis of qualitative characteristics-2.

Figure 3. Scatterplot control panel with histograms. By clicking the left mouse button on a user-selected point on the scatter plot graph, a caption with the number of the data point (row) is displayed. By clicking the left mouse button on the user-selected colored area in the heat map, a caption is displayed with the names of the source data columns along the Y, X axes and the correlation coefficient. Hiding signatures is done by right-clicking on the signature area.

With a large number of monitored measurable factors, it is difficult for even an experienced technologist to maintain an understanding of the possible relationships between the monitored process characteristics. Using our software, you can analyze an unlimited number of factors in one click, pay attention to anomalous outliers (points outside the overall population on the graph) or discrepancies in the expected size and direction of correlation (negative, zero, positive) in pairs of analyzed values.

Multivariate correlation analysis of qualitative characteristics Viscosity and pH

Figure 4. Scatterplot control panel with histograms. Opened the tooltip dropdown when hovering the mouse over the go button in the heat chart control.

Multivariate Correlation Analysis Ames Housing Dataset

Figure 5. Heat chart control panel. The labels of correlation coefficients in the heat chart are disabled. In the heat chart control panel, the range of all 35 source data columns is selected. A caption is displayed for the user-selected correlation area on the heat map. Data source: Ames Housing Dataset.

The expression "4.552e+04" means the number 45,520. This number is represented in scientific notation, where "e+04" means multiplying by 10 to the power of 4, that is, the number is multiplied by 10, four times.

Multivariate Correlation Analysis Ames Housing Dataset

Figure 6. Heat chart control panel. In the heat chart control area, correlation value labels are enabled and the range from column 25 to column 35 (inclusive) of the source data is selected. Data source: Ames Housing Dataset.

Definitions

The correlation coefficient and the coefficient of determination are related to each other and both are used to measure the degree of relationship between two variables.

The correlation coefficient (denoted as R or r) measures the degree of linear relationship between two variables (x) and (y). It takes values ​​from -1 to 1, where -1 means a complete negative linear relationship, 1 means a positive linear relationship, and 0 means no linear relationship. The correlation coefficient shows how close data points are to a trend line or regression line. Thus, the closer the data points lie to the trend line, the higher the correlation coefficient and the stronger the relationship between the variables (x) and (y).

The coefficient of determination (denoted as R² or r²) is the square of the correlation coefficient. It shows how much of the variance in the dependent variable (y) can be explained by the independent variable (x). The coefficient of determination ranges from 0 to 1, where 0 means that the independent variable does not explain the variability in the dependent variable, and 1 means that the independent variable fully explains the variability in the dependent variable.

Thus, the correlation coefficient shows the degree of relationship between variables, while the coefficient of determination shows how well the independent variable explains the variability in the dependent variable.

Definition of emissions

Often, with the help of simple graphical methods, it is possible to understand which of the two factors in a pair is to blame for the observed outlier, to do this, just look at the correlation graphs with histograms of each factor and with itself, see Figure 6.

Multivariate correlation analysis: graphs of factor 1 and factor 3 correlation with itself

Figure 7. Correlation heat chart dashboard: graphs of correlation of all factors and Factor-1 with themselves. There is a problem with recording two values ​​of Factor-1.

Nose operational meaning such an understanding of the “culprit of the outlier” can only be confirmed or refuted by Shewhart’s control XmR-chart for individual values, built according to the initial data of Factor-1, see Figure 5 below.

Control XmR-chart of individual values, built according to the initial data of Factor-1 (before removing outliers).

Figure 8. Control XmR-chart of individual values, built according to the initial data of Factor-1 (before removing outliers).

Control XmR-map of individual values, built according to the initial data of Factor-2

Figure 9. Control XmR-chart of individual values, built according to the initial data of Factor-2. The series of red points from 81 to 89 on the mR-map graph is a reason to understand the special reasons that appeared at these points. Importantly, the multivariate analysis in Figure 4 does not have this ability.

Control XmR-map of individual values, built according to the initial data of Factor-3

Figure 10. Control XmR-chart of individual values, built according to the initial data of Factor-3.

Important

Sometimes removing just one outlier point can change the direction of the correlation (the direction of the trend line) from a positive correlation to a negative one. You should be aware of the possibility of such behavior of the trend line and all automatically calculated derivatives, for example: the equation of the trend function, the coefficient of determination R2 (the value of the reliability of the approximation) and the correlation R. This remark also applies to the equation of linear and other types of regressions built from the original data. The first step is to look at your data, presented graphically on a Shewhart control chart. Pay attention to the operator's initial data entry process and improve it using automated validation of entered values.

Example. At one large manufacturing enterprise producing one type of product, slightly different only in length and diameter, the results of a multivariate correlation analysis of a pair of product roundness indicators demonstrated opposite directions of correlation of these indicators, with no evidence of outliers in the original data. Allowed me to point out to production management the different ways a line operator can control the same processes depending on the size of the product, which led to an investigation into what the operator actually does.

Often emissions are caused by trivial reasons, for example, an error in recording the values ​​read from devices by controllers (and this is a special reason). Shewhart control charts easily cope with such erroneous entries that are outside the zone limited by the upper and lower control limits of the process, for example, the sign separating the integer and fractional parts is shifted by one digit. For example, instead of 0.232, 0.0232 or 2.32 is written.

But there are cases when the controller makes a mistake in recording a value, which at the same time remains in the area limited by the upper and lower control limits of the process, in the event of an error in recording one digit. For example, instead of (0.232) it is written (0.282). In this case, the multivariate statistical functions will have a better chance of identifying a data row with a write error. But you must understand that the robustness (universal applicability) of Shewhart control charts is due to the fact that such errors will not have any significant impact on the calculation of process control limits, and this is the most important property of Shewhart control charts.

Number crunchers can use the machine learning features for regression models (predicting continuous variables) in our software, or the Data Analysis package included with Microsoft Excel to calculate a linear regression model of the data. Next, you can use the Shewhart Control Chart (XmR or XbarR) to analyze the residuals (the difference between the actual and model-predicted values). If the control chart shows subgroups (red dots) with signs of special causes of variability, which may also indicate a mismatch between the data model and the current process, these causes will have to be addressed and eliminated.

For example, a control XmR-chart of individual values ​​and moving ranges, built from the values ​​of the [residuals] of the regression model, will serve as operational definition , rather than a subjective judgment about the outliers observed in the scatter plots (outlier or non-outlier).

Moreover, such analysis will retain information about the bias of the actual value relative to the values ​​predicted by the linear regression function, which greatly facilitates the interpretation of your data and is an important difference from analyzing data using Hotelling T2 charts.

About the passion for multifactorial SPC

The statement of many specialists who like to work with numbers, and not with processes at the shop level, about the purpose of multivariate statistical process control for more effective control of multifactor processes, in contrast to conventional Shewhart control charts, makes no sense. As if Shewhart, Deming, Wheeler built their control charts for single-factor processes, such processes simply do not exist. Moreover, production processes, if you have not even begun to manage them using Shewhart control charts, are most likely in a statistically uncontrollable (unpredictable) state. Shewhart control charts for such processes will already have signals that will need to be dealt with in order to eliminate special causes and bring the processes into a statistically controlled state.

While multivariate analysis may seem cutting-edge to management, explaining to workers on the shop floor what you've learned using multivariate statistical control will only confuse them, confirm the "very difficult" job of improving shop floor processes, and further discourage the company's floor-level employees.