Analysis of rare events using Shewhart control charts

Article: [39] DONALD J. WHEELER: "Working with Rare Events". Source: www.qualitydigest.com
Translation, notes and additional graphic materials with explanations: Scientific Director of the AQT Center Sergey P. Grigoryev , using article material and permission kindly provided to him by Donald Wheeler.

Free access to articles does not in any way diminish the value of the materials contained in them.

What happens when the average quantity becomes very small?

From a data analysis perspective, rare events are problematic. Until we have an event, there is nothing to count, and as a result, many of our time periods will end with zero values. Since zero values ​​contain no real information, we need to consider alternatives to counting rare events. This article will look at simple and complex ways to deal with rare events.

Our first example will involve spills at a chemical plant. Although spills are undesirable, and although every effort is made to prevent them, they do occasionally occur. Over the past few years, a single plant has experienced an average of one spill every eight months. Of course, if a plant experiences a spill on average once every eight months, then 1 spill per month would be 700% higher than average! (When dealing with rare events, a one-unit change can make a huge percentage difference.) A total of six spills occurred in the first four years. Six spills in 48 months gives an average of 0.125 spills per month.

How about using an XmR-chart for individual values ​​and rolling ranges with these counts, as I suggested earlier in the article: Control charts for alternative data (attributes, counts) p-chart, np-chart, C-chart and u-chart or one XmR-chart of individual values ? Using the first four years as a baseline, the upper control limit for the moving range chart is: 0.83, and for the X-map of individual values: 0.80. This turns every month with a spill into a signal of a change in the system! Clearly this is a misinterpretation of these data. The problem is that this XmR card suffers from data starvation in this case. (Sparse data can occur with any type of data. Count data tends to be fragmented if the average number of counts falls below 1.00. Sparse data artificially narrows the limits of the process behavior diagram and leads to an excessive number of false alarms.)

Probable error of a stable measurement system

Figure 1. Number of spills per month on XmR cards

When rare events are counted, special cards become insensitive and the XmR card fails. This is not a problem with the control charts, but with the data itself. Counting rare events is inherently insensitive and weak. No matter how these counts are analyzed, placing this data on any type of control chart will reveal nothing. But there are other ways to characterize rare events. Instead of counting the number of spills each month (event counting), you could instead measure the number of days between spills (measuring the range between rare events). For these data, the time intervals between spills are calculated as follows.

Probable error of a stable measurement system

Figure 2. Determining the time between spills.

One spill in 322 days translates into a spill rate of 0.0031 spills per day: 1⁄322=0.0031

If we multiply the daily spill rate by 365, we get 1.13 spills per year: 0.0031×365=1.1315

Thus, the interval between the first and second spills is equivalent to a spill at a rate of 1.13 spills per year. Likewise, the 247-day interval between the second and third spills translates into a spill rate of 1.48 spills per year. Continuing this way, every time we have an event, we get an instantaneous spill rate.

Probable error of a stable measurement system

Figure 3. Instantaneous spill rates

Probable error of a stable measurement system

Figure 4. XmR-chart for spill rate

The average spill rate during the first four years is 1,418 spills per year. The average moving range is 0.244. Although the use of five values ​​to create an XmR chart is minimal, it took four years to obtain these five values!

If the future point is above the upper control limit, this will indicate that the spill rate is increasing. In the future, a point below the lower control limit will indicate that the spill rate is decreasing. Points in the area between the control limits will be interpreted to mean that the spill rate has not changed. The two spills in 2005 had intervals of 172 and 115 days, respectively. These intervals translate into spill rates (rates) of 2.12 spills per year and 3.17 spills per year. When these values ​​are added to the XmR-chart, we get the result shown in Figure 5.

Probable error of a stable measurement system

Figure 5. Full XmR-chart for spill rates

Although the first spill in 2005 was outside the limit, it barely exceeded it. Given the softness of limits based on five values, we may be hesitant to interpret the sixth point as a clear signal of change. However, the seventh point is far enough outside the limits that it can be safely interpreted as a clear signal - an increase in the spill level has occurred this year. If we return to Figure 1, we see that spills are getting closer together, but we won't be able to detect this change until we move from counting rare events to measuring the window of opportunity between events.

Note that although Figures 1 and 5 look at spill rates, there was a change in the variable between Figures 1 and 5. In Figure 1, the variable was the number of spills per month. Here the numerator could change (number of spills) while the denominator remained constant (one month). Figure 5 shows instantaneous spill rates where the numerator remains constant (one spill) but the denominator can vary (days between spills).

Instead of using instantaneous spill rates in Figure 6, the number of days between spills is used to construct the XmR chart. This control chart is a reverse measure chart. As spills become more frequent, the points in Figure 6 move downward. This simple inversion creates cognitive dissonance for those who must interpret this control chart. While this is not an insurmountable obstacle, it is still an unnecessary obstacle. The instantaneous spill rates shown in Figure 5 are easier to use and easier to interpret than the number of days between spills in Figure 6.

Probable error of a stable measurement system

Figure 6. XmR-charts for days between spills

In addition to the fact that the time between events is an inverse measure, the chart is less sensitive than the chart for instantaneous spill rates. In Figure 5, an increased spill rate will be found whenever the rate exceeds 2,066 spills per year. The lower limit in Figure 6 corresponds to a spill rate of 2.714 (365 x 1/134.475) spills per year. Given that these are rare event methods and that we want to detect any increase in spill rate as quickly as possible, such low sensitivity shown in Figure 6 is undesirable.

Although the instantaneous velocity control chart is usually the control chart of choice, there is one situation where a time-to-event chart is useful. This is when the lower limit of Figure 5 falls below zero. When this happens, the instantaneous speed control chart will no longer show improvements. If you are involved in taking action to reduce the frequency of rare events such that detecting improvements is important, then you may need to resort to constructing control charts of both instantaneous rates and times between events. A control chart of instantaneous rates will allow you to detect increases in the frequency of rare events, and a control chart of time between events will, in this case, allow you to detect decreases in speed as points above the upper control limit. This will be illustrated by the following example.

Summary

When the average count over a period of time falls below 1.0 (ones), you are dealing with rare events. When this happens, p -chart, np -chart, c -chart and u -chart will become very insensitive. At the same time, the sparse data problem will prevent you from using an XmR chart with element counts or event counts. When this happens, you should stop counting events over a period of time and instead measure the domain between rare events. This is where you stop getting the value every time, and instead get the value every time you have an event. (This shift in how you collect data argues against using this approach except in rare cases.)

When working with inter-event times, you can calculate the instantaneous rates for each event and place them on an XmR chart, as shown in Figures 4 and 5, or you can work directly with inter-event times, as shown in Figure 6. When these control charts become one-sided, you may need to work with both types of control charts to identify both improvements and deteriorations.

Sergey P. Grigoryev: For example, to analyze data from emergency space launches, the domain of definition can be used - the number of successful launches between emergency ones, where emergency launches are used as rare events. As the speed of emergency launches, the value obtained by dividing 1 emergency launch by the number of successful launches from the previous emergency can be used. Or you can work directly with the number of successful starts between failures.