Why use MAD instead of a standard-deviation z-score for outlier detection?

The standard deviation is not robust: a single gross error inflates sigma enough to pull the three-sigma band out past itself, so the outlier masks its own detection. The median and MAD have a 50 percent breakdown point, so up to half the window can be contaminated before the estimate moves. The 1.4826 factor rescales MAD to a Gaussian-consistent sigma so the same k multipliers used with z-scores still apply.

How do I pick the rolling window size and k threshold?

The window should be long enough for the rolling median to stabilize, a few multiples of the dominant cycle, but short enough to track legitimate slow drift; 30 to 50 points is a common start. Pick k from the false-flag rate: k = 3.0 (about 0.27 percent) is the general default and k = 3.5 for heavy-tailed signals. Re-benchmark against a stable run periodically, since a creeping false-positive rate usually signals sensor degradation.

Outlier Detection and Filtering Pipelines for SPC Automation

In high-volume manufacturing, raw telemetry and dimensional measurements are never pristine. Sensor drift, probe fouling, electromagnetic interference, and transient mechanical shocks inject artifacts that distort control charts, deflate capability indices, and trip false alarms. The hardest requirement is also the most misunderstood: an outlier filter has to separate measurement-system error from genuine process instability — and the same numeric spike can be either. Delete too aggressively and you erase the assignable-cause signal a chart exists to catch; delete too little and gauge noise widens the limits until nothing ever alarms. This stage of the manufacturing data ingestion and preprocessing pipeline decides, per point, whether a value is a data defect to suppress or a process event to preserve — and records why, so the decision survives an audit.

Raw signals from connecting Python to MES and SCADA systems arrive as asynchronous event streams with inconsistent timestamps, missing tags, and occasional packet loss. Before any statistical test runs, those streams need engineering units standardized, calibration offsets applied, and schema enforced by batch data validation and error handling. Skip that gate and the outlier layer flags unit-conversion errors and calibration jumps as process anomalies, corrupting the baseline it is supposed to protect. Filtering must also run before gap treatment — running a MAD filter on values produced by missing-value imputation creates a circular loop where the filter evaluates numbers it helped create.

What Breaks Without a Filtering Policy

Skip a deliberate policy and the failures are systematic, not random. The first is the masked shift. A blunt global z-score threshold applied across a full shift will happily clip the leading points of a real tool-wear ramp or thermal excursion — the exact assignable cause the chart should report — because early shift points look "extreme" against the pooled mean. The filter cleans away the evidence, the chart stays quiet, and containment never triggers.

The second is autocorrelation-driven over-flagging. High-frequency process data is serially correlated; consecutive readings are not independent draws. A pointwise z-score assumes independence, so on autocorrelated series it inflates the Type I error rate and marks long, physically real drifts as clouds of outliers, which then get imputed into a flat line that collapses the moving range.

The third is row deletion. Dropping a "bad" row silently changes subgroup size, so the X-Bar R chart implementation downstream applies the wrong A₂ constant to a subgroup that quietly lost a point, and it breaks the chronological index that Western Electric and Nelson run rules walk over. The fourth is capability distortion: leaving suppressed points unlogged, or filtering both tails symmetrically when only one is a defect, biases the standard deviation feeding Cp/Cpk and Pp/Ppk and reports a capability the process never demonstrated. None of these announce themselves — the chart still renders. A tiered policy converts each suspect point into an explicit, logged decision before a single limit is drawn.

Statistical Specification: Robust Isolation

Classical outlier tests key off the mean and standard deviation, but both estimators are pulled by the very outliers they are meant to find — a single gross error can inflate $\sigma$ enough to hide itself. Robust filtering replaces them with the median and the median absolute deviation (MAD), which have a 50% breakdown point.

For a rolling window of observations, define the local center and dispersion:

Local center: $\tilde{x} = \operatorname{median}(x_{\,i-w+1}, \dots, x_i)$
Median absolute deviation: $\mathrm{MAD} = \operatorname{median}\!\left(\lvert x_j - \tilde{x} \rvert\right)$
Robust sigma estimate: $\hat{\sigma} = 1.4826 \times \mathrm{MAD}$
Rejection band: $\tilde{x} - k\,\hat{\sigma} \;\le\; x_i \;\le\; \tilde{x} + k\,\hat{\sigma}$

The constant $1.4826 = 1/\Phi^{-1}(0.75)$ rescales the MAD so that $\hat{\sigma}$ is a consistent estimator of the population standard deviation under a Gaussian model. The threshold multiplier $k$ trades sensitivity against robustness. Common operating points:

Threshold k	Approx. two-sided false-flag rate (Gaussian)	Typical use
2.5	≈ 1.2 %	Tight gauge, low-noise dimensional data
3.0	≈ 0.27 %	General-purpose default
3.5	≈ 0.05 %	Heavy-tailed or vibration-prone signals

Carry $\hat{\sigma}$ and the band edges in double precision. Filtering on float32 accumulations of a rolling median over long windows can shift a borderline point across the band and change which observations survive into the limit calculation.

When to Filter vs. When to Let the Chart Decide

Not every out-of-band point should be filtered. The pipeline's job is to remove measurement-system error and hand process variation to the charting engine untouched.

Filter it when the point violates a hard engineering or gauge limit, or when a maintenance log or status tag confirms a sensor fault, cable transient, or calibration event coincident with the reading. These are data defects.
Do not filter it when the point is statistically extreme but physically plausible and unaccompanied by any fault evidence. A genuine spindle-load spike or a step change from a tool swap is exactly what the control chart must see. Suppressing it is tampering.
Choose the estimator by data structure. For subgrouped variables data feeding an X-Bar R chart, filter within rational subgroups so you never smear across boundaries. For high-frequency single-stream data feeding an Individuals & Moving Range (I-MR) chart, use the rolling-MAD approach with an AR(1) residual step, because I-MR has no within-subgroup averaging to damp serial correlation.

When multiple asynchronous stations feed one chart, run the time-series alignment pipeline first — filtering misaligned streams flags the phase offset itself as a swarm of outliers. Fine-grained tactics for keeping the filter blind to real shifts live in filtering measurement outliers without masking real shifts.

Tiered Detection Architecture

Layer 1 — Hard engineering limits

Enforce deterministic boundaries derived from gauge R&R studies, sensor specifications, and physical process constraints before any statistical method runs. Values outside these bounds are measurement-system failure or catastrophic tooling events, not process variation. Replace with NaN, tag HARD_LIMIT, and raise an engineering review — a value that is physically impossible must never enter the estimator.

Layer 2 — Rolling statistical isolation

Static global thresholds generate excessive false positives during normal tool wear or thermal ramp-up. The rolling-window MAD estimator above adapts its center and dispersion to the local process state, so it flags a genuine spike against its immediate neighbourhood instead of against a shift-wide mean that a slow ramp has already shifted.

Layer 3 — Autocorrelation and trend compensation

For serially correlated streams, difference the series or evaluate residuals against a lightweight AR(1) model before thresholding. Filtering the residuals — not the raw levels — restores the independence the MAD band assumes, so a real drift passes through as signal instead of being shredded into a cloud of flags. The scipy.stats module supplies the robust fitting and hypothesis routines that slot into a streaming pipeline.

Production-Ready Python Implementation

The following pipeline enforces hard limits, applies rolling MAD to the survivors, imputes conditionally, and emits a per-point audit trail. It iterates in chunks so multi-shift, multi-station telemetry stays within the memory budget of an edge node, and it resets its rolling window at rational-subgroup boundaries so a filter never smears across a lot, shift, or tool change.

from __future__ import annotations

import numpy as np
import pandas as pd

MAD_TO_SIGMA = 1.4826  # 1 / Phi^-1(0.75): scales MAD to a Gaussian-consistent sigma


class SPCOutlierFilter:
    """Tiered, SPC-safe outlier filter with an auditable reason code per point.

    Layer 1 enforces hard engineering limits; Layer 2 applies a rolling-MAD
    band. Flagged points are imputed with the rolling median of surviving
    values so the chronological index and subgroup size are preserved for
    downstream run-rule evaluation.
    """

    def __init__(
        self,
        hard_limits: tuple[float, float],
        window: int = 50,
        k: float = 3.0,
        chunk_size: int = 100_000,
    ) -> None:
        lower, upper = hard_limits
        if lower >= upper:
            raise ValueError("hard_limits must be (lower, upper) with lower < upper")
        if window < 3:
            raise ValueError("window must be >= 3 for a stable rolling median")
        self.lower = lower
        self.upper = upper
        self.window = window
        self.k = k
        self.chunk_size = chunk_size
        self.audit_log: list[dict] = []

    def _rolling_mad_mask(self, series: pd.Series) -> pd.Series:
        """Boolean mask: True where a point falls outside the rolling-MAD band."""
        med = series.rolling(self.window, min_periods=1).median()
        abs_dev = (series - med).abs()
        mad = abs_dev.rolling(self.window, min_periods=1).median()
        sigma = mad * MAD_TO_SIGMA
        # A zero-MAD window (flat segment) must not divide the process into
        # infinite z-scores; only flag when a finite band is exceeded.
        band = self.k * sigma
        return (series - med).abs() > band.where(band > 0, np.inf)

    def process_group(self, group: pd.DataFrame, value_col: str) -> pd.DataFrame:
        """Apply tiered filtering to one rational subgroup / segment in order."""
        group = group.copy()
        values = group[value_col]
        mask = pd.Series(False, index=group.index)
        reasons = pd.Series("PASS", index=group.index, dtype="object")

        # Layer 1: hard engineering limits ----------------------------------
        hard = (values < self.lower) | (values > self.upper)
        mask |= hard
        reasons[hard] = "HARD_LIMIT"

        # Layer 2: rolling MAD on the survivors of Layer 1 ------------------
        survivors = values.where(~mask)
        stat = self._rolling_mad_mask(survivors) & ~mask
        mask |= stat
        reasons[stat] = "ROLLING_MAD"

        # Conditional imputation: never delete the row --------------------
        if mask.any():
            clean = values.where(~mask)
            rolling_med = clean.rolling(self.window, min_periods=1).median()
            group.loc[mask, value_col] = rolling_med.loc[mask]
            self.audit_log.append(
                {
                    "segment_start": group.index[0],
                    "segment_end": group.index[-1],
                    "flagged": int(mask.sum()),
                    "hard_limit": int(hard.sum()),
                    "rolling_mad": int(stat.sum()),
                }
            )

        group[f"{value_col}_status"] = reasons
        return group

    def run(
        self,
        filepath: str,
        value_col: str,
        id_col: str,
        time_col: str = "timestamp",
    ) -> pd.DataFrame:
        """Chunked, memory-bounded execution over a CSV of ordered readings."""
        results: list[pd.DataFrame] = []
        for chunk in pd.read_csv(filepath, chunksize=self.chunk_size):
            missing = {value_col, id_col, time_col} - set(chunk.columns)
            if missing:
                raise KeyError(f"input missing required columns: {sorted(missing)}")
            chunk = chunk.sort_values([id_col, time_col])
            # Reset the rolling window at every rational-subgroup boundary so a
            # filter never leaks statistics across a lot / shift / tool change.
            filtered = chunk.groupby(id_col, group_keys=False).apply(
                self.process_group, value_col=value_col
            )
            results.append(filtered)
        return pd.concat(results, ignore_index=True)

Validation and Testing

Confirm the estimator is robust, not just present. Feed a known-clean stable run and assert the false-flag rate matches the $k$ you chose from the table above (≈ 0.27 % at $k = 3.0$). A materially higher rate on stable data means the window is too short to have stabilized, or the stream is autocorrelated and needs the Layer 3 residual step.
Test the masked-shift case explicitly. Inject a synthetic step change and a slow linear ramp into a fixture, run the filter, and assert the ramp survives with status == "PASS". If the filter clips the ramp shoulders, widen $k$ or move to residual thresholding — a filter that eats real shifts is worse than none.
Order MSA before thresholds. The hard limits and the credibility of the whole filter depend on a gauge R&R study: if the measurement system's %R&R is high, "outliers" are just gauge noise and no threshold can recover the true value. Establish measurement-system capability first.
Preserve the index. Assert that len(output) == len(input) and that the original index is intact — a passing filter must never change row count.
Check the audit total. Reconcile sum(status != "PASS") against the total flagged in audit_log; a mismatch means a point was silently altered outside the logged path.

Failure Modes and Edge Cases

Symptom	Cause	Fix
Long real drift shredded into many flags	Pointwise MAD on an autocorrelated stream; independence violated	Difference the series or threshold AR(1) residuals (Layer 3) before flagging
Filter clips the start of every shift/lot	Rolling window smears across a rational-subgroup boundary	Reset the window per `id_col`; the `groupby` in `run()` enforces this
Downstream chart uses wrong `A₂`; limits meaningless	Rows deleted, silently changing subgroup n	Impute in place, never `dropna()`; keep the chronological index
Everything flags after a flat segment	Zero-MAD window makes the band collapse to zero	Guard `band > 0`; treat a degenerate window as an infinite band
Capability index drifts after "cleaning"	Symmetric filtering suppressed a one-sided real tail	Filter only against fault evidence; log every suppression for Cpk recompute
Borderline points flip between runs	`float32` accumulation of the rolling median	Compute center, MAD, and band in `float64`

Compliance Notes

Outlier handling is a controlled data transformation, so it must be governed like one. The AIAG SPC Reference Manual (2nd ed.) requires that assignable causes be investigated rather than discarded — the reason-code audit trail (HARD_LIMIT, ROLLING_MAD, AUTOCORR_RESIDUAL) plus the retained original index provides the evidence that a flagged point was reviewed, not silently deleted. ISO 9001:2015 clause 7.1.5 (monitoring and measuring resources) anchors the Layer 1 hard limits to a maintained calibration and gauge R&R record. The MAD-based robust estimator and the $1.4826$ scaling follow the treatment in the NIST/SEMATECH Engineering Statistics Handbook, section 1.3.5.17 (Detection of Outliers). For automated containment, ISO 7870-2 expects control limits computed on data cleaned of measurement error but never of process signal.

Frequently Asked Questions

Why use MAD instead of a standard-deviation z-score?

The standard deviation is not robust: a single gross error inflates $\sigma$ enough to pull the ±3σ band out past itself, so the outlier masks its own detection. The median and MAD have a 50% breakdown point, so up to half the window can be contaminated before the estimate moves. The $1.4826$ factor rescales the MAD so $\hat{\sigma} = 1.4826 \times \mathrm{MAD}$ is Gaussian-consistent and the same $k$ multipliers you know from z-scores still apply.

How do I keep the filter from deleting a real process shift?

Two safeguards. First, filter only against evidence of measurement error — hard-limit violations and fault-tagged readings — and let statistically extreme but fault-free points pass to the chart, which exists to catch them. Second, on serially correlated streams threshold the AR(1) residuals rather than raw levels, so a genuine drift stays in the residual mean and survives. Always regression-test the filter against a fixture containing a known ramp and assert it comes through as PASS.

Should I ever drop the outlier row?

No. Deleting a row changes the subgroup n, which makes the charting engine apply the wrong constant such as A₂, and it breaks the chronological index that run rules walk. Replace the flagged value with the rolling median of surviving points (or NaN for classification downstream), keep the original index, and log the reason code. Row count out must equal row count in.

How do I pick the window size and k?

The window should be long enough for the rolling median to stabilize — at least a few multiples of the dominant cycle — but short enough to track legitimate slow drift; 30–50 points is a common start for high-frequency streams. Pick k from the false-flag table: k = 3.0 (≈ 0.27 %) is the general default, k = 3.5 for vibration-prone or heavy-tailed signals. Re-benchmark both against a known-stable run periodically; a creeping false-positive rate usually signals sensor degradation, not an algorithm fault.

Should outlier filtering run before or after missing-value imputation?

Before. Filtering must operate on raw observations. Running a MAD or Hampel filter on already-imputed values creates a circular loop where the filter evaluates numbers it helped fabricate. The correct order is: validate schema, filter raw data for measurement error, classify remaining nulls by provenance, apply bounded imputation, then compute control statistics.

Filtering measurement outliers without masking real shifts — the residual-based tactics that keep Layer 2 blind to genuine drift
Batch data validation and error handling — the schema gate that must clear before this filter runs
Handling missing values in quality data — runs after filtering, classifying the NaN this stage leaves behind
Time-series alignment for multi-station lines — synchronize streams first so phase offset is not misread as outliers
Rolling window limit recalibration — re-establishes limits on the cleaned series without reusing a contaminated baseline

For where outlier detection sits in the full ingestion sequence, see Manufacturing Data Ingestion and Preprocessing.