How do I choose the rolling window size?

Set the window to span roughly one process cycle: long enough that steady-state noise is well estimated, short enough that a real shift is not diluted. Too small and autocorrelation trips the spike layer; too large and the rolling median lags real transitions. Never let a window straddle a rational-subgroup boundary; reset it at lot, shift, or tool-change markers.

What if a genuine shift is a slow ramp rather than a step?

A slow ramp raises local variance gradually, so the variance-ratio guard may not fire and Layer 1 can nibble points off the leading edge. For known ramp regimes, widen the guard's window relative to the ramp rate, or difference the series or fit an AR(1) trend and filter the residuals instead of the raw signal so the ramp is never mistaken for a string of outliers.

How to Filter Measurement Outliers Without Masking Real Process Shifts

A static z-score cutoff cannot tell a probe-bounce spike apart from the first point of a real tool-wear drift — both sit far from the recent mean, so a blunt filter deletes both. Delete the spike and the chart stays honest; delete the shift and you have quietly erased the exact assignable cause the chart exists to catch, inflating Cpk and suppressing the very alarm operators depend on. This how-to is the noise-removal step of the outlier detection and filtering pipeline inside the broader manufacturing data ingestion and preprocessing workflow: it builds a two-layer filter that quarantines transient measurement artifacts while preserving step changes, ramps, and deliberate setpoint moves, and hands a verifiable validity mask to the charting stage before any control limit is computed.

The design goal is that the filter never silently rewrites process history: every flagged point is marked, never deleted, so the sequential integrity that Nelson and Western Electric run-rule evaluation depends on survives intact and a capability recalculation can always trace which points were excluded and why.

Prerequisites

Confirm these are in place before running the filter:

Python 3.10+ with pandas >= 2.0 and numpy >= 1.24 installed (pip install "pandas>=2.0" "numpy>=1.24")
A single measurement channel as a pd.Series with a monotonic index — sort and de-duplicate timestamps upstream via the time-series alignment pipeline first
Short sensor dropouts already resolved by the missing-value policy for quality data; this filter bridges gaps of at most two samples and assumes longer gaps are already masked
Schema-validated numeric input — non-numeric payloads and sentinel values (-999, 0.0) mapped to NaN by the batch data validation gate
A rough expected process cycle length in samples, so the rolling window can be set to span one cycle without straddling a rational-subgroup boundary
The intended chart type known in advance — spike tolerance differs for an I-MR chart, where a single outlier moves both the individual and the moving range, versus an X-Bar R chart that averages within a subgroup

Why a global filter masks real shifts

Before the code, it helps to see exactly why the naive approach fails, because the two failure modes drive the two-layer design.

A global threshold assumes one stationary regime. A fixed z-score or IQR fence computed over an entire shift treats a thermal ramp-up, a catalyst activation, or a fresh material lot as the same population as steady-state running. A legitimate 3σ excursion during a known transition is then deleted as noise — a guaranteed false negative in precisely the window where the process is most likely to signal.

Standard deviation is not robust to the very shifts you want to keep. A sustained step change inflates the window's standard deviation, which widens the threshold, which then lets subsequent real deviations pass unflagged — the shift both hides itself and desensitizes the filter around it. The fix is a dispersion estimate bounded by the median, which a step change cannot inflate.

The two layers below address these directly: a rolling median absolute deviation (MAD) estimator flags spikes without being contaminated by a sustained shift, and a change-point guard vetoes the flag whenever the local variance signature says a structural break — not a transient — is underway.

Step-by-Step Implementation

Step 1 — Enforce a monotonic index and bridge micro-gaps

Rolling statistics silently produce garbage on an unsorted or duplicated index, and a one-sample dropout mid-window fabricates a discontinuity that reads as an outlier. Sort defensively and forward-fill only the shortest gaps — anything longer must already be masked upstream, never imputed here.

import pandas as pd
import numpy as np


def prepare(series: pd.Series) -> pd.Series:
    """Guarantee a sorted index and bridge at most two missing samples."""
    clean = series.sort_index()
    # ffill limit=2 only: a longer gap is a real data hole, not sensor jitter.
    return clean.ffill(limit=2)

Verify in isolation: pass a deliberately shuffled index and assert prepare(s).index.is_monotonic_increasing is True. A window computed on out-of-order timestamps is meaningless regardless of the statistics that follow.

Step 2 — Flag spikes with a rolling MAD estimator (Layer 1)

MAD resists contamination because the median is bounded by the 50th percentile: a genuine step change moves the median only after enough points accumulate, so a single spike cannot inflate the dispersion metric and widen the fence around itself. The 1.4826 factor normalizes MAD to approximate σ under Gaussian conditions; a threshold k of 3.0–3.5 balances sensitivity against robustness for heavy-tailed shop-floor data.

def layer1_spike_mask(clean: pd.Series, window: int, k: float, min_periods: int) -> pd.Series:
    """True where a point deviates beyond k scaled-MADs from the rolling median."""
    rolling_median = clean.rolling(window, min_periods=min_periods).median()
    abs_dev = (clean - rolling_median).abs()
    rolling_mad = abs_dev.rolling(window, min_periods=min_periods).median()
    scaled_mad = rolling_mad * 1.4826          # MAD → ~σ for Gaussian data
    return abs_dev > (scaled_mad * k)

The threshold expressed as a formula is $\text{flag} \iff |x_i - \tilde{x}_w| > k \cdot 1.4826 \cdot \operatorname{MAD}_w$, where $\tilde{x}_w$ is the rolling median and $\operatorname{MAD}_w = \operatorname{median}(|x_i - \tilde{x}_w|)$ over window $w$. Verify this layer alone flags a lone injected spike and, deliberately, still flags the leading edge of a step change — that over-flagging is exactly what Layer 2 exists to correct.

Step 3 — Veto structural breaks with a change-point guard (Layer 2)

A spike and the first point of a real shift look identical to Layer 1. The distinguishing signal is persistence: a shift raises the local variance relative to the recent baseline and stays raised, whereas a spike is a single-sample transient. Compare current-window variance to a lagged baseline; where the ratio exceeds a threshold, a structural break is underway and the spike flag must be vetoed.

def layer2_shift_mask(clean: pd.Series, window: int, ratio: float, min_periods: int) -> pd.Series:
    """True where local variance signals a sustained structural shift (not a spike)."""
    local_var = clean.rolling(window, min_periods=min_periods).var()
    baseline_var = local_var.shift(window)
    fallback = local_var.iloc[0] if len(local_var.dropna()) else 1.0
    baseline_var = baseline_var.fillna(fallback).clip(lower=1e-12)  # guard ÷0
    return local_var > (baseline_var * ratio)

The guard is $\operatorname{Var}_w > r \cdot \operatorname{Var}_{w,\text{lagged}}$. The clip(lower=1e-12) is load-bearing: a dead-flat baseline window (a stuck sensor) yields zero variance, and without the floor the ratio divides by zero and NaN-poisons the mask. Verify by feeding a clean step change and asserting the guard fires across the transition region.

Step 4 — Compose the layers into one validity mask

The final rule is deliberately asymmetric: quarantine a point only if it is a spike and no structural shift is present. This keeps step changes, linear ramps, and deliberate interventions while excising probe bounce, electrical interference, and momentary calibration drift. Return the mask realigned to the caller's original index, defaulting the warm-up period to valid so no data is dropped before the window fills.

def robust_outlier_filter(
    series: pd.Series,
    window: int = 50,
    mad_threshold: float = 3.5,
    min_periods: int = 5,
    variance_ratio: float = 2.0,
) -> pd.Series:
    """
    Flag transient outliers while preserving sustained process shifts.

    Returns a boolean mask aligned to `series.index`:
    True = valid observation, False = transient artifact to quarantine.

    Layer 1 (rolling MAD) flags spikes; Layer 2 (local variance-ratio
    change-point guard) vetoes flags that coincide with a structural shift,
    so genuine step changes and ramps are never masked.

    Parameters
    ----------
    series : pd.Series
        Single measurement channel with a monotonic index.
    window : int
        Rolling window in samples; set to ~one process cycle.
    mad_threshold : float
        MAD multiplier k for spike detection. Typical 3.0–3.5.
    min_periods : int
        Minimum observations before rolling stats are trusted.
    variance_ratio : float
        Local variance must exceed this multiple of baseline to
        count as a structural shift and veto the spike flag.
    """
    clean = prepare(series)

    is_spike = layer1_spike_mask(clean, window, mad_threshold, min_periods)
    shift_detected = layer2_shift_mask(clean, window, variance_ratio, min_periods)

    # Quarantine only if it's a spike AND not part of a structural shift.
    valid_mask = ~is_spike | shift_detected

    # Realign to caller's index; default warm-up period to valid (keep data).
    return valid_mask.reindex(series.index).fillna(True)


# Usage — mark, never delete: keep the flagged rows for the audit trail.
# df["is_valid"] = robust_outlier_filter(df["temperature_c"], window=60, mad_threshold=3.5)
# df["temperature_c_status"] = np.where(df["is_valid"], "PASS", "ROLLING_MAD")

Verification

Confirm the filter keeps a real shift while dropping a spike using a minimal synthetic fixture — no live data required. Build a steady baseline, inject one single-sample spike, then apply a sustained step, and assert the mask flags the spike but not the step:

import numpy as np
import pandas as pd

rng = np.random.default_rng(0)
idx = pd.date_range("2026-07-01", periods=300, freq="s")
signal = pd.Series(rng.normal(100.0, 0.5, size=300), index=idx)

signal.iloc[150] += 12.0          # transient spike (should be quarantined)
signal.iloc[220:] += 6.0          # sustained step shift (must be preserved)

mask = robust_outlier_filter(signal, window=40, mad_threshold=3.5, variance_ratio=2.0)

assert mask.iloc[150] == False, "spike should be flagged as invalid"
# The step is preserved: the transition region is not wholesale quarantined.
preserved = mask.iloc[220:260].mean()
assert preserved > 0.9, f"real shift was masked (only {preserved:.0%} kept)"
print("spike removed, shift preserved")

Expected output: spike removed, shift preserved. The second assertion is the load-bearing one — a filter that passes the spike check but fails the shift check has quietly reverted to the global-threshold failure mode and will erase assignable causes in production.

Root-Cause Table

Symptom	Cause	Fix
A real step change vanishes from the chart	Global z-score/IQR filter deleted the leading edge as noise	Add the Layer 2 variance-ratio guard so structural breaks veto the spike flag (Steps 3–4)
Cpk drifts upward with no process improvement	Sustained shift inflated the rolling standard deviation, widening the fence until later deviations pass	Use rolling MAD, not standard deviation, for the dispersion estimate (Step 2)
Whole transition window flagged invalid	`variance_ratio` set too low, so normal ramp-up trips the spike layer with no veto	Raise `variance_ratio`, and reset the window at lot/shift/tool-change markers (Steps 3–4)
Mask fills with `NaN`/`False` mid-series	Division by a zero-variance baseline (stuck sensor) NaN-poisoned Layer 2	Keep the `clip(lower=1e-12)` floor on `baseline_var` (Step 3)
Isolated points flagged at every gap	Non-monotonic index or an unbridged one-sample dropout read as a discontinuity	Sort the index and `ffill(limit=2)` before rolling; mask longer gaps upstream (Step 1)

Never delete a flagged row: keep it, tag it with a reason code (ROLLING_MAD) and timestamp, and publish the validity mask alongside the raw channel so a capability recalculation stays reproducible and the electronic batch record remains defensible (21 CFR Part 11; AIAG SPC Reference Manual, ch. I on data integrity; NIST/SEMATECH e-Handbook §1.3.5.17 on robust detection of outliers). For teams logging via MES, carry the flag through the same audit path used when connecting Python to MES and SCADA systems.

FAQ

Why use MAD instead of standard deviation for the spike threshold?

Standard deviation is not robust: a single spike or a sustained shift inflates it, which widens the detection fence and lets subsequent real deviations pass unflagged — the outlier hides itself. MAD is bounded by the median, so contamination up to nearly half the window cannot move it much. That robustness is what lets Layer 1 keep flagging spikes even in the presence of a step change, leaving the change-point guard free to decide whether the deviation is transient or structural.

How do I choose the window size?

Set the window to span roughly one process cycle — long enough that steady-state noise is well estimated, short enough that a real shift is not diluted across a huge baseline. Too small and normal autocorrelation trips the spike layer; too large and the rolling median lags real transitions, delaying detection. Critically, never let a window straddle a rational-subgroup boundary: reset it at lot, shift, or tool-change markers, or the filter blends two distinct populations.

What if a genuine shift is a slow ramp, not a step?

A slow linear ramp raises local variance gradually rather than abruptly, so the variance-ratio guard may not fire and Layer 1 can nibble points off the leading edge. For known ramp regimes (thermal warm-up, catalyst activation) either widen variance_ratio's window relative to the ramp rate, or difference the series / fit a lightweight AR(1) trend and filter the residuals instead of the raw signal — that removes the trend before thresholding so the ramp itself is never mistaken for a string of outliers.

Should I ever delete the flagged rows?

No. Deleting rows breaks the sequential integrity that run-rule detection depends on — Western Electric Rule 2 (nine points on one side of the centerline) and moving-range statistics both assume an unbroken sequence. Mark the row with a status flag instead and let the charting layer decide whether to exclude it from limit calculation. A deleted point is invisible to the audit; a marked one preserves the trail and lets you measure your false-flag rate over time as an early warning of sensor degradation.

Up one level: Outlier Detection and Filtering Pipelines. For the full ingestion architecture see Manufacturing Data Ingestion and Preprocessing.