How to Validate CSV Batch Uploads Against an SPC Schema in Python

Batch CSV ingestion into a Statistical Process Control system fails quietly when schema drift, implicit type coercion, or misaligned subgrouping metadata slips past the parser. An unvalidated upload corrupts control limits, fires false Western Electric rule violations, and breaks the audit trail back to the shop floor. This how-to belongs to the batch data validation and error handling stage of the manufacturing data ingestion and preprocessing pipeline: it shows a deterministic, fail-fast validator that enforces a strict schema contract before rows reach the time-series alignment pipeline or any control limit calculation.

The goal is a validator that never silently mutates data: every rejected row is preserved with a reason code and its original index, so a non-conformance investigation can trace any measurement back to the MES transaction that produced it.

Prerequisites

Before running the validator, confirm the following are in place:

Python 3.10+ with pandas >= 2.0 and pyarrow installed (pip install "pandas>=2.0" pyarrow)
A raw CSV export from the MES, historian, or operator upload — one measurement per row
A documented schema contract: the exact column names, dtypes, and physical bounds the process owner has signed off
The nominal sampling interval per station (needed to validate timestamp cadence)
The intended chart type known in advance — subgroup rules differ for X-Bar R charts (n = 2–9) versus I-MR charts (n = 1)
Sentinel-value handling agreed upstream, so hardware codes are already mapped when handling missing values in quality data

The SPC Schema Contract

An SPC dataset needs structural guarantees beyond ordinary relational constraints. Define the contract explicitly before writing any parsing code:

Field	Dtype	Rule
`timestamp`	datetime64[ns, UTC]	ISO 8601, timezone-aware, monotonic increasing per station
`station_id`	category	Bounded to the MES station registry
`subgroup_id`	int64 / string	Non-null; uniform cardinality for rational subgrouping
`measurement_value`	float64	Within physical sensor bounds (LSL/USL are separate)
`spec_limits`	float64	LSL, USL, target; nullable but validated against engineering tolerances when present

Relying on the default pd.read_csv() behaviour introduces silent failures: trailing whitespace in categorical fields, scientific-notation truncation, or string-to-float coercion that masks a sensor dropout as NaN. The validator below therefore drives every stage from this contract rather than from whatever pandas infers.

Step-by-Step Implementation

Step 1 — Declare the schema as data, not code

Keep the contract in one place so it can be version-controlled and reused across stations. Each entry carries the dtype and, for numerics, the physical bounds.

from dataclasses import dataclass, field
from typing import Optional


@dataclass(frozen=True)
class ColumnSpec:
    name: str
    dtype: str
    required: bool = True
    low: Optional[float] = None   # physical lower bound, numerics only
    high: Optional[float] = None  # physical upper bound, numerics only


SPC_SCHEMA: tuple[ColumnSpec, ...] = (
    ColumnSpec("timestamp", "datetime64[ns, UTC]"),
    ColumnSpec("station_id", "category"),
    ColumnSpec("subgroup_id", "int64"),
    ColumnSpec("measurement_value", "float64", low=0.0, high=100.0),
    ColumnSpec("spec_limits_lsl", "float64", required=False),
    ColumnSpec("spec_limits_usl", "float64", required=False),
)

Step 2 — Read with explicit dtypes and a strict engine

Parse with the pyarrow engine and hand it the string columns as object so numeric coercion is done deliberately in the next step, never implicitly during the read.

import pandas as pd

def read_raw(path: str) -> pd.DataFrame:
    """Read a batch CSV without letting pandas guess numeric types."""
    return pd.read_csv(
        path,
        engine="pyarrow",
        dtype_backend="pyarrow",
        skipinitialspace=True,   # strip leading whitespace in categoricals
        keep_default_na=True,
    )

Step 3 — Phase 1: structural validation

Check column presence first and fail fast if any required column is missing — there is no point range-checking a frame with the wrong shape. Preserve the original index throughout.

def validate_structure(df: pd.DataFrame, schema=SPC_SCHEMA):
    errors = []
    present = set(df.columns)
    missing = [c.name for c in schema if c.required and c.name not in present]
    if missing:
        errors.append({"type": "MISSING_COLUMNS", "columns": missing})
    return errors

Step 4 — Phase 2: type casting and physical bounds

Cast each column to its contracted dtype. A value that will not cast ("ERR", a truncated 1.23E-04 artefact) becomes NaN under errors="coerce", which the mask below flags rather than silently accepts.

import numpy as np

def validate_values(df: pd.DataFrame, schema=SPC_SCHEMA):
    errors = []
    valid = pd.Series(True, index=df.index)

    for spec in schema:
        if spec.name not in df.columns:
            continue

        if spec.dtype.startswith("float"):
            col = pd.to_numeric(df[spec.name], errors="coerce")
            bad_cast = col.isna() & df[spec.name].notna()
            if bad_cast.any():
                errors.append({"type": "UNCASTABLE_VALUE", "column": spec.name,
                               "indices": df.index[bad_cast].tolist()})
                valid &= ~bad_cast

            if spec.low is not None:
                oob = (col < spec.low) | (col > spec.high)
                oob = oob.fillna(False)
                if oob.any():
                    errors.append({"type": "OUT_OF_BOUNDS", "column": spec.name,
                                   "count": int(oob.sum()),
                                   "indices": df.index[oob].tolist()})
                    valid &= ~oob

    return errors, valid

Step 5 — Phase 2 (cont.): timestamp cadence and subgroup cardinality

Parse timestamps as timezone-aware UTC, floor to the sampling interval, and confirm subgroups are uniform. Splitting a rational subgroup on a 500 ms clock offset artificially inflates within-subgroup variance, so this check protects the eventual control limits before they are ever computed.

def validate_spc_rules(df: pd.DataFrame, sampling_interval: str, subgroup_n: int):
    errors = []
    ts = pd.to_datetime(df["timestamp"], utc=True, errors="coerce")
    if not ts.dropna().is_monotonic_increasing:
        errors.append({"type": "NON_MONOTONIC_TIMESTAMP"})

    # Subgroup cardinality — n must be uniform for X-Bar R rational subgrouping
    sizes = df.groupby("subgroup_id").size()
    off = sizes[sizes != subgroup_n]
    if not off.empty:
        errors.append({"type": "SUBGROUP_SIZE_MISMATCH",
                       "expected": subgroup_n,
                       "offending": off.to_dict()})
    return errors

Step 6 — Compose the phases into one result object

Return a structured result — clean frame, quarantined frame, and the full error list — instead of raising. That lets the ingestion worker route bad rows to a dead-letter queue and still chart the valid subset.

@dataclass
class ValidationResult:
    ok: bool
    clean: pd.DataFrame
    quarantine: pd.DataFrame
    errors: list = field(default_factory=list)


def validate_batch(path, sampling_interval="1min", subgroup_n=5) -> ValidationResult:
    df = read_raw(path)
    errors = validate_structure(df)
    if errors:  # wrong shape → stop before touching values
        return ValidationResult(False, df.head(0), df, errors)

    val_errs, valid = validate_values(df)
    spc_errs = validate_spc_rules(df, sampling_interval, subgroup_n)
    errors += val_errs + spc_errs

    clean = df[valid].copy()
    quarantine = df[~valid].copy()
    return ValidationResult(len(clean) > 0, clean, quarantine, errors)

Verification

Confirm the validator behaves deterministically with a minimal fixture. Feed it one clean row and one row that violates the physical bound, then assert the split:

import io

fixture = io.StringIO(
    "timestamp,station_id,subgroup_id,measurement_value\n"
    "2026-07-01T08:00:00Z,ST-1,1,50.2\n"
    "2026-07-01T08:00:00Z,ST-1,1,999.0\n"   # out of [0,100] bound
)

res = validate_batch(fixture, subgroup_n=2)
assert len(res.clean) == 1
assert len(res.quarantine) == 1
assert any(e["type"] == "OUT_OF_BOUNDS" for e in res.errors)
assert res.quarantine.index.tolist() == [1]   # original index preserved
print("validation contract holds")

Expected output: validation contract holds. The final assertion is the load-bearing one — a validator that reindexes on the way out severs traceability to the MES transaction log and makes root-cause analysis impossible.

For large exports that exhaust RAM during this pass, read in chunks with pd.read_csv(path, chunksize=500_000) and accumulate the error list across chunks; convert station_id and product_code to category immediately after casting to cut the memory footprint by 60–80%.

Root-Cause Table

Symptom	Cause	Fix
`measurement_value` arrives as `object` dtype	Scientific notation or `"ERR"` calibration strings in the CSV	Coerce with `pd.to_numeric(..., errors="coerce")` and quarantine the `bad_cast` mask (Step 4)
Phantom subgroup with `n` off by one	Clock jitter between PLCs split a rational subgroup	Floor timestamps to the sampling interval before grouping; validate cardinality in Step 5
False Western Electric alarm on first chart render	Uncleaned batch artefacts entered the charting engine	Run the Phase 3 pre-check and route flagged rows to quarantine, not the chart
Rows silently vanish after validation	Frame was reindexed, severing the link to the MES log	Preserve the original index end to end and split with a boolean mask, never `reset_index`
`MemoryError` on a multi-station export	Whole CSV loaded into memory at once	Use `chunksize=500_000` and downcast identifiers to `category` after casting

Do not blind-impute quarantined rows: forward-fill only short gaps (< 3 sampling intervals) and flag longer gaps with a status column rather than substituting values, since imputation across a maintenance window distorts Cp/Cpk and masks true special-cause variation. Compliance-wise, log every quarantined record with a reason code and timestamp to keep the electronic batch record defensible (21 CFR Part 11; AIAG SPC Reference Manual, ch. I on data integrity).

Up one level: Batch data validation and error handling for SPC pipelines. For chart selection criteria see SPC Fundamentals & Control Chart Taxonomy.