Async Batch Processing Architectures for ACH/Wire Reconciliation

Reconciliation at institutional volume breaks the moment ingestion, decoding, matching, and exception routing are wired together as one synchronous procedure. A single 2 GB NACHA batch loaded into memory, sliced positionally, and joined against the ledger on one thread will hold the process hostage for minutes — and every minute a payment sits undecoded is a minute counted against a Reg E investigation clock. This guide sits within the broader Automated File Ingestion & Parsing Pipelines framework and specifies the asynchronous, chunked, event-driven architecture that replaces that monolith: a topology where I/O-bound file acquisition never blocks CPU-bound transaction matching, where memory stays bounded regardless of file size, and where every unmatched entry routes to an exception queue without stalling the batch behind it.

The core discipline is separation of concerns by concurrency primitive. Network I/O — SFTP polling, object-storage reads, downstream API calls — belongs on an asyncio event loop; CPU work — fixed-width NACHA decoding, hashing, and matching — belongs in isolated processes. The precise decision boundary between those two worlds is developed in Asyncio vs Multiprocessing for Payment Ingestion; this page assembles those primitives into a complete batch architecture and shows how the pieces coordinate under load.

Concept Definition: What "Async Batch" Means for Payments

An async batch architecture is not "run everything concurrently." It is a bounded, staged dataflow in which each stage is the appropriate concurrency model for its workload, connected by back-pressured queues rather than direct calls. Four properties define it in a payments context:

Streaming, not slurping. Files are read as a generator of newline-aligned byte chunks. A NACHA record is exactly 94 bytes; chunk boundaries must fall on record boundaries so a 6 entry-detail record is never split across two workers. Peak resident memory is a function of chunk size and worker count, never of file size.
I/O off the CPU path. The asyncio event loop owns sockets and file descriptors. Any CPU work that runs inside the loop — even a Pydantic model instantiation or a decimal.Decimal sum — blocks the reactor, starving keep-alives and stalling every concurrent download. CPU work is dispatched to a concurrent.futures.ProcessPoolExecutor.
Bounded concurrency. Unbounded asyncio.gather over 500,000 records queues 500,000 coroutines and their payloads into the heap before a single one runs. A semaphore sized to the worker pool enforces back-pressure so the pipeline is memory-bounded by construction.
Idempotent, deterministic keys. Every emitted record carries a correlation ID derived from a pure function of its immutable fields, so a retried chunk or a re-delivered file collapses to the same logical event instead of double-posting.

Complexity-wise the target is a single sequential pass: a streaming, generator-based reader runs in $O (n)$ time over the record count with $O (1)$ resident memory. The failure mode to avoid is any stage that fully materializes or sorts a file, which jumps to $O (n)$ memory and $O (n lo g n)$ time — precisely the transition that produces container OOM kills during a peak settlement window.

Architecture: Stages, Queues, and the CPU/I/O Split

The pipeline is a directed flow of five stages joined by bounded queues. Acquisition and dispatch are asynchronous; decode and match run in the process pool; routing and audit are asynchronous again because they are I/O-bound writes. A file that fails any stage is quarantined with its raw bytes intact rather than dropped.

The batch as a staged dataflow: the asyncio event loop fingerprints and back-pressures files through a bounded semaphore, a ProcessPoolExecutor decodes and keys records across parallel workers, the chunked matcher joins against the ledger, and the stream splits back onto the loop into a settlement path and a circuit-breaker-guarded exception path — every hop recorded to an append-only audit log.

The event loop never touches record bytes. Its job is to keep the process pool saturated but not overrun: it reads the next chunk, waits on the semaphore, hands the chunk to a worker via run_in_executor, and immediately loops to fetch the next chunk while workers decode in parallel. Matched records stream onward to settlement; the transaction matching and reconciliation algorithms that consume them — including sliding-window date reconciliation and tolerance-threshold configuration — are downstream consumers of the canonical records this architecture emits, not part of the batch loop itself.

Phase-by-Phase Implementation

The following phases build the pipeline bottom-up. Every code block uses type hints, decimal.Decimal (or integer cents) for money, and generator patterns for I/O.

Phase 1 — Stream the file as record-aligned chunks

The reader yields byte chunks that always end on a record boundary. This guarantees a worker receives whole 94-byte records and never a truncated tail.

python

from __future__ import annotations

from pathlib import Path
from typing import Iterator

RECORD_LEN = 94  # NACHA fixed-width record length, per byte

def stream_record_chunks(path: Path, records_per_chunk: int = 50_000) -> Iterator[bytes]:
    """Yield byte chunks aligned to NACHA 94-byte record boundaries.

    Resident memory is O(records_per_chunk * RECORD_LEN), independent of file size.
    """
    chunk_bytes = records_per_chunk * (RECORD_LEN + 1)  # +1 for the newline terminator
    with path.open("rb") as fh:
        carry = b""
        while True:
            block = fh.read(chunk_bytes)
            if not block:
                break
            data = carry + block
            # Align to the last complete line so no record is split across chunks.
            cut = data.rfind(b"\n") + 1
            if cut == 0:
                carry = data
                continue
            yield data[:cut]
            carry = data[cut:]
        if carry.strip():
            yield carry

Phase 2 — Decode and validate inside a worker (off the event loop)

The worker runs in a separate process, so it holds no reference to the event loop and cannot block it. It decodes positional fields, coerces money to integer cents, and rejects malformed records early with a strict Pydantic model.

python

from __future__ import annotations

from decimal import Decimal
from typing import Iterator

from pydantic import BaseModel, field_validator

class EntryDetail(BaseModel):
    transaction_code: str
    routing_number: str
    account_number: str
    amount_cents: int          # store money as integer cents; never float
    trace_number: str

    @field_validator("routing_number")
    @classmethod
    def _aba_length(cls, v: str) -> str:
        if len(v) != 9 or not v.isdigit():
            raise ValueError("routing number must be 9 ABA digits")
        return v

    @property
    def amount(self) -> Decimal:
        return (Decimal(self.amount_cents) / Decimal(100)).quantize(Decimal("0.01"))

def decode_chunk(chunk: bytes) -> list[dict[str, object]]:
    """CPU-bound: runs in a ProcessPoolExecutor worker, never on the event loop."""
    out: list[dict[str, object]] = []
    for line in chunk.decode("ascii", errors="strict").splitlines():
        if len(line) < RECORD_LEN or line[0] != "6":  # Entry Detail records only
            continue
        record = EntryDetail(
            transaction_code=line[1:3],
            routing_number=line[3:12],
            account_number=line[12:29].strip(),
            amount_cents=int(line[29:39]),   # positions 30-39, implied 2 decimals
            trace_number=line[79:94].strip(),
        )
        out.append(record.model_dump())
    return out

Phase 3 — Derive a deterministic correlation ID

Idempotency comes from a pure function of immutable fields. Retries and re-deliveries produce the same key, so reconciliation state is safe to replay.

python

import hashlib

def correlation_id(trace_number: str, amount_cents: int, effective_date: str) -> str:
    """Stable SHA-256 key: identical inputs always yield the same reconciliation ID."""
    seed = f"{trace_number}|{amount_cents}|{effective_date}".encode("ascii")
    return hashlib.sha256(seed).hexdigest()

Phase 4 — Orchestrate with asyncio + a bounded process pool

The event loop reads chunks, throttles via a semaphore, and dispatches decode work to the pool. Concurrency is bounded, so peak memory is predictable.

python

from __future__ import annotations

import asyncio
import os
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

async def run_batch(path: Path, records_per_chunk: int = 50_000) -> list[dict[str, object]]:
    max_workers = max(1, (os.cpu_count() or 2) - 1)  # reserve a core for the loop
    semaphore = asyncio.Semaphore(max_workers * 2)    # bound in-flight chunks
    loop = asyncio.get_running_loop()
    results: list[dict[str, object]] = []

    with ProcessPoolExecutor(max_workers=max_workers) as pool:
        async def dispatch(chunk: bytes) -> list[dict[str, object]]:
            async with semaphore:
                return await loop.run_in_executor(pool, decode_chunk, chunk)

        tasks = [
            asyncio.create_task(dispatch(chunk))
            for chunk in stream_record_chunks(path, records_per_chunk)
        ]
        for coro in asyncio.as_completed(tasks):
            results.extend(await coro)
    return results

Phase 5 — Route exceptions with back-pressure and a circuit breaker

Unmatched, duplicate, and discrepant records leave the batch immediately via an async queue. Downstream resolution calls (core banking, fraud scoring, notifications) are guarded by a circuit breaker so a degraded dependency degrades gracefully instead of cascading.

python

from __future__ import annotations

import time
from dataclasses import dataclass, field

@dataclass
class CircuitBreaker:
    """Trips open after `threshold` consecutive failures; half-opens after `cooldown`."""
    threshold: int = 5
    cooldown: float = 30.0
    _failures: int = 0
    _opened_at: float | None = field(default=None)

    def allow(self) -> bool:
        if self._opened_at is None:
            return True
        if time.monotonic() - self._opened_at >= self.cooldown:
            self._opened_at = None      # half-open: allow a trial call
            self._failures = 0
            return True
        return False

    def record(self, ok: bool) -> None:
        if ok:
            self._failures = 0
            self._opened_at = None
        else:
            self._failures += 1
            if self._failures >= self.threshold:
                self._opened_at = time.monotonic()

EXCEPTION_CLASSES = ("FUNDING_MISMATCH", "DUPLICATE_TRACE", "INVALID_ACCOUNT", "REG_E_DISPUTE")

When the breaker is open, exceptions are persisted to a local staging table and deferred to a retry scheduler, so the main batch completes inside its SLA window while data integrity is preserved for later resolution.

Edge Cases & Known Failure Modes

Async batch pipelines fail in ways synchronous ones do not — the failures are concurrency-shaped, and most are silent until reconciliation counts drift.

Failure mode	Root cause	Mitigation
Container `OOMKilled` mid-batch	`asyncio.gather` over all chunks queues every payload into the heap before workers start	Bound in-flight chunks with a `Semaphore(max_workers * 2)`; stream chunks lazily rather than building the full task list eagerly
Record split across two workers	Chunk boundary fell inside a 94-byte record	Align every chunk to the last `\n`; carry the remainder into the next read
Event loop stalls, keep-alives drop	CPU work (decode, Pydantic, Decimal) ran inside the loop instead of a worker	Keep all record-byte work in `run_in_executor`; enable `PYTHONASYNCIODEBUG=1` and alarm on `loop.slow_callback_duration`
Double-posted entries after a retry	Non-idempotent processing of a re-delivered or replayed file	Deterministic `correlation_id` plus a processed-hash ledger keyed on `originator_id + file_hash`
Silent `UnicodeDecodeError` swallowed	`decode("ascii", errors="ignore")` mangles a non-ASCII byte in an addenda memo	Decode with `errors="strict"` and quarantine the record; see encoding drift in legacy exports
Value-date drift across timezones	`effective_date` computed in local time crosses a Fed cutoff boundary	Normalize to a single timezone before keying; range-check effective vs settlement date
Cascading timeout under a slow dependency	Unbounded concurrent calls to a degraded core-banking API	Circuit breaker + exponential backoff with jitter; defer to a retry scheduler when open
`float` rounding breaks control totals	Money summed as `float` accumulates representation error	Sum integer cents or `decimal.Decimal`; never introduce a `float` intermediate

Compliance & Auditability

The architecture exists inside a regulatory envelope, and several design decisions are dictated by rule rather than preference.

Reg E error-resolution timers (12 CFR 1005.11). A financial institution must generally resolve a consumer dispute within 10 business days (extendable to 45 with provisional credit). Batch latency is therefore not merely a performance metric — a stalled decode stage consumes the same clock. The pipeline must hard-timeout via asyncio.wait_for at the batch window boundary and route unprocessed chunks to a dead-letter queue for next-cycle ingestion rather than silently missing the deadline.
NACHA idempotency and duplicate handling. The NACHA Operating Rules treat duplicate entries as originating-side errors subject to reversal timelines. The deterministic correlation_id and the originator_id + file_hash receipt key are the technical controls that keep a re-transmitted batch from manufacturing phantom debits.
Fed EPM / audit reconstruction. Every stage emits a structured JSON log record carrying batch_id, chunk_offset, correlation_id, processing_stage, processing_duration_ms, and — for exceptions — the classification from EXCEPTION_CLASSES. These are written to an append-only, tamper-evident store so an examiner can reconstruct exactly what happened to any file, at any offset, at any point in the run. Nothing is overwritten; failures preserve the raw payload alongside the normalized object and the triggering rule.

Audit determinism is the non-negotiable constraint that shapes the whole topology: because the log is append-only and each state transition is keyed, the entire run is replayable and every exception is timestamped, traceable, and recoverable from any failure state.

Testing & Verification

Verify the concurrency invariants, not just the happy path. Two properties matter most: chunk boundaries never split a record, and identical inputs produce identical correlation IDs.

python

import asyncio
from pathlib import Path

def test_chunks_are_record_aligned(tmp_path: Path) -> None:
    # 3 valid 94-byte '6' records + newlines
    line = "6" + "22" + "021000021" + "12345678901234567" + "0000010000" + " " * 63
    raw = ("\n".join([line, line, line]) + "\n").encode("ascii")
    f = tmp_path / "ppd.ach"
    f.write_bytes(raw)

    for chunk in stream_record_chunks(f, records_per_chunk=2):
        # Every non-final byte-run must decode cleanly into whole records.
        for parsed in chunk.decode("ascii").splitlines():
            assert len(parsed) == 94

def test_correlation_id_is_deterministic() -> None:
    a = correlation_id("072000326000001", 1000, "2026-07-02")
    b = correlation_id("072000326000001", 1000, "2026-07-02")
    assert a == b and len(a) == 64

def test_run_batch_decodes_all_entry_details(tmp_path: Path) -> None:
    line = "6" + "22" + "021000021" + "12345678901234567" + "0000010000" + " " * 63
    f = tmp_path / "b.ach"
    f.write_bytes(("\n".join([line] * 5) + "\n").encode("ascii"))
    records = asyncio.run(run_batch(f, records_per_chunk=2))
    assert len(records) == 5
    assert all(r["amount_cents"] == 10000 for r in records)

A structured audit record emitted per stage should validate against a fixed shape, which makes log-completeness itself testable:

json

{
  "batch_id": "20260702-PPD-0007",
  "correlation_id": "9f2c…e41a",
  "processing_stage": "decode",
  "chunk_offset": 100000,
  "processing_duration_ms": 42,
  "exception_class": null
}

Frequently Asked Questions

Why not just run everything with asyncio and skip the process pool?

Because Python's GIL still serializes CPU work. asyncio gives you cheap concurrency for I/O, but decoding 94-byte records, instantiating Pydantic models, and doing Decimal arithmetic are all CPU-bound and will block the single-threaded event loop. The moment they do, downloads stall, keep-alives drop, and correspondent-bank acknowledgements time out. Put I/O on the loop and CPU in a ProcessPoolExecutor. The full decision matrix is in Asyncio vs Multiprocessing for Payment Ingestion.

How do I size chunks and worker counts for a given container?

Peak resident memory is roughly records_per_chunk * record_len * (max_workers * 2) because the semaphore caps in-flight chunks at twice the worker count. Start with max_workers = os.cpu_count() - 1, pick a chunk size that keeps peak RSS at or below 70% of the container limit, and confirm under a real peak-window file. Larger chunks amortize IPC overhead but raise the memory floor; keep pickled chunk payloads under a few megabytes.

A re-delivered file double-posted every entry — what went wrong?

The pipeline processed the batch without an idempotency gate. Fingerprint every file with a streaming SHA-256 pass and check originator_id + file_hash against a processed-hash ledger before decoding, and give every record a deterministic correlation_id. Together they collapse byte-identical re-deliveries and same-content re-sends to a single logical event, which is exactly what the NACHA duplicate-entry rules require you to prevent.

The batch runs but the event loop periodically freezes. How do I find the culprit?

Set PYTHONASYNCIODEBUG=1 and monitor loop.slow_callback_duration; any callback over ~100 ms is CPU work leaking into the reactor. The usual offenders are a Pydantic validation or a Decimal sum that was accidentally called directly inside a coroutine instead of dispatched to the pool via run_in_executor. Move it off the loop.

What happens to exceptions when a downstream API is down?

The circuit breaker trips after consecutive failures and, while open, exceptions are persisted to a local staging table and handed to a retry scheduler rather than retried inline. This lets the main reconciliation loop finish inside its SLA window while preserving every exception for deferred resolution — no cascading timeouts, no lost disputes.

Can I use polars or duckdb instead of pandas for the matching stage?

Yes, and for high volumes you should. Once chunks are decoded into a columnar frame, the matching join benefits from an Arrow-backed engine; the tradeoffs and drop-in patterns are covered in High-Volume Pandas Parsing Strategies. Whatever engine you choose, keep the join vectorized and per-chunk rather than row-by-row across the whole file.

Asyncio vs Multiprocessing for Payment Ingestion — the exact I/O-vs-CPU decision boundary and the memory-leak scenarios that force a hybrid model.
Fixed-Width File Decoding — the positional slicing and boundary enforcement each worker performs before validation.
High-Volume Pandas Parsing Strategies — vectorized matching, categorical dtypes, and polars/duckdb tradeoffs for the aggregation stage.
Pydantic Schema Validation for Payments — the strict validation gate that rejects malformed records before they reach the matching engine.
Automated File Ingestion & Parsing Pipelines — the parent reference that maps this batch architecture into the full ingestion domain.

Async Batch Processing Architectures for ACH/Wire Reconciliation #

Concept Definition: What "Async Batch" Means for Payments #

Architecture: Stages, Queues, and the CPU/I/O Split #

Phase-by-Phase Implementation #

Phase 1 — Stream the file as record-aligned chunks #

Phase 2 — Decode and validate inside a worker (off the event loop) #

Phase 3 — Derive a deterministic correlation ID #

Phase 4 — Orchestrate with asyncio + a bounded process pool #

Phase 5 — Route exceptions with back-pressure and a circuit breaker #

Edge Cases & Known Failure Modes #

Compliance & Auditability #

Testing & Verification #

Frequently Asked Questions #

Related #

Async Batch Processing Architectures for ACH/Wire Reconciliation

Concept Definition: What "Async Batch" Means for Payments

Architecture: Stages, Queues, and the CPU/I/O Split

Phase-by-Phase Implementation

Phase 1 — Stream the file as record-aligned chunks

Phase 2 — Decode and validate inside a worker (off the event loop)

Phase 3 — Derive a deterministic correlation ID

Phase 4 — Orchestrate with asyncio + a bounded process pool

Phase 5 — Route exceptions with back-pressure and a circuit breaker

Edge Cases & Known Failure Modes

Compliance & Auditability

Testing & Verification

Frequently Asked Questions

Related