Optimizing Async Batch Jobs for 100k+ Daily Reads in Municipal Utility Billing

A pipeline that ingested twenty thousand meters overnight can quietly miss its settlement window at a hundred thousand — not because the code is wrong, but because a fixed 10,000-record chunk size and unbounded concurrency turn a post-outage data burst into connection-pool exhaustion, cascading TimeoutErrors, and a stalled batch. This page is the concurrency-tuning drill-down for Async Batch Processing for High-Volume Reads inside the Meter Data Ingestion & Validation Pipelines subsystem: given a decoupled, already-validated read stream, how do you size batches and cap writers so 100k+ daily reads land inside the billing window without saturating the ledger database? The answer is to stop guessing batch sizes and instead let measured write latency drive them, with a semaphore ceiling and backpressure that throttle ingestion the moment the ledger starts to strain.

Minimal prerequisites

This page assumes the upstream stages already exist: a decoupled ingestion queue, Schema Validation & Data Quality Checks at the boundary, and a ledger table with a UNIQUE constraint on the idempotency key. The tuning layer needs only:

import asyncio
import time
from collections import deque
from dataclasses import dataclass, field
from decimal import Decimal

import asyncpg  # PostgreSQL async driver; connection pool is the scarce resource

Data assumptions: each read is a validated record carrying meter_id: str, interval_utc (timezone-aware, UTC), kwh: Decimal, sequence: int, and a precomputed idempotency_key: str. Consumption and money are Decimal throughout — never float — so summing millions of intervals never drifts a penny at month-end reconciliation.

Annotated implementation — a latency-driven adaptive batch controller

The core optimization is a controller that measures how long each ledger write actually takes and moves the next batch size toward a target latency band. A bounded asyncio.Semaphore caps how many batches hit the connection pool at once, and an EWMA (exponentially weighted moving average) of write latency smooths out single-batch noise so the size does not oscillate. Every decision below is annotated with the billing reason it exists.

@dataclass
class AdaptiveBatcher:
    pool: asyncpg.Pool
    # Concurrency ceiling: never open more concurrent writers than the pool
    # can serve, or you trade throughput for TimeoutError storms.
    max_concurrent_writers: int = 8
    # Batch-size search space. Municipal ledgers rarely benefit above ~5k
    # rows/statement; below ~500 the per-round-trip overhead dominates.
    min_batch: int = 500
    max_batch: int = 5_000
    batch_size: int = 1_000
    # Target write latency band (seconds). Grow the batch below the low
    # watermark, shrink it above the high watermark. Keeps the batch inside
    # the settlement window without starving the pool.
    low_watermark: float = 0.15
    high_watermark: float = 0.40
    _latency_ewma: float = field(default=0.0)
    _sem: asyncio.Semaphore = field(init=False)

    def __post_init__(self) -> None:
        self._sem = asyncio.Semaphore(self.max_concurrent_writers)

    def _resize(self, observed_latency: float) -> None:
        # Smooth the signal so one slow write doesn't halve the batch.
        alpha = 0.3
        self._latency_ewma = (
            observed_latency if self._latency_ewma == 0.0
            else alpha * observed_latency + (1 - alpha) * self._latency_ewma
        )
        if self._latency_ewma > self.high_watermark:
            # Ledger is straining — back off multiplicatively (fast retreat).
            self.batch_size = max(self.min_batch, self.batch_size // 2)
        elif self._latency_ewma < self.low_watermark:
            # Headroom available — grow additively (gentle probe upward).
            self.batch_size = min(self.max_batch, self.batch_size + 250)

    async def _write_batch(self, rows: list[tuple]) -> None:
        # Semaphore is the hard backpressure valve: when all permits are
        # taken, ingestion awaits here instead of piling work onto the pool.
        async with self._sem:
            start = time.monotonic()
            async with self.pool.acquire() as conn:
                # Exactly-once handoff: idempotency_key is UNIQUE, so a
                # re-delivered or retried read is a safe no-op, never a
                # double charge on the customer account.
                await conn.executemany(
                    """
                    INSERT INTO meter_reads
                        (meter_id, interval_utc, kwh, sequence, idempotency_key)
                    VALUES ($1, $2, $3, $4, $5)
                    ON CONFLICT (idempotency_key) DO NOTHING
                    """,
                    rows,
                )
            self._resize(time.monotonic() - start)

    async def run(self, source: "asyncio.Queue[dict | None]") -> None:
        # Drain the validated read stream into right-sized batches. A None
        # sentinel signals end-of-feed so in-flight work can settle cleanly.
        pending: deque[tuple] = deque()
        tasks: set[asyncio.Task] = set()
        while True:
            record = await source.get()
            if record is not None:
                pending.append((
                    record["meter_id"],
                    record["interval_utc"],
                    Decimal(record["kwh"]),  # Decimal end-to-end, no float
                    record["sequence"],
                    record["idempotency_key"],
                ))
            drain = record is None
            # Flush whenever we've accumulated a full adaptive batch, or on
            # end-of-feed so the tail of the run is never left unposted.
            while len(pending) >= self.batch_size or (drain and pending):
                take = self.batch_size if len(pending) >= self.batch_size else len(pending)
                rows = [pending.popleft() for _ in range(take)]
                task = asyncio.create_task(self._write_batch(rows))
                tasks.add(task)
                task.add_done_callback(tasks.discard)
            if drain:
                break
        # Surface any writer exception instead of swallowing it silently.
        await asyncio.gather(*tasks)

The semaphore does double duty here: it caps concurrent load on the connection pool and provides natural backpressure, because async with self._sem blocks the producer once every permit is held. That is the mechanism that keeps a burst — the head-end replaying a day of buffered intervals after a comms outage — from opening more connections than PostgreSQL can serve.

Edge cases and billing gotchas

Post-outage burst overwhelms the pool. When AMI/AMR feed synchronization recovers, a head-end can dump a full day of buffered intervals in minutes. The semaphore ceiling is your first defense, but also watch the database side: query pg_stat_activity for the count of active connections and the wait_event_type column. If writers are stacking on Lock waits, the adaptive controller will already be shrinking batch_size, but you should confirm max_concurrent_writers is below the pool’s max_size with headroom for other services.

Batch-size oscillation (flapping). Without smoothing, a single slow write halves the batch, the next fast write grows it, and throughput sawtooths. The EWMA (alpha = 0.3) plus multiplicative-decrease / additive-increase asymmetry damps this: retreat fast under strain, probe up gently when there is headroom. If you still see flap, widen the watermark band rather than lowering alpha to near zero.

Meter rollover inside a batch. A fixed-width mechanical register wraps back through zero, so a naive consumption delta reads as a huge negative value. Do not let a rollover-corrupted kwh inflate a batch or trip your write-latency reading — quarantine oversized decreases and apply modulus-aware correction upstream, so the batch controller only ever sees clean, validated deltas. Disposition of these anomalies belongs in Reading Anomaly Detection Algorithms, not in the tuning layer.

Partial-batch flush at end-of-feed. The final partial batch (fewer rows than batch_size) must still post, or the tail of the run silently vanishes and those meters miss the billing cycle. The drain and pending branch handles it — verify it in tests, because a fencepost bug here does not raise, it just under-bills.

Verification snippet

Assert the controller both preserves every read (no dropped tail) and reacts to latency in the right direction, using an in-memory fake instead of a live ledger:

import asyncio
from decimal import Decimal


def test_adaptive_batcher_posts_all_reads_and_shrinks_under_latency():
    batcher = AdaptiveBatcher(pool=None, batch_size=2, high_watermark=0.05)
    written: list[tuple] = []

    async def fake_write(rows):
        written.extend(rows)
        batcher._resize(observed_latency=0.50)  # simulate a slow ledger

    batcher._write_batch = fake_write  # type: ignore[method-assign]

    async def drive():
        q: asyncio.Queue = asyncio.Queue()
        for i in range(5):  # odd count -> forces a partial tail flush
            await q.put({
                "meter_id": f"M{i}", "interval_utc": None,
                "kwh": Decimal("1.5"), "sequence": i,
                "idempotency_key": f"k{i}",
            })
        await q.put(None)
        await batcher.run(q)

    asyncio.run(drive())

    assert len(written) == 5           # nothing dropped, tail included
    assert batcher.batch_size == batcher.min_batch  # shrank under strain
    assert all(isinstance(r[2], Decimal) for r in written)  # decimal preserved

The odd read count (5 with batch_size=2) is deliberate: it forces the partial-tail path that a fencepost bug would silently skip.

Frequently Asked Questions

What batch size should I start with for 100k daily reads?

Do not pick a fixed number — that is the anti-pattern this page replaces. Seed batch_size at roughly 1,000 and let the latency watermarks move it. On typical municipal PostgreSQL ledgers the controller settles somewhere between 1,000 and 3,000 rows per statement; the exact figure depends on row width, indexes, and what else is contending for the pool, which is precisely why you measure instead of hard-code it.

How do I stop a post-outage burst from exhausting the connection pool?

Cap concurrent writers with asyncio.Semaphore(max_concurrent_writers) set below the pool’s max_size, and rely on the fact that async with self._sem blocks the producer once every permit is taken — that is backpressure. The adaptive controller then shrinks the batch as write latency climbs, so the burst is absorbed as slower, smaller writes rather than a TimeoutError cascade.

Why use an EWMA instead of the raw last-write latency?

A single slow write — a checkpoint, a vacuum, a brief lock — should not halve your batch size. The exponentially weighted moving average keeps recent history so the controller reacts to a sustained trend, not one noisy sample. Combined with fast multiplicative decrease and slow additive increase, it converges smoothly instead of sawtoothing.

Does shrinking the batch risk missing the billing window?

Smaller batches mean more round-trips, but they only trigger when the ledger is already latency-bound, where large batches would time out and force blind retries — the slower outcome. Sustained shrinking to min_batch is a signal to investigate the database (locks, missing index, undersized pool), not to raise the ceiling and push it harder.

Where does idempotency fit into the tuning layer?

It is orthogonal but non-negotiable: every write uses ON CONFLICT (idempotency_key) DO NOTHING, so a batch retried after a worker crash or broker replay re-posts as a safe no-op. That property is what lets you tune concurrency aggressively — a retried batch can never double-bill a customer. The full retry and dead-letter machinery lives in Error Handling & Retry Workflows.

Async Batch Processing for High-Volume Reads — the end-to-end decoupled architecture this tuning layer plugs into.
Schema Validation & Data Quality Checks — the boundary gate that guarantees the controller only ever batches clean reads.
AMI/AMR Feed Synchronization Protocols — where the post-outage bursts this page defends against originate.
Reading Anomaly Detection Algorithms — rollover and spike disposition kept out of the tuning path.
Error Handling & Retry Workflows — jittered backoff, dead-letter queues, and circuit breakers around the ledger write.

Up one level: Async Batch Processing for High-Volume Reads · Pillar: Meter Data Ingestion & Validation Pipelines · Return to the utilitybilling.org home.

Optimizing Async Batch Jobs for 100k+ Daily Reads in Municipal Utility Billing

# Minimal prerequisites

# Annotated implementation — a latency-driven adaptive batch controller

# Edge cases and billing gotchas

# Verification snippet

# Frequently Asked Questions

# What batch size should I start with for 100k daily reads?

# How do I stop a post-outage burst from exhausting the connection pool?

# Why use an EWMA instead of the raw last-write latency?

# Does shrinking the batch risk missing the billing window?

# Where does idempotency fit into the tuning layer?

# Related Topics