Large Files

Tune Reconify's output format, index backend, and memory settings for high-volume reconciliation runs.

Reconify is built around streaming: it indexes the right source, then streams the left source row by row and emits events incrementally. That design handles large files well, but the default settings (full JSON output, memory index) can become bottlenecks past a certain scale. This guide covers the specific flags to reach for when a run is slow, consuming too much memory, or producing output files too large to open comfortably.

What you'll learn

By the end of this guide, you know:

which output format to choose for files with hundreds of thousands of rows,
how the index backend affects memory usage and when to switch modes,
and how to enable progress logging and tune token-match buffering for large unmatched sets.

How it works

The right source is fully indexed before the left source is streamed. For the memory backend, the entire right-side index lives in RAM. For the disk backend, it spills to a SQLite-backed file. auto switches automatically based on right-file size. See Overview for the pipeline and Config Reference for the full index schema.

Command

reconify reconcile \
  --config reconify.yaml \
  --pair bank_vs_ledger \
  --format ndjson \
  --progress \
  --out results.ndjson

Steps

Switch to a streaming output format

For files above roughly 500k rows, avoid json and table. Use one of:

Format	Best for
`ndjson`	Piping through `jq`, streaming to downstream tools.
`csv`	Loading into spreadsheets or databases row by row.
`json-stream`	Section-by-section JSON without one massive object.

reconify reconcile \
  --config reconify.yaml \
  --pair bank_vs_ledger \
  --format ndjson \
  --out results.ndjson

Configure the index backend

Add an index block to your config:

index:
  backend: auto
  spill_dir: "/tmp/reconify"
  auto_max_right_file_mb: 2048

memory — fastest. Use when the right-side index fits comfortably in RAM.
disk — lower memory. Lookups are slower but the process doesn't balloon in size.
auto — switches to disk when the right-side file exceeds auto_max_right_file_mb. Start here.

Enable progress logging

reconify reconcile \
  --config reconify.yaml \
  --pair bank_vs_ledger \
  --format ndjson \
  --progress \
  --progress-every 1000000 \
  --out results.ndjson

Progress logs go to stderr, separate from the output file, so they don't corrupt NDJSON output or interfere with piped commands.

Skip raw fields to reduce allocation pressure

In each source's parser config:

parser:
  skip_raw: true

By default, every parsed transaction carries its original row fields in raw. For large sources where you don't need those original fields in output, skip_raw: true cuts allocation significantly per row.

Tune token-match buffering

If name_mode: "tokens" is set, Reconify buffers unmatched rows after reference matching to run a second token-similarity pass. For large datasets, that buffer can become substantial:

reconify reconcile \
  --config reconify.yaml \
  --pair bank_vs_ledger \
  --max-token-buffer 100000

If you don't need fallback name matching, set name_mode: "none" in the pair config to disable the buffer entirely.

Verify it worked

Check the summary section of the output for row counts and match rate, and confirm memory usage stayed within expected bounds during the run (use --progress output to track row throughput). If the process still exhausts memory after switching to disk index, the most likely culprit is a very large token-match buffer — reduce --max-token-buffer or disable token mode.

On this page