NATS 2.14 (RC 1) Fast-Ingest Batch Publishing

The first release candidate of NATS 2.14 is out, and one of its standout JetStream additions is fast-ingest batch publishing: a new batch-publishing mode built purely for throughput, complementing the atomic batch publishing that shipped with 2.12.

This post compares fast-ingest to the existing atomic batches, shows how to use it via the orbit.go library, and benchmarks the three publishing paths (plain publish, atomic batch, and fast-ingest) on both memory and file storage.

Heads-up: 2.14 is still a release candidate. The fast-ingest client API just shipped in orbit.go's jetstreamext/v0.3.0 (from PR #32). The benchmarks in this post were run against that PR's branch at commit 4a29bfca, which matches what landed in the release.

Atomic vs Fast-Ingest: The Short Version

Both modes batch multiple messages under a single logical operation, but they trade off differently:

	Atomic batch (2.12+)	Fast-ingest batch (2.14+)
Guarantee	All-or-nothing: either every message persists or none do	Best-effort: messages can be lost; gap behavior is configurable (`ok` or `fail`)
Ordering	Guaranteed via atomicity (all-or-nothing staging)	Guaranteed with `gap: fail` (batch aborts on first missing seq, what landed is contiguous and in order)
Batch size limit	1,000 messages	Unlimited
Server behavior	Stages messages until commit	Persists messages as they arrive
Flow control	None	Server-driven, with tunable ack frequency
Leader change mid-batch	Staged state lost, so batch fails	Continues if gap mode is `ok`; abandoned if `fail`
Intended for	Consistency across related messages	High-throughput ingest

Reach for atomic batching when a group of messages must succeed or fail as a unit. Think “these three events belong to one transaction”. Reach for fast-ingest when you’re pumping telemetry, IoT samples, or log lines into JetStream as fast as possible and can tolerate occasional gaps. ADR-50 defines both.

One key protocol difference: atomic batches carry control information in headers (Nats-Batch-Id, Nats-Batch-Sequence, Nats-Batch-Commit). Fast-ingest encodes control information in the reply subject (<prefix>.<uuid>.<flow>.<gap-mode>.<seq>.<op>.$FI). That tiny shift gives fast-ingest its throughput edge: no per-message header parsing, and the server uses lightweight flow control instead of staging.

Enabling Fast-Ingest on a Stream

Fast-ingest, like atomic batching, is opt-in per stream:

cfg := jetstream.StreamConfig{
    Name:               "EVENTS",
    Subjects:           []string{"events.>"},
    AllowAtomicPublish: true, // atomic batches (allow_atomic)
    AllowBatchPublish:  true, // fast-ingest    (allow_batched)
}

The server added AllowBatchPublish in 2.14. The nats.go field already lives in the 2.14-dev branch but hasn’t landed on main or in a release yet (the latest tagged version, v1.51.0, still only has AllowAtomicPublish). Until it ships, set allow_batched: true manually when creating the stream via the JSON API, or pull from the 2.14-dev branch.

Publishing with `FastPublisher`

The orbit.go API mirrors the existing BatchPublisher, but drops the atomicity guarantee:

import "github.com/synadia-io/orbit.go/jetstreamext"

fp, err := jetstreamext.NewFastPublisher(js)
if err != nil {
    log.Fatal(err)
}

for i := 0; i < 100_000; i++ {
    if _, err := fp.Add("events.sensor", payload(i)); err != nil {
        log.Fatal(err)
    }
}

// Commit returns a final ack that includes the last persisted sequence.
ack, err := fp.Commit(ctx, "events.sensor", payload(100_000))
if err != nil {
    log.Fatal(err)
}
fmt.Printf("last sequence: %d\n", ack.Sequence)

A few things worth calling out:

Non-concurrent use. A FastPublisher ties to a single goroutine. For parallel producers, create one publisher per goroutine.
Close() instead of Commit() works when you don’t want to persist a final message. You just want a clean end-of-batch signal and ack.
Flow control is tunable. The server drives how often it acks; you can nudge it:

fp, err := jetstreamext.NewFastPublisher(js, jetstreamext.FastPublishFlowControl{
    Flow:               200,               // client-side max; server caps at min(500/N, Flow) where N = concurrent fast publishers on the stream
    MaxOutstandingAcks: 3,                 // client-side back-pressure threshold
    AckTimeout:         10 * time.Second,
})

Gap handling. By default (gap: fail), if the server notices a gap (lost message), it abandons the batch — which is also what gives you guaranteed ordered publishing: what persisted is contiguous and in publish order up to the abort point. Opting into gap: ok via WithFastPublisherContinueOnGap(true) trades that ordering guarantee for higher resilience: the batch keeps going and you hear about each dropped message via the error handler (gaps are only surfaced there, there is no automatic logging):
gap-handler.go
```
fp, err := jetstreamext.NewFastPublisher(js,
    jetstreamext.WithFastPublisherContinueOnGap(true),
    jetstreamext.WithFastPublisherErrorHandler(func(err error) {
        log.Printf("fast-ingest: %v", err) // fires on BatchFlowGap and flow errors
    }),
)
```

The Benchmark

Enough theory. Each cell below is the median of 5 timed runs of 100,000 messages (plus a 5,000-message warm-up), on an Apple M3 with 16 GB RAM and an APFS SSD. Single nats-server (v2.12.7 or v2.14.0-RC.1), R1 stream on loopback, no clustering. Payloads of 64 B, 256 B, and 4 KiB; batch sizes 10/100/1000 for both batch methods plus 2000/5000/10000 for fast-ingest. Fast-ingest uses orbit.go defaults (Flow=100, MaxOutstandingAcks=2); tuning comes further down.

Results: File storage

Throughput, file storage

On file storage, every write still goes through a syscall path and lands in NATS’s on-disk message blocks, even though the server only fsyncs periodically in the background (sync_interval, default 2 minutes; sync_always off by default). That per-write cost plus OS page-cache flushing narrows the spread between methods. Plain sync publish tops out around 32k msgs/s, and PublishAsync pulls that up to ~370k by not waiting for each ack.

At batch size 10, atomic batching runs slower than sync publish. You pay the batch setup cost without many messages’ worth of amortization. The crossover happens at batch size ≥ 100, where both atomic and fast-ingest surge past sync.

At batch size 1000, fast-ingest delivers ~360k msgs/s versus ~220k msgs/s for atomic, about +64 %. Larger fast-ingest batches don’t help further on file storage (2000, 5000, 10000 all hover in the ~340-365k msgs/s range); the disk saturates well before the protocol does, which puts fast-ingest roughly at parity with PublishAsync here.

One data point worth noting: atomic-batch throughput is identical between 2.12.7 and 2.14.0-RC.1 (~220k msgs/s at batch 1000 on both). Fast-ingest moves the needle in 2.14, not atomic-batch tuning.

Results: Memory storage

Throughput, memory storage

Memory storage is where fast-ingest pulls ahead. With no disk in the path, per-message overhead becomes the bottleneck, and the lighter-weight control protocol matters:

publish-sync: ~35k/s (dominated by RPC round trip, about the same as file storage).
publish-async: ~650k/s.
atomic-batch at batch 1000: ~544k/s on 2.14 (and ~511k on 2.12).
fast-ingest at batch 1000: ~922k/s.
fast-ingest at batch 2000 / 5000 / 10000: ~941k / 960k / 981k msgs/s.

That’s roughly +69 % over atomic at batch 1000 and +42 % over the fastest plain publish path. Fast-ingest keeps gaining modestly past batch 1000, whereas atomic can’t go there.

The numbers above use the FastPublisher defaults (Flow=100, MaxOutstandingAcks=2). The shipped defaults prioritize stable flow control over absolute peak throughput.

Flow-control tuning

The FastPublishFlowControl knobs visibly change throughput. I ran a small sweep at batch=1000, 256 B (5 iterations per combo, median reported). The shipped defaults prioritize stable flow control over absolute peak throughput:

Throughput vs. flow-control settings (batch=1000, 256B payload). Best config per storage type highlighted in darker purple.

Flow is an upper bound the client gives the server; the server picks the actual cadence. Looking at the 2.14.0-RC.1 implementation (fastBatchInit and checkFlowControl), a solo publisher starts at min(500, Flow) and targets min(500 / N, Flow) where N is the concurrent fast publishers on the stream, ramping by doubling or halving until it matches. When a second publisher joins, the server drops the initial rate to 1 so it can coordinate.

Two consequences for tuning:

Raising the client-side cap from 100 to 500 lets the server pick a much higher rate, and it does: +17 % on file, +32 % on memory.
Past Flow=500, a solo publisher sees almost no change. The server’s internal target is already capped at 500 (411k vs 416k on file, 1.20M vs 1.14M on memory is the same cadence with slightly different MaxOutstandingAcks).

Raising MaxOutstandingAcks from 2 to 5 helps when the publisher outpaces the ack loop, giving the client more room to keep adding while earlier batches settle.

Expect these numbers to change. The server source still carries TODO markers on both halves of the scheduler:

Initial flow: the rate the server picks when a batch starts. Today it’s min(500, Flow) for the first publisher on a stream, 1 for anyone joining after. (fastBatchInit, L174)
Dynamic flow: how the server adjusts that rate mid-batch. Today the target is simply 500 / N where N is the number of concurrent fast publishers on the stream, ramped via doubling/halving. The comment in the code says it should eventually weigh by average message size, RAFT in-flight pressure, and each publisher’s contribution. (checkFlowControl, L286-289)

Once the scheduler gets smarter, the Flow=500 ceiling for a solo publisher and the linear 500/N split are likely to move.

Async persist mode

ADR-50 notes that streams configured with persist_mode: async are compatible with fast-ingest. In that mode the server batches file writes asynchronously instead of flushing in place, so disk I/O leaves the publish hot path.

Stream requirement: persist_mode: async is mutually exclusive with allow_atomic — the server rejects stream creation with err_code 10052 if you set both. Atomic batching depends on synchronous staging, which is exactly what async flushing removes. So this mode is fast-ingest-only.

At 256 B on file storage, fast-ingest alone, default vs async persist mode (100k messages, median of 5):

Batch	`persist_mode: default`	`persist_mode: async`	Delta
100	~252k msgs/s	~588k msgs/s	+133 %
1000	~377k msgs/s	~1.03M msgs/s	+172 %
10000	~382k msgs/s	~1.09M msgs/s	+185 %

At batch=10000 the file-backed stream in async mode pushes ~1.09M msgs/s, slightly above the memory-storage peak we saw earlier (~981k msgs/s). Setting persist_mode: async effectively removes the “file storage is the bottleneck” narrative for fast-ingest.

The tradeoff sits in the durability model: writes that haven’t been flushed yet can be lost on an unclean shutdown. That’s fine for the workloads fast-ingest targets (telemetry, logs, sensor data with tolerable gaps) and already aligned with fast-ingest’s best-effort semantics, but it’s not a free lunch for durability-sensitive streams.

Headline comparison

Headline comparison, file storage

Pulling the best-case numbers for each method on file storage:

Method	Server	Best throughput
`publish-sync`	2.14.0-RC.1	~32k msgs/s
`publish-async`	2.14.0-RC.1	~370k msgs/s
`atomic-batch` (b=1000)	2.14.0-RC.1	~220k msgs/s
`fast-ingest` (b=1000)	2.14.0-RC.1	~360k msgs/s
`fast-ingest` (b=2000)	2.14.0-RC.1	~364k msgs/s

The takeaway: on realistic file-backed streams, fast-ingest matches PublishAsync in raw speed while giving you explicit batching semantics, flow control, and the ability to commit on a final message. On memory-backed streams, it leaves everything else behind.

PublishAsync vs Fast-Ingest: Why Not Just Use Async?

Looking at the file numbers above, PublishAsync sits at ~370k msgs/s and fast-ingest at ~360k msgs/s, near parity. It’s a fair question: don’t wait for each ack, let messages fly. So why a new mode?

The difference is what the server sees. PublishAsync is still an independent per-message publish: each message gets its own async reply token under a single wildcard reply subscription, and the server emits a full PubAck per message. The client hides the round-trip behind a PubAckFuture and a pending-ack map (default MaxPending = 4000; the client stalls for up to stallWait when that fills up). Every per-message cost stays; only the wait is hidden.

There’s also a correctness catch: PublishAsync does not guarantee the message landed in the stream until you observe the ack. It returns a PubAckFuture and moves on. If you never read paf.Err() (or set an error callback) and only wait on PublishAsyncComplete(), you’ll see the “all outstanding publishes resolved” signal but can still miss individual failures. On a reconnect, nats.go resolves every in-flight future with nats.ErrDisconnected; that error still has to be surfaced via those futures or the configured error handler. The benchmark numbers in this post wait on PublishAsyncComplete(), which in the no-error path corresponds to all acks arriving, but the code does not inspect each PubAckFuture individually. In real code, “I called PublishAsync” still differs from “it’s persisted”.

Fast-ingest uses a different protocol. The client opens a batched session, streams messages under one batch id, and the server emits one BatchFlowAck every ackMessages messages (starts at min(500, Flow) for the first publisher on the stream and adjusts toward 500 / N where N is the concurrent fast publishers). That means:

Far fewer acks on the wire. One periodic flow-control ack per N messages, plus a final batch ack, instead of a PubAck per message.
Much lighter client bookkeeping. There’s no per-message PubAckFuture map to maintain; control information is carried in the batch reply subjects and periodic BatchFlowAck messages.
Explicit gap reporting. When the server detects a missing batch sequence it surfaces a BatchFlowGap, so the client sees dropped messages as a first-class event. With PublishAsync, individual failures still reach you, but only if you look at each PubAckFuture or the error callback.
One backpressure knob. MaxOutstandingAcks limits how far ahead the client can run relative to the latest BatchFlowAck; the server drives the actual cadence via BatchFlowAck messages, so there’s no 100k-entry PubAckFuture map to walk.
Coordinated flow control across clients. PublishAsync backpressure is per-client (you wait on acks), so multiple producers can’t see each other. Push too many PublishAsync clients at too high a rate and the server’s internal subscription queue fills past its threshold and gets cleared, dropping messages, much like publishing to a stream with core NATS. Fast-ingest is server-coordinated: when a second publisher joins, the server drops each publisher’s rate (initially to 1, then ramping toward 500/N) so combined throughput stays inside what the stream can absorb. If your machine sustains 1 GB/s with one producer, a second PublishAsync producer can push you past that limit and lose messages; with fast-ingest you can run many more concurrent producers, see each one throttled, and lose none while sustaining high combined throughput.
Guaranteed ordered publishing with gap: fail. Fast-ingest with the default gap: fail mode aborts the batch the moment the server detects a missing sequence, so what does land is contiguous and in publish order, up to the abort point. Atomic batches give you ordering through atomicity (all-or-nothing staging), which is strictly more expensive. PublishAsync can’t strictly guarantee this: TCP keeps messages ordered on the wire, but individual messages can still fail (queue overflow, reconnect) and any retry reorders them relative to later ones already on the wire. With fast-ingest, you get cheap ordering even at higher MaxOutstandingAcks values where throughput is highest.

On file storage, disk I/O buries those protocol savings, so the two look similar. On memory storage, where protocol overhead drives throughput, fast-ingest pulls ahead cleanly (~922k vs ~650k msgs/s at batch 1000) and scales further with batch size.

Use PublishAsync for independent messages with standard per-message acks. Reach for fast-ingest when you want a dedicated ingest path, explicit batches, and lightweight flow control.

Payload Scaling

The charts above fix the payload at 256 bytes. To see how much the payload size matters, I also ran the matrix at 64 B (think: sensor sample, tight structured log line) and 4 KiB (a decent-sized JSON event).

Payload scaling, file storage

On file storage, per-message throughput drops as the payload grows. Expected: writing more bytes per message increases both the syscall cost and the page-cache pressure. At batch size 1000 on v2.14.0-RC.1:

Payload	`PublishAsync`	Atomic batch	Fast-ingest
64 B	381k msgs/s	231k msgs/s	373k msgs/s
256 B	370k msgs/s	220k msgs/s	360k msgs/s
4 KiB	156k msgs/s	91k msgs/s	155k msgs/s

On file storage, fast-ingest runs neck-and-neck with PublishAsync, with slightly lower raw throughput at smaller payloads but still dramatically ahead of atomic-batch. At 4 KiB its bandwidth hits ~620 MB/s. The disk, not the protocol, is the bottleneck.

Payload scaling, memory storage

On memory storage the story shifts: the absolute numbers jump dramatically, and fast-ingest’s advantage over atomic-batch grows at larger payloads. At 4 KiB on memory, fast-ingest at batch=1000 hits ~464k msgs/s ≈ 1.81 GB/s of ingest, compared to ~256k msgs/s (~1.0 GB/s) for atomic-batch.

Two practical takeaways:

Small messages (64 B) on file storage: protocol overhead dominates, and fast-ingest extracts the most from the server for almost no extra cost over PublishAsync, while giving you batch commit semantics.
Large messages (4 KiB) on memory storage: fast-ingest wins for high-throughput ingest pipelines. At close to 2 GB/s on a single R1 stream, it pushes hardware, not client overhead.

Caveats

A few things to keep in mind before extrapolating:

This is a release candidate. Numbers may shift before 2.14 GA.
Single server, R1, loopback, one producer. Real clusters with replication will look different. Replication amplifies the benefit of larger batches, but also shifts where the bottleneck sits.
File storage here runs on an APFS-backed SSD laptop with NATS’s default sync settings (background fsync every 2 minutes). Changing sync_interval (including sync_always) or the underlying storage will move the absolute numbers.
I measured raw publisher throughput, not end-to-end latency or consumer delivery. Fast-ingest concerns getting messages into the stream; the consumer side is unchanged.

What Else Will Come With 2.14

Fast-ingest is one of many improvements in the RC. A pile of other JetStream additions deserve attention. Briefly, based on the release notes:

Repeating & cron-based message schedules: Nats-Schedule: @every 5m or crontab syntax on publish.
Scheduled subject sampling: the Nats-Schedule-Source header lets you resample the last message for a subject at a different rate.
Scheduled subject rollups: trigger a subject rollup on a schedule.
Consumer reset API: $JS.API.CONSUMER.RESET.stream.consumer rewinds a consumer without deleting it.
End-of-batch commit (eob) for atomic batches: the commit marker can now be a non-persisted sentinel instead of an extra payload.
Asynchronous stream state snapshots for replicated streams: lower tail latency on streams with many interior deletes.
Leafnode runtime reloads: add and remove remote leafnode configs without restarting.

Plenty to cover in future posts. Keep an eye on the NATS blog for the 2.14 GA announcement.

Resources

NATS 2.14.0-RC.1 release notes
ADR-50: Batch Publishing
orbit.go jetstreamext/v0.3.0 release (FastPublisher, from PR #32)
Our earlier post on atomic batching: NATS 2.12 Atomic Batch Publishing

If you prefer a GUI for managing the streams you’re pumping data into, check out Qaze.