NATS vs Kafka vs MQTT: Same Category, Very Different Jobs
All three are 'messaging systems.' None of them is interchangeable with the others. A practical breakdown of NATS, Kafka, and MQTT — by the actual design axes that determine which one breaks when you misuse it.
Table of Contents
The number of times I’ve watched a team pick a message system based on “Company X uses it” is depressing. Right behind it: the team that picks the one they already know, regardless of whether it fits the workload. NATS, Kafka, and MQTT get lumped together because they all pass messages between processes. That’s like lumping trucks, sedans, and motorbikes together because they all have wheels.
They are three different tools for three different shapes of problem. Once you know the axes that matter, the decision is usually easy.
tl;dr — NATS is the low-latency nervous system for request/reply, fan-out, and loosely-coupled services. Kafka is a partitioned, replayable log optimized for ingest, ordered processing, and stream analytics. MQTT is a wire-efficient broadcast protocol for large fleets of intermittently-connected devices. The wrong one looks “slow” or “complicated” not because it is bad, but because it’s optimizing for something you don’t need.
The Axes That Actually Matter
Before comparing features, pick the axes that will make or break your system:
- Delivery guarantee: at-most-once, at-least-once, effectively-once (via dedup)
- Ordering: no ordering, partition-level ordering, global ordering
- Persistence / replay: ephemeral, durable with short retention, durable with long retention and replay
- Throughput pattern: many small messages vs few large messages; sustained high throughput vs bursty
- Client shape: services on fast reliable networks vs devices on flaky cellular links
- Operational complexity tolerance: can you run a ZooKeeper/KRaft quorum? or do you want a single binary with zero ops?
Every tool makes a different bet on these. Let’s walk through.
NATS: the low-latency nervous system
NATS is a pub/sub bus with native request/reply, plus wildcards for subject hierarchies. Core NATS is fire-and-forget, at-most-once, no persistence. JetStream (built in since 2.2) adds durable streams, at-least-once delivery, and replay.
What NATS optimizes for:
- Sub-millisecond publish latency in most topologies. The design is ruthlessly minimal — TCP connection per client, topic routing, done.
- Request/reply as a first-class operation.
nc.Request(subject, data, timeout)gives you RPC ergonomics on the message bus. - Subject hierarchies with wildcards.
orders.*.created,orders.US.*,orders.>— easy to model domains. - Low operational overhead. One binary, built in Go, clustered with raft, no external dependencies.
What NATS is not optimized for:
- Long retention. JetStream handles durable streams, but it’s not designed for months-long event logs the way Kafka is.
- Partitioned ordered processing at scale. You can do it with JetStream work queues, but the ergonomics and tooling are behind Kafka’s consumer groups.
- Stream processing frameworks. The Flink/ksqlDB/Spark ecosystem is Kafka’s home turf.
Pick NATS when the shape of your workload is “lots of services, mostly talking to each other in short exchanges, some broadcast, some work queues, and I want to stop running three different message systems.” It’s the default I reach for in modern backend stacks.
Kafka: the append-only log
Kafka is fundamentally a distributed commit log. Topics are partitioned. Each partition is an append-only ordered sequence of records. Consumers track their own offsets. Messages stick around for the configured retention (days, weeks, or forever).
What Kafka optimizes for:
- Sustained high ingest. The append-only log plus zero-copy send makes Kafka handle hundreds of MB/sec per broker without breathing hard.
- Partition-level ordering. Within a partition, order is guaranteed. This is how you get “all events for user X are processed in sequence” — just key by user ID.
- Replay and reprocessing. Offset management means you can rewind a consumer to last Tuesday and replay everything. Critical for analytics, for rebuilding downstream state after a bug, for change-data-capture.
- Stream processing integration. Flink, ksqlDB, Spark Streaming, Kafka Streams — the ecosystem assumes Kafka semantics.
- Large event histories. Tiered storage (pushing older segments to S3) makes long retention cheap.
What Kafka is not optimized for:
- Request/reply. The log model actively fights against it. You can hack it with correlation IDs and reply topics, but you’ll fight the framework.
- Low operational overhead. ZooKeeper was always a pain; KRaft helps but running Kafka in production is still real work.
- Low-latency small messages. A single publish round-trip is typically 5-10ms even on a hot path. That’s fine for most workloads but doesn’t compete with NATS on tight RPC loops.
- Large fan-out to thin clients. Every consumer is assumed to be a persistent process tracking offsets. Not suitable for IoT devices that connect intermittently.
Pick Kafka when you have event histories that matter, ordered per-key processing at scale, stream-processing pipelines downstream, or CDC integration with your databases. Also when you already have it and a new workload can reasonably ride on the existing platform.
Don’t pick Kafka because “it’s what big companies use.” Big companies have Kafka teams. You probably don’t.
MQTT: the device protocol
MQTT is a lightweight pub/sub protocol designed in the late 1990s for SCADA over satellite links — constrained bandwidth, intermittent connectivity, thousands of devices per broker. It’s a wire protocol first, infrastructure second. Popular brokers include EMQX, HiveMQ, Mosquitto, VerneMQ.
What MQTT optimizes for:
- Tiny wire overhead. A PUBLISH packet header can be as small as 2 bytes. Critical for cellular-cost-sensitive deployments.
- Intermittent connections. Persistent sessions, QoS levels 0/1/2, last-will-and-testament. Designed to survive a device being offline for hours.
- Massive broadcast fan-out. One publish to a subject with 100,000 subscribers is feasible on a modern broker.
- Constrained clients. Low CPU, low memory, simple state machine — fits on a microcontroller.
What MQTT is not optimized for:
- Inter-service messaging on reliable networks. You’re paying for reliability features (QoS 2, retained messages, sessions) that you don’t need between two services in the same VPC.
- Long-term persistence and replay. The protocol has retained messages but nothing like Kafka’s log model.
- Complex routing. Subject wildcards work (
+single-level,#multi-level) but the routing semantics are simpler than NATS subjects.
Pick MQTT when you have actual devices on the other end — sensors, meters, vehicles, consumer hardware. For anything server-to-server on a reliable network, MQTT is over-engineered on one axis (device resilience) and under-engineered on another (rich routing / replay).
A Decision Flow
When a team asks me which to use, the path I walk them through is usually some version of this:
flowchart TD
Start([New messaging need]) --> Q1{Are your clients
actual devices?}
Q1 -->|Yes · IoT, sensors,
cellular| MQTT["MQTT
device pub/sub
tiny wire overhead"]
Q1 -->|No · services on
reliable networks| Q2{Do you need
replay of past events?}
Q2 -->|Yes · long retention,
analytics, CDC| Q3{Partitioned ordering
required?}
Q2 -->|No · pub/sub
or request/reply| NATS["NATS
low-latency service bus
JetStream if durable"]
Q3 -->|Yes · per-key ordering
at high volume| Kafka["Kafka
partitioned commit log
days to months retention"]
Q3 -->|No, but I want
durable streams| NATSJet["NATS JetStream
simpler ops
shorter retention than Kafka"]
classDef mqtt fill:#fef5e7,stroke:#b7791f
classDef nats fill:#e8f4f8,stroke:#2c5282
classDef kafka fill:#f0fff4,stroke:#2f855a
class MQTT mqtt
class NATS,NATSJet nats
class Kafka kafka
The Matrix
If you want the one-page decision:
| NATS | Kafka | MQTT | |
|---|---|---|---|
| Primary use | Service-to-service bus | Event log, stream processing | Device pub/sub |
| Delivery default | At-most-once (JetStream: at-least-once) | At-least-once | Configurable (QoS 0/1/2) |
| Ordering | Not guaranteed (JetStream: per stream) | Per partition | Per subject per client |
| Persistence | None in core; durable with JetStream | Built-in; long retention | Retained messages only |
| Replay | JetStream only, with some friction | First-class | No |
| Latency | Sub-ms | 5-10ms | Device-bound |
| Throughput per node | 10s of millions msg/s | 100s of MB/s | Highly variable |
| Ops complexity | Low | High | Medium |
| Request/reply | First-class | Awkward | Not really |
| Client assumption | Reliable services | Reliable consumers | Intermittent devices |
| Good default for | Microservice mesh | Event sourcing, analytics | IoT fleets |
The Real-World Patterns
A few shapes I’ve seen work well, and the corresponding mismatch patterns that caused pain.
Works: Internal service mesh on NATS, CDC on Kafka, IoT on MQTT
A reasonable large-company pattern is all three, each doing what it’s good at:
- NATS (or NATS JetStream) for inter-service request/reply, pub/sub, work queues.
- Kafka for the event log: database CDC, audit events, analytics pipelines, anything that feeds Flink or the data warehouse.
- MQTT for actual devices in the field.
Bridging happens at defined boundaries: an MQTT-to-Kafka connector for device telemetry you want replayable. A NATS-to-Kafka shipper for events that need long retention. Services don’t cross the boundaries directly; platform infra does.
Mismatch: “let’s replace our RPC with Kafka”
I’ve seen this at least four times. Someone reads an event-driven-architecture book, decides RPC is old-fashioned, publishes every inter-service call through Kafka topics. What happens:
- Latency goes from ~5ms to ~30-50ms round-trip because Kafka’s commit-log design isn’t tuned for low-latency reply.
- Debugging gets painful — a request that used to be one span in Jaeger is now half a dozen topics and offsets.
- Backpressure disappears — consumers can fall arbitrarily behind, and the publisher has no idea.
Every time, the fix was “put RPC back in front for the actual synchronous call paths, keep Kafka for the async event flow.” The event log is a great thing to have. It is not a substitute for RPC.
Mismatch: “let’s standardize on MQTT for everything”
Organizations with a strong IoT background sometimes try this. MQTT is what they know. So they run inter-service communication on it too.
Problems:
- Subject matching is less expressive than NATS’s hierarchical patterns — complex routing becomes awkward.
- No persistence/replay means any design requiring “rebuild downstream state” is blocked.
- Broker clusters are tuned for device fan-out, not low-latency service-to-service, so tail latencies are higher than they need to be.
The advice I give: if you’re not sending to devices, don’t use a device protocol.
Mismatch: “we picked NATS, now we need replay”
Teams that picked NATS core for its simplicity sometimes discover six months in that they need event replay — maybe for a bug-induced reprocessing, maybe for a new downstream that needs historical data. Two fixes:
- Migrate to JetStream. Usually the right answer — it’s the same product with durable streams. The upgrade is mostly configuration.
- Add Kafka alongside for the replay use case. More operational overhead, but gives you the full Kafka tooling ecosystem.
Neither is terrible. The real lesson is to check the replay question at the design-review stage.
How to Make the Decision
“Which message system should I use” is not a tech question; it’s a workload-fit question. Answer these first:
- Who talks to whom, and how long do those conversations last?
- What’s the delivery semantics I actually need — at-most-once, at-least-once, effectively-once?
- Do I need to replay history to rebuild downstream state?
- What’s the shape of my clients — reliable services, flaky devices, or mixed?
- What’s my operational appetite — do I want one binary, or can I run a real platform team?
The three tools map onto the answers cleanly. The trouble starts when you skip the questions.
Related
- RPC vs NATS: It’s Not About Sync vs Async — It’s About Who Owns Completion — the prior question: do you even want messaging, or RPC?
- Why Your “Fail-Fast” Strategy is Killing Your Distributed System — what happens when any of these fails.
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.