Intro

KIP-1150 was just announced in Apache Kafka. It has only been two days and we are already beginning to see the effects of this announcement. AutoMQ, a fork of Kafka supporting a rich feature set, completely changed their license to Apache and announced their intention to donate more of their proprietary features to the community.

I am super excited about the KIP. I have long been talking about the direct-to-S3 stateless Kafka systems that we’ve seen an explosion from since 2023.

Why KIP-1150?

KIP-1150 finally promises to bring this innovative design in to Kafka. I say innovative, but it’s frankly just intelligently circumventing the abritrary prices which the cloud providers set. Those prices happened to disproportionately punish software like Kafka that uses a lot of inter-zone data transfer - so KIP-1150 proposes pushing those costs down to S3 (where you don’t pay through the nose for them).

Because of that, the KIP comes with massive cost savings potential for any open source Kafka hosted on AWS (where inter-zone traffic is $0.02/GiB) and GCP (where inter-zone traffic is $0.01/GiB). In Azure, inter-zone traffic is free, but you still wouldn’t have to pay for expensive disks with KIP-1150. It’s noteworthy that disks can be a major cost for a Kafka deployment, but with KIP-405 becoming generally available - it’s fair to assume most of the historical data is stored in object storage.

On that note, it’s worth emphasizing — KIP-405 != KIP-1150.

  • KIP-405 is the asynchronous tiering of older topic data to S3 while new data is synchronously replicated and persisted on disk.
  • KIP-1150 is the direct writing of topic data to S3 without any replication between Kafka brokers.

Besides the cost savings this architecture offers, KIP-1150 should also come with massive operational simplifications. Among the biggest benefits are:

  • not having to manage free disk space buffers and the risk of running out of disk (and bricking your cluster)
  • preventing hot spots by being able to instantly load-balance diskless traffic around brokers (since any broker can be a leader, the topic traffic can be moved instantly now)
  • load-balance clients - for diskless traffic, you’d now have complete control over which broker a client connects to. Previously, a client had to connect to the specific leader. This also allows you to reduce overall connection count, because a client can now do all its work connectng to just 1 broker (as opposed to many)
  • zero-ops auto-scaling - add or remove brokers serving diskless traffic instantly. Scaling a cluster up should be as easy as nginx.

Latency

This all comes at one trade off - latency. The fundamental knobs that control latency are three:

  1. the batch size/interval - how long does a broker buffer produce requests until it starts to commit them to S3
  2. the S3 PUT latency - how long does S3 take to save the file?
  3. the Batch Coordinator latency - how long does the coordinator take to commit the record batch coordinates?

1) is the easiest to tune. If your broker batches files every 200ms, you’d have a max of 200ms to add to your usual Produce request processing. If you then want to cut 180ms off your Produce request time - you’d lower the batch size down to 20ms. Granted, that would result in 10x more S3 PUT calls, more work on the background object merging side and generally more load on the batch coordinator.

Similarly, if you don’t really care about latency - nothing stops you from cranking the batch interval up to 1 second and paying 5x less for S3 PUTs, as well as generally having less load on the cluster. This is the magic of KIP-1150.

Cost Calculation

It’s easy to paint a narrative on text but let’s be real and run the numbers.

Cluster Setup

I’m gonna pick an arbitrary cluster size and workload. Once the proposal gets more traction/acceptance, I will merge this into AKalculator so you can calculate it too. For now, treat this as napkin math - a lot of things in the proposal aren’t clear yet, and are subject to change anyway - but the major cost drivers should be stable.

  • assume we have a 10-broker Kafka cluster
  • assume we have 1 GiB/s of compressed producer traffic. Each broker therefore receives 102.4 MiB/s.
    • I emphasize compressed because many vendors advertise uncompressed (big gotcha).
  • assume our KIP-1150 batch size is configured at 16 MiB and our batch interval at 250ms. Because each broker issues a PUTs every 16 MiB, its effective batch interval is 156.25ms.
  • assume 2 GiB/s of consumer traffic across 10 brokers (2x fanout), all reading from the tail of the log.

KIP-1150 Costs

Request Counts

  • S3 write: each broker does 6.4 PUTs a second, the whole cluster does 64 PUT/s
    • 64 PUT/s
  • coordinator write cost: each PUT results in one call to the BatchCoordinator. The whole cluster issues 64 Batch Coordinator requests a second. This is implemented with a classic Kafka topic, so our only real cost is the storage + replication. Assume for each 16 MiB PUT we have 6 KiB of metadata (2048 characters). The current proposal has around 118 bytes per partition, and assuming each broker receives 50 partitions’ worth of data for the 102.4 MiB/s workload we have — that’s 5.9 KiB per batch call. That’s a measly throughput of 377.6 KiB/s for the coordination topic.
    • 380 KiB/s BatchCoordinator produce
  • merging: while still unspecified how and when, the many small objects created with the initial PUTs get merged into larger ones. We can assume that this requires at a minimum reading the data once (one S3 GET per file), writing it out in larger multi-GiB objects (e.g. with 1 multi-part PUT per 25 MiB) and updating the BatchCoordinator (let’s just handwave this to an addition 100 KiB/s of traffic). How big our new merged files doesn’t matter with relation to the cost calculation, because we will always upload them via multiple smaller multi-part PUT requests.
    • 64 GET/s
    • 100 KiB/s BatchCoordinator produce
    • 41 multi-part PUT/s
  • reads: caching plays a very important role here for e2e latency. We should assume that any write going into a broker is cached for that availability zone. Because 1/3rds of producer traffic is always in the same zone, we can roughly assume that 1/3rds of the read traffic is always cached. At 2048 MiB/s reads, roughly 1365 MiB/s of read traffic will have to be fetched from S3. It would be more efficient if the reads could be served from the larger merged files, but let’s be conservative and assume we are going to be reading from the small files. This results in roughly 85.3125 GETs/s for the cluster (1365 MiB / 16 MiB file size). In practice, it’s probably a bit less due to caching and the fact that two requests may require portions of the same cached-once 16 MiB file.
    • 85.3 GET/s
  • coordinator reads: we are venturing further out into speculation territory - it’s really unclear to me how many reads brokers would need from the BatchCoordinator’s backing store. Let’s stay conservative and assume is it equal to the number of GETs times 1.5x - so ~130 GET/s results in ~767 KiB/s of topic data reads.
    • 770 KiB/s BatchCoordinator reads

Daily Values

There are 86400 seconds in a day.

  • data: the cluster receives 86400 GiB of fresh data every day
  • S3 writes: 5,529,600 PUTs a day
  • S3 reads: 7,371,000 GETs a day
  • coordinator writes: 31.31 GiB Batch Coordinator produce a day
  • coordinator reads: 63.4 GiB Batch Coordinator consume a day
  • merging:
    • 5,529,600 GETs to read all the non-merged objects
    • 3,542,400 multi-part PUTs
    • 8.24 GiB Batch Coordinator produce

Cloud Prices

Our costs are:

  • S3 GET: $0.0004 per 1000 requests
  • S3 PUT: $0.005 per 1000 requests
  • S3 DELETEs: free
  • BatchCoordinator replication: $0.02/GiB for cross-zone traffic
  • BatchCoordinator storage: $0.08/GiB for EBS SSDs (note we have to pay for the 3x replicated data, as well as pay for provisioned free space)
  • BatchCoordinator Produce: each GiB results in ~2.66 GiB in cross-zone traffic (2x for replication, 0.66x for the produce traffic as 2/3rds crosses zone) and 6 GiB of storage (as it’s durably replicated 3x and we keep 50% free space)
  • BatchCoordinator Consume: each GiB results in ~0.66 GiB in cross-zone traffic (0.66x for the consume traffic as 2/3rds crosses zone)

Daily Cost Calculation

A KIP-1150 topic receiving 1 GiB/s of trafic and serving 2 GiB/s of reads should roughly cost $54 a day.

  • S3 writes: $27.64/day (5,529,600 PUTs a day @ $0.005/1000)
  • S3 reads: $2.94/day (7,371,000 GETs a day @ $0.0004/1000)
  • coordinator writes: $2.12/day
    • cross-zone networking: $1.62/day (81.4 GiB cross-zone traffic @ $0.02/GiB)
    • storage: $0.5/day (187.86 GiB of storage @ $0.08/GiB-month)
  • coordinator reads: $0.84/day (41.8 GIB cross-zone traffic @ $0.02/GiB)
  • merging: $20.47/day
    • GETs: $2.2/day (5,529,600 GETs @ $0.0004/1000)
    • PUTs: $17.7/day (3,542,400 PUTs @ $0.005/1000)
    • Batch Coordinator Produce: $0.57/day
      • $0.44/day (22 GiB cross-zone traffic @ $0.02/GiB)
      • $0.13/day (50 GiB storage @ $0.08/GiB-month)

Traditional Kafka Costs

This is easy to compute - we have:

  • 1 GiB/s of produce traffic going into the cluster. On average, when your cluster is spread across 3 AZs, 2/3rds of your produce traffic will cross availabiilty zones. This results in 682.666667 MiB/s of cross-zone traffic
  • 2 GiB/s of replication traffic. When your replication factor is 3, the leader egresses out to two other followers (that live in another zone).
  • 2 GiB/s of consumer traffic. On average, when your cluster is spread across 3 AZs, 2/3rds of your consumer traffic will cross availabiilty zones. This results in 1365.33333 MiB/s of cross-zone traffic. This can be eliminated via KIP-392

Without KIP-392, the total is 4095.99967 MiB/s of cross zone traffic. That’s 345,600 GiB in a day @ $0.02/GiB = $6912/day

With KIP-392 - it’s 2730.66667 MiB/s. That’s 230,400 GiB in a day @ $0.02/GiB = $4,608/day

Kafka cross-zone data transfer costs are 85x to 128x more than KIP-1150.

Traditional Disks

We would need some disks too. While some calculators very extremely assume you’d run 16 TB worth of expensive SSDs per broker in the cloud - we will be realistic and only assume you may run SSDs because you’re using KIP-405 (Tiered Storage) to host the majority of your data cost-efficiently in S3.

Let’s simply assume you store the last 12 hours worth of data on disk (the rest is tiered in S3) and want to keep 50% free space as a buffer. 1024 MiB/s after replication is 3072 MiB/s for the cluster, which is 126.56 TiB after 12 hours and after 50% free space - 253.125 TiB.

Our cluster is suddenly bottlenecked on storage! If we tried to run the same 10-broker setup, we’d need 25.31 TB per broker. Most EBS drives don’t go up beyond 16 TB, and even if they did - it is extremely impractical to manage that amount of state in a single broker. We would realistically scale our cluster size up to 40 brokers to arrive at a much more manageable 6.25 TB per broker at a 40-broker cluster setup. At this size, every broker would be writing 76.8 MiB/s to the disk (not too much IOPS), which would allow us to opt for either st1 HDDs ($0.045/GiB-month) or gp3 SSDs ($0.08/GiB-month), depending on our desired performance profile. Our end disk cost would be $11,664 to $20,736 per month for the disks. Per day, that’s ~$383 and $681.

Just the disks alone cost 7x to 12.6x the cost of KIP-1150!

There is also a minor cost to tier the data to S3. Assuming a 25 MiB multi-part PUT - we get to the same ~41 PUT/s that the KIP-1150 BatchCoordinator pays. This is an extra $17.7 cost per day for traditional Kafka.

The careful eye may notice I am not counting the S3 storage in the KIP-1150 calculation, but that’s because the same copy exists in the traditional replication example. With KIP-405 (Tiered Storage), Kafka always stores the last N {days,bytes} worth of data in the remote store (S3) and the last M {days,bytes} worth of data on the local disks. M is stored in both the local and remote store, hence there is an extra storage fee for the copy that the traditional model requires.

If we counted without KIP-405, it would get way worse for Traditional Kafka.

Cost Comparison

Let’s take the absolute best possible case for Kafka - KIP-405, KIP-392 and the cheaper st1 HDDs are used.

The end result is still astounding:

SystemPer DayPer Month (30 days)Per Year (365 days)
Traditional Kafka$5,008$150,240$1,827,920
KIP-1150$54$1,620$19,710

We are talking about 1.8 million dollars saved per year. That can buy you an engineering team of 5-20 people depending on seniority and location!

This is why I’m so excited about KIP-1150. It’s a relatively small change that pulls a giant lever. Aiven say that KIP-1150 needs to modify just 1.7% of the Kafka codebase, and by the looks of it - will be adding a completely separate path so should not pose risk to the traditional Kafka workloads.

How can it be so cheap?

  • Clouds run 30x+ margins (at least) on the underlying infra (80x+ on network). At such astronomical markups, it only takes an arbitrary decision and the stroke of a pen to decide which architecture is cheaper for us that are building on top of the cloud.
  • S3 is freaking cheap. (for whatever business reasons AWS decided to make it so)

If I was a marketing team at a b2b company i’d say KIP-1150 carefully tuned its cutting-edge engine according to cloud unit economics. But the truth is simpler - S3 is freaking cheap!

Nothing Comes for Free - Latency Tradeoff

The latency is the tradeoff here. As we mentioned earlier, the latency profile consists of 3 steps. We can now reason about them:

  • write latency: p50: ~180ms; p99: ~370ms

    • the batch size/interval: p50: ~78ms, p99: 160ms; (our max is 156.25ms)
    • the Batch Coordinator latency: p50: ~20ms, p99: 70ms; (it’s just a Kafka write to SSDs)
    • the S3 PUT latency: a p50 of ~80ms for S3 PUT (AWS S3 PUT latency benchmark). p99 can be ~140ms.
  • read latency: p50: ~10-50ms (???), p99: ~100ms

    • assuming the worst case of having to fetch every file from S3, the slowest S3 GET latency I could achieve from my benchmark was 86.6ms. (AWS S3 PUT latency benchmark)
    • this heavily depends on the caching strategy, for which the KIP is still not yet released
  • end to end latency: p99 of 470-570ms

I have to preface that this is napkin math. The batch coordinator may be slower, S3 may be slower than my benchmarks, I may be missing something else. But this projected p99 latency is consistent with WarpStream’s p99 latency - so I think it’s directionally correct.


Stretch: Diskless with S3 Express

Let’s venture out further into speculation land. Somehow serendipitously, S3 Express lowered its prices by 50%+ last week. KIP-1150’s design should allow to implement a S3 Express backed solution for lower latencies with less ops burden due to the statelessness of the brokers. By my calculations - it may still be cheaper than traditional Kafka!

What Needs to Change

Two major things need to change in the current design:

  • since S3 Express is single-zone, the brokers would need to write to 3 different S3 buckets (in 3 different zones) in parallel, and only begin committing the batch once the data is persisted in all 3 buckets. This could probably be implemented in the storage plugin part of the KIP and not require any additional changes to the open source code.
  • we would need to vastly reduce the batch interval ms config, resulting in a lot more requests per second to the topic-based Batch Coordinator. The Batch Coordinator will probably need to be heavily optimized, as it’s unlikely to be built for this workload. As of today, it’s unclear whether it’d be able to handle this workload at all.
  • the background merge job would need to merge S3 Express files and offload them to S3 Standard for cheaper longer-term storage. Let’s assume it does this at a conservative 1 hour interval.

Cluster Setup

Let’s stick to our original setup, but modify the batch interval

  • 10-brokers, 1 GiB/s compressed traffic, 102.4 MiB/s per broker, 2 GiB/s consumer traffic from the tail of the log
  • assume our KIP-1150 batch interval is configured to 10ms (very low!). That’s 100 batch commits a second per broker.

S3 Express KIP-1150 Cost Calculation

Request Counts

  • S3 writes: each broker does 300 PUTs a second (100 per zone), the whole cluster does 3000 PUT/s
    • 3000 S3 Express PUT/s
  • coordinator write cost: each quorum PUT results in one call to the BatchCoordinator. The whole cluster issues 1000 Batch Coordinator requests a second. Each commit has 6 KiB of metadata. That’s 5.85 MiB/s of produce throughput.
    • 5.85 MiB/s BatchCoordinator produce
  • merging: there will need to be a TON of merges here. The merged files are offloaded to an S3 Standard bucket. We can similarly assume:
    • 1000 S3 Express GET/s (to read the original files, off of one zone only)
    • 100 KiB/s BatchCoordinator produce (to persist the newly-committed merged data)
    • 41 multi-part PUT/s (25 MiB per PUT to capture the 1 GiB/s of data)
    • 3 GiB/s of data are stored in S3 Express for an hour before they’re merged and offloaded out.
  • reads: Again assuming 1/3rd of reads are cached - roughly 1365 MiB/s of read traffic will be fetched from S3 Express. Each S3 express file is tiny at ~1.02 MiB, so this results in an astounding ~1365 GETs/s for the cluster.
    • 1365 S3 Express GET/s
  • coordinator reads: handwaving 1.5x the amount of written data gets read
    • 8.77 MiB/s BatchCoordinator reads

Daily Values

There are 86400 seconds in a day.

  • S3 Express Writes: 259,200,000 PUTs a day, pushing 259,200 GiB of data (3 GiB/s)
  • S3 Express Reads: 117,936,000 GETs a day, reading 115,172 GiB of data (1365 MiB/s)
  • Coordinator Writes: 493.59 GiB Batch Coordinator produce a day
  • Coordinator Reads: 740 GiB Batch Coordinator consume a day
  • Merging:
    • 86,400,000 S3 Express GETs reading 86400 GiB a day
    • 3,542,400 multi-part S3 Standard PUTs
    • 8.24 GiB Batch Coordinator produce (the 100 KiB/s merged produce)
    • 10800 GiB stored in S3 Express (3 GiB/s for 60 minutes)

Cloud Prices

The difference here is that S3 Express also charges for bandwidth (per GiB). With the recently-reduced prices, our costs are:

  • S3 Standard PUT: $0.005 per 1000 requests
  • S3 Express PUT Bandwidth: $0.0032 per GiB
  • S3 Express PUT Requests: $0.00113 per 1000 requests
  • S3 Express GET Bandwidth: $0.0006 per GiB
  • S3 Express GET Requests: $0.00003 per 1000 requests
  • S3 Express Storage: $0.11 per GiB-month
  • S3 DELETEs: free
  • BatchCoordinator replication: $0.02/GiB for cross-zone traffic
  • BatchCoordinator storage: $0.08/GiB for EBS SSDs
  • BatchCoordinator Produce: each GiB results in ~2.66 GiB in cross-zone traffic (2x for replication, 0.66x for the produce traffic as 2/3rds crosses zone) and 6 GiB of storage (as it’s durably replicated 3x and we keep 50% free space)
  • BatchCoordinator Consume: each GiB results in ~0.66 GiB in cross-zone traffic (0.66x for the consume traffic as 2/3rds crosses zone)

Daily Cost Calculation

A KIP-1150 Express topic receiving 1 GiB/s of trafic and serving 2 GiB/s of reads should roughly cost $1351.2 a day.

  • S3 Express Writes: $1122.34/day
    • requests: $292.9/day (259,200,000 PUTs @ $0.00113/1000)
    • bandwidth: $829.44/day (259,200 GiB @ $0.0032/GiB)
  • S3 Express Reads: $72.63/day
    • requests: $3.53/day (117,936,000 GETs @ $0.00003/1000)
    • bandwidth: $69.1/day (115,172 GiB @ $0.0006/GiB)
  • Coordinator Writes: $34.16/day
    • cross-zone networking: $26.26/day (1313 GiB cross-zone traffic @ $0.02/GiB)
    • storage: $7.9/day (2962 GiB of storage @ $0.08/GiB-month)
  • Coordinator Reads: $9.77/day (488.4 GiB cross-zone traffic @ $0.02/GiB)
  • Merging: $112.3/day
    • S3 Express GETs: $54.43/day
      • requests: $2.59/day (86,400,000 GETs @ $0.00003/1000)
      • bandwidth: $51.84/day (86,400 GiB @ $0.0006/GiB)
    • S3 Standard PUTs: $17.7/day (3,542,400 PUTs @ $0.005/1000)
    • Batch Coordinator Produce: $0.57/day
      • $0.44/day (22 GiB cross-zone traffic @ $0.02/GiB)
      • $0.13/day (50 GiB storage @ $0.08/GiB-month)
    • S3 Express Storage: $39.6/day (10800 GiB stored @ $0.11/GiB-month)

As you can see, the $830/day S3 Express PUT bandwidth charge largely dominates the projected $1351 total cost. This means that cost is the major inevitable cost driver and is decoupled from lesser-known specifics like BatchCoordinator implementation and read caching.

What Would Our Express Latency be?

The 3-step latency profile just had all of its steps reduced in latency:

  • our batch interval is (the extreme) 10ms, meaning no broker waits more than 10ms to send a PUT to S3 Express.

  • I’m not able to find an S3 Express benchmark, but they claim it’s purpose built to deliver single-digit millisecond access. Let’s assume its PUT latencies for p50 is 7ms and p99 is 15ms. GET can be 3ms to 10ms

  • our batch coordinator latency will be the bottleneck here, and it’s getting hammered 100x more than last time. Let’s assume we deployed some fast SSDs and instances to ensure it isn’t too slow.

    • the Batch Coordinator latency: p50: ~20ms, p99: 100ms; (it’s just a Kafka write to SSDs)
  • write latency: p50: ~33ms; p99: ~100ms

    • the batch size/interval: p50: ~5ms, p99: 15ms; (our max is 10ms, but let’s assume the software is slow at times and introduces a 5ms pause)
    • the Batch Coordinator latency: p50: ~20ms, p99: 70ms; (same as before)
    • the S3 Express PUT latency: a p50 of ~8ms (we are writing to 3 zones so we won’t get the p50 of a single one). p99: ~15ms.
  • read latency: p50: ~10ms, p99: ~20ms

    • assuming the worst case of having to fetch every file from S3.
  • end to end latency: p99 of 120-160ms

This napkin math is even further of a stretch, but it’s my best attempt to project it.

The Bottom Line

Again assuming the absolute best possible case for Kafka - KIP-405, KIP-392 and the cheaper st1 HDDs are used. Note this can mean Kafka’s latencies are on the higher side due to the IOPS-limited HDD.

SystemPer DayPer Month (30 days)Per Year (365 days)
Traditional Kafka$5,008$150,240$1,827,920
Slow KIP-1150$54$1,620$19,710
Express KIP-1150$1,351.20$40,536$493,688

KIP-1150 implemented with triple S3 Express writes would still be 3.7x cheaper than Traditional Kafka.

It seems absurd, but it’s the cloud world we live in. As Aiven said it themselves in their announcement blog - Kafka’s architecture just so happens to be heavily taxed by the cloud providers’ current pricing model. With KIP-1150, Kafka becomes both cheaper, more reliable, easier to operate and remains competitive no matter what pricing model is thrown at it.