Kafka Community Spotlight #6

1. Personal

Please tell us about yourself, and where you are from.

I was born in the north of England near Manchester, I had quite a typical family upbringing until my parents divorced in my teens. I went to university in Nottingham and graduated in Computer Science which included a placement year at Siemens mobile. This was my first taste of real world programming (Java & MIDP).

Jason

How did you get into technology?

I always liked gadgets and working things out as a kid, I took apart my mum’s vacuum at 10 years old to “fix it” and I discovered “free” computer games at a similar age by copying BASIC code from library books into my Acorn 3000.

Did you really fix the vacuum?

Haha, I did indeed. It turned out that an internal part was simply clogged up and the simple act of taking it apart got rid of the blockage which made it work again. My mum tells this story to anyone who will listen.

How do you spend your free time? What are your hobbies?

I recently had my first child which now occupies a lot of my time (more than I expected!). Before that I had a constant need to be trying new things. I paint, I play the violin, I was working towards my pilots license (now on hold), and I like to surf when holidays align. And of course, I’m an avid computer gamer when my wife allows!

What does your ideal weekend look like?

A walk in the countryside ending with good food at a pub in the sun

Where is one place you’d recommend anybody reading this travel to?

In the UK, The Cotswolds, it’s stunning and often overlooked by tourists. In the world, it is too hard to choose. Go somewhere where the culture is totally different to what you know.

A book you’d recommend to readers.

Fiction: The Eyre Affair by Jasper Fforde, Non-Fiction: The Trading Game by Gary Stevenson

Favorite food? Best cuisine you’d recommend?

Indian Food - Black Daal in particular. My wife is of Indian heritage so I’m a little biased.

Favorite Movie?

The Matrix - nothing has compared since

Favorite drinks? Non-alcoholic or alcoholic?

Margarita

Any sports you do (or watch!)?

I definitely can’t claim to be a sports person, I did a lot of calisthenics - mainly olympic rings in the last few years until the baby came along and ruined my routine.

What is the best advice you ever got?

With everything you build, aim to design yourself out of the job—if the system doesn’t need you, you’ve likely built it well.

What is the best advice you would give someone?

Never be afraid to fail or ask for help, it’s the only way we truly grow and learn.


2. Kafka

How did you get into Kafka? What was your first experience like?

I’ve been a user/consumer of Kafka for much longer than I’ve been an admin of it. I typically took it for granted especially at larger companies like Twitter where we had an entire team looking after it for us. I only started to appreciate its nuance and complexity when I started to run it as a platform when I worked for Deliveroo. Before that I was more in the key/value, blob storage, and DB space in terms of expertise.

Can you tell me more about the user-side experience of Kafka back then? Anything you wish you had known earlier?

I didn’t realise how much batching mattered or that you couldn’t just blindly scale up producers without effectively watering down your average batch size in some cases. Back then I didn’t even think about linger or batchsize and I mistakenly saturated the request queue a few times because of this. I also learned the hard way that slow clients could also saturate the response queue too if they got stuck.

What Version of Kafka did you start with?

Hard to say, I think as a user I was definitely in 1.x territory, but as an admin, probably more like ~2.4 as I remember when we still used the zookeeper flag.

When should one choose Kafka? When should one not choose it?

Choose it when you’re close to high scale and expect to grow further or need streaming operations for use cases like trading, alerting, and telemetry aggregation . Don’t choose it early on when growth is uncertain or you’re just starting (but don’t design it out entirely), and also don’t choose it if you can’t handle the latency impact of failure modes or maintenance for your use case.

What do you think is the most common use case(s) for Kafka?

Probably between service to service events i.e. Like button at Twitter OR data delivery with some ETL/CDC into a datalake or similar.

What do you think are niche(r) use cases for Kafka?

Cache hydration, token allocation (both used a lot in gambling systems)

Do you think Kafka has a high entry barrier?

Not if you pick a fully managed service (if you can afford it). But yes, a very high barrier if you run it yourself.

Do you think Kafka is hard to manage?

Definitely, especially if you run it on with multi-tenancy for different teams/clients of varying capability and use cases. Running in on K8S with an operator can help - until it breaks!

What’s the most annoying thing in Kafka you can think of?

I’ve always found it to be quite annoying that Kafka fails quite badly when the disk is full and control plane operations/retention die so you can’t easily recover.

What’s the weirdest bug you’ve hit in Kafka?

I’ve seen so many, the one that stands out is probably the topic id mismatch problem if you use older clients with zookeeper flags on newer cluster versions. I’ve seen so many people caught out by this. I’ve also seen some weird ISR states when folks have tried to re-assign themselves out of trouble. 3rd party client implementations have also shown me some weird scenarios, I remember a customer using rust would occasionally block Kafkas response queue because of their threading paradigm.

I’m of the impression Kafka doesn’t have that many bugs. Do you disagree, or were those bugs more present in the early days?

I think the amount of bugs I’ve seen is a function of the fact that when I started at MSK, they still supported 1.x and early 2.x versions. The bugs I’ve seen in later versions of Kafka are in more niche features like tiered storage and transactions which are much less commonly used. Largely I would agree that modern 4.x Kafka is very stable.

If you had a magic wand and could instantly and frictionlessly contribute/fix one thing to Kafka, what would it be?

I would add a feature to allow acks=AtLeastMinIsr to be used instead of acks=all. This would trade off durability during hardware degradation without impacting the latency of the cluster.

Interesting idea! Another idea I’ve had: with network-attached storage (EBS), fsync-ing a single replica means you’re already durably replicated within a single AZ - and perhaps that’s enough for some cases. Do you think durability knobs are generally under-explored/under-appreciated in the Kafka world? Surely within a cluster some workloads are less sensitive to durability/availability?

Absolutely! In the cloud we’re using durable storage by default in most cases. I doubt most folks care about the difference in durability between X9’s and Y9’s. Having more durability levers like this hypothetical feature makes sense to me. I don’t want to choose only between acks=1 or acks=all. And I don’t think replica.lag.time.max.ms by itself is a strong enough lever. The problem is when you ask most customers, they don’t really know what they want in SLO terms, they largely just want their system to “keep working”.

How has Kafka changed over the years from your point of view, for better and for worse?

I’d say for the better, I think moving away from Zookeeper certainly helped a lot as it’s hard to manage well, and I really like the direction we’re heading in terms of the most valuable part of Kafka is the protocol. Stateless brokers and diskless are definitely a positive change.

Have your thoughts about Kafka changed with it over the years?

Yes, I used to think it was more simple than it really is when I was just a user. I didn’t appreciate how one bad client could ruin a cluster for other tenants.

What is the hardest thing you’ve worked on in or around Kafka?

I’ve worked on changing the underlying storage destination to use engines other than a log. This probably gave me my deepest insights into Kafka. That and automation, finding a one size fits all approach for the different sizes of clusters is very tricky.

When you say “engines”, what does that mean? A different data structure? (e.g a LSM tree)

I’ve experimented and continue to do so with various tree models (LSM included, recently learned about the Fractal tree in TokuDB). I’ve also played with some in-memory only models for when durability really doesn’t matter (or async dual write). For each of these structures if I can come close to Kafka parity then there’s obvious new use cases to explore.

What are your opinions on stream processing?

I think it’s a hard thing for imperative programmers to wrap their head around. If I had to prefer an option then I think Flink is the way to go as running Spark is too fat/heavy unless you already have that expertise. But I wish there were simpler options to make concepts like shuffling, windowing, and aggregation more accessible to folks without the learning curve.


3. Business/Work

What do you do at your current job as a Principal Software Engineer at Cisco ThousandEyes? Do you use Kafka there?

I support our platform teams with a particular focus on data storage and observability. Yes we are heavy Kafka users for a variety of use cases (probably most of them).

What is the biggest problem you’re currently hitting there?

In the Kafka space, client-side best practice configuration and capacity planning.

Producer or Consumer best practice? Have you considered using a proxy layer?

Both. Yes, I’ve considered it, central and sidecar. Consumer side proxies are not particularly nice although gRPC streaming can help somewhat. The blast radius and maintenance overhead are the main considerations here vs just getting folks to fix their clients, which is now somewhat easier to audit with AI.

You worked on AWS MSK in your previous job, right? Tell us about that

I largely owned “Availability & Operations” at MSK, when I joined there was a lot of work to be done on improving it’s posture and I was able to make some very large improvements there over the years. Suffice to say the number of support tickets had massively reduced by the time I left and there were a lot more happy customers. One of the epiphanies I had working for them was that it’s really hard to generalize an offering for such a wide and varying customer base when you control so little.

What’s the biggest thing you learned most recently about Kafka that somebody reading this may not realize?

Every time I sit down with someone and help them with a Kafka problem I observe a fundamental knowledge gap about the behaviour of Kafka. People are surprised that they need to think about tuning and recovery to live happy Kafka lives, especially on the client side.

Can you give an example of the tuning/recovery gaps that people have?

A few that are top of mind: Linger/Batch, Client Memory Buffer, Retry Backoff, Connection Backoff/Max, Thread pool tuning (io/network/replica.fetchers), replication throttling, replica max lag, replication batch size, Consumer rebalances (just use the new protocol if you can), client graceful shutdown, and very old slow clients.

Any fun experiences you’d like to share from AWS?

I really enjoyed working on MSK’s “Express” brokers, this was the first time AWS took some ownership of the storage layer of Kafka with a proprietary offering. As part of this I was able to convince some meaningful changes around guardrailing of configurations and alternative maintenance patterns. I also learned way more about Kafka internals than I’d ever had the chance to previously. During this time, I gained the confidence to take on some small PR’s in open-source too.

How did working on a cloud service differ from your past jobs?

Running a platform team at a company is one thing, running a product that will be used by many platform teams is something else. There are customers who know what they’re doing and want many levers, there are customers who want the simple life and things to just work without tradeoffs, and then there are customers who are somewhere in between. This variance makes life very interesting as you’re constantly working to find a natural equilibrium in features and behaviour; not to mention you get to see so many Kafka use cases!

What’s your take on on-prem versus cloud (with relation to Kafka)?

With my business hat on - especially for larger businesses, I think managed kafka is largely the way to go. For growth you really want to minimise the time and effort you put into your stack and keeping things as simple as possible will only ever make your life easier. That being said, if I ran a startup, I would absolutely run Kafka myself without concern. That wouldn’t necessarily be true for technologies that I’m less familiar with.


4. General/Parting

What do you think about queues for Kafka?

It’s a great feature, I wonder if this is the end for rabbitmq, activemq etc. I particularly like it because a lot of people using Kafka really want true queue semantics, don’t care about ordering, and they already have a Kafka cluster.

What do you think about how schemas are handled in Kafka?

Data governance eventually pops up as a conversation in any company. Eventually we all have to think about it. I would certainly like Kafka to grow in this space. Stricter schemas also open up a world of some nice features like filtering or lightweight ETL.

What do you think about Kafka proxies?

If your use case allows, they are worth investing in. The gains from being able to standardize the majority of produce traffic outweigh the increased blast radius imho.

What do you think about Diskless Topics?

I’m excited to see Kafka become cheaper/simpler, and I wonder if this also opens the door to having Kafka send data directly to other engines without a separate consumer application.

What do you think about all the different Kafka forks/implementations? Any favorites?

I only recently learned about Tansu. I like the simplicity of this approach very much as it aligns with my own design philosophies. But massive respect to warpstream for being the first (I think) to really move the needle here.

What do you think about Kafka’s current state of “cloud nativeness”?

It’s nonexistent imho, Kafka assumes a stable network and stable IO. The cloud is not this.

How do you see the future of Kafka usage and development, 5 years out?

I think simplicity and the use of Kafka data from the point of produce is coming. Nobody wants multiple applications just to move data from A to B with some processing along the way.

What other tech besides messaging do you have interest in?

I’ve looked after some very large postgres databases and I’ve worked on proprietary key/value store implementations. I’ve started to dabble in the invention space now that AI is able to 10X my productivity and fill in the gaps in my knowledge when I want to change something.

Do you use AI in your day-to-day? How? Any favorite tools?

Yes, all the time. I’m a heavy codex user, I know Claude is typically faster but I find the codex workflow more natural and inline with the way I reason. Following on from my S3 Files blog I was able to use Codex to completely disable replication to allow multiple brokers to run against a shared file system with unclean leader election. Without AI this would have been months of work for me personally. Sadly the cost footprint of S3 Files means I won’t pursue this any further, but it was a fun experiment!

How many Kafka Summits have you been to? How has the conference changed over the years?

Just 1 sadly, in the time I’ve been involved with Kafka, the great cost crunch has hit and gone are the days of plentiful conferences. Back in the day I would never miss a JavaOne.

Anything else you’d like to add?

Don’t be afraid to look at what you’ve built and change it, status quo is not a reason to not try new things.

Any Social Media channels of yours we should be aware of?

It’s only in the last few years I’ve started to post, and for now I’ve stuck with LinkedIn.