The Game-Changing Solution to Reduce Kafka Traffic Costs

Reduce Kafka traffic costs by up to 25% with one simple configuration change. Running a real-time data streaming platform at scale is expensive, but it doesn’t have to break your budget.

At Grab, our Kafka infrastructure serves millions of events daily across Southeast Asia’s leading superapp. However, there was one line item on our AWS bill that kept growing: cross-availability zone (cross-AZ) traffic charges.

Does this sound familiar? If you’re running Kafka on AWS, you’re probably bleeding money on network transfer costs too. Fortunately, we discovered a game-changing solution that can reduce Kafka traffic costs dramatically.

The $100K Problem Hidden in Plain Sight

Picture this scenario: You’ve got Kafka brokers spread across three availability zones for high availability. Smart move, right? Additionally, your partition replicas are distributed perfectly for resilience.

However, here’s what’s quietly draining your budget:

By default, every Kafka consumer talks only to partition leaders.

With three AZs, there’s a 67% chance that leader lives in a different zone from your consumer. Every. Single. Time.

Furthermore, the math is brutal:

Producer → Leader (minimal cross-AZ cost)
Leader → Followers (2x replication cost)
Consumers → Leader (massive cost multiplier)

That last bullet point? It was eating 50% of our total Kafka platform costs. Consequently, we knew we had to find a better solution.

The Game-Changing Solution (It’s Simpler Than You Think)

Kafka 2.3 introduced a feature that most teams overlook: replica fetching for consumers.

Instead of always hitting the partition leader, consumers can now read from the replica in their own AZ. As a result, you get zero cross-AZ traffic and zero extra charges.

Moreover, here’s the beautiful part: the data is identical across all replicas. Therefore, you’re not sacrificing consistency for cost savings.

Our Step-by-Step Implementation Guide

Phase 1: The Kafka Upgrade Journey

Initially, we jumped straight to Kafka 3.1 (skipping 2.3) for better stability and bug fixes. Here’s our zero-downtime upgrade strategy:

1: Zookeeper First

First, we upgraded Zookeeper to maintain compatibility
Subsequently, ensured backward compatibility with existing Kafka version
Finally, achieved zero service disruption

2: Rolling Kafka Upgrade

Next, we deployed Kafka 3.1 with backward-compatible protocol
During this phase, we maintained a mixed version cluster during transition
Additionally, we updated Cruise Control simultaneously

3: Protocol Version Bump

Lastly, we enabled latest inter-broker protocol
Furthermore, we implemented gradual rollout with fallback compatibility

Phase 2: The Configuration Magic

Broker Setup (2 lines of config):

# Already had this for rack awareness
broker.rack={{ ec2_az_id }}

# This is the new magic
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector

Consumer Setup (1 environment variable):

# Our SDK handles the rest automatically
export KAFKA_FETCH_FROM_CLOSEST_REPLICA=true

That’s it. Seriously.

Our internal SDK automatically:

First, detects the host’s AZ from EC2 metadata
Then, sets the client.rack parameter
Finally, routes fetch requests to local replicas

Phase 3: Extending Beyond Basic Consumers

However, we didn’t stop at application consumers. Instead, we updated:

Flink pipelines – Stream processing workloads
Kafka Connect – Data integration pipelines
Internal tooling – Monitoring and management tools

Ultimately, every consumer that could benefit got the upgrade.

The Results Will Shock You

Three months after rollout, the numbers spoke for themselves:

25% reduction in cross-AZ traffic
50% cut in network transfer costs
Stable data consumption volume maintained

Furthermore, AWS Cost Explorer showed the impact immediately. Both ends of the traffic flow benefited:

On one hand, brokers sent less cross-AZ data (steep cost drop)
On the other hand, consumers received less cross-AZ data (another steep drop)

As a result, we achieved linear cost reduction with zero data loss or service degradation.

The Hidden Gotchas (And How to Handle Them)

Latency Trade-off: Up to 500ms Added

Reading from followers means waiting for replication lag. For ultra-low latency use cases, we kept the traditional leader-only approach.

Our decision framework:

Cost-sensitive workloads → Closest replica
Latency-critical workloads → Leader only

Maintenance Complexity Increased

Broker rotations became trickier. Consumers now connect to follower replicas, so “demoting” a broker doesn’t isolate it completely.

Our solution: Enhanced error handling and retry logic in consumer applications.

Load Balancing Challenges

Consumer distribution across AZs directly impacts broker load. Uneven consumer placement creates hot spots.

Our monitoring approach:

Added AZ distribution metrics to our SDK
Proactive consumer rebalancing
Capacity planning per AZ

Beyond Kafka: The Bigger Picture

This wasn’t just about Kafka. We sparked a company-wide initiative to reduce Kafka traffic costs cross-AZ. Our service mesh team (Sentry) adopted similar strategies.

The ripple effect:

Other teams audited their cross-AZ usage
Shared learnings across engineering org
Standardized cost optimization practices

Your Next Steps

Ready to implement this yourself? Here’s your action plan:

Audit your current setup – Measure cross-AZ traffic baseline
Plan the upgrade – Kafka 2.3+ required
Test in staging – Verify latency impact for your workloads
Gradual rollout – Start with non-critical consumers
Monitor and optimize – Watch for load distribution issues

The Bottom Line

Sometimes the biggest wins come from the simplest changes. A few configuration tweaks saved us thousands monthly and will save hundreds of thousands annually.

Your Kafka bill doesn’t have to be a budget buster. With replica fetching, you can have both high availability and cost efficiency.

What’s your biggest cloud cost pain point? Reply and let me know – I’d love to hear what infrastructure challenges you’re tackling.

Tags: #Kafka #AWS #CostOptimization #Infrastructure #Engineering #DataStreaming #CloudCosts

Reference: https://engineering.grab.com/zero-traffic-cost