Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join host Kris Jenkins (Senior Developer Advocate, Confluent) as he streams through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Improving Kafka Scalability and Elasticity with Tiered Storage
What happens when you need to store more than a few petabytes of data? Rittika Adhikari (Software Engineer, Confluent) discusses how her team implemented tiered storage, a method for improving the scalability and elasticity of data storage in Apache Kafka®. She also explores the motivating factors for building it in the first place: cost, performance, and manageability.
Before Tiered Storage, there was no real way to retain Kafka data indefinitely. Because of the tight coupling between compute and storage, users were forced to use different tools to access cold and hot data. Additionally, the cost of re-replication was prohibitive because Kafka had to process large amounts of data rather than small hot sets.
As a member of the Kafka Storage Foundations team, Rittika explains to Kris Jenkins how her team initially considered a Kafka data lake but settled on a more cost-effective method – tiered storage. With tiered storage, one tier handles elasticity and throughput for long-term storage, while the other tier is dedicated to high-cost, low-latency, short-term storage. Before, re-replication impacted all brokers, slowing down performance because it required more replication cycles. By decoupling compute and storage, they now only replicate the hot set rather than weeks of data.
Ultimately, this tiered storage method broke down the barrier between compute and storage by separating data into multiple tiers across the cloud. This allowed for better scalability and elasticity that reduced operational toil.
In preparation for a broader rollout to customers who heavily rely on compacted topics, Rittika’s team will be implementing tier compaction to support tiering of compacted topics. The goal is to have the partition leader perform compaction. This will substantially reduce compaction costs (CPU/disk) because the number of replicas compacting is significantly smaller. It also protects the broker resource consumption through a new compaction algorithm and throttling.
Jun Rao explains: What is Tiered Storage?Enabling Tiered StorageInfinite Storage in Confluent PlatformKafka Storage and Processing FundamentalsKIP-405: Kafka Tiered StorageOptimizing Apache Kafka’s Internals with Its Co-Creator Jun RaoWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Decoupling with Event-Driven Architecture
In principle, data mesh architecture should liberate teams to build their systems and gather data in a distributed way, without having to explicitly coordinate. Data is the thing that can and should decouple teams, but proper implementation has its challenges.
In this episode, Kris talks to Florian Albrecht (Solution Architect, Hermes Germany) about Galapagos, an open-source DevOps software tool for Apache Kafka® that Albrecht created with his team at Hermes, a German parcel delivery company.
After Hermes chose Kafka to implement company-wide event-driven architecture, Albrecht’s team created rules and guidelines on how to use and really make the most out of Kafka. But the hands-off approach wasn’t leading to greater independence, so Albrecht’s team tried something different to documentation— they encoded the rules as software.
This method pushed the teams to stop thinking in terms of data and to start thinking in terms of events. Previously, applications copied data from one point to another, with slight changes each time. In the end, teams with conflicting data were left asking when the data changed and why, with a real impact on customers who might be left wondering when their parcel was redirected and how. Every application would then have to be checked to find out when exactly the data was changed. Event architecture terminates this cycle.
Events are immutable and changes are registered as new domain-specific events. Packaged together as event envelopes, they can be safely copied to other applications, and can provide significant insights. No need to check each application to find out when manually entered or imported data was changed—the complete history exists in the event envelope. More importantly, no more time-consuming collaborations where teams help each other to interpret the data.
Using Galapagos helped the teams at Hermes to switch their thought process from raw data to event-driven. Galapagos also empowers business teams to take charge of their own data needs by providing a protective buffer. When specific teams, providers of data or events, want to change something, Galapagos enforces a method which will not kill the production applications already reading the data. Teams can add new fields which existing applications can ignore, but a previously required field that an application could be relying on won’t be changeable.
Business partners using Galapagos found they were better prepared to give answers to their developer colleagues, allowing different parts of the business to communicate in ways they hadn’t before. Through Galapagos, Hermes saw better success decoupling teams.
A Guide to Data MeshPractical Data Mesh ebookGalapagos GitHubFlorian Albrecht GitHubWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get $100 of free Confluent Cloud usage (details)
If Streaming Is the Answer, Why Are We Still Doing Batch?
Is real-time data streaming the future, or will batch processing always be with us? Interest in streaming data architecture is booming, but just as many teams are still happily batching away. Batch processing is still simpler to implement than stream processing, and successfully moving from batch to streaming requires a significant change to a team’s habits and processes, as well as a meaningful upfront investment. Some are even running dbt in micro batches to simulate an effect similar to streaming, without having to make the full transition. Will streaming ever fully take over?
In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should). Is micro batching the natural stepping stone between batch and streaming? Will there ever be a unified understanding on how data should be processed over time? Is the lack of agreement on best practices for data streaming an insurmountable obstacle to widespread adoption? What exactly is holding teams back from fully adopting a streaming model?
Recorded live at Current 2022: The Next Generation of Kafka Summit, the panel includes Adi Polak (Vice President of Developer Experience, Treeverse), Amy Chen (Partner Engineering Manager, dbt Labs), Eric Sammer (CEO, Decodable), and Tyler Akidau (Principal Software Engineer, Snowflake).
dbt LabsDecodablelakeFSSnowflakeView sessions and slides from Current 2022Stream Processing vs. Batch Processing: What to KnowFrom Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica FineWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Security for Real-Time Data Stream Processing with Confluent Cloud
Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka® for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale.
SecurityScorecard mines data from dozens of digital sources to discover security risks and flaws with the potential to expose their client’ data. This includes scanning and ingesting data from a large number of ports to identify suspicious IP addresses, exposed servers, out-of-date endpoints, malware-infected devices, and other potential cyber threats for more than 12 million companies worldwide.
To allow real-time stream processing for the organization, the team moved away from using RabbitMQ to open-source Kafka for processing a massive amount of data in a matter of milliseconds, instead of weeks or months. This makes the detection of a website’s security posture risk happen quickly for constantly evolving security threats. The team relied on batch pipelines to push data to and from Amazon S3 as well as expensive REST API based communication carrying data between systems. They also spent significant time and resources on open-source Kafka upgrades on Amazon MSK.
Self-maintaining the Kafka infrastructure increased operational overhead with escalating costs. In order to scale faster, govern data better, and ultimately lower the total cost of ownership (TOC), Brandon, lead of the organization’s Pipeline team, pivoted towards a fully-managed, cloud-native approach for more scalable streaming data pipelines, and for the development of a new Automatic Vendor Detection (AVD) product.
Jared and Brandon continue to leverage the Cloud for use cases including using PostgreSQL and pushing data to downstream systems using CSC connectors, increasing data governance and security for streaming scalability, and more.
SecurityScorecard Case StudyBuilding Data Pipelines with Apache Kafka and ConfluentWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Running Apache Kafka in Production
What are some recommendations to consider when running Apache Kafka® in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases.
Here are 6 recommendations for maximizing Kafka in production:
1. Nail Down the Operational Part
When setting up your cluster, in addition to dealing with the usual architectural issues, make sure to also invest time into alerting, monitoring, logging, and other operational concerns. Managing a distributed system can be tricky and you have to make sure that all of its parts are healthy together. This will give you a chance at catching cluster problems early, rather than after they have become full-blown crises.
2. Reason Properly About Serialization and Schemas Up Front
At the Kafka API level, events are just bytes, which gives your application the flexibility to use various serialization mechanisms. Avro has the benefit of decoupling schemas from data serialization, whereas Protobuf is often preferable to those practiced with remote procedure calls; JSON Schema is user friendly but verbose. When you are choosing your serialization, it's a good time to reason about schemas, which should be well-thought-out contracts between your publishers and subscribers. You should know who owns a schema as well as the path for evolving that schema over time.
3. Use Kafka As a Central Nervous System Rather Than As a Single Cluster
Teams typically start out with a single, independent Kafka cluster, but they could benefit, even from the outset, by thinking of Kafka more as a central nervous system that they can use to connect disparate data sources. This enables data to be shared among more applications.
4. Utilize Dead Letter Queues (DLQs)
DLQs can keep service delays from blocking the processing of your messages. For example, instead of using a unique topic for each customer to which you need to send data (potentially millions of topics), you may prefer to use a shared topic, or a series of shared topics that contain all of your customers. But if you are sending to multiple customers from a shared topic and one customer's REST API is down—instead of delaying the process entirely—you can have that customer's events divert into a dead letter queue. You can then process them later from that queue.
5. Understand Compacted Topics
By default in Kafka topics, data is kept by time. But there is also another type of topic, a compacted topic, which stores data by key and replaces old data with new data as it comes in. This is particularly useful for working with data that is updateable, for example, data that may be coming in through a change-data-capture log. A practical example of this would be a retailer that needs to update prices and product descriptions to send out to all of its locations.
6. Imagine New Use Cases Enabled by Kafka's Recent Evolution
The biggest recent change in Kafka's history is its migration to the cloud. By using Kafka there, you can reserve your engineering talent for business logic. The unlimited storage enabled by the cloud also means that you can truly keep data forever at reasonable cost, and thus you don't have to build a separate system for your historical data needs.
Kafka Internals 101 Watch in videoKris Jenkins' TwitterUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Build a Real Time AI Data Platform with Apache Kafka
Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless.
Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task.
With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database.
Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team.
Kafka Streams 101 courseThe Difference Engine for Unlocking the Kafka Black BoxGitHub repo: kash.pyWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Tim is probably the best tech speaker I have ever listened. Please keep doing them.
Can’t stop listening.
This podcast is so good that I started at episode 1 and am going through them all. It’s amazing how well these complex topics are explained via audio.
a perfect pod
the things i learn from listening this podcast is insurmountable. i’m eternally grateful to you all.