Partitioned Logs: The Backbone of Kafka

Andrew-Shepard
2 min readDec 19, 2023

--

Kafka’s main draw is it’s performance, scalability, and fault tolerance. This is thanks to it’s robust implementation of a design pattern dubbed partitioned logs.

Kafka’s Distributed Dataflow and Partitioning

What’s a Partitioned Log?

In Kafka, a log is a data structure that is basically an ordered set of messages and a partition is a log. Each message in the log is assigned it’s unique and sequential identifier known as an offset. In this context, a partition is Kafka’s method of splitting the data it processes into these logs. A partition is a log for Kafka, which means each partition is an ordered set of messages — allowing for a separate queue that parallelizes processing, as well as independent reads and writes from the partitions.

And how does it partition?

With the partitions set up, a producer will either use round robin to distribute the messages across partitions — or if order is needed for the messages, will assign messages with the same key to go to the same partition. Because a partition has it’s own unique offsets, this means that order is not guaranteed without the logs going to the same partition, which is the behavior when you don’t assign a key. But in turn with good key usage or utilization of the default round robin, you can take advantage of the distributed structure that gives Kafka its performance. The exception to these things is that custom partitioners are configurable in Kafka, and you can configure partitioning based on aspects of that message.

Additionally, the replication is configurable as a replication factor which is how many copies of each partition is maintained. However the distribution of data is done at the partition and message level, and not the replication level.

The number of partitions is also configured and can be utilized for more parallelism (with good message distribution strategy), but comes with an overhead cost — so unused partitions can detract from performance.

What to consider

Messages that are partitioned with a key go to the same partition, this can be a slowdown if misconfigured as it doesn't utilize the parallelism and distributed nature that gives Kafka it’s performance. Consider using another strategy when order is not needed.

Changing partitions gets harder the more data you have and the longer your cluster is configured that way, as changing it touches future data distribution and consumer processing.

And a consumer can only be as parallel as the amount of partitions you set!

I encourage you to experiment with different partitioning strategies in your Kafka deployments. I believe knowing the core driving architecture can let you build good practice on top of the foundation to get the most scalable, performant, reliable data streams you can out of Kafka.

--

--

Andrew-Shepard

Andrew Shepard, a Software Engineer with hands-on experience in SaaS and pharmaceutical tech, focuses on Python, REST, Docker, and Kubernetes.