Apache Kafka

Apache Kafka Streams Repartition Transformations

Apache Kafka Streams is a free and open-source distributed streaming platform that allows us to build real-time streaming applications that are capable of processing the data streams in a scalable and fault-tolerant way.

One of the fundamental features of Kafka Streams is the ability to perform stream processing operations such as transformations, aggregations, and joins on data streams.

Kafka Streams offer a wide range of transformations that allow us to quickly and efficiently analyze the data in a Kafka Stream.

One important transformation operation in Kafka Streams is the repartition transformation. A repartition transformation allows us to repartition the data on a stream based on a specific key. This is useful for downstream operations that require the data to be partitioned in a specific key.

In this tutorial, we will learn how we can perform a repartition transformation in a Kafka stream using the Kafka Streams DSL API.

NOTE: It is good to note that a repartition is an extremely expensive operation on the cluster. It is therefore to minimize the number of preparation steps in the stream’s lifecycle.

Kafka Streams Repartition

In a Kafka Stream, we can perform a repartition operation using the through() and to() methods. We can carry out these operations on a KStream or a KTable objects depending on your targeting object.

The following shows the method definition for the to() method in Kafka Streams:

public void to(final String topic, final Produced<K, V> produced);

The method takes two main arguments:

  1. Topic – The target topic name on which the repartitioned data will be written.
  2. Produced – This specifies an instance of the produced <K, V> class that is used to define how the data is serialized and written to the new topic.

The following snippet demonstrates how to use the to() method to repartition a given stream.

stream

.selectKey((key, value) -> value.getField("customer_id"))

.groupByKey()

.count()

.to("customer-purchases-count", Produced.with(Serdes.String(), Serdes.Long()));

In the provided example, we use the selectKey() method to get the customer_id field from the value and use it as the new key.

We then proceed to group the stream by the new key using the groupByKey() transformation. We also count the number of purchases per customer and write the repartitioned stream to a new topic.

NOTE: The role of the Produced.with() method is to define how the data is serialized and written to the new topic. In this case, we use the string and long serializers.

As mentioned, we can also use the through() method to perform a repartition transformation.

The method definition is as shown in the following:

public KStream<K, V> through(final String topic, final Produced<K, V> produced);

The method accepts similar parameters as the to() method.

The following snippet shows the usage for the through() method:

KStream<String, String> originalStream = builder.stream("original-topic");

KStream<String, String> repartitionedStream = originalStream

.selectKey((key, value) -> value.getField("customer_id"))

.groupByKey()

.count()

.through("customer-purchases-count", Produced.with(Serdes.String(), Serdes.Long()));

As you can see, the functionality is closely similar to the to() method.

Conclusion

We covered the fundamentals of working with repartition transformations using the through() and to() methods from the Kafka Streams API.

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list