However, although this feature does help Kafka in managing the data the right way, it can provide some drawbacks.
Take for example an Apache Kafka system that stores the state of IoT devices. The system might log each state change for each device in a Kafka topic such as the temperature, humidity, and other sensor readings. Over time, the log could grow very large, making it difficult to manage and slow down the processing.
However, Apache Kafka provides us with a feature which is known as “log compaction”.
Kafka log compaction is a feature in Apache Kafka that allows for efficient storage and retrieval of the most recent version of a record in a Kafka topic. It works by keeping the latest value for each key in a compacted topic while discarding the older values. The compaction process involves removing the old, superseded records and retaining only the most recent version of each record. This helps to maintain a smaller, more manageable log size and ensures that the consumers receive the latest values for a key, even in the presence of failures or restarts.
Hence, using a feature such as Kafka log retention, the system could retain only the most recent state for each device, discarding the older states that are no longer relevant. For example, if a device sends multiple readings with the same temperature and humidity, only the most recent reading is kept in the log. This helps to keep the log compact and manageable while providing a complete picture of the current state of each device.
In this scenario, the log compaction allows the system to maintain a smaller and more efficient log while still providing valuable information about the state of the devices. It also reduces the amount of required storage space to store the log which makes it more cost-effective and scalable.
How Log Compaction Works – Simplified
We can simplify the process of how Kafka log compaction works using the following steps:
- Once a new record is written to a compacted topic, it is stored in the log along with its key and timestamp.
- Over time, multiple versions of the same record may be written to the log, each with a different value.
- The compaction process periodically scans the log and removes all but the latest version of each record based on its timestamp.
- The compacted log is then reorganized which allows it to consume as little space as possible.
- Consumers that subscribe to the compacted topic can retrieve the latest value for each record even if the log has been compacted multiple times.
Hence, by keeping only the latest version of each record, the log compaction helps reduce the log’s size while still providing the consumers with the most up-to-date information. It also helps to minimize the amount of required disk space to store the log to make it more cost-effective and scalable.
Kafka Log Compaction Configurations
We can configure the compaction rules in the broker configuration file. The compaction rules and patterns are mainly determined by the log.cleanup.policy = compact parameter.
Setting the value of the “log.cleanup.policy” parameter to “compact” allows Kafka to enable the log compaction for the topic. By default, the log compaction is not enabled for all topics. It is also to keep in mind that not all logs support compaction. This depends on the type of data that is stored in the topic.
The following are some other configurations that you need to be aware off when working with log compaction:
log.cleaner.enable – This is a Boolean value that allows you to enable or disable the log cleaner process in the cluster. The log cleaner process runs in the background of the Kafka broker and is responsible to clean up the expired messages from the log. Keep in mind that enabling or disabling the log cleaner is a trade-off between the disk space usage and performance. While disabling the log cleaner can help to conserve the disk space, it can also result in a slower, less efficient log. On the other hand, enabling the log cleaner can improve the log performance and lead to disk space usage issues.
log.cleaner.threads – This sets the number of background threads that are used for log cleaning.
log.segment.ms – This determines the maximum amount of time to wait before closing an active segment.
log.cleaner.delete.retention.ms – This sets the amount of time, in milliseconds, that the transaction markers and delete records are stored before deletion.
It is good to note that the provided configurations are some of the most common log compaction rules within the Kafka cluster. You can check out a lot more in the Kafka documentation.
Enable the Log Compaction
You can enable the log compaction for a given topic using the kafka-topics utility and the –config parameter.
An example is as shown in the following:
Once we run the previous command, the “users” topic uses the log compaction policy and retains only the latest version of each record. The minimum cleanable dirty ratio of 0.5 means that the log cleaner only starts cleaning the segments when 50% of the segment’s bytes have been overwritten. The segment interval of 3600 ms means that new segments are created every 3600 milliseconds.
You can also enable the log compaction during creation as shown in the following command syntax:
Conclusion
This post explored the fundamentals of log compaction in Apache Kafka. We covered about the log compaction, how it works, and how to enable it in a Kafka cluster. It is good to note that this is a basic tutorial. There is much more about log compaction in Kafka that is not included in this post. We recommend that you go through the other articles in this website to expand your knowledge.