Apache Cassandra

Cassandra Full Repair

Apache Cassandra is built to ensure high availability in all environments. This means that if a single or multiple nodes with a cluster are unavailable, the other nodes within the cluster will handle the requested operations.

However, when the unavailable nodes become reachable, they need to know what operations they missed within the cluster. Therefore, Apache Cassandra uses a hint feature to alert the node(s) of all the functions that they missed while unavailable. Although hints can perform notification of the missing operations, it does not guarantee a complete data consistency across the cluster. This inconsistency can lead to data loss, especially if an active cluster multiplies the CRUD operations.

To prevent a data loss, you need to perform the data repairs, allowing the nodes to synchronize the data across the cluster with the updated information.

In this tutorial, you will discover how to perform the repairs manually in the Cassandra cluster using the nodetool utility.

Types of Repairs in Cassandra

Cassandra supports two main types of repairs:

  1. Incremental Repairs
  2. Full Repairs

Incremental Repairs

By default, Cassandra performs an incremental repair. This repair only repairs the data that has changed since the previous repair. This is less resource intensive and very useful when you regularly perform the repairs.

One disadvantage of incremental repairs is that once the data is marked as repaired, Cassandra will not attempt to repair it again. This can lead to data loss, especially if the repair becomes corrupt.

Full Repairs

On the other hand, full repairs are much resource intensive, especially on disk and network I/O operations. However, they perform the data repairs across the cluster, synchronizing the correct, up-to-date information.

We could spend this entire article on the various types of Cassandra repairs and how Cassandra handles the repairs. However, let’s get into the main course of the article.

The Nodetool Repair Command

To perform a data repair on a Cassandra cluster, we use the nodetool repair command. The command syntax and options are as shown:

Depending on the specified method, the repair command performs either a full or incremental repair on one or more nodes.

How to Perform a Full Repair in Cassandra Cluster

In this section, let us look at how we can perform a full repair on a Cassandra cluster.

NOTE: To best illustrate a full repair, run the commands in this tutorial in a cluster with three or more nodes.

Step 1: Create a keyspace with a replication factor of 3 (or for the number of available nodes).

cassandra@cqlsh> CREATE KEYSPACE development
... WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 3};

Step 2: Create a table and add a sample data.

cassandra@cqlsh:development> CREATE TABLE t ( id int, name text, age int, PRIMARY KEY(id) );

Step 3: Add a sample data.

cassandra@cqlsh:development> INSERT INTO t (id, name, age) VALUES (0, 'User1', 2);
cassandra@cqlsh:development> INSERT INTO t (id, name, age) VALUES (1, 'User2', 3);
cassandra@cqlsh:development> INSERT INTO t (id, name, age) VALUES (2, 'User3', 5);

Step 4: Get a data stored in the table.

cassandra@cqlsh:development> SELECT * FROM t;

Output:

id | age | name
----+-----+-------
1 |   3 | User2
0 |   2 | User1
2 |   5 | User3

Step 5: Update the keyspace to include 4 replicas.

cassandra@cqlsh:development> ALTER KEYSPACE development
... WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 4};

Increasing the number of replicas mimics the operation of a node in the cluster going down and coming back up.

Increasing the replication factor should give you a message to perform a data repair.

Step 6: Perform a full data repair as:

$ nodetool repair -full development

The previous command performs a full repair on all the tables in the specified keyspace. To repair only one table, we can run the following command:

$ nodetool repair -full development t

This should repair only the table “t” in the keyspace.

To view the repair status, you can use the tpstats command:

$ nodetool tpstats

Conclusion

In this article, you learned how to perform Cassandra’s full repair using the nodetool utility.

Thanks for reading!

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list