Let’s see deep dive a bit into Cassandra and know why is this used a lot with the keyword “Big Data” a lot.
Relational databases are the most commonly used databases. They work well with applications designed for a wide range of business operations. Some of the popular ones are MySQL, PostgreSQL, etc.
But some applications have requirements such as writing large volumes of data quickly while others require extremely fast response time and with that, they require high availability.
Relational Databases do support high reads and high writes some of their key properties (ACID) make it difficult to active them properly say consistency (C in ACID); maintaining consistency adds extra overhead to its operations. No SQL Databases don’t support the stick ACID properties but are faster in operations.
Cassandra is a wide-column No SQL Database it has many similarities to Relational Databases say, both use tables as basic data structure and even data types, keys, etc.
Being No SQL it doesn’t have a fixed schema, which means some rows may have different columns than other rows and it's eventual consistent which means replicas of rows may have a different version of data for a short time but it gradually becomes consistent, replicas of a node is kept to provide fault tolerance, in case a node fail user can still retrieve data from its replica. The number of replicas is based on the replication strategy defined by the user.
In the case of Cassandra, there is no difference between the Master and Worker nodes, they all run the same set of services.
Cassandra runs on a cluster of servers(called nodes) instead of a single one to provide high availability, thus just primary keys are not helpful in finding rows in Cassandra. Thus it uses
- Partition key: It defines which node is the cluster to use when storing to retrieving a row. It is used to distribute data across your nodes
- Clustering key: It defines the order in which rows are stored. It is used to sort data within a partition.
Cassandra used CQL (similar to SQL but with restrictions) aka Cassandra Query Language to play and manipulate data.
Cassandra is often used to store large volumes of data thus it's always recommended to have the proper data type while defining your attributes in the tables.
Refer to this to know more about Data Type.
Data Model design in Cassandra is driven by the queries. In RDMS while doing the data modeling we first define the entities and then the relationship between them but here we don’t start with entities we start with queries we want to run. And depending on that we often use a single table in Cassandra instead of multiples tables to have the result of a query. The reason is it's not like relational modeling it's there to achieve fast read and fast writes over very large volumes of data.
There are no joins and a lot of data duplication. Basically joins are expensive thus to avoid them we duplicate the data among the rows, therefore here a table is a mix of different attributes thus no defined entity/table. And we know this goes again the Database design best practices, they basically help to reduce data anomalies, but here we thrive for higher performance that might come at cost of some anomalies.
And yes duplicating data leads to the need for bigger storage. Cassandra is used when your top priority is being able to write and read data quickly. But this does not mean we waste space with Cassandra, we should use data types that are sufficient for what we need, but not more.
We have denormalizing instead of sorting and joining. And we have two ways of doing that:
- Keeping multiple copies of data across and table, duplicating data instead of joining tables.
- Using collections(set, list, and maps) to store multiple values instead of using a separate table
In Cassandra, you can’t query a table without using the primary key.
We define the Cassandra table to answer a single query but sometimes we can use a single table to answer multiple queries, we can have secondary indexes, they are indexes on columns that allow us to specify those columns in a `where` clause. Secondary indexes are useful when;
- There are many rows that have the indexed value, as with fewer rows they don’t perform well.
- Tables do not have a counter, an autoincrement feature of Cassandra
- The column is not frequently updated
We do have Materralilized views in Cassandra that help in reducing the overhead of managing denormalized tables. They are managed by Cassandra and are read-only. But this has an added cost in terms of write performance as data need to be replicated to these materialized tables too.
Refer to this to know more about what and creation of Materialized views.
In order to provide higher availability and performance, it keeps multiple copies of Data. The cluster is organized into a logical ring. While creating a keyspace over a cluster we specify the number of replicas we would like to keep and the strategy based on single or multiple Data centers, where we keep them. A replication factor of ’n’ means keeping ’n’ copies of the data.
Replication strategy that uses a single data center is called ‘Simple Strategy’ and the one that is used for multiple data center configuration is called ‘Network Topology Strategy’.
When we use data replication, there is the possibility that the copies of data may get out of sync. We can specify how we need to handle consistency here, in some cases we might be ok with replicas being inconsistent but in other cases, we might want the majority of all of the replicas to report the same answer. Based on your requirement we can specify consistency type and once that is reached we mark the I/O or write operation as a success.
We can specify these types at three levels; Entire Cluster, Single Data Center, and Single I/O or write operation.
Let’s end here but we will also have a follow-up article on Cassandra Architecture soon.
You may know more about Cassandra from the official Apache documentation: https://cassandra.apache.org/doc/latest/
In case you liked it and it helped you to get familiar with Cassandra please give a shout out via claps. Keep in touch we have more to come next.