Cassandra Data Model Design Guide

2023-01-05 13:05:56

Abstract: This article details the five steps of Cassandra data modeling with a simple example. The following is the translation.

We recently published an article in Instaclustr about data modeling errors that often occur in Cassandra. This article is very popular and prompts me to think about how to design a high quality Cassandra data model to avoid falling into the trap during the design process.

On the Internet, you can find many excellent articles on adapting data model design rules and design patterns, such as the Apache Cassandra data modeling guide and data modeling best practices.

However, we do not have a detailed operation step to guide you in analyzing the data and adapting the corresponding rules and patterns. But this white paper is trying to fill this gap.

Phase 1: Understanding the data

There are two steps in this phase, both of which are designed to better understand the data you are modeling and the access patterns you need.

Defining data fields

The first step is to understand the data domain in depth. As a person very familiar with relational data modeling, I tend to understand these entities, primary keys, and relationships with each other by drawing ER diagrams. However, if you are familiar with another notation, you can also try it out. You need to understand the following key points at a logical level:

What are the entities (or objects) in the data model?

What are the main key attributes of an entity?

What are the relationships between entities (ie, references from one to another)?

What is the relative cardinality of the relationship (for example, assuming a one-to-many relationship, is the average 1 to 10 or 1 to 10000)?

Define the required access mode

Next, figure out how you need to access the data yourself:

List the paths that need to access the data, for example:

Search for transaction records within a date range, indexed by customer ID, and then search the search results for details of a particular transaction. Search by a specific server and metric to retrieve x metrics in ascending order of age.

Retrieve x metrics from a specific point in time, retrieved by a specific server and metric.

For a given sensor, retrieve all readings for multiple metrics for a given date.

For a given sensor, retrieve the current value.

Keep in mind that any update to a record is an access path and needs to be carefully considered.

Determine which access is the most critical from a performance perspective. Are there some accesses that need to be as fast as possible, while others require a certain amount of time to read multiple times or retrieve them within a certain range?

Keep in mind that at this stage, you need a very comprehensive understanding of how to access data, making a trade-off between Cassandra's performance, reliability, and scalability.

Second stage: understanding the entity

There are two specific steps in this phase to understand the primary and secondary entities associated with the data.

Identify the primary access entity

Now, we are moving from analyzing the data domain and application requirements to designing the data model. Before entering this stage, you need to do a solid job in the above two steps.

The main idea at this stage is to normalize the data into as few tables as possible based on the access pattern you are using. For each key press query, you need a table to meet the query needs. I created a term "primary access entity" to describe the entity used for the query (for example, a lookup by customer ID would use the customer table as the primary access entity, and a lookup by server and metric name would use the server-metric entity as the primary Access to the entity).

The primary access entity defines the partition level of the denormalized result table (ie, the table provides a partition for each instance of the primary access entity).

You can choose to use a secondary index to satisfy some access patterns instead of using different primary access entities for data replication. Keep in mind that the columns contained in the secondary index should be lower than the base of the indexed table, and you should be familiar with how often the index values â€‹â€‹are updated.

For the example of the access pattern above, we will define the following primary access entities:

Customers and transactions ( get a list of transactions from the client entity and then find the transaction details from the trading entity)

Server-metric

sensor

Allocation of secondary entities

The next step is to find a place to store entity data that is not selected as the primary access entity (these entities are called secondary entities). You can do this:

By taking data from a parent secondary entity of a one-to-many relationship and storing multiple copies of it at the primary access entity level (for example, storing the customer's phone number in the customer's order record).

Get data from sub-minor entities of a one-to-many relationship and store them at the primary access entity level by using a cluster key or by using multi-value types (lists and maps) (for example, adding a list of records to a transaction table) ).

For some secondary entities, there is only one related primary access entity, so there is no need to choose which direction to push the data into. For other entities, you need to choose which primary access entities to push the data into.

For optimal read performance, a copy of the data needs to be pushed into each of the primary access entities used as data access paths in the secondary entity.

E-Cigarette

E-Cigarettes also named vapes and vapors. veory popular in young people all the world. we are vape factory in China.OEM&ODM can be offered.

High PRO Disposable Vape,Mango Ice Maskking Pro Vape,Maskking High Pro Disposable Vape,Disposable Vape

Shenzhen Uscool Technology Co., Ltd , https://www.uscoolvape.com