Azure Cosmos Database Concepts— Part 1

Published in

FAUN — Developer Community 🐾

7 min readMay 23, 2020

NoSQL has become the latest trend in all applications. You must have seen a lot of changes in the last decade where many NoSQL databases were invented to make the world a better place. The thing which caused so much invention is data & space which is increasing exponentially.

NoSQL databases are used when we deal with a high volume of data (events) E.g. Time Series, User Profiling, Click Stream data, etc. Few examples of open-source NoSQL databases are Apache Cassandra, Apache HBase & Couchbase, etc. Cloud companies like Google, Amazon & Microsoft not only provide open-source databases as a service, they also have their NoSQL database which is preferred while developing the application using their infrastructure. Few examples are

Google — Google Cloud Datastore

Microsoft — Cosmos Database

Amazon — Dynamo Database

There is not much information available on the internet on how these databases work internally because they are offered as a service by Cloud companies. According to them, these are the fastest ones available in the market. Moving from Open Source NoSQL to Cloud offering and vice versa is quite common. Open Source NoSQL database requires a lot of learning & POCs. More documentation you have, more testing you need to do. Cloud companies only present how they store data and how you can interact with the database using their SDKs. On the internet, there is always debate on why we should or should not use any NoSQL database which is offered by any Cloud companies.

Now you might ask a question on how come people are using databases provides by the Cloud companies if developers who are going to use don’t even know what’s happening inside. The best thing which Cloud companies offer is support because it is built by them and they are offering it as a service so they are responsible for it whereas, in the case of open source, you have to depend on community which might not fix the issues immediately.

Recently, I had the requirement to use the Cosmos database and based on that I had to model my data so that it can fit in Cosmos and I can avoid the use of other databases.

Cosmos database is a NoSQL database that is provided by Microsoft. It is a distributed database and elastically scalable. Data is saved in the form of a Document. They provide many ways to access the data stored in the Cosmos database. One of the popular ways is to use the SQL API because we are all familiar with SQL. We will go through some terminologies of the Cosmos database.

Database

A database is a high-level unit in Cosmos. It can be thought of as the “database” in any RDBMS. When we create the database, we provision RU/seconds (Request Units) which are responsible for calculating the cost of the Cosmos database usage. Request Units/seconds is the cost of any operation in Cosmos which can be either Save, Upsert, Delete or Retrieve. The minimum Request Unit which can be provisioned is 400 RU per second as per Cosmos Database documentation.

Containers

These are like tables in RDBMS and collections in MongoDB. Each container is tied to a database. We can either provision separate Request Units/second for any container or set it at the Database Level. If we don’t provision Request Units /seconds it will be at database level means all containers will share Request Units provisioned at the database level.

If you want to provide Request Units/second not per container but per database, you can use this https://azure.microsoft.com/en-in/blog/sharing-provisioned-throughput-across-multiple-containers-in-azure-cosmosdb/.

When creating a container, we can choose if we want to add it to the existing database or create a new database. We also have to provide a partition key per container. The unique key can also be provided at the time of provisioning a container. Once created, Partition Key and Unique Key can’t be changed.

Partition Key

One of the important concepts in the Cosmos database is Partition Key. There are 2 kinds of partitions. The first is Logical Partition and the other is Physical Partition.

The partition key creates logical partitions of data. Each Logical partition is mapped with a physical partition in the Cosmos database. Mapping is managed by Azure Cosmos internally to provide performance and scalability and we can’t modify it. Whenever more throughput or storage needed Cosmos spreads them over more Physical Partitions.

The idea behind Logical partition is to distribute data equally. If the partition key is not distributing the data evenly in different physical partitions, there might be some partitions that will become“hot” and rate-limiting might occur on those partitions.

If the workload running on a logical partition consumes more than the throughput that was allocated to that logical partition, your operations get rate-limited. When rate-limiting occurs, you can either increase the provisioned throughput for the entire container or retry the operations.
~ Cosmos DB Documentation

Cosmos also controls the size of the document (~2MB) and logical Partition (~10GB). If your data size is bigger better to use a blob store to store the data and use a blob link in Cosmos as a reference.

If you are not able to decide Partition Key in your document you can also create a synthetic partition key. Azure Cosmos DB uses hash-based partitioning to spread logical partitions across physical partitions. Azure Cosmos DB hashes the partition key and hashed value determines the physical partition. So it is advised to choose a column as partitions key which has high cardinality. Low cardinality is a bad choice and it defeats the purpose of Cosmos also.

In the above diagram, if we are storing product data and we take category as Partition key, then we can store only 10 GB of data for a single category but if items increases then we will face issues. To avoid that we can’t have the only Category as partition key we have to append it with another field best choice can be a subcategory. This means items in a subcategory should be less than 10GB.

Id Field

This field is added to the document by the Cosmos database itself. It is unique per logical partition. Id & partition key combination acts as the primary key in the Cosmos database.

Unique Key

This we declare at the time of container creation we can’t change it once database container is created. The unique Key & partition key combination acts as the primary key in the Cosmos database.

Sample Document

{
“productId”: “65886266-f424–4285–9844-c0eefb6576ba”,
“category”: “Sports”,
“subCategory”: “Athletic”,
“partitionKey”: “Sports_Athletic”
}

We will create a database “all-products” and a container “product” having partition key as “partitionKey” and unique key as “productId”.