Supervising Infrastructure and Contractual Information in Managed Applications

Why is monitoring so important when it comes to infrastructure and contractual data?

Raghunandan Gupta
Adobe Tech Blog

--

In the world of Managed Applications, the monitoring of infrastructure and contractual data plays a crucial role.

By keeping a close eye on these aspects, application owners can ensure optimal performance, security, compliance, and resource utilization.

In enterprise applications, a seller supplies multiple licenses and the costs associated with them. Each license has different contractual information, which is important to provide a better experience to customers.

Let us delve into the key reasons why monitoring infrastructure and contractual data is essential in managed applications.

Infrastructure Monitoring

Managed applications heavily rely on dedicated infrastructure, managed either by the application owner or a hosting provider.

Monitoring this infrastructure is vital to provide stability, performance, and availability.

By promptly identifying and resolving potential issues or bottlenecks, application owners can optimize the overall performance of their applications. Proactive maintenance also becomes possible, preventing any downtime that may disrupt user experience.

Contractual Obligations

Managed applications often operate under specific contractual agreements with customers.

These agreements may include Service Level Agreements (SLAs) that define performance guarantees, uptime requirements, and response times.

Monitoring contractual data allows application owners to track and report compliance with these obligations. It enables them to detect any deviations from agreed-upon metrics and fosters effective communication with customers, ensuring service quality and adherence to contractual terms.

Security and Compliance

Monitoring infrastructure data is instrumental in detecting and mitigating security threats, ensuring compliance with data protection regulations, and safeguarding sensitive information.

By closely monitoring infrastructure activities, application owners can identify any suspicious or unauthorized access attempts, as well as vulnerabilities that may compromise the application’s security.

Compliance requirements often necessitate continuous monitoring and auditing of infrastructure data, ensuring data privacy and protection.

Performance Optimization

To optimize performance, application owners need to monitor infrastructure data such as database usage, SFTP space, memory consumption, and disk I/O.

These metrics provide valuable insights into resource utilization. By analyzing these metrics, owners can identify areas for optimization, such as increasing resource allocation or optimizing code. This improves the overall efficiency of their applications.

Monitoring in practice

We provide a product that offers multiple licenses, depending on the size of the customer. Each license includes unique contractual data, and infrastructure capacity is associated with this data.

We require a system to closely monitor various metrics to enhance customer experience and provide visibility to customers. This system will also highlight any necessary actions that customers need to take to ensure the smooth operation of the application.

Requirements

This system needs to:

  • Present the collected data in an aggregated format, allowing users to view it on a weekly, monthly, and yearly basis.
  • Display the contractual terms and utilization trends of different metrics to the end customer.
  • Enable users to retrieve data at any desired point in time.
  • Ensure the ability to update the metadata of data points in the event of changes, such as post-migration updates.
  • Provide the capability to configure the data retention policy to comply with legal requirements.
  • Allow users to set custom alert conditions on the collected data for different monitoring metrics, triggering notifications or actions based on specific criteria. For example, sending an alert only when two consecutive alerts breach a certain threshold.
  • Store each data point, so that we can address customer inquiries regarding sent alerts.

Exploration

We required a mechanism to record events occurring over a series of time intervals. To fulfil the requirements of the product, we investigated several options for storing time series data.

Open Time Series Database (OpenTSDB)

In OpenTSDB, the primary focus is on storing and retrieving time series data rather than updating individual data points.

OpenTSDB is designed for efficient storage and querying of time series data, making it suitable for applications that require high-speed data ingestion and retrieval.

While it is technically possible to update data points in OpenTSDB, it is not a frequent practice or recommended approach. OpenTSDB follows an append-only architecture, meaning that once data points are written, they are typically immutable and cannot be directly modified.

Instead, the general approach is to write new data points with updated values or metadata.

If you need to adjust or corrections to data points, it is advisable to write new data points with the corrected values rather than updating existing ones. This ensures data integrity and maintains a reliable audit trail of the data history.

In summary, while OpenTSDB supports limited ways to update data points, it is more aligned with storing and retrieving time series data efficiently, rather than focusing on frequent updates or modifications of individual data points.

Timescale Database

Monitoring infrastructure and contractual data is highly important within managed applications.

By harnessing the valuable insights obtained through monitoring, owners can make well-informed decisions, take proactive measures to resolve issues and establish a transparent and trustworthy relationship with both themselves and their customers. With meticulous monitoring in place, managed applications can flourish in delivering unparalleled user experiences.

Timescale DB is an extension for PostgreSQL that enhances it into a time series database, offering the advantages of PostgreSQL combined with the necessary speed for handling time series data.

We chose to use Azure, as our services are deployed in Azure and it’s easy to use Virtual network peering to communicate between services and databases.

Capacity Estimation

We are collecting more than 5+ metrics every 15 minutes for thousands of instances. So, on an average daily basis, we are consuming a million events per hour.

Each data point in the storage consists of several columns: a mandatory timestamp column (BIGINT), an ID column (varchar (255)), a value column (BIGINT), an allocated BIGINT column for DB usage and SFTP usage, and additional metadata information along with unique message ID for each event and can be considered as event identifier.

When storing each data point, the average size of a table row is calculated as follows: 8 bytes + 255 bytes + 8 bytes + 255 bytes + 255 bytes + 255 bytes + 1000 bytes, which amounts to 2044 bytes (approximately 2 KB). It’s important to note that indexes are also present for each table, which adds to the overall space consumption.

Computing total database space that should be provisioned: million data points each data point taking 2 KB approx.

1000000* 1.5 KB = 1.5 GB per hour.

This will keep increasing as we add more metrics.

We have selected auto-scale while creating the Postgres instance so that we don’t face any downtime.

Database Schema

Below is one of the table schemas which we created to store the data of the Postgres database.

CREATE TABLE db_usage_metrics (
timestamp bigint NOT NULL,
id varchar(255) NOT NULL,
instance_id VARCHAR(255) NOT NULL,
value bigint NOT NULL,
allocated bigint NOT NULL,
metadata json DEFAULT NULL,
tenant_id VARCHAR(255) DEFAULT NULL,
org_id VARCHAR(255) DEFAULT NULL,
message_id VARCHAR(1000) NOT NULL,
constraint unique_message_id_db_usage_metrics UNIQUE(message_id, timestamp),
PRIMARY KEY (timestamp, id)
)
WITH (
OIDS = FALSE
);

We created Hyper Table to store the time series data and added a column that stores the timestamp values.

We wanted the chunk to be of 7 days, hence we have set interval 7*24*60*60*1000 = 604800000 milliseconds.

SELECT create_hypertable('db_usage_metrics', 'timestamp', chunk_time_interval => 604800000);

Indexes were created based on search queries.

CREATE INDEX ON db_usage_metrics (timestamp DESC, org_id, tenant_id, instance_id);

CREATE INDEX ON db_usage_metrics (instance_id DESC, timestamp DESC);

CREATE INDEX ON db_usage_metrics (tenant_id DESC, timestamp DESC);

CREATE INDEX ON db_usage_metrics (org_id DESC, timestamp DESC);

Data Ingestion and validation

A producer is a long-running process that constantly retrieves data from the source system and pushes it to different Kafka topics which are created separately for different metrics. Subsequently, a monitoring service consumes this data and validates it before ingesting it into the database.

Data Retrieval

We offer customers both current and historical perspectives. Customers can choose from predefined time periods, and based on their selection, they’ll see detailed data points along with instances where thresholds were exceeded.

For daily fetching data, the query below can be used.

SELECT To_timestamp(( Time_bucket(86400000, rum.timestamp) / 1000 )) AS date,
Last(rum.value, rum.timestamp) AS
latest_used,
Last(rum.allocated, rum.timestamp) AS
latest_allocated,
rum.instance_id AS
instance_id,
rum.tenant_id AS
tenant_id,
rum.org_id AS
org_id,
Avg(rum.value) AS avg_used
,
Avg(rum.allocated) AS
avg_allocated,
Min(rum.value) AS min_used
,
Min(rum.allocated) AS
min_allocated,
Max(rum.value) AS max_used
,
Max(rum.allocated) AS
max_allocated,
Max(rum.build_number) AS
build_number,
Max(rum.environment) AS
environment
FROM (SELECT First_value(build_number)
OVER (
partition BY instance_id, tenant_id, org_id
ORDER BY timestamp DESC) AS build_number,
First_value(environment)
OVER (
partition BY instance_id, tenant_id, org_id
ORDER BY timestamp DESC) AS environment,
timestamp,
instance_id,
tenant_id,
org_id,
value,
allocated
FROM db_usage_metrics dum
WHERE dum.timestamp > :minimumTimestamp
AND dum.timestamp < :maximumTimestamp) rum
GROUP BY date,
rum.instance_id,
rum.tenant_id,
rum.org_id
ORDER BY date ASC,
instance_id ASC,
tenant_id ASC,
org_id ASC

Data Retention

According to the product requirements, it is necessary to retain data in the database for only the past year. Therefore, a retention policy has been implemented on each metric to ensure that data beyond the specified time period is automatically removed.

The latest version of Timescale introduces additional functions, such as add_retention_policy and add_drop_chunks_policy, which simplify certain tasks and enhance usability.

To read more tips and tech discussions from Raghunandan Gupta, look up more of his articles on Medium here.

References

--

--