vmstorage disk capacity planning¶

vmstorage is responsible for storing multicluster metrics for observability. In order to ensure the stability of vmstorage, it is necessary to adjust the disk capacity of vmstorage according to the number of clusters and the size of the cluster. For more information, refer to vmstorage retention period and disk space.

Test Results¶

After 14 days of disk observation of vmstorage of clusters of different sizes, We found that the disk usage of vmstorage was positively correlated with the amount of metrics it stored and the disk usage of individual data points.

The amount of metrics stored instantaneously increase(vm_rows{ type != "indexdb"}[30s]) to obtain the increased amount of metrics within 30s
Disk usage of a single data point: sum(vm_data_size_bytes{type!="indexdb"}) / sum(vm_rows{type != "indexdb"})

calculation method¶

Disk usage = Instantaneous metrics x 2 x disk usage for a single data point x 60 x 24 x storage time (days)

Parameter Description:

The unit of disk usage is Byte .
Storage duration (days) x 60 x 24 converts time (days) into minutes to calculate disk usage.
The default collection time of Prometheus in Insight Agent is 30s, so twice the amount of metrics will be generated within 1 minute.
The default storage duration in vmstorage is 1 month, refer to Modify System Configuration to modify the configuration.

Warning

This formula is a general solution, and it is recommended to reserve redundant disk capacity on the calculation result to ensure the normal operation of vmstorage.

reference capacity¶

The data in the table is calculated based on the default storage time of one month (30 days), and the disk usage of a single data point (datapoint) is calculated as 0.9. In a multicluster scenario, the number of Pods represents the sum of the number of Pods in the multicluster.

When the service mesh is not enabled¶

Cluster size (number of Pods)	Metrics	Disk capacity
100	8W	6 GiB
200	16W	12 GiB
300	24w	18 GiB
400	32w	24 GiB
500	40w	30 GiB
800	64w	48 GiB
1000	80W	60 GiB
2000	160w	120 GiB
3000	240w	180 GiB

When the service mesh is enabled¶

Cluster size (number of Pods)	Metrics	Disk capacity
100	15W	12 GiB
200	31w	24 GiB
300	46w	36 GiB
400	62w	48 GiB
500	78w	60 GiB
800	125w	94 GiB
1000	156w	120 GiB
2000	312w	235 GiB
3000	468w	350 GiB

Example¶

There are two clusters in the AI platform, of which 500 Pods are running in the global management cluster (service mesh is turned on), and 1000 Pods are running in the worker cluster (service mesh is not turned on), and the expected metrics are stored for 30 days.

The number of metrics in the global management cluster is 800x500 + 768x500 = 784000
Worker cluster metrics are 800x1000 = 800000

Then the current vmstorage disk usage should be set to (784000+80000)x2x0.9x60x24x31 =124384896000 byte = 116 GiB

Note

For the relationship between the number of metrics and the number of Pods in the cluster, refer to Prometheus Resource Planning.