Monitoring Metrics Items

CubeFS integrates Prometheus as the monitoring metric collection module, which can be combined with specific situations to enable monitoring metrics or configure monitoring panels.

Support reporting go gc stats and mem stats, as follows

# GC-related
go_gc_duration_seconds_sum
go_gc_duration_seconds
...
# Memory allocation
go_memstats_alloc_bytes
go_memstats_heap_idle_bytes
...
# Process-related
process_resident_memory_bytes
process_start_time_seconds
process_open_fds
...

Master

The monitoring metrics reported by the Master module mainly include the health status, usage, and statistical data of nodes in the cluster.

Metric Name	Description
cfs_master_dataNodes_count	Number of data nodes in the cluster
cfs_master_dataNodes_inactive	Number of abnormal data nodes in the cluster
cfs_master_dataNodes_increased_GB	Change in disk space usage in the cluster within 2 minutes
cfs_master_dataNodes_total_GB	Total disk space size in the cluster
cfs_master_dataNodes_used_GB	Disk space size used in the cluster
cfs_master_disk_error{addr="xx",path="xx"}	Bad disk monitoring in the cluster, including abnormal node IP and disk path
cfs_master_metaNodes_count	Total number of metadata nodes in the cluster
cfs_master_metaNodes_inactive	Number of abnormal metadata nodes in the cluster
cfs_master_metaNodes_increased_GB	Change in memory usage of metadata within 2 minutes in the cluster
cfs_master_metaNodes_total_GB	Total memory size of metadata in the cluster
cfs_master_metaNodes_used_GB	Memory size used by metadata in the cluster
cfs_master_vol_count	Number of volumes in the cluster
cfs_master_vol_meta_count{type="dentry",volName="vol"}	Details of the specified volume, type: dentry, inode, dp, mp
cfs_master_vol_total_GB{volName="xx"}	Capacity of the specified volume
cfs_master_vol_usage_ratio{volName="xx"}	Usage rate of the specified volume
cfs_master_vol_used_GB{volName="xx"}	Used capacity of the specified volume

MetaNode

The monitoring metrics of the MetaNode can be used to monitor the qps and latency data of various metadata operations of each volume, such as lookup, createInode, createDentry, etc.

Metric Name	Description
cfs_metanode_$op_count	Total number of requests for the corresponding operation of the meta node, which can be used to calculate the request qps
cfs_metanode_$op_hist_bucket	Hist data of the corresponding operation request of the meta node, which can be used to calculate the 95 value of the latency
cfs_metanode_$op_hist_count	Total number of corresponding requests for the meta node, same as cfs_metanode_$op_count
cfs_metanode_$op_hist_sum	Total time consumption of the corresponding operation request of the meta node, which can be used to calculate the average latency with hist_count

DataNode

The monitoring metrics of the DataNode can be used to monitor the qps and latency data of various data operations of each volume, as well as bandwidth, such as read, write, etc.

Metric Name	Description
cfs_dataNode_$op_count	Total number of requests for the corresponding operation of the data node, which can be used to calculate the request qps
cfs_dataNode_$op_hist_bucket	Hist data of the corresponding operation request of the data node, which can be used to calculate the 95 value of the latency
cfs_dataNode_$op_hist_count	Total number of corresponding requests for the data node, same as cfs_datanode_$op_count
cfs_dataNode_$op_hist_sum	Total time consumption of the corresponding operation request of the data node, which can be used to calculate the average latency with hist_count
cfs_dataNode_dataPartitionIOBytes	Total amount of data read and written by the data node, which can be used to calculate the bandwidth data of the specified disk and volume
cfs_dataNode_dataPartitionIO_count	Total number of IOs of the data node, which can be used to calculate the disk IO qps data
cfs_dataNode_dataPartitionIO_hist_bucket	Histogram data of the IO operation of the data node, which can be used to calculate the 95 value of the IO
cfs_dataNode_dataPartitionIO_hist_count	Total number of IO operations of the data node, same as above
cfs_dataNode_dataPartitionIO_hist_sum	Total delay of the IO operation of the data node, which can be used to calculate the average delay with hist_count

ObjectNode

The monitoring metrics of the ObjectNode are mainly used to monitor the request volume and time consumption of various S3 operations, such as copyObject, putObject, etc.

Metric Name	Description
cfs_objectnode_$op_count	Total number of requests for the corresponding operation of the object node, which can be used to calculate the qps
cfs_objectnode_$op_hist_count	Same as above
cfs_objectnode_$op_hist_sum	Total time consumption of the corresponding operation request of the object node, which can be used to calculate the average latency with hist_count
cfs_objectnode_$op_hist_bucket	Histogram data of the corresponding request of the object node, which can be used to calculate the 95 value and 99 value of the request latency

FuseClient

The monitoring metrics of the FuseClient module are mainly used to monitor the request volume, time consumption, cache hit rate, etc. of the interaction with the data module or metadata module, such as fileread, filewrite, etc. The details are as follows:

Metric Name	Description
cfs_fuseclient_$dp_count	Total number of requests for the corresponding operation of the client, which can be used to calculate the qps
cfs_fuseclient_$dp_hist_count	Same as above
cfs_fuseclient_$dp_hist_sum	Total time consumption of the corresponding operation request of the client, which can be used to calculate the average latency with hist_count
cfs_fuseclient_$dp_hist_bucket	Histogram data of the corresponding request of the client, which can be used to calculate the 95 value of the request latency

Blobstore

Common Metrics Items

Label	Description
api	Request API name, can configure path depth
code	Response status code
host	Host name, automatically obtain the current host name
idc	Configuration file configuration item
method	Request type
service	Service name
tag	Custom tag
team	Custom team name

You can modify the service audit log configuration item to enable related metrics

Configuration Item	Description	Required
idc	IDC name	No, it is recommended to fill in if the metric is enabled
service	Module name	No, it is recommended to fill in if the metric is enabled
tag	Custom tag, such as configuring clusterid	No, it is recommended to fill in if the metric is enabled
enable_http_method	Whether to enable status code statistics	No, default is off
enable_req_length_cnt	Whether to enable request length statistics	No, default is off
enable_resp_length_cnt	Whether to enable response length statistics	No, default is off
enable_resp_duration	Whether to enable request/response interval time consumption statistics	No, default is off
max_api_level	Maximum API path depth	No

{
  "auditlog": {
    "metric_config": {
      "idc": "z0",
      "service": "SCHEDULER",
      "tag": "100",
      "team": "cubefs",
      "enable_http_method": true,
      "enable_req_length_cnt": true,
      "enable_resp_length_cnt": true,
      "enable_resp_duration": true,
      "max_api_level": 3
    }
  }
}

service_response_code

Request status code metrics, record request API, status code, module name and other information, generally used to calculate request error rate, request volume, etc.

# TYPE service_response_code counter
service_response_code{api="scheduler.inspect.acquire",code="200",host="xxx",idc="z0",method="GET",service="SCHEDULER",tag="100",team=""} 8.766433e+06

service_request_length

Request body length statistics metrics, generally used to calculate request bandwidth, traffic.

# TYPE service_request_length counter
service_request_length{api="clustermgr.chunk.report",code="200",host="xxx",idc="z0",method="POST",service="CLUSTERMGR",tag="",team=""} 27631

service_response_length

Request response body length statistics metrics, generally used to calculate download bandwidth, traffic.

# TYPE service_response_length counter
service_response_length{api="clustermgr.chunk.report",code="200",host="xxxx",idc="z0",method="POST",service="CLUSTERMGR",tag="",team=""} 6

service_response_duration_ms

Request response time statistics metrics, which can be used to calculate the 95 latency of the request or response.

# TYPE service_response_duration_ms histogram
service_response_duration_ms_bucket{api="clustermgr.config.get",code="200",host="xxx",idc="z0",method="GET",reqlength="",resplength="",service="CLUSTERMGR",tag="",team="",le="1"} 22

Access

blobstore_access_cache_hit_rate

Cache hit rate, status includes four types: none (the hit value is nil), hit (hit), miss (not hit), expired (expired)

Label	Description
cluster	Cluster ID
service	Monitoring component, such as memcache, proxy, etc.
status	Status, none, hit, miss, expired

# TYPE blobstore_access_cache_hit_rate counter
blobstore_access_cache_hit_rate{cluster="100",service="memcache",status="hit"} 3.4829103e+13
blobstore_access_cache_hit_rate{cluster="100",service="proxy",status="hit"} 2.4991594e+07

blobstore_access_unhealth

Failure degradation metrics statistics

Label	Description
cluster	Cluster ID
action	allocate, punish, repair.msg, delete.msg
host	Failed node
reason	Failure reason
module	Degradation dimension, diskwith, volume, service

# TYPE blobstore_access_unhealth counter
blobstore_access_unhealth{action="punish",cluster="100",host="xxx",module="diskwith",reason="Timeout"} 7763

blobstore_access_download

Download failure statistics metrics

Label	Description
cluster	Cluster ID
way	Download method, EC read or direct read (Direct)

# TYPE blobstore_access_download counter
blobstore_access_download{cluster="100",way="Direct"} 37
blobstore_access_download{cluster="100",way="EC"} 3016

Clustermgr

blobstore_clusterMgr_chunk_stat_info

Chunk status metrics, statistics of the total number of chunks and available chunks

Label	Description
cluster	Cluster ID
idc	IDC
region	Region information
is_leader	Whether it is the master node
item	TotalChunk, TotalFreeChunk

# TYPE blobstore_clusterMgr_chunk_stat_info gauge
blobstore_clusterMgr_chunk_stat_info{cluster="100",idc="z0",is_leader="false",item="TotalChunk",region="cn-south-2"} 55619

blobstore_clusterMgr_disk_stat_info

Disk status metrics

Label	Description
cluster	Cluster ID
idc	IDC
region	Region information
is_leader	Whether it is the master node
item	Available, Broken, Dropped, Dropping, Expired, Readonly, Repaired, Repairing, Total

# TYPE blobstore_clusterMgr_disk_stat_info gauge
blobstore_clusterMgr_disk_stat_info{cluster="100",idc="z0",is_leader="false",item="Available",region="cn-south-2"} 107

blobstore_clusterMgr_raft_stat

Raft status metrics

Label	Description
cluster	Cluster ID
idc	IDC
region	Region information
is_leader	Whether it is the master node
item	applied_index, committed_index, peers, term

# TYPE blobstore_clusterMgr_raft_stat gauge
blobstore_clusterMgr_raft_stat{cluster="100",is_leader="false",item="applied_index",region="cn-south-2"} 2.97061597e+08

blobstore_clusterMgr_space_stat_info

Cluster space metrics

Label	Description
cluster	Cluster ID
idc	IDC
region	Region information
is_leader	Whether it is the master node
item	FreeSpace, TotalBlobNode, TotalDisk, TotalSpace, UsedSpace, WritableSpace

# TYPE blobstore_clusterMgr_space_stat_info gauge
blobstore_clusterMgr_space_stat_info{cluster="100",is_leader="false",item="FreeSpace",region="cn-south-2"} 1.76973072064512e+15

blobstore_clusterMgr_vol_status_vol_count

Volume status metrics

Label	Description
cluster	Cluster ID
idc	IDC
region	Region information
is_leader	Whether it is the master node
status	active, allocatable, idle, lock, total, unlocking

# TYPE blobstore_clusterMgr_vol_status_vol_count gauge
blobstore_clusterMgr_vol_status_vol_count{cluster="100",is_leader="false",region="cn-south-2",status="active"} 316

BlobNode

blobstore_blobnode_disk_stat

Disk status metrics

Label	Description
cluster_id	Cluster ID
idc	IDC
disk_id	Disk ID
host	Local service address
rack	Rack information
item	free, reserved, total_disk_size, used

# TYPE blobstore_blobnode_disk_stat gauge
blobstore_blobnode_disk_stat{cluster_id="100",disk_id="243",host="xxx",idc="z2",item="free",rack="testrack"} 6.47616868352e+12

Scheduler

scheduler_task_shard_cnt

Task shard count

Label	Description
cluster_id	Cluster ID
kind	success, failed
task_type	Task type, delete, shard_repair, balance, disk_drop, disk_repair, manual_migrate

# TYPE scheduler_task_shard_cnt counter
scheduler_task_shard_cnt{cluster_id="100",kind="failed",task_type="delete"} 7.4912551e+07

scheduler_task_reclaim

Task redistribution metrics

Label	Description
cluster_id	Cluster ID
kind	success, failed
task_type	Task type, balance, disk_drop, disk_repair, manual_migrate

# TYPE scheduler_task_reclaim counter
scheduler_task_reclaim{cluster_id="100",kind="success",task_type="balance"} 0

scheduler_task_data_size

Task data migration metrics, unit: byte

Label	Description
cluster_id	Cluster ID
kind	success, failed
task_type	Task type, balance, disk_drop, disk_repair, manual_migrate

# TYPE scheduler_task_data_size counter
scheduler_task_data_size{cluster_id="100",kind="success",task_type="balance"} 0

scheduler_task_cnt

Task count metrics

Label	Description
cluster_id	Cluster ID
kind	success, failed
task_type	Task type, balance, disk_drop, disk_repair, manual_migrate
task_status	finishing, preparing, worker_doing

# TYPE scheduler_task_cnt gauge
scheduler_task_cnt{cluster_id="100",kind="success",task_status="finishing",task_type="balance"} 0

scheduler_task_cancel

Task cancellation metrics

Label	Description
cluster_id	Cluster ID
kind	success, failed
task_type	Task type, balance, disk_drop, disk_repair, manual_migrate

# TYPE scheduler_task_cancel counter
scheduler_task_cancel{cluster_id="100",kind="success",task_type="balance"} 0

scheduler_free_chunk_cnt_range

Cluster idle chunk statistics

Label	Description
cluster_id	Cluster ID
idc	IDC
rack	Rack

# TYPE scheduler_free_chunk_cnt_range histogram
scheduler_free_chunk_cnt_range_bucket{cluster_id="100",idc="z0",rack="testrack",le="5"} 0

kafka_topic_partition_consume_lag

Kafka consumer latency

Label	Description
cluster_id	Cluster ID
module_name	Service name
partition	Partition
topic	Topic

# TYPE kafka_topic_partition_consume_lag gauge
kafka_topic_partition_consume_lag{cluster_id="100",module_name="SCHEDULER",partition="0",topic="dg_blob_delete"} 1.488541e+06

kafka_topic_partition_offset

Label	Description
cluster_id	Cluster ID
module_name	Service name
partition	Partition
topic	Topic
type	consume, newest, oldest

# TYPE kafka_topic_partition_offset gauge
kafka_topic_partition_offset{cluster_id="100",module_name="SCHEDULER",partition="0",topic="dg_blob_delete",type="consume"} 5.37820629e+08

Proxy

blobstore_proxy_volume_status

Volume status metrics

Label	Description
cluster	Cluster ID
idc	IDC
codemode	Volume mode
type	total_free_size, volume_nums

# TYPE blobstore_proxy_volume_status gauge
blobstore_proxy_volume_status{cluster="100",codemode="EC15P12",idc="z0",service="PROXY",type="total_free_size"} 9.01538397118e+11

blobstore_proxy_cache

Volume and disk cache status metrics

Label	Description
cluster	Cluster ID
service	Service name, disk, volume
name	Cache layer, memcache, diskv, clustermgr
action	Cache value, hit, miss, expired, error

# TYPE blobstore_proxy_cache counter
blobstore_proxy_cache{action="hit",cluster="100",name="clustermgr",service="disk"} 6345
blobstore_proxy_cache{action="hit",cluster="100",name="memcache",service="volume"} 2.3056289e+07
blobstore_proxy_cache{action="miss",cluster="100",name="diskv",service="volume"} 230595

Edit on GitHub

Quick Start

Operation Guide

FAQ

Monitoring Metrics Items

Master

MetaNode

DataNode

ObjectNode

FuseClient

Blobstore

Common Metrics Items

Access

Clustermgr

BlobNode

Scheduler

Proxy

# Monitoring Metrics Items

# Process-Related Metrics

# Master

# MetaNode

# DataNode

# ObjectNode

# FuseClient

# Blobstore

# Common Metrics Items

# Access

# Clustermgr

# BlobNode

# Scheduler

# Proxy

Monitoring Metrics Items

Process-Related Metrics

Master

MetaNode

DataNode

ObjectNode

FuseClient

Blobstore

Common Metrics Items

Access

Clustermgr

BlobNode

Scheduler

Proxy