全面Kafka监控方案:从配置到指标
1.1.监控配置
开启JMX服务端口:kafka基本分为broker、producer、consumer三个子项,每一项的启动都需要用到 $KAFKA_HOME/bin/kafka-run-class.sh
脚本,在该脚本中,存在以下语句:
if ...
KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"fi
if ...
KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT "
Fi
在启动kafka的过程中,只要指定 JMX_PORT 的值,即可对broker、producer、consumer进行监控。目前有两种方法,在$KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/bin/kafka-console-consumer.sh $KAFKA_HOME/bin/kafka-console producer.sh
三个脚本中分别添加 $JMX_PORT=XXXX
语句,但是只适用于使用console方式对topic进行使用的情况。
修改$KAFKA_HOME/bin/kafka-run-class.sh
脚本中的上述语句,使其端口随机变化,可以通过 ps -ef |grep kafka 命令来获取随机的端口号,来进行监控。
1.2.监控工具
Prometheus监控Kafka
- 如可以采用docker部署
kafka-exporter:
docker run -ti -d --rm -p 9308:9308 danielqsj/kafka-exporter --kafka.server=192.168.0.4:9092
监控项名称 | 阈值说明 | 使用的公式 |
---|---|---|
Kafka的Brokers在线 | 1m !=1严重 | count(kafka_server_replicamanager_leadercount{job=~"$job"}) |
Kafka集群中副本处于同步失败或失效状态的分区数 | >0严重 | sum(kafka_topic_partition_under_replicated_ partition{topic=~"$topic", namespace=~"$kubernetes_namespace"}) |
Kafka集群中控制器的数量 | !=1严重 | sum(kafka_controller_kafkacontroller_activecontrollercount{job=~"$job"}) |
Kafka离线分区数 | >0严重 | sum(kafka_controller_kafkacontroller_offlinepartitionscount{job=~"$job"}) |
Kafka每秒入网络流量 | >=150中度 | avg_over_time(kafka_server_BrokerTopicMetrics_ OneMinuteRate{name="BytesInPerSec",topic=""}[1m]) / 1024 /1024 |
Kafka请求处理程序线程空闲的平均时间百分比 | <= 0.3中度 | avg_over_time(kafka_server_KafkaRequestHandlerPool_ OneMinuteRate{name="RequestHandlerAvgIdlePercent",}[1m]) |
2. 在prometheus.yml出添加kafka配置
- job_name: 'kafka_exporter'
static_configs:
- targets: ['$node1:9308']
- 重启prometheus加载。
- 在promethues的管理界面可以查看状态:
- 然后配置grafana来展示图表效果。
- 告警监控项,如下表供参考:
时间百分比 | rcent",}[1m]) | |
---|---|---|
Kafka请求处理程序线程空闲的平均时间百分比 | <= 0.3中度 | avg_over_time(kafka_server _KafkaRequestHandlerPool _OneMinuteRate{name=“ RequestHandlerAvgIdlePercent”,}[1m]) |
Kafka网络处理器线程空闲的平均时间百分比 | <= 0.3中度 | avg_over_time(kafka_network_ SocketServer_Value{name=“ NetworkProcessorAvgIdlePercent”,}[1m]) |
Kafka已建立的连接数 | > 3000中度> 5000严重 | sum(avg_over_time(kafka_ server_socket_server_metrics_ connection_count{listener=“PLAINTEXT”,} [1m])) by (instance,app) |
Kafka每秒新建连接数 | > 100中度> 200 严重 | sum(avg_over_time(kafka_server _socket_server_metrics_connection _creation_rate[1m])) by (instance) |
Kafka请求在请求队列中等待的时间 | >5000中度 | avg_over_time(kafka_networ k_RequestMetrics_999thPercentile {name=“RequestQueueTimeMs”, request=“Produce”,}[1m]) |
Kafka_leader处理请求的时间 | >5000中度 | avg_over_time(kafka_network_ RequestMetrics_999thPercentile {name=“LocalTimeMs”,request=“Produce”,}[1m]) |
Kafka请求等待follower的时间 | >1000中度 | avg_over_time(kafka_network_ RequestMetrics_999thPercentile {name=“RemoteTimeMs”,request=“Produce”,}[1m]) |
Kafka请求在响应队列中等待的时间 | >1000中度 | avg_over_time(kafka_network_ RequestMetrics_999thPercentile {name=“ResponseQueueTimeMs”,request=“Produce”,}[1m]) |
Kafka发送响应的时间 | >1000中度 | avg_over_time(kafka_network_RequestMetrics _999thPercentile{name=“ResponseSendTimeMs”, request=“Produce”,}[1m]) |
Kafka汇总传入消息速率 | > 200000中度 | avg_over_time(kafka_server_ BrokerTopicMetrics_OneMinuteRate {name=“MessagesInPerSec”,topic=“”}[1m]) |
kafka消费滞后告警 | >1000 | sum(kafka_consumergroup _lag{topic!=“sop_free_study_fix-student_wechat_detail”}) by (consumergroup, topic) > 1000 |
kafka-exporter停止 | < 1 | kafka_exporter_build_info |
kafka server停止 | <1 | kafka_brokers |
kafka监控topic实时生产速率 | >= 0 | sum(irate(kafka_topic_partition_current_ offset{topic !~ "__consumer_offsets |
Kafka消费者端分区偏移量 | 5m >= 0 | sum(delta(kafka_consumergroup_current _offset[5m])/5) by (consumergroup, topic) |
Kafka消费者组的当前主题分区偏移汇总 | sum(delta(kafka_consumergroup_current _offset_sum[5m])/5) by (consumergroup, topic) | |
Kafka某个消费组消费延迟 | 5m >100000中度 | sum(kafka_consumergroup_lag) by (consumergroup,partition,topic) |
Kafka某个消费者组在某个主题分区的近似滞后情况汇总 | sum(kafka_consumergroup_lag_sum) by (consumergroup,partition,topic) | |
某个消费组成员 | kafka_consumergroup_ members{instance=“$instance”} | |
Kafka分区的位移量汇总 | sum(kafka_topic_partition_current_offset) by (partition,topic) | |
Kafka分区的同步副本数 | 1m =0 中度 | sum(kafka_topic_partition_in_sync_replica) |
Kafka旧主题分区偏移 | sum(kafka_topic_partition_oldest _offset{topic=~“$topic”}) by (partition,topic) | |
Kafka主题分区的副本数 | 1m <3中度 | sum(kafka_topic_partition _replicas{topic=~“$topic”}) |
Kafka主题分区复制不足的分区数 | sum(kafka_topic_partition_under _replicated_partition{topic=~“$topic”}) | |
Kafka 总分区数 | 5m >1000中度 | sum(kafka_topic_partitions) by(topic) |
1.3.性能指标
系统相关指标
- 系统信息收集 java.lang:type=OperatingSystem
- Thread信息收集 java.lang:type=Threading
- 获取mmaped和direct空间
- 通过BufferPoolMXBean获取used、capacity、count
GC相关指标
- Young GC
java.lang:type=GarbageCollector,name=G1 Young Generation - Old GC
java.lang:type=GarbageCollector,name=G1 Old Generation
JVM相关指标
通过MemoryMXBean获取JVM相关信息HeapMemoryUsage和NonHeapMemoryUsage;通过MemoryPoolMXBean获取其他JVM内存空间指标,例如:Metaspace、Codespace等
Topic相关指标
- Topic消息入站速率(Byte)
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=" + topic - Topic消息出站速率(Byte)
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=" + topic - Topic请求被拒速率
kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic=" + topic - Topic失败拉去请求速率
kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec,topic=" + topic; - Topic发送请求失败速率
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec,topic=" + topic - Topic消息入站速率(message)
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=" + topic
Broker相关指标
- Log flush rate and time
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs - 同步失效的副本数
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions - 消息入站速率(消息数)
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec - 消息入站速率(Byte)
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec - 消息出站速率(Byte)
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec - 请求被拒速率
kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec - 失败拉去请求速率
kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec - 发送请求失败速率
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec - Leader副本数
kafka.server:type=ReplicaManager,name=LeaderCount - Partition数量
kafka.server:type=ReplicaManager,name=PartitionCount - 下线Partition数量
kafka.controller:type=KafkaController,name=OfflinePartitionsCount - Broker网络处理线程空闲率
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent - Leader选举比率
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs - Unclean Leader选举比率
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec - Controller存活数量
kafka.controller:type=KafkaController,name=ActiveControllerCount - 请求速率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce - Consumer拉取速率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer - Follower拉去速率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower - Request total time
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce - Consumer fetch total time
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer - Follower fetch total time
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower - Time the follower fetch request waits in the request queue
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=FetchFollower - Time the Consumer fetch request waits in the request queue
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=FetchConsumer - Time the Produce fetch request waits in the request queue
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce - Broker I/O工作处理线程空闲率
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent - ISR变化速率
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
1.4.性能指标说明
指标 | 单位 | 具体含义 |
---|---|---|
kafka.broker_offset offsets | broker | 上当前消息的偏移量(offset) |
kafka.consumer.bytes_in | bytes/second | consumer 字节率(bytes in rate) |
kafka.consumer.delayed_requests | requests | 延迟的 consumer 请求数 |
kafka.consumer.expires_per_second | evictions/second | 延迟 consumer 的请求到期(expiration)速率 |
kafka.consumer.fetch_rate | requests | consumer 向 broker 发送提取请求(fetch requests)的最低速率 |
kafka.consumer.kafka_commits | writes/second | 面向 Kafka 的 offset commits 速率 |
kafka.consumer.max_lag | offsets | 最大消费滞后(consumer lag) |
kafka.consumer.messages_in | messages/second | consumer 消息消费(consumption)的速率 |
kafka.consumer.zookeeper_commits | writes/second | 面向 ZooKeeper 的 offset commits 速率 |
kafka.consumer_lag | offsets | consumer 和 broker 之间的消息滞后(lag) |
kafka.consumer_offset | offsets | consumer 的当前消息偏移量(current message offset) |
kafka.expires_sec | evictions/second | 延迟生产者(delayed producer)的请求到期(request expiration)速率 |
kafka.follower.expires_per_second | evictions/second | 关注者(followers)的请求到期(request expiration)速率 |
kafka.log.flush_rate | flushes/second | 日志刷新速率 |
kafka.messages_in | messages | 传入(incoming)信息速率 |
kafka.net.bytes_in | bytes/second | 传入(incoming)字节速率 |
kafka.net.bytes_out | bytes/second | 传出(outgoing)字节速率 |
kafka.net.bytes_rejected | bytes/second | 被拒绝(rejected)的字节速率 |
kafka.producer.bytes_out | bytes/second | producer 字节输出速率 |
kafka.producer.delayed_requests | requests | 延迟的 producer 请求数 |
kafka.producer.expires_per_seconds | evictions/second | producer 请求到期率 |
kafka.producer.io_wait | nanoseconds | Producer I/O 等待时间 |
kafka.producer.message_rate | messages/second | Producer 消息速率 |
kafka.producer.request_latency_avg | milliseconds | Producer 平均请求延迟 |
kafka.producer.request_rate | requests/second | producer 每秒钟的请求数 |
kafka.producer.response_rate | responses/second | producer 每秒钟的响应数 |
kafka.replication.isr_expands | nodes/second | 副本加入 ISR 池的速率 |
kafka.replication.isr_shrinks | nodes/second | 副本离开 ISR 池的速率 |
kafka.replication.leader_elections | events/second | 领导选举(Leader election)频率 |
kafka.replication.unclean_leader_elections | events/second | Unclean 的领导选举(Leader election)频率 |
kafka.replication.under_replicated_partitions | 未使用的分区数 | |
kafka.request.fetch.failed | requests | 客户端获取请求(fetch request)失败次数 |
kafka.request.fetch.failed_per_second | requests/second | 每秒钟的客户端获取请求(fetch request)失败率 |
kafka.request.fetch.time.99percentile | requests/second | 获取请求(fetch request)时间的第 99 百分位的值 |
kafka.request.fetch.time.avg | requests/second | 获取请求(fetch request)时间的平均值 |
kafka.request.handler.avg.idle.pct | fractions | 请求处理程序线程(request handler threads)的平均空闲时间占比 |
kafka.request.metadata.time.99percentile | milliseconds | 元数据(metadata)请求时间的第 99 百分位的值 |
kafka.request.metadata.time.avg | milliseconds | 元数据(metadata)请求时间的的平均值 |
kafka.request.offsets.time.99percentile | milliseconds | offset 请求时间的第 99 百分位的值 |
kafka.request.offsets.time.avg | milliseconds | offset 请求时间的平均值 |
kafka.request.produce.failed | requests | 失败的产品请求(produce requests)数 |
kafka.request.produce.failed_per_second | requests/second | 每秒钟的产品请求(produce requests)失败率 |
kafka.request.produce.time.99percentile | requests/second | 产品请求(produce requests)时间的第 99 百分位的值 |
kafka.request.produce.time.avg | requests/second | 产品请求(produce requests)平均时间 |
kafka.request.update_metadata.time.99percentile | milliseconds | 更新元数据请求(update metadata requests)时间的第 99 百分位的值 |
kafka.request.update_metadata.time.avg | milliseconds | 更新元数据请求(update metadata requests)时间的平均值 |
1.5.重要指标说明
参照kafka-manager管理工具
1.
kafka.replication.under_replicated_partitions:
Under Replicated Partitions
: 在一个运行健康的集群中,处于同步状态的副本数(ISR)应该与总副本数(简称AR:Assigned Repllicas)完全相等,如果分区的副本远远落后于leader,那这个follower将被ISR池删除,随之而来的是IsrShrinksPerSec(可理解为isr的缩水情况,后面会讲)的增加。由于kafka的高可用性必须通过副本来满足,所有有必要重点关注这个指标,让它长期处于大于0的状态。
2. Brokers Spread:
broker使用率,如kafka集群9个broker,某topic有7个partition,则broker spread: 7 / 9 = 77%
3. Brokers Leader Skew:
leader partition是否存在倾斜,如kafka集群9个broker,某topic14个partition,则正常每个broker有2个leader partition。若其中一个broker有0个leader partition,一个有4个leader partition,则broker leader skew: (4 - 2) / 14 = 14%
由于kafka所有读写都在leader上进行, broker leader skew会导致不同broker的读写负载不均衡,配置参数 auto.leader.rebalance.enable=true
可以使kafka每5min自动做一次leader的rebalance,消除这个问题。
4. Lag:
表示consumer的消费能力,计算公式为Lag = LogSize - Consumer Offset,Kafka Manager从zk获取LogSize,从kafka __consumer_offsets topic读取Offset。两步操作存在一个时间gap,因此吞吐很大的topic上会出现LogSize > Offset 的情况。导致Lag负数。
原文地址:https://blog.csdn.net/qq_40477248/article/details/144745053
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!