自学内容网 自学内容网

全面Kafka监控方案:从配置到指标

1.1.监控配置

开启JMX服务端口:kafka基本分为broker、producer、consumer三个子项,每一项的启动都需要用到 $KAFKA_HOME/bin/kafka-run-class.sh 脚本,在该脚本中,存在以下语句:

if ...
KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false  -Dcom.sun.management.jmxremote.ssl=false"fi
if ...
  KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT "
Fi

​ 在启动kafka的过程中,只要指定 JMX_PORT 的值,即可对broker、producer、consumer进行监控。目前有两种方法,在$KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/bin/kafka-console-consumer.sh $KAFKA_HOME/bin/kafka-console producer.sh三个脚本中分别添加 $JMX_PORT=XXXX 语句,但是只适用于使用console方式对topic进行使用的情况。
修改$KAFKA_HOME/bin/kafka-run-class.sh 脚本中的上述语句,使其端口随机变化,可以通过 ps -ef |grep kafka 命令来获取随机的端口号,来进行监控。

1.2.监控工具

Prometheus监控Kafka

  1. 如可以采用docker部署
kafka-exporter:
docker run -ti -d --rm -p 9308:9308 danielqsj/kafka-exporter --kafka.server=192.168.0.4:9092
监控项名称阈值说明使用的公式
Kafka的Brokers在线1m !=1严重count(kafka_server_replicamanager_leadercount{job=~"$job"})
Kafka集群中副本处于同步失败或失效状态的分区数>0严重sum(kafka_topic_partition_under_replicated_
partition{topic=~"$topic", namespace=~"$kubernetes_namespace"})
Kafka集群中控制器的数量!=1严重sum(kafka_controller_kafkacontroller_activecontrollercount{job=~"$job"})
Kafka离线分区数>0严重sum(kafka_controller_kafkacontroller_offlinepartitionscount{job=~"$job"})
Kafka每秒入网络流量>=150中度avg_over_time(kafka_server_BrokerTopicMetrics_
OneMinuteRate{name="BytesInPerSec",topic=""}[1m]) / 1024 /1024
Kafka请求处理程序线程空闲的平均时间百分比<= 0.3中度avg_over_time(kafka_server_KafkaRequestHandlerPool_
OneMinuteRate{name="RequestHandlerAvgIdlePercent",}[1m])

在这里插入图片描述
2. 在prometheus.yml出添加kafka配置

  - job_name: 'kafka_exporter'
    static_configs:
    - targets: ['$node1:9308']
  1. 重启prometheus加载。
  2. 在promethues的管理界面可以查看状态:
    在这里插入图片描述
  3. 然后配置grafana来展示图表效果。
  4. 告警监控项,如下表供参考:
时间百分比rcent",}[1m])
Kafka请求处理程序线程空闲的平均时间百分比<= 0.3中度avg_over_time(kafka_server
_KafkaRequestHandlerPool
_OneMinuteRate{name=“
RequestHandlerAvgIdlePercent”,}[1m])
Kafka网络处理器线程空闲的平均时间百分比<= 0.3中度avg_over_time(kafka_network_
SocketServer_Value{name=“
NetworkProcessorAvgIdlePercent”,}[1m])
Kafka已建立的连接数> 3000中度> 5000严重sum(avg_over_time(kafka_
server_socket_server_metrics_
connection_count{listener=“PLAINTEXT”,}
[1m])) by (instance,app)
Kafka每秒新建连接数> 100中度> 200 严重sum(avg_over_time(kafka_server
_socket_server_metrics_connection
_creation_rate[1m])) by (instance)
Kafka请求在请求队列中等待的时间>5000中度avg_over_time(kafka_networ
k_RequestMetrics_999thPercentile
{name=“RequestQueueTimeMs”,
request=“Produce”,}[1m])
Kafka_leader处理请求的时间>5000中度avg_over_time(kafka_network_
RequestMetrics_999thPercentile
{name=“LocalTimeMs”,request=“Produce”,}[1m])
Kafka请求等待follower的时间>1000中度avg_over_time(kafka_network_
RequestMetrics_999thPercentile
{name=“RemoteTimeMs”,request=“Produce”,}[1m])
Kafka请求在响应队列中等待的时间>1000中度avg_over_time(kafka_network_
RequestMetrics_999thPercentile
{name=“ResponseQueueTimeMs”,request=“Produce”,}[1m])
Kafka发送响应的时间>1000中度avg_over_time(kafka_network_RequestMetrics
_999thPercentile{name=“ResponseSendTimeMs”,
request=“Produce”,}[1m])
Kafka汇总传入消息速率> 200000中度avg_over_time(kafka_server_
BrokerTopicMetrics_OneMinuteRate
{name=“MessagesInPerSec”,topic=“”}[1m])
kafka消费滞后告警>1000sum(kafka_consumergroup
_lag{topic!=“sop_free_study_fix-student_wechat_detail”})
by (consumergroup, topic) > 1000
kafka-exporter停止< 1kafka_exporter_build_info
kafka server停止<1kafka_brokers
kafka监控topic实时生产速率>= 0sum(irate(kafka_topic_partition_current_
offset{topic !~ "__consumer_offsets
Kafka消费者端分区偏移量5m >= 0sum(delta(kafka_consumergroup_current
_offset[5m])/5) by (consumergroup, topic)
Kafka消费者组的当前主题分区偏移汇总sum(delta(kafka_consumergroup_current
_offset_sum[5m])/5) by (consumergroup, topic)
Kafka某个消费组消费延迟5m >100000中度sum(kafka_consumergroup_lag)
by (consumergroup,partition,topic)
Kafka某个消费者组在某个主题分区的近似滞后情况汇总sum(kafka_consumergroup_lag_sum)
by (consumergroup,partition,topic)
某个消费组成员kafka_consumergroup_
members{instance=“$instance”}
Kafka分区的位移量汇总sum(kafka_topic_partition_current_offset) by (partition,topic)
Kafka分区的同步副本数1m =0 中度sum(kafka_topic_partition_in_sync_replica)
Kafka旧主题分区偏移sum(kafka_topic_partition_oldest
_offset{topic=~“$topic”}) by (partition,topic)
Kafka主题分区的副本数1m <3中度sum(kafka_topic_partition
_replicas{topic=~“$topic”})
Kafka主题分区复制不足的分区数sum(kafka_topic_partition_under
_replicated_partition{topic=~“$topic”})
Kafka 总分区数5m >1000中度sum(kafka_topic_partitions) by(topic)

1.3.性能指标

系统相关指标

  1. 系统信息收集 java.lang:type=OperatingSystem
  2. Thread信息收集 java.lang:type=Threading
  3. 获取mmaped和direct空间
  4. 通过BufferPoolMXBean获取used、capacity、count

GC相关指标

  1. Young GC
    java.lang:type=GarbageCollector,name=G1 Young Generation
  2. Old GC
    java.lang:type=GarbageCollector,name=G1 Old Generation

JVM相关指标

通过MemoryMXBean获取JVM相关信息HeapMemoryUsage和NonHeapMemoryUsage;通过MemoryPoolMXBean获取其他JVM内存空间指标,例如:Metaspace、Codespace等

Topic相关指标

  1. Topic消息入站速率(Byte)
    kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=" + topic
  2. Topic消息出站速率(Byte)
    kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=" + topic
  3. Topic请求被拒速率
    kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic=" + topic
  4. Topic失败拉去请求速率
    kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec,topic=" + topic;
  5. Topic发送请求失败速率
    kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec,topic=" + topic
  6. Topic消息入站速率(message)
    kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=" + topic

Broker相关指标

  1. Log flush rate and time
    kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
  2. 同步失效的副本数
    kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
  3. 消息入站速率(消息数)
    kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
  4. 消息入站速率(Byte)
    kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
  5. 消息出站速率(Byte)
    kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
  6. 请求被拒速率
    kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec
  7. 失败拉去请求速率
    kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec
  8. 发送请求失败速率
    kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec
  9. Leader副本数
    kafka.server:type=ReplicaManager,name=LeaderCount
  10. Partition数量
    kafka.server:type=ReplicaManager,name=PartitionCount
  11. 下线Partition数量
    kafka.controller:type=KafkaController,name=OfflinePartitionsCount
  12. Broker网络处理线程空闲率
    kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
  13. Leader选举比率
    kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
  14. Unclean Leader选举比率
    kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
  15. Controller存活数量
    kafka.controller:type=KafkaController,name=ActiveControllerCount
  16. 请求速率
    kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce
  17. Consumer拉取速率
    kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer
  18. Follower拉去速率
    kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower
  19. Request total time
    kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
  20. Consumer fetch total time
    kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer
  21. Follower fetch total time
    kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower
  22. Time the follower fetch request waits in the request queue
    kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=FetchFollower
  23. Time the Consumer fetch request waits in the request queue
    kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=FetchConsumer
  24. Time the Produce fetch request waits in the request queue
    kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce
  25. Broker I/O工作处理线程空闲率
    kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
  26. ISR变化速率
    kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

1.4.性能指标说明

指标单位具体含义
kafka.broker_offset offsetsbroker上当前消息的偏移量(offset)
kafka.consumer.bytes_inbytes/secondconsumer 字节率(bytes in rate)
kafka.consumer.delayed_requestsrequests延迟的 consumer 请求数
kafka.consumer.expires_per_secondevictions/second延迟 consumer 的请求到期(expiration)速率
kafka.consumer.fetch_raterequestsconsumer 向 broker 发送提取请求(fetch requests)的最低速率
kafka.consumer.kafka_commitswrites/second面向 Kafka 的 offset commits 速率
kafka.consumer.max_lagoffsets最大消费滞后(consumer lag)
kafka.consumer.messages_inmessages/secondconsumer 消息消费(consumption)的速率
kafka.consumer.zookeeper_commitswrites/second面向 ZooKeeper 的 offset commits 速率
kafka.consumer_lagoffsetsconsumer 和 broker 之间的消息滞后(lag)
kafka.consumer_offsetoffsetsconsumer 的当前消息偏移量(current message offset)
kafka.expires_secevictions/second延迟生产者(delayed producer)的请求到期(request expiration)速率
kafka.follower.expires_per_secondevictions/second关注者(followers)的请求到期(request expiration)速率
kafka.log.flush_rateflushes/second日志刷新速率
kafka.messages_inmessages传入(incoming)信息速率
kafka.net.bytes_inbytes/second传入(incoming)字节速率
kafka.net.bytes_outbytes/second传出(outgoing)字节速率
kafka.net.bytes_rejectedbytes/second被拒绝(rejected)的字节速率
kafka.producer.bytes_outbytes/secondproducer 字节输出速率
kafka.producer.delayed_requestsrequests延迟的 producer 请求数
kafka.producer.expires_per_secondsevictions/secondproducer 请求到期率
kafka.producer.io_waitnanosecondsProducer I/O 等待时间
kafka.producer.message_ratemessages/secondProducer 消息速率
kafka.producer.request_latency_avgmillisecondsProducer 平均请求延迟
kafka.producer.request_raterequests/secondproducer 每秒钟的请求数
kafka.producer.response_rateresponses/secondproducer 每秒钟的响应数
kafka.replication.isr_expandsnodes/second副本加入 ISR 池的速率
kafka.replication.isr_shrinksnodes/second副本离开 ISR 池的速率
kafka.replication.leader_electionsevents/second领导选举(Leader election)频率
kafka.replication.unclean_leader_electionsevents/secondUnclean 的领导选举(Leader election)频率
kafka.replication.under_replicated_partitions未使用的分区数
kafka.request.fetch.failedrequests客户端获取请求(fetch request)失败次数
kafka.request.fetch.failed_per_secondrequests/second每秒钟的客户端获取请求(fetch request)失败率
kafka.request.fetch.time.99percentilerequests/second获取请求(fetch request)时间的第 99 百分位的值
kafka.request.fetch.time.avgrequests/second获取请求(fetch request)时间的平均值
kafka.request.handler.avg.idle.pctfractions请求处理程序线程(request handler threads)的平均空闲时间占比
kafka.request.metadata.time.99percentilemilliseconds元数据(metadata)请求时间的第 99 百分位的值
kafka.request.metadata.time.avgmilliseconds元数据(metadata)请求时间的的平均值
kafka.request.offsets.time.99percentilemillisecondsoffset 请求时间的第 99 百分位的值
kafka.request.offsets.time.avgmillisecondsoffset 请求时间的平均值
kafka.request.produce.failedrequests失败的产品请求(produce requests)数
kafka.request.produce.failed_per_secondrequests/second每秒钟的产品请求(produce requests)失败率
kafka.request.produce.time.99percentilerequests/second产品请求(produce requests)时间的第 99 百分位的值
kafka.request.produce.time.avgrequests/second产品请求(produce requests)平均时间
kafka.request.update_metadata.time.99percentilemilliseconds更新元数据请求(update metadata requests)时间的第 99 百分位的值
kafka.request.update_metadata.time.avgmilliseconds更新元数据请求(update metadata requests)时间的平均值

1.5.重要指标说明

参照kafka-manager管理工具
1.

kafka.replication.under_replicated_partitions:
Under Replicated Partitions

: 在一个运行健康的集群中,处于同步状态的副本数(ISR)应该与总副本数(简称AR:Assigned Repllicas)完全相等,如果分区的副本远远落后于leader,那这个follower将被ISR池删除,随之而来的是IsrShrinksPerSec(可理解为isr的缩水情况,后面会讲)的增加。由于kafka的高可用性必须通过副本来满足,所有有必要重点关注这个指标,让它长期处于大于0的状态。
2. Brokers Spread:
broker使用率,如kafka集群9个broker,某topic有7个partition,则broker spread: 7 / 9 = 77%
3. Brokers Leader Skew:
leader partition是否存在倾斜,如kafka集群9个broker,某topic14个partition,则正常每个broker有2个leader partition。若其中一个broker有0个leader partition,一个有4个leader partition,则broker leader skew: (4 - 2) / 14 = 14%
由于kafka所有读写都在leader上进行, broker leader skew会导致不同broker的读写负载不均衡,配置参数 auto.leader.rebalance.enable=true 可以使kafka每5min自动做一次leader的rebalance,消除这个问题。
4. Lag:
表示consumer的消费能力,计算公式为Lag = LogSize - Consumer Offset,Kafka Manager从zk获取LogSize,从kafka __consumer_offsets topic读取Offset。两步操作存在一个时间gap,因此吞吐很大的topic上会出现LogSize > Offset 的情况。导致Lag负数。


原文地址:https://blog.csdn.net/qq_40477248/article/details/144745053

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!