Files
geedge-jira/md/OMPUB-688.md
2025-09-14 21:52:36 +00:00

126 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 【XJ-NPM现场】NEZHA 22.02版本 Loki状态反复异常问题
| ID | Creation Date | Assignee | Status |
|----|----------------|----------|--------|
| OMPUB-688 | 2022-11-07T17:12:44.000+0800 | 史振东 | 已解决 |
---
排查nezha步骤如下
1图表报错No PromServer availableprometheus服务不可用
!image-2022-11-07-16-59-33-332.png|width=692,height=286!
2在Nezha Web界面的Explore中使用指标语句查询prometheus服务不可用
!image-2022-11-07-16-59-48-757.png|width=683,height=293!
3浏览器打开promethues界面查询指标语句得到
1警告获取服务器时间时出现意外响应状态服务不可用。
2执行查询时出错查询队列中的查询超时。
指标数据不可正常显示。
!image-2022-11-07-17-00-03-215.png|width=691,height=204!
4查看nz-agent-error的日志
/var/log/nezha/nz-agent/nz-agent-error-2022-11-06.0.log
从error log中可以看到在连接端口号为13100的应用时出现了Read timed out。
!image-2022-11-07-17-01-24-057.png|width=702,height=27!
5从conf空间的端口规划中找到13100端口对应的组件是loki
!image-2022-11-07-17-02-33-101.png|width=351,height=82!
6查看loki的状态为停用状态。
!image-2022-11-07-17-02-59-339.png|width=686,height=153!
7查看loki进程信息。Loki的状态为Dsl不可中断的睡眠状态。
!image-2022-11-07-17-03-30-219.png|width=696,height=28!**shizhendong** commented on *2022-11-09T17:04:46.162+0800*:
BUG产生的原因loki 进程处于 不可唤醒的休眠状态Disk sleep导致 nz-agent 程序状态异常,监控功能异常
排查过程:
# 查看 loki.service 日志,发现 systemd 重启 loki 失败的原因是 端口 被占用 
level=error msg="error running loki" err="listen tcp :13100: bind: address already in use
# 通过查看端口占用情况,发现是 loki 进程占有端口
!1.png|thumbnail!
# 通过查看该进程状态,确定 loki 进程处于 僵死状态,无法 kill 掉
!2.png|thumbnail!
# 猜测方向为 1.网卡休眠 2.磁盘IO异常。
通过磁盘检测和网卡状态查看,一切正常,并未发现异常情况
# BUG临时通过重启服务器解决
---
**shizhendong** commented on *2022-11-09T17:08:37.513+0800*:
该 BUG 持续关注中,并对此环境 NEZHA 增加监控信息,如下:
# 10.111.231.101 基础信息监控CPU、Disk、Memory...),并在 asset info 添加对应监控图
# NZ 组件监控Dashboard = NEZHA monitoring包含程序组件Prometheus, loki, redis, mariadb等
# Nz alert rule包含对 10.111.231.101 设备上 NEZHA 组件及基础信息监控
---
## Attachments
**32643/1.png**
---
**32644/2.png**
---
**32604/image-2022-11-07-16-59-33-332.png**
---
**32603/image-2022-11-07-16-59-48-757.png**
---
**32602/image-2022-11-07-17-00-03-215.png**
---
**32601/image-2022-11-07-17-01-24-057.png**
---
**32600/image-2022-11-07-17-02-17-148.png**
---
**32599/image-2022-11-07-17-02-33-101.png**
---
**32598/image-2022-11-07-17-02-59-339.png**
---
**32597/image-2022-11-07-17-03-30-219.png**
---