Files
geedge-jira/md/OMPUB-575.md
2025-09-14 21:52:36 +00:00

271 lines
5.7 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 【E21-NZ】BJR-IGW NPB 频繁出现短时间的asset_ping_failed告警
| ID | Creation Date | Assignee | Status |
|----|----------------|----------|--------|
| OMPUB-575 | 2022-08-04T21:30:17.000+0800 | 方顺健 | 已关闭 |
---
近期 BJR-IGW的NPB 、DIR-IGW的NPB、  national  center服务器出现短时间的asset_ping_failed告警其中BJR-IGW NPB 告警近期尤为频繁且每次是单个NPB告警持续时间多数都是4m30s
处理进展:
告警发生期间从nz-agent  ping  NPB 也可以ping 成功且无丢包现象但是NZ采集到的ping状态是失败 等于0查看告警期间的交换机日志也无相应的异常信息。
 
alert messages详情参考附件。**fangshunjian** commented on *2022-08-05T11:00:03.392+0800*:
{code:java}
// code placeholder
# 登录agent服务器
ssh 10.243.12.3
# 备份配置文件
cp /opt/nezha/nz-agent/blackbox_exporter/config.conf /opt/nezha/nz-agent/blackbox_exporter/config.conf.bak
# 修改log级别
echo "OPTION=\"--web.listen-address='0.0.0.0:19115' --config.file='/opt/nezha/nz-agent/blackbox_exporter/blackbox.yml' --log.level=debug\"" > /opt/nezha/nz-agent/blackbox_exporter/config.conf
# 重启 blackbox-exporter
systemctl restart blackbox-exporter.service{code}
[~liuju] 需要按照以上步骤开启debug级别日志待下一次出现时 进一步排查
---
**liuju** commented on *2022-08-05T15:28:25.953+0800*:
收到10.243.12.3 已开启debug级别日志观察中
---
**fangshunjian** commented on *2022-08-08T14:39:13.968+0800*:
导出blackbox exporter 日志
journalctl --since "2022-08-06" -u blackbox-exporter.service > blackbox.log && tar -zcvf blackbox.tar.gz ./blackbox.log
---
**fangshunjian** commented on *2022-08-08T15:08:58.422+0800*:
经排查 告警时出现 ping 响应超时
Cannot get TTL from the received packet
!image-2022-08-08-15-07-58-373.png!
!image-2022-08-08-16-09-06-021.png!!image-2022-08-08-16-09-49-835.png!
---
**fangshunjian** commented on *2022-08-08T17:07:34.954+0800*:
1、将脚本上传到 10.243.12.3
* [^ping_test.sh]
2、chmod +x ping_test.sh  #设置可执行权限
3、nohup ./ping_test.sh 10.243.11.2 &  # 每10s执行一次ping
4、待下次 10.243.11.2 再次出现告警时,检查 ping_test.log 是否同样出现错误信息,从而确定是否为 blackbox exporter bug
 
---
**liuju** commented on *2022-08-08T18:03:28.525+0800*:
收到,好的。
---
**liuju** commented on *2022-08-08T18:16:35.375+0800*:
已添加了ping状态检测脚本。
---
**liuju** commented on *2022-08-08T21:47:50.140+0800*:
 2022-08-08 16:24:58  10.243.11.2  出现ping告警告警消息active时显示持续时间9m,但是告警消失之后显示该条告警消息持续时间为4m30s 告警active期间2022-08-08  16:30:00 从NZ-agent ping 10.243.11.2 也进行了截图,无丢包。
告警消失之后,执行以下语句:
 journalctl --since "2022-08-08" -u blackbox-exporter.service > blackbox20220808.log && tar -zcvf blackbox20220808.tar.gz ./blackbox20220808.log
导出blackbox20220808.tar.gz及ping_test.log日志及NZ上告警消息截图打包到文件夹ping20220808压缩上传到附件。
---
**fangshunjian** commented on *2022-08-10T17:33:27.292+0800*:
[~liuju] 请按照以下步骤,更新 blackbox_exporter 程序
1、ssh 10.243.12.3
2、mv /opt/nezha/nz-agent/blackbox_exporter/blackbox_exporter /opt/nezha/nz-agent/blackbox_exporter/blackbox_exporter_bak  #备份
3、将 blackbox_exporter文件上传到  /opt/nezha/nz-agent/blackbox_exporter/
* [^blackbox_exporter]
4、systemctl restart blackbox-exporter.service # 重启服务
5、systemctl status blackbox-exporter.service  # 检查状态
 
 
---
**liuju** commented on *2022-08-11T14:33:12.745+0800*:
好的,已更新,继续观察。
---
**liuju** commented on *2022-08-12T14:52:44.939+0800*:
我在2022-08-11 09:30:46 在10.243.12.3  上更新的blackbox-exporter.service,查询更新之后NZ上ping告警目前还到仍有一个。 !image-2022-08-12-09-52-24-114.png!
---
**fangshunjian** commented on *2022-08-16T10:43:22.687+0800*:
为了避免误报的情况调整alert rule持续时间为超过两个检查周期异常才发出告警。
1、登录 NZ 系统
2、选择 Configuration / APM Settings
* 修改 ping interval 为 60s
* 保存
3、选择 Alerts / Rules 
* 修改 asset_ping_failed 规则 duration 时间为120 s
* 保存
 
!image-2022-08-16-10-36-52-190.png!
---
**liuju** commented on *2022-08-16T21:07:38.243+0800*:
已将NZ  Configuration / APM Settings  ping interval  300s修改为60s,
已将NZ  Alerts / Rules  asset_ping_failed  duration 60s修改为130s.
目前在持续关注后续效果中。
 
---
**liuju** commented on *2022-08-19T16:42:07.187+0800*:
截止到2022-08-19今天查询过去两天NZ上关于ping告警消息查询结果未再出现该issu反馈的问题。
---
**liuju** commented on *2022-08-26T15:46:12.825+0800*:
截止到2022-08-26 更新之后未再出现该issu反馈的问题现关闭该issue
---
## Attachments
**30119/20220804_Last_7days_ping_failed.xlsx**
---
**30120/alert-message-2022-08-04+16-28-01.xlsx**
---
**30290/blackbox_exporter**
---
**30187/image-2022-08-08-15-07-58-373.png**
---
**30196/image-2022-08-08-16-09-06-021.png**
---
**30195/image-2022-08-08-16-09-49-835.png**
---
**30377/image-2022-08-12-09-52-24-114.png**
---
**30456/image-2022-08-16-10-36-52-190.png**
---
**30201/ping_test.sh**
---
**30212/ping20220808.rar**
---
**30115/微信图片_20220804162536.png**
---
**30116/微信图片_20220804162542.png**
---
**30117/微信图片_20220804162548.png**
---
**30118/微信图片_20220804162619.png**
---
**30121/微信图片_20220804162907.png**
---