Files
geedge-jira/md/OMPUB-1446.md

66 lines
2.5 KiB
Markdown
Raw Normal View History

2025-09-14 21:52:36 +00:00
# BOL-IGW站点多台NPB设备产生TSG-OS container restart告警
| ID | Creation Date | Assignee | Status |
|----|----------------|----------|--------|
| OMPUB-1446 | 2024-09-02T20:06:19.000+0800 | 杨威 | 开放 |
---
No description
---
**yangwei** commented on *2024-09-03T11:20:14.754+0800*:
现场
* Bole-IGW两组共计10个NPB除Bole-IGW T9K02 NPB05外其余所有traffic_engine均在9月1日00:40~00:42之间出现重启触发现场为marsio报no mbuf
分析
* 检查触发重启的NPB日志
** firewall进程在00:42均出现CPU使用超过99%的告警,范围为所有包处理核
!image-2024-09-03-10-45-22-623.png|width=536,height=179!
*
** 未触发重启的NPB05在同一时刻也有上述CPU使用超99%的告警持续2s后消失推测该NPB当时剩余mbuf较其他节点稍多因此未触发marsio的no mbuf同时该NPB也并未触发死锁检测告警。
** 由于firewall所有包处理线程CPU使用率接近100%导致未及时处理marsio缓冲队列中的mbuf大部分NPB在1s内触发marsio no mbuf告警
* 检查未重启的NPB监控
** 重启时段为流量低谷期在00:41左右明显出现udp新建上涨同时monitor命中也从每秒100~300突增至9k+
!image-2024-09-03-11-03-22-597.png|width=389,height=266!!image-2024-09-03-11-07-07-174.png|width=393,height=277!
结论
* 综上根据现场的日志和监控推测重启的原因为9.1 00:41时刻Bole-IGW站点所有NPB收到突增的UDP流量同时大量命中monitor策略所有处理线程CPU使用超过99%触发marsio no mbuf告警导致所有pod重启。
 
问题
* 突发流量持续时间较短日志显示约1~2s除NPB05外其余NPB未到达overload protection最小检测周期1s即已触发marsio no mbuffirewall需要考虑调整过载保护的检测周期至更细粒度。
* 考虑增加对monitor策略命中速率的限制
 
---
2025-09-14 22:26:17 +00:00
# Attachments
2025-09-14 21:52:36 +00:00
2025-09-14 22:26:17 +00:00
Attachment: image-2024-09-03-10-45-22-623.png
![image-2024-09-03-10-45-22-623.png](https://gfwleak.exec.li/admin/geedge-jira/raw/branch/master/attachment/62173/image-2024-09-03-10-45-22-623.png)
2025-09-14 21:52:36 +00:00
2025-09-14 22:26:17 +00:00
Attachment: image-2024-09-03-11-03-22-597.png
![image-2024-09-03-11-03-22-597.png](https://gfwleak.exec.li/admin/geedge-jira/raw/branch/master/attachment/62174/image-2024-09-03-11-03-22-597.png)
Attachment: image-2024-09-03-11-07-07-174.png
![image-2024-09-03-11-07-07-174.png](https://gfwleak.exec.li/admin/geedge-jira/raw/branch/master/attachment/62175/image-2024-09-03-11-07-07-174.png)
2025-09-14 21:52:36 +00:00