Files
geedge-jira/md/OMPUB-1446.md
2025-09-14 21:52:36 +00:00

63 lines
2.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BOL-IGW站点多台NPB设备产生TSG-OS container restart告警
| ID | Creation Date | Assignee | Status |
|----|----------------|----------|--------|
| OMPUB-1446 | 2024-09-02T20:06:19.000+0800 | 杨威 | 开放 |
---
No description
---
**yangwei** commented on *2024-09-03T11:20:14.754+0800*:
现场
* Bole-IGW两组共计10个NPB除Bole-IGW T9K02 NPB05外其余所有traffic_engine均在9月1日00:40~00:42之间出现重启触发现场为marsio报no mbuf
分析
* 检查触发重启的NPB日志
** firewall进程在00:42均出现CPU使用超过99%的告警,范围为所有包处理核
!image-2024-09-03-10-45-22-623.png|width=536,height=179!
*
** 未触发重启的NPB05在同一时刻也有上述CPU使用超99%的告警持续2s后消失推测该NPB当时剩余mbuf较其他节点稍多因此未触发marsio的no mbuf同时该NPB也并未触发死锁检测告警。
** 由于firewall所有包处理线程CPU使用率接近100%导致未及时处理marsio缓冲队列中的mbuf大部分NPB在1s内触发marsio no mbuf告警
* 检查未重启的NPB监控
** 重启时段为流量低谷期在00:41左右明显出现udp新建上涨同时monitor命中也从每秒100~300突增至9k+
!image-2024-09-03-11-03-22-597.png|width=389,height=266!!image-2024-09-03-11-07-07-174.png|width=393,height=277!
结论
* 综上根据现场的日志和监控推测重启的原因为9.1 00:41时刻Bole-IGW站点所有NPB收到突增的UDP流量同时大量命中monitor策略所有处理线程CPU使用超过99%触发marsio no mbuf告警导致所有pod重启。
 
问题
* 突发流量持续时间较短日志显示约1~2s除NPB05外其余NPB未到达overload protection最小检测周期1s即已触发marsio no mbuffirewall需要考虑调整过载保护的检测周期至更细粒度。
* 考虑增加对monitor策略命中速率的限制
 
---
## Attachments
**62173/image-2024-09-03-10-45-22-623.png**
---
**62174/image-2024-09-03-11-03-22-597.png**
---
**62175/image-2024-09-03-11-07-07-174.png**
---