Files
geedge-jira/md/OMPUB-1054.md
2025-09-14 21:52:36 +00:00

138 lines
3.6 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 【E21现场】升级23.07后Bole-IGW出现因记录HTTP会话日志申请大量内存导致连续重启
| ID | Creation Date | Assignee | Status |
|----|----------------|----------|--------|
| OMPUB-1054 | 2023-11-05T12:10:53.000+0800 | 杨威 | 已关闭 |
---
Bole-IGW NPB03从2023.11.3 12点(UTC+3)左右开始持续重启现象是watchdog timeout触发重启原因是部分线程大量申请内存导致缺页如果不开watchdog的话则会在启动后迅速触发oom**liuyang** commented on *2023-11-05T12:11:43.168+0800*:
暂时定位到原因是流量中有异常的HTTP URL触发会话插件占用内存异常暂时将会话插件对于HTTP只记录host具体是修改对应NPB中session_record.inf
[HTTP]
#FUNC_FLAG=ALL
FUNC_FLAG=HTTP_HOST
---
**yangwei** commented on *2023-11-06T14:24:02.557+0800*:
故障开始时间UTC+3 2023-11-03 11:40
故障排查时段UTC+3 2023-11-03 15:11-17:48
 
重启现象运行一段时间后单包处理延迟高触发watchdog timeout
!anydesk00001.png|thumbnail!
重启现场session_record调用sendlog发送日志调用realloc时间长导致超时
!anydesk00000.png|thumbnail!
定位过程:
* 关闭watchdog前台gdb运行观察到运行数分钟后部分CPU核 sys调用上涨同时内存快速增长触发OOM
!anydesk00002.png|thumbnail!
* perf top -C sys调用上涨的核观察到内存分配相关函数调用高对应的火焰图显示sys调用上涨原因为触发缺页中断
!anydesk00003.png|thumbnail!
!image-2023-11-06-14-14-36-831.png|width=468,height=336!
* bt查看对应sys调用高的核调用栈主要在HTTP日志处理
!image-2023-11-06-14-16-02-180.png|width=636,height=358!
* 尝试关闭session_record中的HTTP数据处理入口内存上涨的现象未复现
* 尝试session_record仅处理HTTP URL相关数据内存上涨现象复现
* 继续尝试session_record仅处理HTTP Host内存上涨现象未复现同时检查重启现象的范围全网对应时段仅有Bol-IGW NPB03一台暂时定位原因为流量中的异常HTTP URL导致session_record拼接日志内存使用异常触发watchdog timeout
故障处理
* UTC+3 2023-11-03 17:48暂时将Bole-IGW NPB 03中session_record处理HTTP的入口改为仅处理Host待观察后续情况
---
**yangwei** commented on *2023-11-06T14:32:12.243+0800*:
UTC+3 2023-11-03 21:02 Bole-IGW NPB03出现重启重启现象依然为watchdog timeout对应时段PSI监控CPU waiting较高
!image-2023-11-06-14-29-10-155.png|width=521,height=1098!
同时自17:48修改session_record参数后内存使用未见异常但是Application Drop计数明显上涨推测原因仍然与异常流量有关
!image-2023-11-06-14-30-31-368.png|width=507,height=535!
---
**yangwei** commented on *2023-11-27T09:49:58.473+0800*:
11-24 *Bole-IGW* {*}NOB04{*}重现UTC+3 14:29开启频繁重启现象为重启后触发watchdog timeout
执行11-03 NBP03相同的处理方式恢复正常
!image-2023-11-27-09-47-55-825.png|width=802,height=435!
---
**liuyang** commented on *2024-08-31T18:34:04.849+0800*:
系统已升级至TSG24.02关闭此bug。升级后系统再出现类似问题重新创建bug
---
## Attachments
**46696/anydesk00000.png**
---
**46695/anydesk00001.png**
---
**46697/anydesk00002.png**
---
**46698/anydesk00003.png**
---
**46700/image-2023-11-06-14-14-36-831.png**
---
**46701/image-2023-11-06-14-16-02-180.png**
---
**46705/image-2023-11-06-14-29-10-155.png**
---
**46706/image-2023-11-06-14-30-31-368.png**
---
**47582/image-2023-11-27-09-47-55-825.png**
---
**46699/perf+(3).svg**
---