138 lines
3.6 KiB
Markdown
138 lines
3.6 KiB
Markdown
# 【E21现场】升级23.07后,Bole-IGW出现因记录HTTP会话日志申请大量内存,导致连续重启
|
||
|
||
| ID | Creation Date | Assignee | Status |
|
||
|----|----------------|----------|--------|
|
||
| OMPUB-1054 | 2023-11-05T12:10:53.000+0800 | 杨威 | 已关闭 |
|
||
|
||
|
||
---
|
||
|
||
Bole-IGW NPB03从2023.11.3 12点(UTC+3)左右开始持续重启,现象是watchdog timeout触发重启,原因是部分线程大量申请内存导致缺页,如果不开watchdog的话,则会在启动后迅速触发oom**liuyang** commented on *2023-11-05T12:11:43.168+0800*:
|
||
|
||
暂时定位到原因是流量中有异常的HTTP URL,触发会话插件占用内存异常,暂时将会话插件对于HTTP只记录host,具体是修改对应NPB中session_record.inf
|
||
[HTTP]
|
||
#FUNC_FLAG=ALL
|
||
FUNC_FLAG=HTTP_HOST
|
||
|
||
|
||
|
||
---
|
||
|
||
**yangwei** commented on *2023-11-06T14:24:02.557+0800*:
|
||
|
||
故障开始时间:UTC+3 2023-11-03 11:40
|
||
|
||
故障排查时段:UTC+3 2023-11-03 15:11-17:48
|
||
|
||
|
||
|
||
重启现象:运行一段时间后,单包处理延迟高,触发watchdog timeout
|
||
|
||
!anydesk00001.png|thumbnail!
|
||
|
||
重启现场:session_record调用sendlog发送日志,调用realloc时间长导致超时
|
||
|
||
!anydesk00000.png|thumbnail!
|
||
|
||
定位过程:
|
||
* 关闭watchdog,前台gdb运行,观察到运行数分钟后,部分CPU核 sys调用上涨,同时内存快速增长,触发OOM
|
||
|
||
!anydesk00002.png|thumbnail!
|
||
* perf top -C ’sys调用上涨的核‘,观察到内存分配相关函数调用高,对应的火焰图显示sys调用上涨原因为触发缺页中断
|
||
|
||
!anydesk00003.png|thumbnail!
|
||
|
||
!image-2023-11-06-14-14-36-831.png|width=468,height=336!
|
||
* bt查看对应sys调用高的核,调用栈主要在HTTP日志处理
|
||
|
||
!image-2023-11-06-14-16-02-180.png|width=636,height=358!
|
||
* 尝试关闭session_record中的HTTP数据处理入口,内存上涨的现象未复现
|
||
* 尝试session_record仅处理HTTP URL相关数据,内存上涨现象复现
|
||
* 继续尝试,session_record仅处理HTTP Host,内存上涨现象未复现,同时检查重启现象的范围,全网对应时段仅有Bol-IGW NPB03一台,暂时定位原因为流量中的异常HTTP URL导致session_record拼接日志内存使用异常,触发watchdog timeout
|
||
|
||
故障处理
|
||
* UTC+3 2023-11-03 17:48暂时将Bole-IGW NPB 03中session_record处理HTTP的入口改为仅处理Host,待观察后续情况
|
||
|
||
|
||
|
||
---
|
||
|
||
**yangwei** commented on *2023-11-06T14:32:12.243+0800*:
|
||
|
||
UTC+3 2023-11-03 21:02 Bole-IGW NPB03出现重启,重启现象依然为watchdog timeout,对应时段PSI监控CPU waiting较高
|
||
|
||
!image-2023-11-06-14-29-10-155.png|width=521,height=1098!
|
||
|
||
同时自17:48修改session_record参数后,内存使用未见异常,但是Application Drop计数明显上涨,推测原因仍然与异常流量有关
|
||
|
||
!image-2023-11-06-14-30-31-368.png|width=507,height=535!
|
||
|
||
|
||
|
||
---
|
||
|
||
**yangwei** commented on *2023-11-27T09:49:58.473+0800*:
|
||
|
||
11-24 *Bole-IGW* {*}NOB04{*}重现,UTC+3 14:29开启频繁重启,现象为重启后,触发watchdog timeout
|
||
|
||
执行11-03 NBP03相同的处理方式,恢复正常
|
||
|
||
!image-2023-11-27-09-47-55-825.png|width=802,height=435!
|
||
|
||
|
||
|
||
---
|
||
|
||
**liuyang** commented on *2024-08-31T18:34:04.849+0800*:
|
||
|
||
系统已升级至TSG24.02,关闭此bug。升级后系统再出现类似问题,重新创建bug
|
||
|
||
|
||
|
||
---
|
||
|
||
|
||
|
||
## Attachments
|
||
|
||
**46696/anydesk00000.png**
|
||
|
||
---
|
||
|
||
**46695/anydesk00001.png**
|
||
|
||
---
|
||
|
||
**46697/anydesk00002.png**
|
||
|
||
---
|
||
|
||
**46698/anydesk00003.png**
|
||
|
||
---
|
||
|
||
**46700/image-2023-11-06-14-14-36-831.png**
|
||
|
||
---
|
||
|
||
**46701/image-2023-11-06-14-16-02-180.png**
|
||
|
||
---
|
||
|
||
**46705/image-2023-11-06-14-29-10-155.png**
|
||
|
||
---
|
||
|
||
**46706/image-2023-11-06-14-30-31-368.png**
|
||
|
||
---
|
||
|
||
**47582/image-2023-11-27-09-47-55-825.png**
|
||
|
||
---
|
||
|
||
**46699/perf+(3).svg**
|
||
|
||
---
|
||
|