138 lines
3.6 KiB
Markdown
138 lines
3.6 KiB
Markdown
|
|
# 【E21现场】升级23.07后,Bole-IGW出现因记录HTTP会话日志申请大量内存,导致连续重启
|
|||
|
|
|
|||
|
|
| ID | Creation Date | Assignee | Status |
|
|||
|
|
|----|----------------|----------|--------|
|
|||
|
|
| OMPUB-1054 | 2023-11-05T12:10:53.000+0800 | 杨威 | 已关闭 |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
Bole-IGW NPB03从2023.11.3 12点(UTC+3)左右开始持续重启,现象是watchdog timeout触发重启,原因是部分线程大量申请内存导致缺页,如果不开watchdog的话,则会在启动后迅速触发oom**liuyang** commented on *2023-11-05T12:11:43.168+0800*:
|
|||
|
|
|
|||
|
|
暂时定位到原因是流量中有异常的HTTP URL,触发会话插件占用内存异常,暂时将会话插件对于HTTP只记录host,具体是修改对应NPB中session_record.inf
|
|||
|
|
[HTTP]
|
|||
|
|
#FUNC_FLAG=ALL
|
|||
|
|
FUNC_FLAG=HTTP_HOST
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2023-11-06T14:24:02.557+0800*:
|
|||
|
|
|
|||
|
|
故障开始时间:UTC+3 2023-11-03 11:40
|
|||
|
|
|
|||
|
|
故障排查时段:UTC+3 2023-11-03 15:11-17:48
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
重启现象:运行一段时间后,单包处理延迟高,触发watchdog timeout
|
|||
|
|
|
|||
|
|
!anydesk00001.png|thumbnail!
|
|||
|
|
|
|||
|
|
重启现场:session_record调用sendlog发送日志,调用realloc时间长导致超时
|
|||
|
|
|
|||
|
|
!anydesk00000.png|thumbnail!
|
|||
|
|
|
|||
|
|
定位过程:
|
|||
|
|
* 关闭watchdog,前台gdb运行,观察到运行数分钟后,部分CPU核 sys调用上涨,同时内存快速增长,触发OOM
|
|||
|
|
|
|||
|
|
!anydesk00002.png|thumbnail!
|
|||
|
|
* perf top -C ’sys调用上涨的核‘,观察到内存分配相关函数调用高,对应的火焰图显示sys调用上涨原因为触发缺页中断
|
|||
|
|
|
|||
|
|
!anydesk00003.png|thumbnail!
|
|||
|
|
|
|||
|
|
!image-2023-11-06-14-14-36-831.png|width=468,height=336!
|
|||
|
|
* bt查看对应sys调用高的核,调用栈主要在HTTP日志处理
|
|||
|
|
|
|||
|
|
!image-2023-11-06-14-16-02-180.png|width=636,height=358!
|
|||
|
|
* 尝试关闭session_record中的HTTP数据处理入口,内存上涨的现象未复现
|
|||
|
|
* 尝试session_record仅处理HTTP URL相关数据,内存上涨现象复现
|
|||
|
|
* 继续尝试,session_record仅处理HTTP Host,内存上涨现象未复现,同时检查重启现象的范围,全网对应时段仅有Bol-IGW NPB03一台,暂时定位原因为流量中的异常HTTP URL导致session_record拼接日志内存使用异常,触发watchdog timeout
|
|||
|
|
|
|||
|
|
故障处理
|
|||
|
|
* UTC+3 2023-11-03 17:48暂时将Bole-IGW NPB 03中session_record处理HTTP的入口改为仅处理Host,待观察后续情况
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2023-11-06T14:32:12.243+0800*:
|
|||
|
|
|
|||
|
|
UTC+3 2023-11-03 21:02 Bole-IGW NPB03出现重启,重启现象依然为watchdog timeout,对应时段PSI监控CPU waiting较高
|
|||
|
|
|
|||
|
|
!image-2023-11-06-14-29-10-155.png|width=521,height=1098!
|
|||
|
|
|
|||
|
|
同时自17:48修改session_record参数后,内存使用未见异常,但是Application Drop计数明显上涨,推测原因仍然与异常流量有关
|
|||
|
|
|
|||
|
|
!image-2023-11-06-14-30-31-368.png|width=507,height=535!
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2023-11-27T09:49:58.473+0800*:
|
|||
|
|
|
|||
|
|
11-24 *Bole-IGW* {*}NOB04{*}重现,UTC+3 14:29开启频繁重启,现象为重启后,触发watchdog timeout
|
|||
|
|
|
|||
|
|
执行11-03 NBP03相同的处理方式,恢复正常
|
|||
|
|
|
|||
|
|
!image-2023-11-27-09-47-55-825.png|width=802,height=435!
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuyang** commented on *2024-08-31T18:34:04.849+0800*:
|
|||
|
|
|
|||
|
|
系统已升级至TSG24.02,关闭此bug。升级后系统再出现类似问题,重新创建bug
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Attachments
|
|||
|
|
|
|||
|
|
**46696/anydesk00000.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46695/anydesk00001.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46697/anydesk00002.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46698/anydesk00003.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46700/image-2023-11-06-14-14-36-831.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46701/image-2023-11-06-14-16-02-180.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46705/image-2023-11-06-14-29-10-155.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46706/image-2023-11-06-14-30-31-368.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**47582/image-2023-11-27-09-47-55-825.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**46699/perf+(3).svg**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|