9.7 KiB
[WMS-UTR项目]: Firewall释放内存较慢触发watchdog timout导致SAPP应用重启
| ID | Creation Date | Assignee | Status |
|---|---|---|---|
| OMPUB-1196 | 2024-03-26T11:34:16.000+0800 | 杨威 | 已关闭 |
- P19现场: Firewall释放内存较慢触发watchdog timout导致SAPP应用重启 ** 时间:2024-03-23 22:31:34 ** 频率:2~4天/次 ** 版本:v24.01.1-8ee198a (x86_64_COTS) *** firewall-3.0.37.685e5e8 *** sapp-pr-4.3.36.abab760 ** sapp日志
{code:java} [root@msh-tsgx01 sapp]# cat runtimelog.2024-03-23 | grep dead Sat Mar 23 08:20:28 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711164028, thread index:1, TID(LWP):280, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:0 ##### Sat Mar 23 08:20:29 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711164029, thread index:0, TID(LWP):279, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:1 ##### Sat Mar 23 08:20:30 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711164030, thread index:0, TID(LWP):279, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:2 ##### Sat Mar 23 22:31:13 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215073, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:0 ##### Sat Mar 23 22:31:14 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215074, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:1 ##### Sat Mar 23 22:31:15 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215075, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:2 ##### Sat Mar 23 22:31:16 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215076, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:3 ##### Sat Mar 23 22:31:17 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215077, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:4 ##### Sat Mar 23 22:31:18 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215078, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:5 ##### Sat Mar 23 22:31:19 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215079, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:6 ##### Sat Mar 23 22:31:20 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215080, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:7 ##### Sat Mar 23 22:31:21 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215081, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:8 ##### Sat Mar 23 22:31:22 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215082, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:9 ##### Sat Mar 23 22:31:23 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215083, thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:10 ##### Sat Mar 23 22:31:23 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 486, ##### detect deadlock in PID:103, at timestamp:1711215083, thread index:8, TID(LWP):0, sd_notify_enable:1, deadlock_detected:11, deadlock_cnt:287, Trigger ABORT ##### [root@msh-tsgx01 sapp]# {code}
** 栈信息
{code:java} 106 LWP 287 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 107 LWP 295 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 108 LWP 362 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 109 LWP 329 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 110 LWP 308 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 111 LWP 283 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 112 LWP 321 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6
(gdb) thr 106 [Switching to thread 106 (LWP 287)] #0 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 (gdb) bt #0 0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 #1 0x00000000004e9d88 in je_pages_purge_forced () #2 0x00000000004dc765 in je_extent_dalloc_wrapper () #3 0x00000000004e9031 in pac_decay_to_limit.part () #4 0x00000000004e9607 in je_pac_maybe_decay_purge () #5 0x000000000048c418 in je_arena_decay () #6 0x00000000004ffad2 in je_tcache_bin_flush_small () #7 0x0000000000481b38 in je_free_default () #8 0x00007f28a58b7237 in nmx_destroy_pool () from /opt/tsg/framework/lib/libutable.so #9 0x00007f28a587c3d9 in utable_free () from /opt/tsg/framework/lib/libutable.so #10 0x00007f28a69418fe in firewall_context_exdata_free(session*, int, void*, void*) () from ./plug/business/firewall/firewall.so #11 0x00007f28a71f6014 in session_free () from ./plug/stellar_on_sapp/stellar_on_sapp.so #12 0x000000000044f272 in stream_bridge_destroy_per_stream () #13 0x000000000044455a in udp_free_stream () #14 0x0000000000446892 in del_stream_by_time () #15 0x0000000000449aba in polling_stream_timeout () #16 0x0000000000508e1d in marsio4_worker () #17 0x00007f2ab63c61ca in start_thread () from /lib64/libpthread.so.0 #18 0x00007f2ab52b4e73 in clone () from /lib64/libc.so.6 (gdb) {code}yangwei commented on 2024-03-26T16:36:03.845+0800:
重启时段内存占用出现突增,疑似死锁的现场为jemalloc触发内存整理进入系统调用madvise,考虑调整jemalloc参数,禁用内存碎片整理,观察现场是否会复现
拟操作如下:
- pod启动前设置环境变量 export MALLOC_CONF="retain:true,dirty_decay_ms:300000,muzzy_decay_ms:300000"
!image-2024-03-26-14-39-06-978.png|width=637,height=448!
yangwei commented on 2024-04-08T17:38:43.698+0800:
见https://docs.geedge.net/pages/viewpage.action?pageId=129087928中在hotfix脚本调整内存参数操作
yangwei commented on 2024-05-06T14:23:58.404+0800:
原因同https://jira.geedge.net/browse/TSG-20774。
由于WMS-UTR现场的os版本为24.01,firewall版本为3.0.44,建议的解决方案如下:
- 现场同步升级至24.02.17及以上版本,已在firewall中限制并发缓存的transaction数量,避免本issue中的问题
- 考虑到WMS-UTR现场处理的流量已经过筛选,可能造成UDP Transaction并发较多导致触发watchdog的的流量仅有可能为SIP,未升级os的前提下,可调小单会话中SIP并发的transaction数量,避免本issue中的问题,操作如下:
{code:java} 1、在/etc/tsg-os/tsg-traffic-engine-vsys-1/firewall_prestart_script.sh脚本末位添加一行,内容为 sed -ci 's/session_expire_num=1000/session_expire_num=10/' /opt/tsg/sapp/conf/sip/sip_main.conf 2、重启容器,kubectl rollout restart deploy/tsg-traffic-engine-vsys-1-firewall 3、重启完成后,执行kubectl exec -it tsg-traffic-engine-vsys-1-firewall-xxx-xxx(pod名) -- cat /opt/tsg/sapp/conf/sip/sip_main.conf,确认其中session_expire_num=10存在这一行 {code} * ** 上述hotfix已经于4.29在现场流量最大的MSH01和TWA03上执行,至5.6这两台未再出现watchdog timeout,建议WMS-UTR在os升级前,整体执行该hotfix
liuxueli commented on 2024-05-16T17:45:34.082+0800:
- 问题原因参见: TSG-21289
- 2024/05/16 更新{{{}firewall-{}}}{{{}3.0{}}}{{{}.{}}}{{{}45{}}}{{{}.fa65364{}}},观察效果。
Attachments
54210/image-2024-03-26-14-39-06-978.png