Files
geedge-jira/md/OMPUB-1196.md
2025-09-14 21:52:36 +00:00

9.7 KiB
Raw Blame History

[WMS-UTR项目]: Firewall释放内存较慢触发watchdog timout导致SAPP应用重启

ID Creation Date Assignee Status
OMPUB-1196 2024-03-26T11:34:16.000+0800 杨威 已关闭

  • P19现场: Firewall释放内存较慢触发watchdog timout导致SAPP应用重启 ** 时间2024-03-23 22:31:34 ** 频率2~4天/次 ** 版本v24.01.1-8ee198a (x86_64_COTS) *** firewall-3.0.37.685e5e8 *** sapp-pr-4.3.36.abab760 ** sapp日志

{code:java} [root@msh-tsgx01 sapp]# cat runtimelog.2024-03-23 | grep dead Sat Mar 23 08:20:28 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711164028,  thread index:1, TID(LWP):280, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:0 ##### Sat Mar 23 08:20:29 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711164029,  thread index:0, TID(LWP):279, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:1 ##### Sat Mar 23 08:20:30 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711164030,  thread index:0, TID(LWP):279, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:2 ##### Sat Mar 23 22:31:13 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215073,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:0 ##### Sat Mar 23 22:31:14 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215074,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:1 ##### Sat Mar 23 22:31:15 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215075,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:2 ##### Sat Mar 23 22:31:16 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215076,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:3 ##### Sat Mar 23 22:31:17 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215077,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:4 ##### Sat Mar 23 22:31:18 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215078,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:5 ##### Sat Mar 23 22:31:19 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215079,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:6 ##### Sat Mar 23 22:31:20 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215080,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:7 ##### Sat Mar 23 22:31:21 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215081,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:8 ##### Sat Mar 23 22:31:22 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215082,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:9 ##### Sat Mar 23 22:31:23 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 463, ##### detect deadlock in PID:103, at timestamp:1711215083,  thread index:8, TID(LWP):287, sd_notify_enable:0, deadlock_detected:1, deadlock_cnt:10 ##### Sat Mar 23 22:31:23 2024, FATAL, sapp , file /tmp/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX_PREFIX/MESA_Platform/sapp/src/timer/sapp_timer.c, line 486, ##### detect deadlock in PID:103, at timestamp:1711215083,  thread index:8, TID(LWP):0, sd_notify_enable:1, deadlock_detected:11, deadlock_cnt:287,  Trigger ABORT ##### [root@msh-tsgx01 sapp]#  {code}

** 栈信息


{code:java}   106  LWP 287           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6   107  LWP 295           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6   108  LWP 362           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6   109  LWP 329           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6   110  LWP 308           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6   111  LWP 283           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6   112  LWP 321           0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6

(gdb) thr 106 [Switching to thread 106 (LWP 287)] #0  0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 (gdb) bt #0  0x00007f2ab52b4a4b in madvise () from /lib64/libc.so.6 #1  0x00000000004e9d88 in je_pages_purge_forced () #2  0x00000000004dc765 in je_extent_dalloc_wrapper () #3  0x00000000004e9031 in pac_decay_to_limit.part () #4  0x00000000004e9607 in je_pac_maybe_decay_purge () #5  0x000000000048c418 in je_arena_decay () #6  0x00000000004ffad2 in je_tcache_bin_flush_small () #7  0x0000000000481b38 in je_free_default () #8  0x00007f28a58b7237 in nmx_destroy_pool ()    from /opt/tsg/framework/lib/libutable.so #9  0x00007f28a587c3d9 in utable_free ()    from /opt/tsg/framework/lib/libutable.so #10 0x00007f28a69418fe in firewall_context_exdata_free(session*, int, void*, void*) () from ./plug/business/firewall/firewall.so #11 0x00007f28a71f6014 in session_free ()    from ./plug/stellar_on_sapp/stellar_on_sapp.so #12 0x000000000044f272 in stream_bridge_destroy_per_stream () #13 0x000000000044455a in udp_free_stream () #14 0x0000000000446892 in del_stream_by_time () #15 0x0000000000449aba in polling_stream_timeout () #16 0x0000000000508e1d in marsio4_worker () #17 0x00007f2ab63c61ca in start_thread () from /lib64/libpthread.so.0 #18 0x00007f2ab52b4e73 in clone () from /lib64/libc.so.6 (gdb)  {code}yangwei commented on 2024-03-26T16:36:03.845+0800:

重启时段内存占用出现突增疑似死锁的现场为jemalloc触发内存整理进入系统调用madvise考虑调整jemalloc参数禁用内存碎片整理观察现场是否会复现

拟操作如下:

  • pod启动前设置环境变量 export MALLOC_CONF="retain:true,dirty_decay_ms:300000,muzzy_decay_ms:300000"

!image-2024-03-26-14-39-06-978.png|width=637,height=448!


yangwei commented on 2024-04-08T17:38:43.698+0800:

https://docs.geedge.net/pages/viewpage.action?pageId=129087928中在hotfix脚本调整内存参数操作


yangwei commented on 2024-05-06T14:23:58.404+0800:

原因同https://jira.geedge.net/browse/TSG-20774

由于WMS-UTR现场的os版本为24.01firewall版本为3.0.44,建议的解决方案如下:

  • 现场同步升级至24.02.17及以上版本已在firewall中限制并发缓存的transaction数量避免本issue中的问题
  • 考虑到WMS-UTR现场处理的流量已经过筛选可能造成UDP Transaction并发较多导致触发watchdog的的流量仅有可能为SIP未升级os的前提下可调小单会话中SIP并发的transaction数量避免本issue中的问题操作如下

{code:java} 1、在/etc/tsg-os/tsg-traffic-engine-vsys-1/firewall_prestart_script.sh脚本末位添加一行内容为 sed -ci 's/session_expire_num=1000/session_expire_num=10/' /opt/tsg/sapp/conf/sip/sip_main.conf 2、重启容器kubectl rollout restart deploy/tsg-traffic-engine-vsys-1-firewall 3、重启完成后执行kubectl exec -it tsg-traffic-engine-vsys-1-firewall-xxx-xxx(pod名) -- cat /opt/tsg/sapp/conf/sip/sip_main.conf确认其中session_expire_num=10存在这一行 {code} * ** 上述hotfix已经于4.29在现场流量最大的MSH01和TWA03上执行至5.6这两台未再出现watchdog timeout建议WMS-UTR在os升级前整体执行该hotfix


liuxueli commented on 2024-05-16T17:45:34.082+0800:

  • 问题原因参见 TSG-21289
  • 2024/05/16 更新{{{}firewall-{}}}{{{}3.0{}}}{{{}.{}}}{{{}45{}}}{{{}.fa65364{}}},观察效果。

Attachments

54210/image-2024-03-26-14-39-06-978.png