# 【M22项目】YGN-MYTEL站点多台tsgx出现container firewall restarted | ID | Creation Date | Assignee | Status | |----|----------------|----------|--------| | OMPUB-1321 | 2024-06-11T13:59:01.000+0800 | 杨威 | 已关闭 | --- YGN-MYTEL出现container firewall restarted的TSGX以及重启时间 TSG-OS-YGN-MYTEL-TSGX004 2024-06-11T03:39:11+06:30 TSG-OS-YGN-MYTEL-TSGX005 2024-06-11T03:37:11+06:30 TSG-OS-YGN-MYTEL-TSGX011 2024-06-11T03:38:11+06:30 TSG-OS-YGN-MYTEL-TSGX012 2024-06-11T03:37:41+06:30 TSG-OS-YGN-MYTEL-TSGX018 2024-06-11T03:37:41+06:30 TSG-OS-YGN-MYTEL-TSGX019 2024-06-11T03:39:11+06:30 TSG-OS-YGN-MYTEL-TSGX021 2024-06-11T03:31:11+06:30 TSG-OS-YGN-MYTEL-TSGX022 2024-06-11T03:37:41+06:30 TSG-OS-YGN-MYTEL-TSGX023 2024-06-11T03:38:41+06:30 其中YGN-MYTEL-TSGX004、YGN-MYTEL-TSGX019、YGN-MYTEL-TSGX023这三台firewall重启后,程序处理的Traffic趋近于0**gitlab** commented on *2024-06-12T15:51:30.460+0800*: [陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|https://git.mesalab.cn/MESA_Platform/marsio/-/commit/857de62018a8eaa6547fab541f251dd7e566d0f6] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote} --- **gitlab** commented on *2024-06-14T09:38:31.590+0800*: [陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/480] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote} --- **gitlab** commented on *2024-06-14T09:38:34.791+0800*: [陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|https://git.mesalab.cn/MESA_Platform/marsio/-/commit/b4c19cdd847fce4cbb32c6981fbaf9d3a3d30b94] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote} --- **gitlab** commented on *2024-06-14T14:57:22.329+0800*: [陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|https://git.mesalab.cn/MESA_Platform/marsio/-/commit/06f515f5fee9457b99841cfd3bb42bcf31f91f7d] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [backport-from-dev-4-8|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/backport-from-dev-4-8]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote} --- **gitlab** commented on *2024-06-14T14:58:10.372+0800*: [陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/481] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [backport-from-dev-4-8|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/backport-from-dev-4-8]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote} --- **yangwei** commented on *2024-06-17T09:55:10.131+0800*: 1、重启时刻(6月11日 3:31~3:42),对应设备的firewall engine均出现第三方DPI单包处理延迟高的告警,处理的栈为IPv4->UDP !image-2024-06-17-09-40-59-680.png|width=995,height=461! 2、单机监控显示对应时刻UDP新建会话量出现突增 !image-2024-06-17-09-38-55-435.png|width=460,height=288! 3、调测现场记录的coredump,处理延迟高的目的端口为3389,源目的IP均为141.11.197.xx段,查询IP段属于path network([Path Network|https://path.net/]) !image-2024-06-17-10-01-39-693.png|width=404,height=247! 该公司主营ddos缓解等业务,推测重启原因可能为该公司牵引UDP3389攻击流量,流量本身导致qdpi处理延迟高,触发firewall watchdog,导致重启 !image-2024-06-17-09-49-36-214.png|width=710,height=397!   解决思路:本地构造UDP3389流量,尝试复现堆栈现场的spin_lock,确认触发条件是由于流量负载本身,还是预分配资源超限造成的处理延迟高。 --- **yangwei** commented on *2024-06-19T14:36:30.512+0800*:  2024-06-15T03:37:40+06:30 YGN-MYTEL-TSGX026 再次复现 现场触发spin_lock的两个worker thread对应处理数据包的四元组如下: (saddr = 21559714, daddr = 3917357249, source = 45811, dest = 50537}  (saddr = 21559714, daddr = 3917357249, source = 45811, dest = 58562} !anydesk00000.png|width=560,height=315! !anydesk00001.png|width=558,height=314! 客户端IP 193.32.126.233 org “AS39351 31173 Services AB” 服务端IP 162.249.72.1 org “AS6507 Riot Games, Inc” !image-2024-06-19-14-30-38-097.png|width=455,height=271!!image-2024-06-19-14-31-37-570.png|width=436,height=269!   --- **yangwei** commented on *2024-06-19T14:40:44.572+0800*: 2024-06-19T06:18:11+06:30 TSG-OS-YGN-MYTEL-TSGX024 复现 !image-2024-06-19-14-41-03-900.png|width=631,height=311! !image-2024-06-19-14-40-33-071.png|width=1049,height=613! --- **gitlab** commented on *2024-06-20T11:10:44.703+0800*: [陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/483] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [dev-4.7|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/dev-4.7]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote} --- **yangwei** commented on *2024-06-21T12:20:04.328+0800*: 在本地和福建测试环境尝试复现触发spin_lock的条件,初步结论如下: * spin_lock与资源分配关系不大。减少给每个线程分配的nflow,在本地循环回放数据包,以及福建测试环境均未触发spin_lock * 排除资源分配原因,触发spin_lock的原因与特定流量负载相关的可能性较大。   观察spin_lock冲突导致处理延迟高的堆栈,spin_lock通常由libqmbundle.so调用 * 当前qdpi_detector的实现,参照示例为全局初始化一个engine对象,engine加载qmbundle后,在engine对象中注册N个worker_process(对应packet-io线程数) * 运行时,传入woker_id执行识别 推测libqmbundle.so中接口调用spin_lock的原因,在于识别某些负载时,需要使用竞争性资源,而全局单个engine共用一个qmbundle,使用竞争性资源时间较长时,spin_lock开销较大。 尝试更新qdpi_detector初始化逻辑,为每个worker_process,都分配一个engine对象,并且分别加载一份qmbundle,验证更新后是否不再调用spin_lock     --- **gitlab** commented on *2024-06-21T15:39:51.350+0800*: [宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a commit|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/commit/3d80c610dc8685b6df7ba05cd0da6c18b0cbad8b] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote} --- **gitlab** commented on *2024-06-21T15:40:28.996+0800*: [宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a merge request|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/merge_requests/2586] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote} --- **gitlab** commented on *2024-06-21T15:45:54.505+0800*: [宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a commit|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/commit/21be469b6311f00a3fca22c5f2172ca299a008ef] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote} --- ## Attachments **58894/1.png** --- **59174/anydesk00000.png** --- **59175/anydesk00001.png** --- **59129/image-2024-06-17-09-38-55-435.png** --- **59130/image-2024-06-17-09-40-59-680.png** --- **59131/image-2024-06-17-09-49-36-214.png** --- **59132/image-2024-06-17-10-01-39-693.png** --- **59176/image-2024-06-19-14-30-38-097.png** --- **59177/image-2024-06-19-14-31-20-805.png** --- **59178/image-2024-06-19-14-31-37-570.png** --- **59179/image-2024-06-19-14-39-59-360.png** --- **59180/image-2024-06-19-14-40-33-071.png** --- **59181/image-2024-06-19-14-41-03-900.png** --- **58899/YGN-MYTEL-TSGX004.html** --- **58887/YGN-MYTEL-TSGX004.png** --- **58889/企业微信截图_b0135b85-7a53-42f5-ac12-f77da916093e.png** --- **58891/企业微信截图_b0135b85-7a53-42f5-ac12-f77da916093e-1.png** ---