9.0 KiB
【M22项目】YGN-MYTEL站点多台tsgx出现container firewall restarted
| ID | Creation Date | Assignee | Status |
|---|---|---|---|
| OMPUB-1321 | 2024-06-11T13:59:01.000+0800 | 杨威 | 已关闭 |
YGN-MYTEL出现container firewall restarted的TSGX以及重启时间
TSG-OS-YGN-MYTEL-TSGX004 2024-06-11T03:39:11+06:30 TSG-OS-YGN-MYTEL-TSGX005 2024-06-11T03:37:11+06:30 TSG-OS-YGN-MYTEL-TSGX011 2024-06-11T03:38:11+06:30 TSG-OS-YGN-MYTEL-TSGX012 2024-06-11T03:37:41+06:30 TSG-OS-YGN-MYTEL-TSGX018 2024-06-11T03:37:41+06:30 TSG-OS-YGN-MYTEL-TSGX019 2024-06-11T03:39:11+06:30 TSG-OS-YGN-MYTEL-TSGX021 2024-06-11T03:31:11+06:30 TSG-OS-YGN-MYTEL-TSGX022 2024-06-11T03:37:41+06:30 TSG-OS-YGN-MYTEL-TSGX023 2024-06-11T03:38:41+06:30
其中YGN-MYTEL-TSGX004、YGN-MYTEL-TSGX019、YGN-MYTEL-TSGX023这三台firewall重启后,程序处理的Traffic趋近于0gitlab commented on 2024-06-12T15:51:30.460+0800:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|857de62018] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
gitlab commented on 2024-06-14T09:38:31.590+0800:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/480] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
gitlab commented on 2024-06-14T09:38:34.791+0800:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|b4c19cdd84] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
gitlab commented on 2024-06-14T14:57:22.329+0800:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|06f515f5fe] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [backport-from-dev-4-8|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/backport-from-dev-4-8]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
gitlab commented on 2024-06-14T14:58:10.372+0800:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/481] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [backport-from-dev-4-8|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/backport-from-dev-4-8]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
yangwei commented on 2024-06-17T09:55:10.131+0800:
1、重启时刻(6月11日 3:31~3:42),对应设备的firewall engine均出现第三方DPI单包处理延迟高的告警,处理的栈为IPv4->UDP
!image-2024-06-17-09-40-59-680.png|width=995,height=461!
2、单机监控显示对应时刻UDP新建会话量出现突增
!image-2024-06-17-09-38-55-435.png|width=460,height=288!
3、调测现场记录的coredump,处理延迟高的目的端口为3389,源目的IP均为141.11.197.xx段,查询IP段属于path network([Path Network|https://path.net/])
!image-2024-06-17-10-01-39-693.png|width=404,height=247!
该公司主营ddos缓解等业务,推测重启原因可能为该公司牵引UDP3389攻击流量,流量本身导致qdpi处理延迟高,触发firewall watchdog,导致重启
!image-2024-06-17-09-49-36-214.png|width=710,height=397!
解决思路:本地构造UDP3389流量,尝试复现堆栈现场的spin_lock,确认触发条件是由于流量负载本身,还是预分配资源超限造成的处理延迟高。
yangwei commented on 2024-06-19T14:36:30.512+0800:
2024-06-15T03:37:40+06:30 YGN-MYTEL-TSGX026 再次复现
现场触发spin_lock的两个worker thread对应处理数据包的四元组如下:
(saddr = 21559714, daddr = 3917357249, source = 45811, dest = 50537}
(saddr = 21559714, daddr = 3917357249, source = 45811, dest = 58562}
!anydesk00000.png|width=560,height=315!
!anydesk00001.png|width=558,height=314!
客户端IP 193.32.126.233 org “AS39351 31173 Services AB”
服务端IP 162.249.72.1 org “AS6507 Riot Games, Inc”
!image-2024-06-19-14-30-38-097.png|width=455,height=271!!image-2024-06-19-14-31-37-570.png|width=436,height=269!
yangwei commented on 2024-06-19T14:40:44.572+0800:
2024-06-19T06:18:11+06:30 TSG-OS-YGN-MYTEL-TSGX024 复现
!image-2024-06-19-14-41-03-900.png|width=631,height=311!
!image-2024-06-19-14-40-33-071.png|width=1049,height=613!
gitlab commented on 2024-06-20T11:10:44.703+0800:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/483] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [dev-4.7|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/dev-4.7]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
yangwei commented on 2024-06-21T12:20:04.328+0800:
在本地和福建测试环境尝试复现触发spin_lock的条件,初步结论如下:
- spin_lock与资源分配关系不大。减少给每个线程分配的nflow,在本地循环回放数据包,以及福建测试环境均未触发spin_lock
- 排除资源分配原因,触发spin_lock的原因与特定流量负载相关的可能性较大。
观察spin_lock冲突导致处理延迟高的堆栈,spin_lock通常由libqmbundle.so调用
- 当前qdpi_detector的实现,参照示例为全局初始化一个engine对象,engine加载qmbundle后,在engine对象中注册N个worker_process(对应packet-io线程数)
- 运行时,传入woker_id执行识别
推测libqmbundle.so中接口调用spin_lock的原因,在于识别某些负载时,需要使用竞争性资源,而全局单个engine共用一个qmbundle,使用竞争性资源时间较长时,spin_lock开销较大。
尝试更新qdpi_detector初始化逻辑,为每个worker_process,都分配一个engine对象,并且分别加载一份qmbundle,验证更新后是否不再调用spin_lock
gitlab commented on 2024-06-21T15:39:51.350+0800:
[宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a commit|3d80c610dc] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote}
gitlab commented on 2024-06-21T15:40:28.996+0800:
[宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a merge request|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/merge_requests/2586] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote}
gitlab commented on 2024-06-21T15:45:54.505+0800:
[宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a commit|21be469b63] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote}
Attachments
58894/1.png
59174/anydesk00000.png
59175/anydesk00001.png
59129/image-2024-06-17-09-38-55-435.png
59130/image-2024-06-17-09-40-59-680.png
59131/image-2024-06-17-09-49-36-214.png
59132/image-2024-06-17-10-01-39-693.png
59176/image-2024-06-19-14-30-38-097.png
59177/image-2024-06-19-14-31-20-805.png
59178/image-2024-06-19-14-31-37-570.png
59179/image-2024-06-19-14-39-59-360.png
59180/image-2024-06-19-14-40-33-071.png
59181/image-2024-06-19-14-41-03-900.png
58899/YGN-MYTEL-TSGX004.html
58887/YGN-MYTEL-TSGX004.png
58889/企业微信截图_b0135b85-7a53-42f5-ac12-f77da916093e.png
58891/企业微信截图_b0135b85-7a53-42f5-ac12-f77da916093e-1.png