Files
geedge-jira/md/OMPUB-1321.md

254 lines
9.0 KiB
Markdown
Raw Normal View History

2025-09-14 21:52:36 +00:00
# 【M22项目】YGN-MYTEL站点多台tsgx出现container firewall restarted
| ID | Creation Date | Assignee | Status |
|----|----------------|----------|--------|
| OMPUB-1321 | 2024-06-11T13:59:01.000+0800 | 杨威 | 已关闭 |
---
YGN-MYTEL出现container firewall restarted的TSGX以及重启时间
TSG-OS-YGN-MYTEL-TSGX004 2024-06-11T03:39:11+06:30
TSG-OS-YGN-MYTEL-TSGX005 2024-06-11T03:37:11+06:30
TSG-OS-YGN-MYTEL-TSGX011 2024-06-11T03:38:11+06:30
TSG-OS-YGN-MYTEL-TSGX012 2024-06-11T03:37:41+06:30
TSG-OS-YGN-MYTEL-TSGX018 2024-06-11T03:37:41+06:30
TSG-OS-YGN-MYTEL-TSGX019 2024-06-11T03:39:11+06:30
TSG-OS-YGN-MYTEL-TSGX021 2024-06-11T03:31:11+06:30
TSG-OS-YGN-MYTEL-TSGX022 2024-06-11T03:37:41+06:30
TSG-OS-YGN-MYTEL-TSGX023 2024-06-11T03:38:41+06:30
其中YGN-MYTEL-TSGX004、YGN-MYTEL-TSGX019、YGN-MYTEL-TSGX023这三台firewall重启后程序处理的Traffic趋近于0**gitlab** commented on *2024-06-12T15:51:30.460+0800*:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|https://git.mesalab.cn/MESA_Platform/marsio/-/commit/857de62018a8eaa6547fab541f251dd7e566d0f6] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
---
**gitlab** commented on *2024-06-14T09:38:31.590+0800*:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/480] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
---
**gitlab** commented on *2024-06-14T09:38:34.791+0800*:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|https://git.mesalab.cn/MESA_Platform/marsio/-/commit/b4c19cdd847fce4cbb32c6981fbaf9d3a3d30b94] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [bugfix-OMPUB-1321|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/bugfix-OMPUB-1321]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
---
**gitlab** commented on *2024-06-14T14:57:22.329+0800*:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a commit|https://git.mesalab.cn/MESA_Platform/marsio/-/commit/06f515f5fee9457b99841cfd3bb42bcf31f91f7d] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [backport-from-dev-4-8|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/backport-from-dev-4-8]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
---
**gitlab** commented on *2024-06-14T14:58:10.372+0800*:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/481] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [backport-from-dev-4-8|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/backport-from-dev-4-8]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
---
**yangwei** commented on *2024-06-17T09:55:10.131+0800*:
1、重启时刻6月11日 3:31~3:42对应设备的firewall engine均出现第三方DPI单包处理延迟高的告警处理的栈为IPv4->UDP
!image-2024-06-17-09-40-59-680.png|width=995,height=461!
2、单机监控显示对应时刻UDP新建会话量出现突增
!image-2024-06-17-09-38-55-435.png|width=460,height=288!
3、调测现场记录的coredump处理延迟高的目的端口为3389源目的IP均为141.11.197.xx段查询IP段属于path network([Path Network|https://path.net/])
!image-2024-06-17-10-01-39-693.png|width=404,height=247!
该公司主营ddos缓解等业务推测重启原因可能为该公司牵引UDP3389攻击流量流量本身导致qdpi处理延迟高触发firewall watchdog导致重启
!image-2024-06-17-09-49-36-214.png|width=710,height=397!
 
解决思路本地构造UDP3389流量尝试复现堆栈现场的spin_lock确认触发条件是由于流量负载本身还是预分配资源超限造成的处理延迟高。
---
**yangwei** commented on *2024-06-19T14:36:30.512+0800*:
 2024-06-15T03:37:40+06:30 YGN-MYTEL-TSGX026 再次复现
现场触发spin_lock的两个worker thread对应处理数据包的四元组如下
(saddr = 21559714, daddr = 3917357249, source = 45811, dest = 50537}
 (saddr = 21559714, daddr = 3917357249, source = 45811, dest = 58562}
!anydesk00000.png|width=560,height=315!
!anydesk00001.png|width=558,height=314!
客户端IP 193.32.126.233 org “AS39351 31173 Services AB”
服务端IP 162.249.72.1 org “AS6507 Riot Games, Inc”
!image-2024-06-19-14-30-38-097.png|width=455,height=271!!image-2024-06-19-14-31-37-570.png|width=436,height=269!
 
---
**yangwei** commented on *2024-06-19T14:40:44.572+0800*:
2024-06-19T06:18:11+06:30 TSG-OS-YGN-MYTEL-TSGX024 复现
!image-2024-06-19-14-41-03-900.png|width=631,height=311!
!image-2024-06-19-14-40-33-071.png|width=1049,height=613!
---
**gitlab** commented on *2024-06-20T11:10:44.703+0800*:
[陆秋文|https://git.mesalab.cn/luqiuwen] mentioned this issue in [a merge request|https://git.mesalab.cn/MESA_Platform/marsio/-/merge_requests/483] of [MESA Platform / marsio|https://git.mesalab.cn/MESA_Platform/marsio] on branch [dev-4.7|https://git.mesalab.cn/MESA_Platform/marsio/-/tree/dev-4.7]:{quote}OMPUB-1321 bugfix: clear the inflight lean counter when prod recreated.{quote}
---
**yangwei** commented on *2024-06-21T12:20:04.328+0800*:
在本地和福建测试环境尝试复现触发spin_lock的条件初步结论如下
* spin_lock与资源分配关系不大。减少给每个线程分配的nflow在本地循环回放数据包以及福建测试环境均未触发spin_lock
* 排除资源分配原因触发spin_lock的原因与特定流量负载相关的可能性较大。
 
观察spin_lock冲突导致处理延迟高的堆栈spin_lock通常由libqmbundle.so调用
* 当前qdpi_detector的实现参照示例为全局初始化一个engine对象engine加载qmbundle后在engine对象中注册N个worker_process对应packet-io线程数
* 运行时传入woker_id执行识别
推测libqmbundle.so中接口调用spin_lock的原因在于识别某些负载时需要使用竞争性资源而全局单个engine共用一个qmbundle使用竞争性资源时间较长时spin_lock开销较大。
尝试更新qdpi_detector初始化逻辑为每个worker_process都分配一个engine对象并且分别加载一份qmbundle验证更新后是否不再调用spin_lock
 
 
---
**gitlab** commented on *2024-06-21T15:39:51.350+0800*:
[宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a commit|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/commit/3d80c610dc8685b6df7ba05cd0da6c18b0cbad8b] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote}
---
**gitlab** commented on *2024-06-21T15:40:28.996+0800*:
[宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a merge request|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/merge_requests/2586] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote}
---
**gitlab** commented on *2024-06-21T15:45:54.505+0800*:
[宋延超|https://git.mesalab.cn/syc] mentioned this issue in [a commit|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/commit/21be469b6311f00a3fca22c5f2172ca299a008ef] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-mrzcpd-v4.7.7-for-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-mrzcpd-v4.7.7-for-24.02]:{quote}🐞 fix(OMPUB-1321): Upgrade mrzcpd to v4.7.7{quote}
---
## Attachments
**58894/1.png**
---
**59174/anydesk00000.png**
---
**59175/anydesk00001.png**
---
**59129/image-2024-06-17-09-38-55-435.png**
---
**59130/image-2024-06-17-09-40-59-680.png**
---
**59131/image-2024-06-17-09-49-36-214.png**
---
**59132/image-2024-06-17-10-01-39-693.png**
---
**59176/image-2024-06-19-14-30-38-097.png**
---
**59177/image-2024-06-19-14-31-20-805.png**
---
**59178/image-2024-06-19-14-31-37-570.png**
---
**59179/image-2024-06-19-14-39-59-360.png**
---
**59180/image-2024-06-19-14-40-33-071.png**
---
**59181/image-2024-06-19-14-41-03-900.png**
---
**58899/YGN-MYTEL-TSGX004.html**
---
**58887/YGN-MYTEL-TSGX004.png**
---
**58889/企业微信截图_b0135b85-7a53-42f5-ac12-f77da916093e.png**
---
**58891/企业微信截图_b0135b85-7a53-42f5-ac12-f77da916093e-1.png**
---