342 lines
8.6 KiB
Markdown
342 lines
8.6 KiB
Markdown
|
|
# 【E21现场】BOLE-IGW 多块NPB CPU均接近100%
|
|||
|
|
|
|||
|
|
| ID | Creation Date | Assignee | Status |
|
|||
|
|
|----|----------------|----------|--------|
|
|||
|
|
| OMPUB-670 | 2022-10-15T17:53:41.000+0800 | 杨威 | 已关闭 |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
2022-10-15 BOLE-IGW 站点流量割接完毕,流量接入系统后,10.225.11.1/10.225.11.2/10.225.11.5 出现CPU告警,几乎所有CPU核持续在98%~100%之间。
|
|||
|
|
|
|||
|
|
排查处理结果:
|
|||
|
|
|
|||
|
|
从BOLE-IGW NPB上fs2_sysinfo.log日志显示
|
|||
|
|
|
|||
|
|
Tcp_Link_NEW:48975
|
|||
|
|
|
|||
|
|
Udp_link_NEW: 13630
|
|||
|
|
|
|||
|
|
单NPB流量:8Gbps
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
正常情况TCP+UDP新建连接数1Gbps=2000左右属于正常
|
|||
|
|
|
|||
|
|
目前新建连接数TCP+UDP=62605 过高 。
|
|||
|
|
|
|||
|
|
打开10.225.11/2/5 DDOS_Bypass 功能
|
|||
|
|
|
|||
|
|
将etc/sapp.toml中的stream_bypass_enabled 由0改完1 重启sapp。
|
|||
|
|
|
|||
|
|
重启之后CPU 降为最多80%左右,告警消除。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
附件包含:
|
|||
|
|
|
|||
|
|
SSM-IGW fs2_sysinfo.log
|
|||
|
|
|
|||
|
|
打开DDOS_bypass前 BOLE-IGW-05 fs2_sysinfo.log 及CPU情况,打开后fs2_sysinfo.log详情 及CPU情况
|
|||
|
|
|
|||
|
|
从DOS Events及session records查看到的相关数据信息截图。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
**liuyang** commented on *2022-10-18T09:39:26.149+0800*:
|
|||
|
|
|
|||
|
|
根据2022.10.17现场反馈信息补充如下(如果遗漏或者信息不清楚,请备注以便现场同事补充):
|
|||
|
|
|
|||
|
|
BOLE-IGW的5个NPB开着ddos bypass开关,当晚上流量高峰时(单NPB流量6.5Gbps左右)依然CPU告警,新建连接数还是很高(TCP+UDP新建6.5W/s左右)。
|
|||
|
|
|
|||
|
|
原因:Nezha CPU告警是40/43=93%左右告警;触发DDOS Bypass是单核超95%且持续500毫秒,所以当流量峰值且新建连接高时,即使开了DDOS Bypass,CPU也会超过告警阈值。
|
|||
|
|
|
|||
|
|
!image-2022-10-18-09-39-16-594.png|width=621,height=305!
|
|||
|
|
|
|||
|
|
!image-2022-10-18-09-38-22-432.png|width=608,height=354!
|
|||
|
|
|
|||
|
|
!image-2022-10-18-09-38-37-072.png|width=600,height=275!
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2022-10-18T10:22:20.545+0800*:
|
|||
|
|
|
|||
|
|
[~liuju] 麻烦补充下功能端以下信息,辅助判断
|
|||
|
|
* 单核perf信息,命令 perf top -C [core_id]
|
|||
|
|
* 功能端 sysinfo信息,命令 cat [sapp_run_path]/sysinfo.log
|
|||
|
|
* 功能端 规则扫描信息, 命令 cat [sapp_run_path]/tsg_static_maat.status,cat [sapp_run_path]/app_sketch_maat.status
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuju** commented on *2022-10-18T20:39:45.033+0800*:
|
|||
|
|
|
|||
|
|
[~yangwei] 在观察到10.225.11.5 再次出现CPU告警时,采集了现场你需要的数据内容以及该NPB过去1天的资产监控数据详情,已上传到附件中。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuju** commented on *2022-10-19T16:44:06.635+0800*:
|
|||
|
|
|
|||
|
|
2022-10-19 BOLE-IGW 10.225.11.5 sapp添加了5个核。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
目前/opt/tsg/sapp/etc/sapp.toml配置如下:
|
|||
|
|
|
|||
|
|
worker_threads=48
|
|||
|
|
|
|||
|
|
bind_mask=[5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
/opt/tsg/tfe/conf/tfe/tfe.conf配置如下:
|
|||
|
|
|
|||
|
|
nr_worker_threads=1
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**zhengchao** commented on *2022-10-19T17:07:43.685+0800*:
|
|||
|
|
|
|||
|
|
将Top SNI Learning的Limit改为100条试试。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2022-10-19T19:06:56.177+0800*:
|
|||
|
|
|
|||
|
|
阶段性原因分析如下:
|
|||
|
|
* 现象
|
|||
|
|
** 10.15号流量恢复后,Bole-IGW站点NPB报CPU使用告警
|
|||
|
|
* CPU使用量变化
|
|||
|
|
|
|||
|
|
** 观察10.15之后的CPU使用情况,在开启DDoS Bypass后,使用率较开启前平滑,没有出现之前大部分CPU使用至100%的情况,但是高峰期使用率仍然高于NZ告警阈值
|
|||
|
|
** !image-2022-10-19-18-40-03-374.png|width=557,height=294!
|
|||
|
|
** 查看历史记录,该站点在10.15恢复前,CPU使用率较为规律,峰值使用量接近告警阈值,但是整体在阈值以下
|
|||
|
|
** 在10.15恢复流量后(期间有对TSG进行升级和修复操作),整体CPU使用量有所上升
|
|||
|
|
** !image-2022-10-19-18-36-44-180.png|width=346,height=386!
|
|||
|
|
* CPU使用分布
|
|||
|
|
** 使用perf top观察CPU使用高的核,发现占比高的集中在自动机扫描函数,以及部分由于升级jemalloc带来的free malloc函数(升级前使用dictator作为内存池,没有这部分消耗)
|
|||
|
|
** !image-2022-10-19-18-43-06-186.png|width=552,height=208!
|
|||
|
|
** 观察扫描状态统计输出,发现TSG_OBJ_FQDN的扫描命中率高于94%
|
|||
|
|
** !image-2022-10-19-18-44-25-300.png|width=819,height=307!
|
|||
|
|
** 配合sapp的流统计,该站点主要会话为C2S的单向流,与该台NPB整体流量不高(6.5Gbps),但是新建连接数较高的现象相符(通常C2S侧的负载长度小于S2C侧)
|
|||
|
|
** !image-2022-10-19-18-46-53-312.png|width=618,height=338!
|
|||
|
|
* 原因推测
|
|||
|
|
** 原因1 使用jemalloc带来CPU使用增加
|
|||
|
|
*** sapp升级后,改用jemalloc作为内存池,增加了CPU使用,参见https://docs.geedge.net/pages/viewpage.action?pageId=82871119的记录,整体增加5%
|
|||
|
|
*** 十一之前,Bole IGW中部分NPB已经升级该版本的sapp,并未触发CPU使用告警,不是决定性原因
|
|||
|
|
** 原因2 推测修复TopSNI学习bug后,整体的SNI扫描命中率提升,造成CPU使用率提升
|
|||
|
|
*** bug参见https://jira.geedge.net/browse/OMPUB-664
|
|||
|
|
* 操作
|
|||
|
|
** 修改Top SNI Learning的Limit为100条,用于验证原因2推测
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuju** commented on *2022-10-19T21:32:27.287+0800*:
|
|||
|
|
|
|||
|
|
已将Top SNI Learning的Limit改为100条
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuju** commented on *2022-10-19T22:46:12.737+0800*:
|
|||
|
|
|
|||
|
|
2022-10-19 上午BOLE-IGW 10.225.11.5 sapp添加了的5个核,下午现在已重新绑回去了恢复之前配置,重启了sapp tfe
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2022-10-24T09:56:57.133+0800*:
|
|||
|
|
|
|||
|
|
记录
|
|||
|
|
* 问题1:TopSNI学习到*.com
|
|||
|
|
** 周四将TopSNI数量调整为100后,FQDN的命中率仍然在85%以上
|
|||
|
|
** 导出学习到的100条SNI后发现,存在*.com的条目
|
|||
|
|
*** 原因为会话日志中存在www.com的SNI日志
|
|||
|
|
*** 按TopSNI学习的规则,需要访问量大于一定基数后,才能进入学习条件,可知www.com的访问量已达到一定规模
|
|||
|
|
*** 直接访问https://www.com,能够正常返回含www.com的SNI,排除流量异常的可能性
|
|||
|
|
** 参考:自动学习或需要加入纠错规则
|
|||
|
|
* 问题2:周四将ToSNI改成1条后,FQDN命中率降至46%,但是仍然存在高峰期CPU使用告警
|
|||
|
|
** 后续
|
|||
|
|
*** 排除jemalloc带来的CPU消耗,提供一个不带jemalloc的sapp找一片NPB进行验证
|
|||
|
|
*** 结合历史流量记录,分析是否流量变化导致的CPU性能不够
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2022-10-25T16:47:54.281+0800*:
|
|||
|
|
|
|||
|
|
10.24更新Bole一台不带jemalloc的sapp,CPU使用无明显变化,仍然告警
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**yangwei** commented on *2022-10-25T16:54:20.545+0800*:
|
|||
|
|
|
|||
|
|
根据从NZ导出的每个IGW站点第五台NPB升级前后的监控统计(9.24和10.22,同为周六),观察中午12:00的Throughput、Connection,对比结果如下:
|
|||
|
|
|
|||
|
|
!image-2022-10-25-16-54-48-494.png|width=608,height=58!
|
|||
|
|
* MWV和DIR站点流量无明显变化
|
|||
|
|
* Bole、SSM和BJR近一个月流量显著上升
|
|||
|
|
** 最大带宽增幅出现在SSM,达到48%,从4.7Gbps上涨至7Gbps
|
|||
|
|
** 会话增长幅度最大的为BJR的37.1%,流量基础从3.96Gbps上涨至5.33Gbps
|
|||
|
|
* Bole的流量增幅仅次于BJR,而且该站点会话数显著高于其他站点,平均每Gbps会话量达到2400,较第二的MWV站点高出2.27倍
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuju** commented on *2023-02-21T14:53:06.592+0800*:
|
|||
|
|
|
|||
|
|
BOLE-IGW 流量在从90Gbps降低至目前50Gbps后,CPU告警消除。
|
|||
|
|
故暂时关闭该issue。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Attachments
|
|||
|
|
|
|||
|
|
**31788/app_sketch_maat.status**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31791/BOL-IGW-T9K001-NPB05.html**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31789/fs2_sysinfo+(5).log**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31707/image-2022-10-18-09-38-22-432.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31708/image-2022-10-18-09-38-37-072.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31709/image-2022-10-18-09-39-16-594.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31870/image-2022-10-19-18-36-44-180.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31871/image-2022-10-19-18-38-48-643.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31872/image-2022-10-19-18-40-03-374.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31874/image-2022-10-19-18-43-06-186.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31875/image-2022-10-19-18-44-25-300.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31876/image-2022-10-19-18-46-53-312.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**32130/image-2022-10-25-16-54-48-494.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31790/tsg_static_maat.status**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31657/微信图片_20221015124139.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31656/微信图片_20221015124226.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31655/微信图片_20221015124401.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31654/微信图片_20221015124410.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31653/微信图片_20221015124431.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31652/微信图片_20221015124456.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31651/微信图片_20221015124544.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31650/微信图片_20221015124615.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31649/微信图片_20221015124648.jpg**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31648/微信图片_20221015124700.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31647/微信图片_20221015124753.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31658/微信图片_20221015131442.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31659/微信图片_20221015131503.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31787/微信图片_20221018153649.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31786/微信图片_20221018153656.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31785/微信图片_20221018153702.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31784/微信图片_20221018153707.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|