6.4 KiB
【XJ-TEST】试验局TSG-OS处理流量大于25Gbps时丢包
| ID | Creation Date | Assignee | Status |
|---|---|---|---|
| OMPUB-1039 | 2023-10-19T18:07:53.000+0800 | 刘洋 | 已关闭 |
TSG-OS版本为v23.07.19-9dc3a7e版本,夜间流量达到25Gbps时,SAPP存在丢包(CPU使用率50%左右,内存30%左右);现将4台设备流量全部分到一台设备(3.17),流量约60Gbps,SAPP丢一半流量,CPU的硬中断较高,CPU使用率上不去(流量25GBps/60Gbps都一直30%左右)。 luqiuwen commented on 2023-10-19T18:19:13.697+0800:
经排查,sapp丢包时其包处理线程在等待锁,没有进行收包,收包队列满造成丢包。 {code:java} 658.234 ( 1.660 ms): futex(uaddr: 0x7f1ac4cfcd88, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, val3: MATCH_ANY) = 0 syscall (/usr/lib64/libc-2.28.so) [0xb62227] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0xb616d0] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0xb617be] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0x8206ac] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0x8a82f1] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0x82cd71] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0x1857a1] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) [0x1895d0] (/opt/tsg/sapp/plug/business/tsg_vulpes/libonnxruntime.so.1.10.0) auto_label_call_ML_c (/opt/tsg/sapp/plug/business/tsg_vulpes/tsg_vulpes.so) traffic_process (/opt/tsg/sapp/plug/business/tsg_vulpes/tsg_vulpes.so) plugin_call_streamentry (/opt/tsg/sapp/sapp) call_streamentry (/opt/tsg/sapp/sapp) stream_process (/opt/tsg/sapp/sapp) stream_process_udp (/opt/tsg/sapp/sapp) udp_free_stream (/opt/tsg/sapp/sapp) streamaddlist (/opt/tsg/sapp/sapp) [0x43948] (/opt/tsg/sapp/sapp) dealipv4udppkt (/opt/tsg/sapp/sapp) ipv4_entry (/opt/tsg/sapp/sapp) eth_entry (/opt/tsg/sapp/sapp) [0x2e9e1] (/opt/tsg/sapp/sapp) [0x107c66] (/opt/tsg/sapp/sapp) [0x108051] (/opt/tsg/sapp/sapp) start_thread (/usr/lib64/libpthread-2.28.so) __GI___clone (inlined) {code} 这一锁由tsg_vulpes使用,通过tsg-os-cli关闭该功能后,不再丢包运行正常。
yangwei commented on 2023-10-19T18:35:57.760+0800:
临时解决方案,关闭加密语音识别功能,tsg-os-cli中做如下设置: set template name tsg_traffic_engine_default encrypt_traffic_identify voice_bahavior_engine no
xiapeng commented on 2023-10-20T12:20:30.973+0800:
TSG-OS版本升级为tsg-os-v23.07.22-c92d517版本并关闭加密语音识别功能后,单机流量在75Gbps以下时,未出现严重丢包的情况,在75Gbps以上时,开始丢包,流量峰值达到98Gbps时,丢包量达到最大,流量大于75Gbps时间段内存使用率稳定在35%以下,cpu使用率在70%–95%之间频繁波动
!image-2023-10-20-12-52-09-665.png|width=394,height=141!
!image-2023-10-20-12-52-33-703.png|width=394,height=139!
!image-2023-10-20-12-53-12-337.png|width=157,height=178!
yangwei commented on 2023-10-20T12:36:33.918+0800:
贴下现场的监控,文字描述看不出丢包的量级和资源使用情况
yangwei commented on 2023-10-20T12:36:59.957+0800:
现场测试环境怎么接的?有拓扑图么?
xiapeng commented on 2023-10-20T13:04:53.878+0800:
测试环境设计拓扑图:[https://docs.geedge.net/pages/viewpage.action?pageId=94778025]
实际环境做了如下修改: 1.取消了串联设备,回流交换机,RCP交换机设备及所在线路
2.取消了ATCA通用流量接入设备与光保设备之间直连线路,改为光保设备 – ->光放设备–>ATCA通用流量 线路
yangwei commented on 2023-10-20T13:09:08.065+0800:
上传下NZ上设备的完整监控
yangwei commented on 2023-10-20T18:31:45.663+0800:
issue中描述的>25Gbps丢包原因已经定位并解决,先关闭,有其他情况另开bug
Attachments
Attachment: 1697705757545.png
Attachment: 1697705784889.png
Attachment: image-2023-10-20-12-52-09-665.png
Attachment: image-2023-10-20-12-52-33-703.png
Attachment: image-2023-10-20-12-53-12-337.png




