【WMS-UTR项目】多个站点的tsgx出现tsg_os_container_restart告警

ID	Creation Date	Assignee	Status
OMPUB-1207	2024-03-29T20:47:45.000+0800	杨威	已关闭

根据当地时间3月28日的告警，发现msh-tsgx01、pcap-tsgx01与 twa-tsgx01-05 均出现了一次或多次tsg_os_container_restart的告警且28日当天并未进行hotfix等包含重启的操作，附件为当地时间28号的告警信息yangwei commented on 2024-03-31T15:20:32.134+0800:

现象

告警中tsg_os_container_restart分为两个时间段：

1、2024-03-28 13点-16点，Firewall container restart

pcap-tsgx01 16:15:14
twa-tsgx05 15:31:14
msh-tsgx01 14:51:14
twa-tsgx02 13:52:44

2、2024-03-28 00:00-00:26

twa-tsgx01至twa-tsgx05出现共计18次 packet–IO engine和Firewall container重启

分析

检查现场导出的tws-tsgx02设备上的sos report（执行sos report --log-size=0命令）中的操作审计日志（sos_command/auditd_info文件），在时段2（00:00-00:26），存在升级网卡驱动和使用tsg-os-cli重启container操作，{}时段2的18次重启应该是由于现场升级操作导致{} ** !image-2024-03-31-15-31-25-944.png|width=983,height=252!
时段1 twa-tsgx02 13:52:44重启原因同OMPUB-1196 [WMS-UTR项目]: Firewall释放内存较慢触发watchdog timout导致SAPP应用重启
时段1中pcap-tsgx01，twa-tsgx05 ，msh-tsgx01的重启原因，待现场导出Firewall watchdog相关日志后进一步分析，推测原因同twa-tsgx02 13:52:44重启原因

yangwei commented on 2024-05-06T14:38:35.587+0800:

4.30-5.6出现两次container restart，原因分别如下：

msh06，5.1，ssl decoder段错误，现场为解析chello->BtoL1BytesNum，需要hotfix ssl decoder
twa02，4.30，watchdog timeout，本地日志文件已经被回收，推测原因同https://jira.geedge.net/browse/OMPUB-1196

yangwei commented on 2024-05-17T16:00:33.358+0800:

至5.16，WMS现场未再出现因ssl decoder触发的段错误，暂时关闭本issue

因触发watchdog timeout导致的重启，在https://jira.geedge.net/browse/OMPUB-1196追踪进度

caoshanfeng commented on 2024-08-29T17:30:03.070+0800:

已更新，测试完成

Attachments

Attachment: alert-message-2024-03-29+06-39-46.xlsx

alert-message-2024-03-29+06-39-46.xlsx

Attachment: image-2024-03-31-15-31-25-944.png

2.7 KiB Raw Permalink Blame History Unescape Escape

【WMS-UTR项目】多个站点的tsgx出现tsg_os_container_restart告警

Attachments

2.7 KiB

Raw Permalink Blame History