2025-09-14 21:52:36 +00:00
|
|
|
|
# 【E21现场-OLAP】近期SSM-IGW OLAP Flink TaskManager Down 告警
|
|
|
|
|
|
|
|
|
|
|
|
| ID | Creation Date | Assignee | Status |
|
|
|
|
|
|
|----|----------------|----------|--------|
|
|
|
|
|
|
| OMPUB-808 | 2023-02-15T19:22:47.000+0800 | 戚岱杰 | 已关闭 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
No description
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**qidaijie** commented on *2023-02-17T10:44:05.212+0800*:
|
|
|
|
|
|
|
|
|
|
|
|
问题原因:Flink taskmanager进程假死,导致日志数据无法处理产生告警。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
现场情况:
|
|
|
|
|
|
# 目前IGW站点均发现有较多非结构化文件写入的情况,其中MWV-IGW和SSM-IGW数量较大.
|
|
|
|
|
|
# MWV-IGW和SSM-IGW局点,每天写入3TB+的eml文件;平均每秒300个请求,峰值1000左右。参考: !MWV和SSM-IGW磁盘存储情况.png|thumbnail! !SSM-IGW-HOS请求数量.png|thumbnail!
|
|
|
|
|
|
## 按照当前的写入量,存储余量最多可支持持续写入4天左右。
|
|
|
|
|
|
# Flink受到资源影响以及汇聚国家中心Kafka的处理延迟,数据堆在内存中无法及时处理,致使taskmanager进程重启。
|
|
|
|
|
|
## HOS使用的资源占总资源的50%,Taskmanager无法及时处理数据,造成数据堆积。参考文件:[^MWV-SSM资源使用.txt]
|
|
|
|
|
|
## 使用SSL加密汇聚国家中心Kafka的处理延迟,比SASL用户认证方式延迟多1.3倍左右。参考文件:[^Kafka不同认证生产者延迟情况.txt]
|
|
|
|
|
|
## 通过监控可以观察到SSM-IGW的Taskmanager大部分时候可以自动恢复;但重启次数过多导致了进程假死。参考: !SSM-IGW-Taskmanager重启情况.png|thumbnail!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
临时处理方案:HOS请求限速由2000修改100,并持续观察。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**qidaijie** commented on *2023-02-17T10:45:04.179+0800*:
|
|
|
|
|
|
|
|
|
|
|
|
MWV-IGW/SSM-IGW/BJR-IGW/BOLE-IGW四个站点已修改HOS限速。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**qidaijie** commented on *2023-02-21T13:53:22.608+0800*:
|
|
|
|
|
|
|
|
|
|
|
|
DIR-IGW站点修改HOS限速。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**qidaijie** commented on *2023-02-27T15:21:45.085+0800*:
|
|
|
|
|
|
|
|
|
|
|
|
修改上述IGW站点HOS限速后,持续观察一段时间,未再出现相关情况。
|
|
|
|
|
|
|
|
|
|
|
|
!230222-27OLAP告警情况.png|thumbnail!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
# Attachments
|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
Attachment: 230222-27OLAP告警情况.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
Attachment: Kafka不同认证生产者延迟情况.txt
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
[Kafka不同认证生产者延迟情况.txt](https://gfwleak.exec.li/admin/geedge-jira/raw/branch/master/attachment/35233/Kafka不同认证生产者延迟情况.txt)
|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
Attachment: MWV-SSM资源使用.txt
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
[MWV-SSM资源使用.txt](https://gfwleak.exec.li/admin/geedge-jira/raw/branch/master/attachment/35231/MWV-SSM资源使用.txt)
|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
Attachment: MWV和SSM-IGW磁盘存储情况.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
Attachment: SSM-IGW-HOS请求数量.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|
Attachment: SSM-IGW-Taskmanager重启情况.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Attachment: 微信图片_20230215141453.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Attachment: 微信图片_20230215141917.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Attachment: 微信图片_20230215141928.png
|
2025-09-14 22:27:11 +00:00
|
|
|
|
|
2025-09-14 22:26:17 +00:00
|
|
|
|

|
2025-09-14 21:52:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|