150 lines
3.4 KiB
Markdown
150 lines
3.4 KiB
Markdown
|
|
# 【E21-OLAP】分中心升级到22.07后,运行一段时间多个升级后的站点出现OLAP HBase Down告警
|
|||
|
|
|
|||
|
|
| ID | Creation Date | Assignee | Status |
|
|||
|
|
|----|----------------|----------|--------|
|
|||
|
|
| OMPUB-605 | 2022-09-01T21:22:40.000+0800 | 戚岱杰 | 已关闭 |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
22.02-22.07 OLAP 分中心升级时发现,分中心完成22.07版本升级且升级验证ok后,程序运行一段时间会出现OLAP HBase Down告警,经排查是hbase有问题,可能为数据损坏导致。
|
|||
|
|
|
|||
|
|
目前已升级以下8个站点:
|
|||
|
|
|
|||
|
|
2022-08-31:
|
|||
|
|
|
|||
|
|
BOLE-IGW
|
|||
|
|
|
|||
|
|
2022-09-01 上午
|
|||
|
|
|
|||
|
|
SSM-IGW 、MWV-IGW 、DIR-IGW
|
|||
|
|
|
|||
|
|
2022-09-01 下午
|
|||
|
|
|
|||
|
|
BOL-PE、LGH-PE 、OAP-PE
|
|||
|
|
|
|||
|
|
现升级到22.07版本后出现OLAP HBase Down告警的有以下站点:
|
|||
|
|
|
|||
|
|
BOL-IGW、MWV-IGW 、BJR-IGW 、DIR-IGW**qidaijie** commented on *2022-09-05T15:58:14.725+0800*:
|
|||
|
|
|
|||
|
|
经过排查:
|
|||
|
|
# 仅在IGW站点出现以下情况。
|
|||
|
|
# 在功能端8月29日更新后HOS请求量突增,服务使用的资源突增;服务器资源基本满载。
|
|||
|
|
!HOS成功请求.jpg|thumbnail! !HOS失败请求.png|thumbnail!
|
|||
|
|
|
|||
|
|
针对该现象进行了以下操作进行优化,对比测试:
|
|||
|
|
# BJR-IGW站点,增加了HBase限流配置。
|
|||
|
|
# BOL、MWV、DIR、SSM增加了HOS限流配置。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**qidaijie** commented on *2022-09-06T16:50:59.318+0800*:
|
|||
|
|
|
|||
|
|
经过确认,在BJR-IGW修改的HBase配置无明显效果,将还原之前修改的参数。
|
|||
|
|
增加GC日志相关配置,用以进一步确认问题: [^HBase故障排查-20220906.txt]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**qidaijie** commented on *2022-10-06T23:51:31.991+0800*:
|
|||
|
|
|
|||
|
|
开启hbase的info日志后,查看日志可以确定是GC时间过长导致的regionserver进程挂掉。GC时间为203s,超过与zookeeper的连接超时时间180s,因此进程被杀掉。
|
|||
|
|
!hbase日志.png|thumbnail!
|
|||
|
|
|
|||
|
|
解决:需要进行GC调优,增大hbase regionserver内存。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**qidaijie** commented on *2022-10-19T16:14:27.203+0800*:
|
|||
|
|
|
|||
|
|
经过讨论,将MWV-IGW局点的HBase内存调整至40GB,并观察该局点情况。
|
|||
|
|
|
|||
|
|
操作文档:[^HBase内存修改-20221017.txt]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**qidaijie** commented on *2022-11-02T17:02:58.483+0800*:
|
|||
|
|
|
|||
|
|
经过一段时间观察,MWV-IGW HBase修改内存后未再出现regionserver进程挂掉的情况,未修改的局点任然有regionserver进程挂掉的情况。
|
|||
|
|
!screenshot-1.png|thumbnail!
|
|||
|
|
|
|||
|
|
需要将剩余的Bole-IGW、Shashamane-IGW、Bahir Dar-IGW、Dire Dawa-IGW以及Bole-PE内存配置进行调整,操作文档与之前一致。[~liuju]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**liuju** commented on *2022-11-14T14:50:34.254+0800*:
|
|||
|
|
|
|||
|
|
内存配置调整已修改完毕。[~qidaijie]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**qidaijie** commented on *2022-11-18T09:50:16.107+0800*:
|
|||
|
|
|
|||
|
|
目前情况:
|
|||
|
|
|
|||
|
|
1:在11.03号修改内存后,持续运行10天查看各站点情况,发现已无Region宕的情况。
|
|||
|
|
|
|||
|
|
!增加内存后情况图20221031-1114.png|thumbnail!
|
|||
|
|
|
|||
|
|
2:主要的站点IGW以及BOLE-PE已完成修改,后续等待现场升级22.11版本时,同步修改全部站点。
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Attachments
|
|||
|
|
|
|||
|
|
**30932/HBase故障排查-20220902.txt**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**30933/HBase故障排查-20220905.txt**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**30950/HBase故障排查-20220906.txt**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31856/HBase内存修改-20221017.txt**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31507/hbase日志.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**30909/HOS成功请求.jpg**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**30910/HOS失败请求.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**32530/screenshot-1.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**31869/增加内存后情况图.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**32919/增加内存后情况图20221031-1114.png**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|