199 lines
5.9 KiB
Markdown
199 lines
5.9 KiB
Markdown
# 【M22项目】修改Split By类型Shaping Profile的速率后,无法打开网址
|
||
|
||
| ID | Creation Date | Assignee | Status |
|
||
|----|----------------|----------|--------|
|
||
| OMPUB-1269 | 2024-05-03T18:08:04.000+0800 | 杨威 | 处理中 |
|
||
|
||
|
||
---
|
||
|
||
描述:修改限制速率大小后无法打开网址,将Rule关闭后,可以打开网址
|
||
|
||
复现步骤:
|
||
* 创建Shaping Rule如下图所示: !image-2024-05-03-16-35-44-425.png|thumbnail!
|
||
* 修改Shaping Profile限制的速率大小如下图: !image-2024-05-03-16-36-34-036.png|thumbnail!
|
||
* 访问地址:[https://www.youtube.com|https://www.youtube.com/]
|
||
|
||
当前问题:无法打开youtube
|
||
|
||
包已放到附件中**liuxueli** commented on *2024-05-04T12:25:33.291+0800*:
|
||
|
||
* 现象:
|
||
** 命中shaping策略后,客户端正在下载文件的连接终止(下载速率为0),客户端访问网页(youtube)失败
|
||
* 干扰
|
||
** 测试时存在Monitor、statistics、shaping策略,仅存在一条生效的shaping策略
|
||
** 因Hotfix存在重启设备,单位时间内仅存在一台设备重启
|
||
* 复现
|
||
** kill consul、shaping进程未复现问题
|
||
** 随机挑选一台没有流量的设备重启即复现问题,测试仅存在一条生效的shaping策略且仅存在一个profile
|
||
|
||
*
|
||
**
|
||
*** 重启一台设备且仅重启一次
|
||
* 结论
|
||
** 复现问题时执行cluster sanity check命令查看存在异常的key,执行结果如下
|
||
|
||
*
|
||
**
|
||
***
|
||
{code:java}
|
||
[root@tsg-traffic-engine-vsys-1-shaping-c7d76f9c9-sgkd4 shaping_engine]# exit
|
||
[root@MDY-ATOM-TSGX009 tsg-traffic-engine-vsys-1]# kubectl exec -it tsg-traffic-engine-vsys-1-shaping-c7d76f9c9-sgkd4 -- bash
|
||
Defaulted container "shaping" out of: shaping, telegraf-shaping, log-dir-hook, init-default-svc (init), init-announce-svc (init), init-cm-svc (init), init-packet-io-engine-ready (init), shaping-init (init)
|
||
[root@tsg-traffic-engine-vsys-1-shaping-c7d76f9c9-sgkd4 shaping_engine]# /opt/tsg/framework/bin/swarmkv-cli -n tsg-shaping-vsys1 -c 10.172.12.9:8500
|
||
tsg-shaping-vsys1> cluster sanity check
|
||
1) "tsg-shaping-4023-incoming"
|
||
tsg-shaping-vsys1> cluster sanity check
|
||
1) "tsg-shaping-4023-incoming"
|
||
tsg-shaping-vsys1> cluster sanity check
|
||
1) "tsg-shaping-4023-incoming"
|
||
2) "tsg-shaping-4023-incoming"
|
||
tsg-shaping-vsys1> cluster sanity check
|
||
(integer) 0
|
||
tsg-shaping-vsys1> cluster sanity check
|
||
(integer) 0
|
||
tsg-shaping-vsys1> {code}
|
||
|
||
* 疑点
|
||
** 推测仅重启swarmkv集群的leader或者key owner会导致异常,但随机挑选一台无流量的设备重启即复现本问题
|
||
* 下一步复测
|
||
** 复现问题时,确认重启设备与swarmkv集群的leader或者key owner设备是否正相关。
|
||
|
||
|
||
|
||
---
|
||
|
||
**yangwei** commented on *2024-05-04T13:49:56.142+0800*:
|
||
|
||
M22现场 2024-05-04复测,在集群未有节点重启的情况下,概率性出现https://jira.geedge.net/browse/TSG-17649类似的现象,即一段时间后,客户端断网。
|
||
|
||
检查Shaping Engine的统计输出,发现异步调用的P80延迟为17.7ms,平均延迟为6.2ms
|
||
|
||
!image-2024-05-04-13-47-43-156.png|width=720,height=210!
|
||
|
||
|
||
|
||
---
|
||
|
||
**yangwei** commented on *2024-05-04T17:49:59.229+0800*:
|
||
|
||
进一步进行测试验证,同样下发如下策略
|
||
* splitby token bucket,限速为7.5Mbps
|
||
* 限速条件为Client IP
|
||
|
||
{*}持续测试一段时间后,客户端出现断网的现象{*}。
|
||
|
||
!image-2024-05-04-17-41-45-545.png|width=613,height=305!
|
||
|
||
测试的{*}客户端流量分散在不同的站点的不同节点{*},在{*}YGN-MYTEL站点{*}检查Shaping Engine的{*}请求返回延迟,平均20ms。{*}
|
||
|
||
!image-2024-05-04-17-45-02-867.png|width=834,height=240!
|
||
|
||
使用 CLUSTER SANITY check检查集群状态,{*}显示存在异常的key{*}。
|
||
|
||
!image-2024-05-04-17-46-14-535.png|width=712,height=396!
|
||
|
||
{*}使用SwarmKV CLI模拟新的member(1233)进行消费{*},由于Split By的单位为local host,原则上模拟出的member可以独享一个7.5Mbps的Token_bucket,但是使用命令行{*}连续消费少量Token,仍然出现消费失败的情况,不符合预期。{*}
|
||
|
||
!image-2024-05-04-17-48-39-582.png!
|
||
|
||
上述客户端断网的情况,概率性出现,且在一段时间后能够自愈。
|
||
|
||
初步结论:issue中描述的问题,{*}与标题中修改profile的行为无直接关系{*},而是下发{*}SplitBy的限速Profile后{*},{*}一定概率会出现客户端断网的现象{*}。可能的原因为集群内部通信延迟较大后,由于SplitBy Token Bucket同步的数据量较大,Shaping Engine获取Token失败,造成客户端疑似断网的现象。
|
||
|
||
|
||
|
||
M现场尝试下发Generic Token Bucket并测试,进一步确认上述结论。
|
||
|
||
|
||
|
||
---
|
||
|
||
**hebingning** commented on *2024-05-05T12:38:54.157+0800*:
|
||
|
||
5月4日多次进行测试,测试结果:
|
||
|
||
Shaping Rule引用Type为:Generic/Fair Share 的Shaping Profile
|
||
|
||
未发生断网情况
|
||
|
||
Shaping Rule引用Type为:Split By的Shaping Profile
|
||
|
||
出现客户端断网的情况[~yangwei]
|
||
|
||
|
||
|
||
---
|
||
|
||
**zhengchao** commented on *2024-05-06T10:46:21.985+0800*:
|
||
|
||
用SWARMKV的INFO命令看一下节点的同步带宽。
|
||
|
||
|
||
|
||
---
|
||
|
||
**liuxueli** commented on *2024-05-06T11:51:31.355+0800*:
|
||
|
||
* 这个是客户端断网期间执行CLUSTER SANITY check / CLUSTER NODES / CLUSTER INFO三个命令的结果:
|
||
** [^10.161.12.26_2024-05-04_15_34_58.txt]
|
||
|
||
|
||
|
||
---
|
||
|
||
**zhengchao** commented on *2024-05-06T13:31:10.804+0800*:
|
||
|
||
Bulk Token Bucket同步消息是4Mb/msg,1个worker线程可能扛不住这么大的同步量
|
||
{quote}17) 1) "10.168.12.6:30745"
|
||
|
||
sync_err: 882383
|
||
|
||
instantaneous_input_kbps: 234599.00
|
||
instantaneous_input_msgs: 69.00{quote}
|
||
|
||
|
||
|
||
---
|
||
|
||
|
||
|
||
## Attachments
|
||
|
||
**56643/10.161.12.26_2024-05-04_15_34_58.txt**
|
||
|
||
---
|
||
|
||
**56612/image-2024-05-03-16-35-44-425.png**
|
||
|
||
---
|
||
|
||
**56611/image-2024-05-03-16-36-34-036.png**
|
||
|
||
---
|
||
|
||
**56623/image-2024-05-04-13-47-43-156.png**
|
||
|
||
---
|
||
|
||
**56625/image-2024-05-04-17-41-45-545.png**
|
||
|
||
---
|
||
|
||
**56626/image-2024-05-04-17-45-02-867.png**
|
||
|
||
---
|
||
|
||
**56627/image-2024-05-04-17-46-14-535.png**
|
||
|
||
---
|
||
|
||
**56628/image-2024-05-04-17-48-39-582.png**
|
||
|
||
---
|
||
|
||
**56610/youtube.pcapng**
|
||
|
||
---
|
||
|