14 KiB
【M22项目】MDY-ORD-TSGX006出现container proxy restarted告警
| ID | Creation Date | Assignee | Status |
|---|---|---|---|
| OMPUB-1508 | 2024-10-21T12:21:14.000+0800 | 杨威 | 已解决 |
2024-10-16T16:02:48+06:30 MDY-ORD-TSGX006出现container proxy restarted告警,附件为当天资产详情以及对应的sosreportwangmenglan commented on 2024-10-21T18:40:56.092+0800:
TFE堆栈信息:
!image-2024-10-21-18-27-19-945.png!
打印hash节点信息:
!image-2024-10-21-18-31-05-194.png!
发现四元组流表中num_buckets异常,值为0。
进一步排查发现该站点TFE接收到的控制报文的协议为:
!image-2024-10-21-18-19-43-308.png!
TFE不支持解析该协议,只能拿到最外层四元组。
TFE会通过四元组创建流表,因此产生大量的hash冲突的节点。
现怀疑是由于大量的hash冲突,导致添加四元组流表时出错。
wangmenglan commented on 2024-10-21T18:47:10.892+0800:
排查时发现该隧道协议,firewall发送的控制报文中,隧道标识没有设置,请firewall检查对应的逻辑。[~yangwei]
gitlab commented on 2024-10-22T14:33:49.728+0800:
[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a commit|0cf18915e9] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [hotfix-OMPUB1508-baseon-24.02|https://git.mesalab.cn/tango/sf_classifier/-/tree/hotfix-OMPUB1508-baseon-24.02]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}
gitlab commented on 2024-10-22T14:54:34.609+0800:
[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a commit|b7dbe15778] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [hotfix-OMPUB1508-baseon-24.02|https://git.mesalab.cn/tango/sf_classifier/-/tree/hotfix-OMPUB1508-baseon-24.02]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}
gitlab commented on 2024-10-22T15:46:22.753+0800:
[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a commit|908ac22069] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [dev-24.10-tunnel-flag|https://git.mesalab.cn/tango/sf_classifier/-/tree/dev-24.10-tunnel-flag]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}
gitlab commented on 2024-10-22T16:06:22.751+0800:
[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a merge request|https://git.mesalab.cn/tango/sf_classifier/-/merge_requests/43] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [dev-24.10-tunnel-flag|https://git.mesalab.cn/tango/sf_classifier/-/tree/dev-24.10-tunnel-flag]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}
lijia commented on 2024-10-23T09:34:21.923+0800:
sf_classifier发送policy_update控制报文时,对于隧道流量的is_tunnel标志位设置错误,已修复: https://jira.geedge.net/browse/TSG-22776
SFC这个bug几个月之前即存在,对于1508这个issue的根本原因,建议 [~wangmenglan] 再查一下。
yangwei commented on 2024-10-23T17:37:04.381+0800:
已在MDY-ORD-TSGX006 hotfix sf_classifer v1.1.6,捕获到firewall发出的l2tp承载流量,控制报文中is_tunnel bit已设置(见截图中控制报文负载最后一个bytes为0x02)
[^m22_l2tp_ssl_ctrl.pcapng]
^!image-2024-10-23-17-35-52-582.png|width=628,height=353!^
gitlab commented on 2024-10-24T11:46:00.170+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|0dfee4f6e4] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [develop-l2tp|https://git.mesalab.cn/tango/tfe/-/tree/develop-l2tp]:{quote}OMPUB-1508 support l2tpv2 protocol{quote}
wangmenglan commented on 2024-10-25T11:25:39.395+0800:
已在MDY-ORD-TSGX006 hotfix TFE,现以支持l2tpv2协议的解析,避免现场出现大量hash冲突的情况。先持续观察一段时间
wangmenglan commented on 2024-10-30T11:38:26.263+0800:
根本原因:
是uthash动态扩容导致程序崩溃,打印coredump信息,发现uthash扩容了33次,num_buckets的值应为2^33,而num_buckets定义为unsigned int类型,发生了溢出,最终值为0
{code:java} coredump信息 (gdb) p *table->root_by_addr->hh2.tbl.tail.tbl $4 = {buckets = 0x7fd2c401ca10, num_buckets = 0, log2_num_buckets = 33, num_items = 84, tail = 0x7fd2c40139f8, hho = 120, ideal_chain_maxlen = 42, nonideal_items = 0, ineff_expands = 0, noexpand = 0, signature = 2685476833}
其中log2_num_buckets 表示uthash扩容次数.{code} 发生溢出后,再执行hash表的添加操作就很容易产生崩溃
{code:java}
首先扩容时,创建的new_buckets个数为0
#define HASH_EXPAND_BUCKETS(hh, tbl, oomed)
do
{
unsigned _he_bkt;
unsigned _he_bkt_i;
struct UT_hash_handle *_he_thh, *_he_hh_nxt;
UT_hash_bucket *_he_new_buckets, *_he_newbkt;
_he_new_buckets = (UT_hash_bucket *)uthash_malloc(
sizeof(struct UT_hash_bucket) * (tbl)->num_buckets * 2U);
……
} while (0)
因为(tbl)->num_buckets的值为0,所以_he_new_buckets分配的个数为0
其次在计算buckets的位置时,num_bkts为0,会出现访问越界的问题
#define HASH_TO_BKT(hashv, num_bkts, bkt)
do
{
bkt = ((hashv) & ((num_bkts)-1U));
} while (0)
{code} 造成此现象原因分析:
uthash动态扩容的逻辑中,如果存在大量的hash冲突后,进行扩容后,仍然存在大量的冲突,uthash就不再进行扩容操作
{code:java}
#define HASH_EXPAND_BUCKETS(hh, tbl, oomed)
……
uthash_free((tbl)->buckets, (tbl)->num_buckets * sizeof(struct UT_hash_bucket));
(tbl)->num_buckets *= 2U;
(tbl)->log2_num_buckets++;
(tbl)->buckets = _he_new_buckets;
(tbl)->ineff_expands = ((tbl)->nonideal_items > ((tbl)->num_items >> 1)) ? ((tbl)->ineff_expands + 1U) : 0U;
if ((tbl)->ineff_expands > 1U)
{
(tbl)->noexpand = 1;
uthash_noexpand_fyi(tbl);
}
(tbl)->noexpand设置为1,不再进行扩容操作{code}
但该逻辑无法解决以下场景:
当uthash中已经存在了一定的足够离散的数据后,突然间插入了大量存在hash冲突的数据,会导致uthash反复执行扩容操作,不会触发抑制扩容的逻辑。
本地复现: {code:java} unsigned long cnt = 0; unsigned long add = 150000000; struct session_node *temp = NULL; struct session_node *d_temp = NULL;
while(1) { cnt++; temp = (struct session_node *)calloc(1, sizeof(struct session_node)); temp->id = cnt; if (cnt < add) { temp->session_id = cnt; } else { temp->session_id = add; } HASH_ADD(hh1, root_by_id, id, sizeof(temp->id), temp); HASH_ADD(hh2, root_by_session_id, session_id, sizeof(temp->session_id), temp); if (cnt % 100000 == 0) { printf("add hash cnt:%lu\n", cnt); } }
先将150000000条session_id持续递增的数据插入uthash后,后续插入的数据session_id值固定为150000000{code} {code:java} 测试程序出现崩溃 Program received signal SIGSEGV, Segmentation fault. 0x0000000000401713 in main () at test.c:39 39 HASH_ADD(hh2, root_by_session_id, session_id, sizeof(temp->session_id), temp); (gdb) bt #0 0x0000000000401713 in main () at test.c:39 (gdb) bt #0 0x0000000000401713 in main () at test.c:39 (gdb) p *root_by_session_id->hh2.tbl $7 = {buckets = 0x7422c0, num_buckets = 0, log2_num_buckets = 33, num_items = 150100001, tail = 0x508b1d438, hho = 72, ideal_chain_maxlen = 75050001, nonideal_items = 0, ineff_expands = 0, noexpand = 0, signature = 2685476833}
插入数据在150100001时,发生崩溃,动态扩容达到33次
注意:如果直接插入大量存在hash冲突的数据,可插入的数据远远大于150100001{code} [uthash User Guide (troydhanson.github.io)|https://troydhanson.github.io/uthash/userguide.html#expansion]
wangmenglan commented on 2024-10-30T20:13:16.966+0800:
问题描述: tfe使用uthash实现双索引流表,支持通过session id或四元组进行查询。每条流会同时建立这两种索引。当处理隧道协议(如l2tp)流量时,由于tfe无法解析内层报文,导致大量具有不同session id但相同外层四元组的流被添加到流表中。这些流在四元组索引中会被hash到同一个桶,导致该桶持续扩容,最终造成系统异常。
具体流程: 1、tfe从控制报文获取session id和原始报文,先通过session id索引查找流表,若未找到,则创建新流表项,同时建立session id索引和四元组索引; 2、对于L2TP报文,由于只解析了外层报文,没有解析到最内层四元组,导致新建流表项时,以四元组的Key的索引映射到同一个桶中; 3、随着L2TP隧道流的增加,四元组索引的冲突率不断增加,需要不断扩容以降低冲突率。当索引容量达到2^32时,uthash容量达到上限,进而造成了uthash中表示容量的变量溢出,造成段错误。
解决方案: 对于隧道流量、单向流量及命中no intercept策略的流量,仅建立session id索引,不再建立四元组索引。由于这些场景下流表仅用于记录metric信息,移除四元组索引不会影响查询功能和业务逻辑。
gitlab commented on 2024-11-04T15:26:38.117+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|2675a79578] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}
gitlab commented on 2024-11-05T09:55:54.823+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|7e131d851c] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}
gitlab commented on 2024-11-05T10:21:44.345+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|02fae975ed] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-24.02-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-24.02-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}
gitlab commented on 2024-11-05T10:25:43.650+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a merge request|https://git.mesalab.cn/tango/tfe/-/merge_requests/680] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-24.02-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-24.02-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}
gitlab commented on 2024-11-05T11:34:28.688+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a merge request|https://git.mesalab.cn/tango/tfe/-/merge_requests/681] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}
gitlab commented on 2024-11-05T11:34:34.308+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|be0bdc08e3] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}
gitlab commented on 2024-11-05T14:40:34.679+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|4e392b2c10] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-tfe-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-tfe-24.02]:{quote}fix(OMPUB-1508):update tfe to v4.8.84{quote}
gitlab commented on 2024-11-05T14:41:23.721+0800:
[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a merge request|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/merge_requests/2822] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-tfe-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-tfe-24.02]:{quote}fix(OMPUB-1508):update tfe to v4.8.84{quote}
Attachments
63551/image-2024-10-21-18-19-43-308.png
63552/image-2024-10-21-18-27-19-945.png
63553/image-2024-10-21-18-31-05-194.png
63644/image-2024-10-23-17-35-52-582.png
63643/m22_l2tp_ssl_ctrl.pcapng
63544/MDY-ORD-TSGX006.html
63554/proxy_ctrl_pkt.pcapng
63543/sosreport-MDY-ORD-TSGX006-20241021093426.tar.xz