Files
geedge-jira/md/OMPUB-1508.md
2025-09-14 21:52:36 +00:00

14 KiB
Raw Blame History

【M22项目】MDY-ORD-TSGX006出现container proxy restarted告警

ID Creation Date Assignee Status
OMPUB-1508 2024-10-21T12:21:14.000+0800 杨威 已解决

2024-10-16T16:02:48+06:30 MDY-ORD-TSGX006出现container proxy restarted告警附件为当天资产详情以及对应的sosreportwangmenglan commented on 2024-10-21T18:40:56.092+0800:

TFE堆栈信息

!image-2024-10-21-18-27-19-945.png!

打印hash节点信息

!image-2024-10-21-18-31-05-194.png!

发现四元组流表中num_buckets异常值为0。

 

进一步排查发现该站点TFE接收到的控制报文的协议为

!image-2024-10-21-18-19-43-308.png!

TFE不支持解析该协议只能拿到最外层四元组。

TFE会通过四元组创建流表因此产生大量的hash冲突的节点。

现怀疑是由于大量的hash冲突导致添加四元组流表时出错。


wangmenglan commented on 2024-10-21T18:47:10.892+0800:

排查时发现该隧道协议firewall发送的控制报文中隧道标识没有设置请firewall检查对应的逻辑。[~yangwei]


gitlab commented on 2024-10-22T14:33:49.728+0800:

[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a commit|0cf18915e9] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [hotfix-OMPUB1508-baseon-24.02|https://git.mesalab.cn/tango/sf_classifier/-/tree/hotfix-OMPUB1508-baseon-24.02]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}


gitlab commented on 2024-10-22T14:54:34.609+0800:

[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a commit|b7dbe15778] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [hotfix-OMPUB1508-baseon-24.02|https://git.mesalab.cn/tango/sf_classifier/-/tree/hotfix-OMPUB1508-baseon-24.02]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}


gitlab commented on 2024-10-22T15:46:22.753+0800:

[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a commit|908ac22069] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [dev-24.10-tunnel-flag|https://git.mesalab.cn/tango/sf_classifier/-/tree/dev-24.10-tunnel-flag]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}


gitlab commented on 2024-10-22T16:06:22.751+0800:

[李佳|https://git.mesalab.cn/lijia] mentioned this issue in [a merge request|https://git.mesalab.cn/tango/sf_classifier/-/merge_requests/43] of [TSG-OS / sf_classifier|https://git.mesalab.cn/tango/sf_classifier] on branch [dev-24.10-tunnel-flag|https://git.mesalab.cn/tango/sf_classifier/-/tree/dev-24.10-tunnel-flag]:{quote}fix OMPUB-1508: not set policy_update ctrl packet 'is_tunnel' flag{quote}


lijia commented on 2024-10-23T09:34:21.923+0800:

sf_classifier发送policy_update控制报文时对于隧道流量的is_tunnel标志位设置错误已修复: https://jira.geedge.net/browse/TSG-22776 

SFC这个bug几个月之前即存在对于1508这个issue的根本原因建议 [~wangmenglan] 再查一下。


yangwei commented on 2024-10-23T17:37:04.381+0800:

已在MDY-ORD-TSGX006 hotfix sf_classifer v1.1.6捕获到firewall发出的l2tp承载流量控制报文中is_tunnel bit已设置见截图中控制报文负载最后一个bytes为0x02

[^m22_l2tp_ssl_ctrl.pcapng]

^!image-2024-10-23-17-35-52-582.png|width=628,height=353!^


gitlab commented on 2024-10-24T11:46:00.170+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|0dfee4f6e4] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [develop-l2tp|https://git.mesalab.cn/tango/tfe/-/tree/develop-l2tp]:{quote}OMPUB-1508 support l2tpv2 protocol{quote}


wangmenglan commented on 2024-10-25T11:25:39.395+0800:

已在MDY-ORD-TSGX006 hotfix TFE现以支持l2tpv2协议的解析避免现场出现大量hash冲突的情况。先持续观察一段时间


wangmenglan commented on 2024-10-30T11:38:26.263+0800:

根本原因:

    是uthash动态扩容导致程序崩溃打印coredump信息发现uthash扩容了33次num_buckets的值应为2^33而num_buckets定义为unsigned int类型发生了溢出最终值为0

  {code:java} coredump信息 (gdb) p *table->root_by_addr->hh2.tbl.tail.tbl $4 = {buckets = 0x7fd2c401ca10, num_buckets = 0, log2_num_buckets = 33,   num_items = 84, tail = 0x7fd2c40139f8, hho = 120, ideal_chain_maxlen = 42,   nonideal_items = 0, ineff_expands = 0, noexpand = 0, signature = 2685476833}

其中log2_num_buckets 表示uthash扩容次数.{code} 发生溢出后再执行hash表的添加操作就很容易产生崩溃

  {code:java} 首先扩容时创建的new_buckets个数为0 #define HASH_EXPAND_BUCKETS(hh, tbl, oomed)    
do                    
{                                
  unsigned _he_bkt;                  
  unsigned _he_bkt_i;                  
  struct UT_hash_handle *_he_thh, *_he_hh_nxt;
  UT_hash_bucket *_he_new_buckets, *_he_newbkt;
  _he_new_buckets = (UT_hash_bucket *)uthash_malloc(
      sizeof(struct UT_hash_bucket) * (tbl)->num_buckets * 2U);
  ……                                        
  } while (0)

因为(tbl)->num_buckets的值为0所以_he_new_buckets分配的个数为0

其次在计算buckets的位置时num_bkts为0会出现访问越界的问题 #define HASH_TO_BKT(hashv, num_bkts, bkt)
  do                                      
  {                                      
    bkt = ((hashv) & ((num_bkts)-1U));    
  } while (0)

{code} 造成此现象原因分析:

    uthash动态扩容的逻辑中如果存在大量的hash冲突后进行扩容后仍然存在大量的冲突uthash就不再进行扩容操作 {code:java} #define HASH_EXPAND_BUCKETS(hh, tbl, oomed)        
      ……        
      uthash_free((tbl)->buckets, (tbl)->num_buckets * sizeof(struct UT_hash_bucket));      
      (tbl)->num_buckets *= 2U;    
      (tbl)->log2_num_buckets++;        
      (tbl)->buckets = _he_new_buckets;    
      (tbl)->ineff_expands = ((tbl)->nonideal_items > ((tbl)->num_items >> 1)) ? ((tbl)->ineff_expands + 1U) : 0U;
      if ((tbl)->ineff_expands > 1U)        
      {    
        (tbl)->noexpand = 1;        
        uthash_noexpand_fyi(tbl);      
      } (tbl)->noexpand设置为1不再进行扩容操作{code}     但该逻辑无法解决以下场景:

        当uthash中已经存在了一定的足够离散的数据后突然间插入了大量存在hash冲突的数据会导致uthash反复执行扩容操作不会触发抑制扩容的逻辑。

    本地复现: {code:java}     unsigned long cnt = 0;     unsigned long add = 150000000;     struct session_node *temp = NULL;     struct session_node *d_temp = NULL;

    while(1) {         cnt++;         temp = (struct session_node *)calloc(1, sizeof(struct session_node));         temp->id = cnt;         if (cnt < add) {            temp->session_id = cnt;         }         else {             temp->session_id = add;         }                 HASH_ADD(hh1, root_by_id, id, sizeof(temp->id), temp);         HASH_ADD(hh2, root_by_session_id, session_id, sizeof(temp->session_id), temp);         if (cnt % 100000 == 0) {             printf("add hash cnt:%lu\n", cnt);         }     }

先将150000000条session_id持续递增的数据插入uthash后后续插入的数据session_id值固定为150000000{code} {code:java} 测试程序出现崩溃 Program received signal SIGSEGV, Segmentation fault. 0x0000000000401713 in main () at test.c:39 39              HASH_ADD(hh2, root_by_session_id, session_id, sizeof(temp->session_id), temp); (gdb) bt #0  0x0000000000401713 in main () at test.c:39 (gdb) bt #0  0x0000000000401713 in main () at test.c:39 (gdb) p *root_by_session_id->hh2.tbl $7 = {buckets = 0x7422c0, num_buckets = 0, log2_num_buckets = 33, num_items = 150100001, tail = 0x508b1d438, hho = 72, ideal_chain_maxlen = 75050001,   nonideal_items = 0, ineff_expands = 0, noexpand = 0, signature = 2685476833}

插入数据在150100001时发生崩溃动态扩容达到33次

注意如果直接插入大量存在hash冲突的数据可插入的数据远远大于150100001{code} [uthash User Guide (troydhanson.github.io)|https://troydhanson.github.io/uthash/userguide.html#expansion]


wangmenglan commented on 2024-10-30T20:13:16.966+0800:

问题描述: tfe使用uthash实现双索引流表支持通过session id或四元组进行查询。每条流会同时建立这两种索引。当处理隧道协议如l2tp流量时由于tfe无法解析内层报文导致大量具有不同session id但相同外层四元组的流被添加到流表中。这些流在四元组索引中会被hash到同一个桶导致该桶持续扩容最终造成系统异常。

具体流程: 1、tfe从控制报文获取session id和原始报文先通过session id索引查找流表若未找到则创建新流表项同时建立session id索引和四元组索引 2、对于L2TP报文由于只解析了外层报文没有解析到最内层四元组导致新建流表项时以四元组的Key的索引映射到同一个桶中 3、随着L2TP隧道流的增加四元组索引的冲突率不断增加需要不断扩容以降低冲突率。当索引容量达到2^32时uthash容量达到上限进而造成了uthash中表示容量的变量溢出造成段错误。

解决方案: 对于隧道流量、单向流量及命中no intercept策略的流量仅建立session id索引不再建立四元组索引。由于这些场景下流表仅用于记录metric信息移除四元组索引不会影响查询功能和业务逻辑。


gitlab commented on 2024-11-04T15:26:38.117+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|2675a79578] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}


gitlab commented on 2024-11-05T09:55:54.823+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|7e131d851c] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}


gitlab commented on 2024-11-05T10:21:44.345+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|02fae975ed] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-24.02-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-24.02-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}


gitlab commented on 2024-11-05T10:25:43.650+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a merge request|https://git.mesalab.cn/tango/tfe/-/merge_requests/680] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-24.02-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-24.02-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}


gitlab commented on 2024-11-05T11:34:28.688+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a merge request|https://git.mesalab.cn/tango/tfe/-/merge_requests/681] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}


gitlab commented on 2024-11-05T11:34:34.308+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|be0bdc08e3] of [TSG-OS / tfe|https://git.mesalab.cn/tango/tfe] on branch [feature-session-table|https://git.mesalab.cn/tango/tfe/-/tree/feature-session-table]:{quote}OMPUB-1508 For tunnel traffic, asymmetric traffic, and traffic matching the...{quote}


gitlab commented on 2024-11-05T14:40:34.679+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a commit|4e392b2c10] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-tfe-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-tfe-24.02]:{quote}fix(OMPUB-1508):update tfe to v4.8.84{quote}


gitlab commented on 2024-11-05T14:41:23.721+0800:

[王孟岚|https://git.mesalab.cn/wangmenglan] mentioned this issue in [a merge request|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/merge_requests/2822] of [TSG / tsg-os-buildimage|https://git.mesalab.cn/tsg/tsg-os-buildimage] on branch [update-tfe-24.02|https://git.mesalab.cn/tsg/tsg-os-buildimage/-/tree/update-tfe-24.02]:{quote}fix(OMPUB-1508):update tfe to v4.8.84{quote}


Attachments

63551/image-2024-10-21-18-19-43-308.png


63552/image-2024-10-21-18-27-19-945.png


63553/image-2024-10-21-18-31-05-194.png


63644/image-2024-10-23-17-35-52-582.png


63643/m22_l2tp_ssl_ctrl.pcapng


63544/MDY-ORD-TSGX006.html


63554/proxy_ctrl_pkt.pcapng


63543/sosreport-MDY-ORD-TSGX006-20241021093426.tar.xz