update readme && output_examlple
This commit is contained in:
256
README_zh.md
256
README_zh.md
@@ -6,87 +6,78 @@ changelog
|
||||
11.9 多个变量监控支持
|
||||
11.10 按照 pid 区分不同内核结构, 支持每个进程单独申请取消自己的监控.
|
||||
11.13 用户接口 cancel_all_watch -> cancel_watch, 每个进程互不干扰.
|
||||
11.28 完全重构,更新文档.
|
||||
```
|
||||
|
||||
## 说明
|
||||
|
||||
监控 数值变量(给定 地址,长度), 超过设定条件打印系统堆栈信息.
|
||||
监控 数值变量(给定 地址,长度), 达到设定条件打印系统内 Task 信息(用户态堆栈/内核态堆栈/调用链信息).
|
||||
- 支持多进程, 单个进程退出时,取消该进程的所有监控.
|
||||
- 相同定时间隔会分配到同一个定时器,一个定时器最多监控 32 个变量,全局最多 128 个定时器.
|
||||
- 以上数量限制定义在 `source/module/monitor_timer.h`.
|
||||
- `testcase/helloworld.c` 有测试到单进程 2049 个变量;
|
||||
|
||||
同时监控数量
|
||||
- 相同定时长度的监控 会被分为一组,对应一个定时器.
|
||||
- 一组最多 32 个变量,超过后会分配一个新的定时器.
|
||||
- 定时器数量全局最多 128 个.
|
||||
- 以上数量限制定义在 `watch_module.h` 头部宏.
|
||||
文件结构
|
||||
|
||||
```log
|
||||
├── build // output
|
||||
├── source // all source code
|
||||
│ ├── buffer // 模块与用户空间通信的缓冲区
|
||||
│ ├── module // 模块代码
|
||||
│ ├── uapi // 用户空间接口
|
||||
│ ├── ucli // 用户空间命令行工具
|
||||
│ └── ucli_py // 用户空间命令行 python (仅测试用,待完成)
|
||||
│ └── libunwind // python 解析堆栈信息移植库
|
||||
├── testcase // 测试用例
|
||||
└── tools // 测试工具
|
||||
```
|
||||
|
||||
## 使用
|
||||
|
||||
示例如 helloworld.c
|
||||
- 添加 `#include "watch.h"`
|
||||
设定对变量监控有两种函数: 宏定义 或 定义 watch_arg 结构体
|
||||
- 都需要添加 `source/uapi` 下的头文件 `#include "monitor_user.h"`
|
||||
|
||||
需要取消监控时调用 `cancel_watch();` variant_monitor 会取消该进程所有监控.
|
||||
- 当进程退出后,也会执行相同的操作,取消该进程所有监控.
|
||||
- 因此调用 `cancel_watch();` 是个可选项,但依然建议调用以避免可能的内存泄漏.
|
||||
|
||||
获取 Task 信息是一项耗时操作,这里使用了 workqueue 处理,且一次处理后该定时器重启间隔默认为 5s.
|
||||
- 此值可以在 `/proc/variable_monitor/dump_reset_sec` 查看和修改.
|
||||
|
||||
### 挂载驱动
|
||||
|
||||
项目根目录
|
||||
|
||||
```bash
|
||||
# 编译加载模块
|
||||
make && insmod source/variable_monitor.ko
|
||||
# 卸载模块,清理编译文件
|
||||
# rmmod source/variable_monitor.ko && make clean
|
||||
# 仅在 `kernel 5.17.15-1.el8.x86_64` 测试,其他内核版本未测试.
|
||||
```
|
||||
|
||||
### 宏定义
|
||||
|
||||
示例如 `testcase/helloworld.c`, 对常见数值类型宏定义 方便使用:
|
||||
- 其他类型见 `source/uapi/monitor_user_sw.h`
|
||||
```c
|
||||
// 传入变量名 | 地址 | 阈值
|
||||
START_WATCH_INT("temp", &temp, 150);
|
||||
START_WATCH_INT_LESS("temp", &temp, 150);
|
||||
```
|
||||
|
||||
默认情况下,使用宏定义 定时器的时间间隔为 10us; 此值可以在 `/proc/variable_monitor/def_interval_ns` 查看和修改.
|
||||
|
||||
### watch_arg 结构体
|
||||
|
||||
如果需要对定时间隔等有更多控制,请定义 watch_arg 结构体,start_watch 启动监控:
|
||||
- 对每个需要监控的变量 设置: 名称 && 地址 && 长度, 设置阈值, 比较方式, 定时器间隔(ns) 等.
|
||||
- `start_watch(watch_arg);` 启动监控
|
||||
- 需要取消监控时调用 `cancel_watch();`
|
||||
|
||||
超出设定条件时,打印系统堆栈信息, `dmesg` 查看,如下示例:
|
||||
- 一个定时器内,多个变量超过阈值,堆栈信息不会重复输出;
|
||||
- 打印堆栈后定时器再启动时间为 1s, 1s 后开始下一个轮次监控.
|
||||
|
||||
```log
|
||||
[ 713.225894] -------------------------------------
|
||||
[ 713.225900] -------------watch monitor-----------
|
||||
[ 713.225900] Threshold reached:
|
||||
[ 713.225901] name: temp0, threshold: 150, pid: 4261
|
||||
[ 713.225902] name: temp1, threshold: 151, pid: 4261
|
||||
[ 713.225903] name: temp2, threshold: 152, pid: 4261
|
||||
[ 713.225904] name: temp3, threshold: 153, pid: 4261
|
||||
[ 713.225904] name: temp4, threshold: 154, pid: 4261
|
||||
[ 713.225905] name: temp5, threshold: 155, pid: 4261
|
||||
[ 713.225905] name: temp6, threshold: 156, pid: 4261
|
||||
[ 713.225906] name: temp7, threshold: 157, pid: 4261
|
||||
[ 713.225906] name: temp8, threshold: 158, pid: 4261
|
||||
[ 713.225907] name: temp9, threshold: 159, pid: 4261
|
||||
[ 713.225907] name: temp10, threshold: 160, pid: 4261
|
||||
[ 713.225908] name: temp11, threshold: 161, pid: 4261
|
||||
[ 713.225908] name: temp12, threshold: 162, pid: 4261
|
||||
[ 713.225909] name: temp13, threshold: 163, pid: 4261
|
||||
[ 713.225909] name: temp14, threshold: 164, pid: 4261
|
||||
[ 713.225910] name: temp15, threshold: 165, pid: 4261
|
||||
[ 713.225910] name: temp16, threshold: 166, pid: 4261
|
||||
[ 713.225911] name: temp17, threshold: 167, pid: 4261
|
||||
[ 713.225911] name: temp18, threshold: 168, pid: 4261
|
||||
[ 713.225912] name: temp19, threshold: 169, pid: 4261
|
||||
[ 713.225912] name: temp20, threshold: 170, pid: 4261
|
||||
[ 713.225913] name: temp21, threshold: 171, pid: 4261
|
||||
[ 713.225913] name: temp22, threshold: 172, pid: 4261
|
||||
[ 713.225914] name: temp23, threshold: 173, pid: 4261
|
||||
[ 713.225914] name: temp24, threshold: 174, pid: 4261
|
||||
[ 713.225915] name: temp25, threshold: 175, pid: 4261
|
||||
[ 713.225915] name: temp26, threshold: 176, pid: 4261
|
||||
[ 713.225916] name: temp27, threshold: 177, pid: 4261
|
||||
[ 713.225916] name: temp28, threshold: 178, pid: 4261
|
||||
[ 713.225916] name: temp29, threshold: 179, pid: 4261
|
||||
[ 713.225917] name: temp30, threshold: 180, pid: 4261
|
||||
[ 713.225917] name: temp31, threshold: 181, pid: 4261
|
||||
[ 713.225918] Timestamp (ns): 1699846710299420862
|
||||
[ 713.225919] Recent Load: 0.05, 0.12, 0.08
|
||||
[ 713.225921] task: name rcu_gp, pid 3, state 1026
|
||||
[ 713.225926] rescuer_thread+0x290/0x390
|
||||
[ 713.225931] kthread+0xd7/0x100
|
||||
[ 713.225932] ret_from_fork+0x1f/0x30
|
||||
[ 713.225935] task: name rcu_par_gp, pid 4, state 1026
|
||||
[ 713.225936] rescuer_thread+0x290/0x390
|
||||
[ 713.225937] kthread+0xd7/0x100
|
||||
[ 713.225938] ret_from_fork+0x1f/0x30
|
||||
[ 713.225940] task: name netns, pid 5, state 1026
|
||||
[ 713.225941] rescuer_thread+0x290/0x390
|
||||
[ 713.225942] kthread+0xd7/0x100
|
||||
```
|
||||
|
||||
### 参数说明
|
||||
|
||||
start_watch 传入的是 watch_arg 结构体.各个字段意义如下
|
||||
- name 限制 `MAX_NAME_LEN`(15) 个有效字符
|
||||
|
||||
```c
|
||||
// start_watch 传入的是 watch_arg 结构体.各个字段意义如下
|
||||
// - name 限制 `MAX_NAME_LEN`(15) 个有效字符
|
||||
typedef struct
|
||||
{
|
||||
pid_t task_id; // current process id
|
||||
@@ -98,44 +89,129 @@ typedef struct
|
||||
unsigned char greater_flag; // reverse flag (true: >, false: <)
|
||||
unsigned long time_ns; // timer interval (ns)
|
||||
} watch_arg;
|
||||
```
|
||||
|
||||
一个初始化示例
|
||||
|
||||
```c
|
||||
//一个初始化示例
|
||||
watch_args = (watch_arg){
|
||||
.task_id = getpid(),
|
||||
.ptr = &temp,
|
||||
.name = "temp",
|
||||
.length_byte = sizeof(int),
|
||||
.threshold = 150 + i,
|
||||
.threshold = 150,
|
||||
.unsigned_flag = 0,
|
||||
.greater_flag = 1,
|
||||
.time_ns = 2000 + (i / 33) * 5000
|
||||
.time_ns = 2000 + 5000
|
||||
};
|
||||
start_watch(watch_args);
|
||||
```
|
||||
|
||||
### 打印输出
|
||||
|
||||
定时器不断按照设定间隔轮询变量,当达到设定条件时,采集此时系统内符合要求的 Task 信息(用户态堆栈/内核态堆栈/调用链信息).
|
||||
- `dmesg` 可以查看到具体的超出设定条件的变量信息;
|
||||
- Task 信息被输出到缓存区,使用 ucli 工具查看.
|
||||
|
||||
`dmesg` 打印示例如下
|
||||
|
||||
```log
|
||||
[42865.640988] -------------------------------------
|
||||
[42865.640992] -----------variable monitor----------
|
||||
[42865.640993] 超出阈值:1701141698684973655
|
||||
[42865.640994] : pid: 63936, name: temp0, ptr: 00000000bade6e61, threshold:110
|
||||
[42865.648068] -------------------------------------
|
||||
[42875.640703] -------------------------------------
|
||||
[42875.640706] -----------variable monitor----------
|
||||
[42875.640706] 超出阈值:1701141708684881779
|
||||
[42875.640708] : pid: 63936, name: temp0, ptr: 00000000bade6e61, threshold:110
|
||||
[42875.640710] : pid: 63936, name: temp1, ptr: 00000000ee645b96, threshold:111
|
||||
[42875.640711] : pid: 63936, name: temp2, ptr: 00000000f62b7afe, threshold:112
|
||||
[42875.640711] : pid: 63936, name: temp3, ptr: 00000000d100fa3c, threshold:113
|
||||
[42875.640712] : pid: 63936, name: temp4, ptr: 000000006d31cae1, threshold:114
|
||||
[42875.640712] : pid: 63936, name: temp5, ptr: 00000000723c7a2a, threshold:115
|
||||
[42875.640713] : pid: 63936, name: temp6, ptr: 0000000026ef6e83, threshold:116
|
||||
[42875.640714] : pid: 63936, name: temp7, ptr: 00000000fc1e5d5e, threshold:117
|
||||
[42875.640714] : pid: 63936, name: temp8, ptr: 0000000069b2666e, threshold:118
|
||||
[42875.640715] : pid: 63936, name: temp9, ptr: 000000000176263d, threshold:119
|
||||
[42875.648023] -------------------------------------
|
||||
```
|
||||
|
||||
默认情况下 `ucli` 编译后在 build 文件夹下
|
||||
|
||||
`ucli > output`
|
||||
- ucli 会将缓存区内容解析后输出到 `output` 文件中.
|
||||
- **此操作会清空缓存区**
|
||||
|
||||
`ucli` 工具输出示例如下(详情见 output_example)
|
||||
- userstack 是 testcase 下的堆栈信息测试程序.
|
||||
|
||||
```log
|
||||
##CGROUP:[/] 51666 [510] 采样命中[D]
|
||||
进程信息: [/ / userstack], PID: 51666 / 51666
|
||||
##C++ pid 51666
|
||||
用户态堆栈SP:7ffcd5822298, BP:2, IP:7f071c720838
|
||||
#~ 0x7f071c720838 __GI___nanosleep ([symbol])
|
||||
#~ 0x7f071c72076e __sleep ([symbol])
|
||||
#~ 0x400a08 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a64 customFunction3 ([symbol])
|
||||
#~ 0x400a42 customFunction2 ([symbol])
|
||||
#~ 0x400a21 customFunction1 ([symbol])
|
||||
#~ 0x400a75 main ([symbol])
|
||||
#~ 0x7f071c661d85 __libc_start_main ([symbol])
|
||||
#~ 0x40081e _start ([symbol])
|
||||
内核态堆栈:
|
||||
#@ 0xffffffff811730dd hrtimer_nanosleep ([kernel.kallsyms])
|
||||
#@ 0xffffffff811733a6 __x64_sys_nanosleep ([kernel.kallsyms])
|
||||
#@ 0xffffffff819fa117 do_syscall_64 ([kernel.kallsyms])
|
||||
#@ 0xffffffff81c0007c entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
|
||||
#@ 0xffffffff819fa117 do_syscall_64 ([kernel.kallsyms])
|
||||
#@ 0xffffffff81c0007c entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
|
||||
#@ 0xffffffff819fa117 do_syscall_64 ([kernel.kallsyms])
|
||||
#@ 0xffffffff81c0007c entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
|
||||
#* 0xffffffffffffff userstack (UNKNOWN)
|
||||
进程链信息:
|
||||
#^ 0xffffffffffffff ./build/userstack (UNKNOWN)
|
||||
#^ 0xffffffffffffff /bin/bash --init-file /root/.vscode-server-insiders/cli/servers/Insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6/server/out/vs/workbench/contrib/terminal/browser/media/shellIntegration-bash.sh (UNKNOWN)
|
||||
#^ 0xffffffffffffff /root/.vscode-server-insiders/cli/servers/Insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6/server/node /root/.vscode-server-insiders/cli/servers/Insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6/server/out/bootstrap-fork --type=ptyHost --logsPath /root/ (UNKNOWN)
|
||||
#^ 0xffffffffffffff /root/.vscode-server-insiders/cli/servers/Insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6/server/node /root/.vscode-server-insiders/cli/servers/Insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6/server/out/server-main.js --connection-token=remotessh --a (UNKNOWN)
|
||||
#^ 0xffffffffffffff sh /root/.vscode-server-insiders/cli/servers/Insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6/server/bin/code-server-insiders --connection-token=remotessh --accept-server-license-terms --start-server --enable-remote-auto-shutdown --socket-path=/tmp/code (UNKNOWN)
|
||||
#^ 0xffffffffffffff /root/.vscode-server-insiders/code-insiders-ca9da6c177fc4cf7429e1d0c1c52f710d6d953c6 command-shell --cli-data-dir /root/.vscode-server-insiders/cli --on-port --require-token b5a047063eb7 (UNKNOWN)
|
||||
#^ 0xffffffffffffff /usr/lib/systemd/systemd --switched-root --system --deserialize 17 (UNKNOWN)
|
||||
##
|
||||
```
|
||||
|
||||
## demo
|
||||
|
||||
项目主文件下
|
||||
usercase 文件夹下
|
||||
- `helloworld.c`: 测试大量变量监控
|
||||
- `userstack.c`: 测试用户态堆栈输出
|
||||
- `hptest.c`: 测试 hugePage 挂载
|
||||
|
||||
```bash
|
||||
# 编译加载模块
|
||||
make && insmod variable_monitor.ko
|
||||
./helloworld
|
||||
```
|
||||
|
||||
dmesg 可以看到打印的堆栈信息
|
||||
|
||||
```bash
|
||||
# 卸载模块,清理编译文件
|
||||
rmmod variable_monitor.ko && make clean
|
||||
```
|
||||
|
||||
仅在 `kernel 5.17.15-1.el8.x86_64` 测试,其他内核版本未测试.
|
||||
|
||||
## 其他
|
||||
|
||||
程序分为两部分: 字符设备 和 用户空间接口, 两者通过 ioctl 通信.
|
||||
|
||||
11504
output_example
Normal file
11504
output_example
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user