译者注
笔者在MacBook M2上搭建Linux虚拟机上开发eBPF程序时,遇到一些LSM eBPF类型程序无法运行的问题,哪怕是5.15内核的ubuntu server,依旧无法正常运行。显然,aarch64跟x86_64的内核功能有差异。在笔者尝试定位这些差异时,看到这篇文章,可以让大家更直观地了解LSM eBPF在两种CPU 内核上的差异。
原文本博客文章是我们在Linux中对于`aarch64`上`BPF LSM`支持的内部研究的摘要。如果你对内核代码库不熟悉,要开始查看内核源码是非常困难的,因此我们决定发布这篇文章,展示我们的方法,因为这对于想要探索内核内部的任何人都可能有所帮助。
简介
在x86_64上,我们已经在使用BPF LSM,而在aarch64上,我们依赖于Kprobes,因此我们想知道内核中缺少了哪些功能,才能让这些功能在aarch64上可用。
我们曾多次深入研究内核源代码,但通常我们搜索的是已经存在的东西,以了解其工作原理。但在这种情况下,我们在寻找的是不存在的东西,我们追寻的是那些因为未实现而返回错误的内容。
回想起Steven Rostedt关于如何开始学习Linux内核的讲话,我们从ftrace(以及构建在跟踪基础设施上的工具)开始,以了解当我们将一个不受支持的BPF程序加载到内核时会发生什么。
问题
这是当我们尝试将一个BPF LSM程序加载到aarch64 5.15 Linux内核时,使用我们的软件pulsar[2]时的输出:
root@pine64-1:/home/exein# ./pulsar-enterprise-exec pulsard [2023-02-16T1445Z INFO pulsar::daemon] Starting module process-monitor [2023-02-16T1445Z INFO pulsar::daemon] Starting module file-system-monitor [2023-02-16T1446Z INFO pulsar::daemon] Starting module network-monitor [2023-02-16T1446Z INFO pulsar::daemon] Starting module logger [2023-02-16T1446Z INFO pulsar::daemon] Starting module rules-engine [2023-02-16T1446Z INFO pulsar::daemon] Starting module desktop-notifier [2023-02-16T1446Z ERROR pulsar::module_manager] Module error in file-system-monitor: failed program attach lsm path_mknod Caused by: 0: `bpf_raw_tracepoint_open` failed 1: No error information (os error 524) [2023-02-16T1446Z INFO pulsar::daemon] Starting module anomaly-detection [2023-02-16T1446Z INFO pulsar::daemon] Starting module malware-detection [2023-02-16T1446Z ERROR pulsar::module_manager] Module error in malware-detection: /var/lib/pulsar/malware_detection/models/parameters.json not found [2023-02-16T1446Z INFO pulsar::daemon] Starting module platform-connector [2023-02-16T1446Z INFO platform_connector::client] Connected to https://platform-dev-instance.exein.io:8001/ [2023-02-16T1446Z INFO pulsar::daemon] Starting module threat-response [2023-02-16T1446Z ERROR pulsar::module_manager] Module error in network-monitor: failed program attach lsm socket_bind Caused by: 0: `bpf_raw_tracepoint_open` failed 1: No error information (os error 524)
我们在尝试加载与path_mknodLSM挂钩相关的BPF程序时,pulsar出现了错误524或ENOTSUPP。让我们尝试深入研究这个问题。
注意: 在进行这项研究时,我们当时无法找到预先编译为启用BPF和BTF的aarch64,因此我们不得不编译一个自定义内核。我们还启用了跟踪选项和function_graph插件,以使用下面的工具。
所有的实验都是在一台装有定制Armbian[3]镜像的Pine A64上进行的。
这些镜像具有带有标准Ubuntu 22.04 LTS Jammy用户空间的自定义内核。
工具
为了调查这个问题,我们使用了以下工具:
bpftrace[4]:基于BPF的工具,使用自定义类C语言动态附加探针。
trace-cmd[5]:围绕tracefs文件系统的包装器,与ftrace基础设施交互。
要使用这些工具,您需要在Linux内核中启用一些选项,请查阅官方文档获取完整的要求。
注意: 也可以使用其他工具来完成相同的工作,例如perf-tools[6]中的funcgraph和kprobe。
Linux 5.15
现在我们开始使用这些工具来查看在内核5.15中尝试加载我们的BPF程序时会发生什么。
从这一点开始到本文末尾,我们将使用probe二进制文件代替pulsar,因为它更简单。为了简要概括其工作原理,以下是命令行帮助:
exein@pine64-1:~$ ./probe Test runner for eBPF programs Usage: probe [OPTIONS]Commands: file-system-monitor Watch file creations process-monitor Watch process events (fork/exec/exit) network-monitor Watch network events help Print this message or the help of the given subcommand(s) Options: -v, --verbose -h, --help Print help -V, --version Print version
在这些示例中,我们将尝试加载file-system-monitor探针。
通过运行以下命令,我们可以看到__sys_bpf函数的函数图调用,这是BPF系统调用的入口点:
trace-cmd record -p function_graph -g __sys_bpf ./probe file-system-monitor trace-cmd report
输出是一个非常庞大的函数图,太大了,无法在这里粘贴。由于我们遇到了错误,我们对程序停止前的最后几个函数感兴趣。以下是trace-cmd report输出的最后几行:
... tokio-runtime-w-1666 [003] 1318.058019: funcgraph_entry: | bpf_trampoline_link_prog() { tokio-runtime-w-1666 [003] 1318.058020: funcgraph_entry: 2.292 us | bpf_attach_type_to_tramp(); tokio-runtime-w-1666 [003] 1318.058024: funcgraph_entry: 1.250 us | mutex_lock(); tokio-runtime-w-1666 [003] 1318.058028: funcgraph_entry: | bpf_trampoline_update() { tokio-runtime-w-1666 [003] 1318.058030: funcgraph_entry: | kmem_cache_alloc_trace() { tokio-runtime-w-1666 [003] 1318.058031: funcgraph_entry: 1.167 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058036: funcgraph_exit: 6.792 us | } tokio-runtime-w-1666 [003] 1318.058039: funcgraph_entry: | kmem_cache_alloc_trace() { tokio-runtime-w-1666 [003] 1318.058042: funcgraph_entry: 2.750 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058046: funcgraph_exit: 6.417 us | } tokio-runtime-w-1666 [003] 1318.058048: funcgraph_entry: 2.708 us | bpf_jit_charge_modmem(); tokio-runtime-w-1666 [003] 1318.058053: funcgraph_entry: | bpf_jit_alloc_exec_page() { tokio-runtime-w-1666 [003] 1318.058055: funcgraph_entry: | bpf_jit_alloc_exec() { tokio-runtime-w-1666 [003] 1318.058057: funcgraph_entry: | vmalloc() { tokio-runtime-w-1666 [003] 1318.058059: funcgraph_entry: | __vmalloc_node() { tokio-runtime-w-1666 [003] 1318.058061: funcgraph_entry: | __vmalloc_node_range() { tokio-runtime-w-1666 [003] 1318.058064: funcgraph_entry: | __get_vm_area_node.constprop.64() { tokio-runtime-w-1666 [003] 1318.058067: funcgraph_entry: | kmem_cache_alloc_node_trace() { tokio-runtime-w-1666 [003] 1318.058069: funcgraph_entry: 1.459 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058073: funcgraph_exit: 6.292 us | } tokio-runtime-w-1666 [003] 1318.058075: funcgraph_entry: | alloc_vmap_area() { tokio-runtime-w-1666 [003] 1318.058077: funcgraph_entry: | kmem_cache_alloc_node() { tokio-runtime-w-1666 [003] 1318.058079: funcgraph_entry: 1.167 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058085: funcgraph_exit: 7.625 us | } tokio-runtime-w-1666 [003] 1318.058088: funcgraph_entry: | kmem_cache_alloc_node() { tokio-runtime-w-1666 [003] 1318.058089: funcgraph_entry: 1.208 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058092: funcgraph_exit: 4.584 us | } tokio-runtime-w-1666 [003] 1318.058104: funcgraph_entry: | kmem_cache_free() { tokio-runtime-w-1666 [003] 1318.058107: funcgraph_entry: 2.084 us | __slab_free(); tokio-runtime-w-1666 [003] 1318.058110: funcgraph_exit: 5.667 us | } tokio-runtime-w-1666 [003] 1318.058112: funcgraph_entry: 6.375 us | insert_vmap_area.constprop.74(); tokio-runtime-w-1666 [003] 1318.058119: funcgraph_exit: + 44.667 us | } tokio-runtime-w-1666 [003] 1318.058122: funcgraph_exit: + 58.250 us | } tokio-runtime-w-1666 [003] 1318.058124: funcgraph_entry: | __kmalloc_node() { tokio-runtime-w-1666 [003] 1318.058125: funcgraph_entry: 1.625 us | kmalloc_slab(); tokio-runtime-w-1666 [003] 1318.058128: funcgraph_entry: 1.167 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058131: funcgraph_exit: 7.208 us | } tokio-runtime-w-1666 [003] 1318.058133: funcgraph_entry: | alloc_pages() { tokio-runtime-w-1666 [003] 1318.058135: funcgraph_entry: 1.583 us | get_task_policy.part.48(); tokio-runtime-w-1666 [003] 1318.058138: funcgraph_entry: 1.500 us | policy_node(); tokio-runtime-w-1666 [003] 1318.058141: funcgraph_entry: 1.209 us | policy_nodemask(); tokio-runtime-w-1666 [003] 1318.058143: funcgraph_entry: | __alloc_pages() { tokio-runtime-w-1666 [003] 1318.058145: funcgraph_entry: 1.458 us | should_fail_alloc_page(); tokio-runtime-w-1666 [003] 1318.058147: funcgraph_entry: | get_page_from_freelist() { tokio-runtime-w-1666 [003] 1318.058150: funcgraph_entry: 1.583 us | prep_new_page(); tokio-runtime-w-1666 [003] 1318.058153: funcgraph_exit: 5.459 us | } tokio-runtime-w-1666 [003] 1318.058154: funcgraph_exit: + 10.542 us | } tokio-runtime-w-1666 [003] 1318.058155: funcgraph_exit: + 22.083 us | } tokio-runtime-w-1666 [003] 1318.058157: funcgraph_entry: | __cond_resched() { tokio-runtime-w-1666 [003] 1318.058158: funcgraph_entry: 1.833 us | rcu_all_qs(); tokio-runtime-w-1666 [003] 1318.058161: funcgraph_exit: 4.167 us | } tokio-runtime-w-1666 [003] 1318.058166: funcgraph_entry: 5.542 us | vmap_pages_range_noflush(); tokio-runtime-w-1666 [003] 1318.058173: funcgraph_exit: ! 112.375 us | } tokio-runtime-w-1666 [003] 1318.058175: funcgraph_exit: ! 116.000 us | } tokio-runtime-w-1666 [003] 1318.058176: funcgraph_exit: ! 119.292 us | } tokio-runtime-w-1666 [003] 1318.058177: funcgraph_exit: ! 122.542 us | } tokio-runtime-w-1666 [003] 1318.058179: funcgraph_entry: | find_vm_area() { tokio-runtime-w-1666 [003] 1318.058180: funcgraph_entry: 1.375 us | find_vmap_area(); tokio-runtime-w-1666 [003] 1318.058183: funcgraph_exit: 4.333 us | } tokio-runtime-w-1666 [003] 1318.058185: funcgraph_entry: | set_memory_x() { tokio-runtime-w-1666 [003] 1318.058186: funcgraph_entry: | change_memory_common() { tokio-runtime-w-1666 [003] 1318.058188: funcgraph_entry: | find_vm_area() { tokio-runtime-w-1666 [003] 1318.058189: funcgraph_entry: 1.333 us | find_vmap_area(); tokio-runtime-w-1666 [003] 1318.058192: funcgraph_exit: 3.875 us | } tokio-runtime-w-1666 [003] 1318.058193: funcgraph_entry: | vm_unmap_aliases() { tokio-runtime-w-1666 [003] 1318.058194: funcgraph_entry: | _vm_unmap_aliases.part.58() { tokio-runtime-w-1666 [003] 1318.058196: funcgraph_entry: 1.542 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058199: funcgraph_entry: 1.208 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058202: funcgraph_entry: 1.166 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058205: funcgraph_entry: 1.208 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058207: funcgraph_entry: 1.208 us | mutex_lock(); tokio-runtime-w-1666 [003] 1318.058210: funcgraph_entry: | purge_fragmented_blocks_allcpus() { tokio-runtime-w-1666 [003] 1318.058212: funcgraph_entry: 1.500 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058214: funcgraph_entry: 1.500 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058217: funcgraph_entry: 1.500 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058220: funcgraph_entry: 1.167 us | rcu_read_unlock_strict(); tokio-runtime-w-1666 [003] 1318.058222: funcgraph_exit: + 11.917 us | } tokio-runtime-w-1666 [003] 1318.058224: funcgraph_entry: | __purge_vmap_area_lazy() { tokio-runtime-w-1666 [003] 1318.058232: funcgraph_entry: | kmem_cache_free() { tokio-runtime-w-1666 [003] 1318.058234: funcgraph_entry: 1.250 us | __slab_free(); tokio-runtime-w-1666 [003] 1318.058237: funcgraph_exit: 4.791 us | } tokio-runtime-w-1666 [003] 1318.058241: funcgraph_entry: 1.209 us | __cond_resched_lock(); tokio-runtime-w-1666 [003] 1318.058244: funcgraph_exit: + 19.625 us | } tokio-runtime-w-1666 [003] 1318.058245: funcgraph_entry: 1.167 us | mutex_unlock(); tokio-runtime-w-1666 [003] 1318.058247: funcgraph_exit: + 53.042 us | } tokio-runtime-w-1666 [003] 1318.058248: funcgraph_exit: + 55.625 us | } tokio-runtime-w-1666 [003] 1318.058250: funcgraph_entry: | __change_memory_common() { tokio-runtime-w-1666 [003] 1318.058251: funcgraph_entry: | apply_to_page_range() { tokio-runtime-w-1666 [003] 1318.058253: funcgraph_entry: | __apply_to_page_range() { tokio-runtime-w-1666 [003] 1318.058255: funcgraph_entry: 1.250 us | pud_huge(); tokio-runtime-w-1666 [003] 1318.058258: funcgraph_entry: 1.166 us | pmd_huge(); tokio-runtime-w-1666 [003] 1318.058260: funcgraph_entry: 1.208 us | change_page_range(); tokio-runtime-w-1666 [003] 1318.058263: funcgraph_exit: 9.834 us | } tokio-runtime-w-1666 [003] 1318.058264: funcgraph_exit: + 12.709 us | } tokio-runtime-w-1666 [003] 1318.058266: funcgraph_exit: + 15.459 us | } tokio-runtime-w-1666 [003] 1318.058268: funcgraph_exit: + 80.791 us | } tokio-runtime-w-1666 [003] 1318.058270: funcgraph_exit: + 84.834 us | } tokio-runtime-w-1666 [003] 1318.058272: funcgraph_exit: ! 218.500 us | } tokio-runtime-w-1666 [003] 1318.058274: funcgraph_entry: | __alloc_percpu_gfp() { tokio-runtime-w-1666 [003] 1318.058276: funcgraph_entry: | pcpu_alloc() { tokio-runtime-w-1666 [003] 1318.058281: funcgraph_entry: 2.250 us | mutex_lock_killable(); tokio-runtime-w-1666 [003] 1318.058290: funcgraph_entry: | pcpu_find_block_fit() { tokio-runtime-w-1666 [003] 1318.058293: funcgraph_entry: 2.833 us | pcpu_next_fit_region.constprop.38(); tokio-runtime-w-1666 [003] 1318.058299: funcgraph_exit: 9.084 us | } tokio-runtime-w-1666 [003] 1318.058301: funcgraph_entry: | pcpu_alloc_area() { tokio-runtime-w-1666 [003] 1318.058315: funcgraph_entry: 4.000 us | pcpu_block_update_hint_alloc(); tokio-runtime-w-1666 [003] 1318.058320: funcgraph_entry: 2.208 us | pcpu_chunk_relocate(); tokio-runtime-w-1666 [003] 1318.058324: funcgraph_exit: + 22.625 us | } tokio-runtime-w-1666 [003] 1318.058327: funcgraph_entry: 1.208 us | mutex_unlock(); tokio-runtime-w-1666 [003] 1318.058332: funcgraph_entry: 1.584 us | pcpu_memcg_post_alloc_hook(); tokio-runtime-w-1666 [003] 1318.058335: funcgraph_exit: + 58.833 us | } tokio-runtime-w-1666 [003] 1318.058336: funcgraph_exit: + 61.834 us | } tokio-runtime-w-1666 [003] 1318.058338: funcgraph_entry: | kmem_cache_alloc_trace() { tokio-runtime-w-1666 [003] 1318.058339: funcgraph_entry: 1.167 us | should_failslab(); tokio-runtime-w-1666 [003] 1318.058342: funcgraph_exit: 4.458 us | } tokio-runtime-w-1666 [003] 1318.058359: funcgraph_entry: | bpf_image_ksym_add() { tokio-runtime-w-1666 [003] 1318.058360: funcgraph_entry: | bpf_ksym_add() { tokio-runtime-w-1666 [003] 1318.058363: funcgraph_entry: 1.583 us | __local_bh_enable_ip(); tokio-runtime-w-1666 [003] 1318.058366: funcgraph_exit: 5.750 us | } tokio-runtime-w-1666 [003] 1318.058369: funcgraph_exit: 9.834 us | } tokio-runtime-w-1666 [003] 1318.058371: funcgraph_entry: 1.250 us | arch_prepare_bpf_trampoline(); tokio-runtime-w-1666 [003] 1318.058373: funcgraph_entry: 2.292 us | kfree(); tokio-runtime-w-1666 [003] 1318.058377: funcgraph_exit: ! 348.625 us | } tokio-runtime-w-1666 [003] 1318.058379: funcgraph_entry: 1.250 us | mutex_unlock(); tokio-runtime-w-1666 [003] 1318.058382: funcgraph_exit: ! 363.167 us | } tokio-runtime-w-1666 [003] 1318.058384: funcgraph_entry: | bpf_link_cleanup() { tokio-runtime-w-1666 [003] 1318.058386: funcgraph_entry: | bpf_link_free_id.part.30() { tokio-runtime-w-1666 [003] 1318.058392: funcgraph_entry: | call_rcu() { tokio-runtime-w-1666 [003] 1318.058396: funcgraph_entry: 1.834 us | rcu_segcblist_enqueue(); tokio-runtime-w-1666 [003] 1318.058401: funcgraph_exit: 9.333 us | } tokio-runtime-w-1666 [003] 1318.058403: funcgraph_entry: 1.542 us | __local_bh_enable_ip(); tokio-runtime-w-1666 [003] 1318.058406: funcgraph_exit: + 19.542 us | } tokio-runtime-w-1666 [003] 1318.058408: funcgraph_entry: | fput() { tokio-runtime-w-1666 [003] 1318.058409: funcgraph_entry: | fput_many() { tokio-runtime-w-1666 [003] 1318.058411: funcgraph_entry: | task_work_add() { tokio-runtime-w-1666 [003] 1318.058414: funcgraph_entry: 1.625 us | kick_process(); tokio-runtime-w-1666 [003] 1318.058418: funcgraph_exit: 6.750 us | } tokio-runtime-w-1666 [003] 1318.058419: funcgraph_exit: + 10.333 us | } tokio-runtime-w-1666 [003] 1318.058420: funcgraph_exit: + 12.708 us | } tokio-runtime-w-1666 [003] 1318.058422: funcgraph_entry: 2.250 us | put_unused_fd(); tokio-runtime-w-1666 [003] 1318.058426: funcgraph_exit: + 41.416 us | } tokio-runtime-w-1666 [003] 1318.058428: funcgraph_entry: 1.292 us | mutex_unlock(); tokio-runtime-w-1666 [003] 1318.058430: funcgraph_entry: 1.250 us | kfree(); tokio-runtime-w-1666 [003] 1318.058433: funcgraph_exit: ! 567.458 us | } tokio-runtime-w-1666 [003] 1318.058435: funcgraph_entry: 2.125 us | __bpf_prog_put.isra.47(); tokio-runtime-w-1666 [003] 1318.058438: funcgraph_exit: ! 602.291 us | } tokio-runtime-w-1666 [003] 1318.058439: funcgraph_exit: ! 631.791 us | } ```shell 这是`kernel/bpf/trampoline.c`中与最后执行的函数`bpf_trampoline_update`对应的源代码: ```c static int bpf_trampoline_update(struct bpf_trampoline *tr) { struct bpf_tramp_image *im; struct bpf_tramp_progs *tprogs; u32 flags = BPF_TRAMP_F_RESTORE_REGS; bool ip_arg = false; int err, total; tprogs = bpf_trampoline_get_progs(tr, &total, &ip_arg); if (IS_ERR(tprogs)) return PTR_ERR(tprogs); if (total == 0) { err = unregister_fentry(tr, tr->cur_image->image); bpf_tramp_image_put(tr->cur_image); tr->cur_image = NULL; tr->selector = 0; goto out; } im = bpf_tramp_image_alloc(tr->key, tr->selector); if (IS_ERR(im)) { err = PTR_ERR(im); goto out; } if (tprogs[BPF_TRAMP_FEXIT].nr_progs || tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs) flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; if (ip_arg) flags |= BPF_TRAMP_F_IP_ARG; err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE, &tr->func.model, flags, tprogs, tr->func.addr); if (err < 0) goto out; WARN_ON(tr->cur_image && tr->selector == 0); WARN_ON(!tr->cur_image && tr->selector); if (tr->cur_image) /* progs already running at this address */ err = modify_fentry(tr, tr->cur_image->image, im->image); else /* first time registering */ err = register_fentry(tr, im->image); if (err) goto out; if (tr->cur_image) bpf_tramp_image_put(tr->cur_image); tr->cur_image = im; tr->selector++; out: kfree(tprogs); return err; }
根据先前的输出,我们可以看到:
tokio-runtime-w-1666 [003] 1318.058371: funcgraph_entry: 1.250 us | arch_prepare_bpf_trampoline(); tokio-runtime-w-1666 [003] 1318.058373: funcgraph_entry: 2.292 us | kfree();
在arch_prepare_bpf_trampoline和kfree函数之间没有其他函数调用,所以很可能第一个函数在err变量中返回了错误代码。让我们来验证一下!
通过以下方式在shell中启动bpftace,我们可以捕获arch_prepare_bpf_trampoline函数的返回值并将其打印到控制台上:
bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d ", retval); }'
并且在另一个终端中启动probe后,我们从bpftace得到了以下输出:
root@pine64-1:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d ", retval); }' Attaching 1 probe... retval link: -524
这是因为内核5.15缺乏对aarch64架构的arch_prepare_bpf_trampoline实现,并使用了默认的占位符实现。
int __weak arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *orig_call) { return -ENOTSUPP; }
因此,这个功能在这个内核版本上是不受支持的。好消息是,多亏了这个补丁[7],它在6.x内核中得到了实现。
让我们移步到6.x内核。
Linux 6.1
如果我们尝试在内核 6.1 上运行 probe,我们会得到以下输出:
root@pine64:/home/exein# ./probe file-system-monitor thread 'main' panicked at 'initialization failed: ProgramAttachError { program: "lsm path_mknod", program_error: SyscallError { call: "bpf_raw_tracepoint_open", io_error: Os { code: 524, kind: Uncategorized, message: "No error information" } } }', src/bin/probe.rs43 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
对于内核版本6.1,我们仍然遇到了和5.15内核一样的错误!!!让我们找出其中的原因。
这次在arch_prepare_bpf_trampoline上运行bpftrace,我们得到了以下输出:
root@pine64:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval tp link: %d ", retval); }' Attaching 1 probe... retval tp link: 284
所以问题不在这里,这个函数不再返回错误了。让我们回到函数调用图。
这次我们启动trace-cmd,跳过一些函数以获得更清晰的输出:
trace-cmd record -p function_graph -g bpf_trampoline_link_prog -n bpf_jit_alloc_exec -n kmalloc_trace -n arch_prepare_bpf_trampoline -n generic_handle_domain_irq -n do_interrupt_handler -n irq_exit_rcu ./probe file-system-monitor
我们从trace-cmd report中获得以下输出:
root@pine64:/home/exein# trace-cmd report CPU 0 is empty CPU 1 is empty CPU 3 is empty cpus=4 tokio-runtime-w-11886 [002] 193385.056283: funcgraph_entry: | bpf_trampoline_link_prog() { tokio-runtime-w-11886 [002] 193385.056321: funcgraph_entry: + 15.042 us | mutex_lock(); tokio-runtime-w-11886 [002] 193385.056373: funcgraph_entry: | __bpf_trampoline_link_prog() { tokio-runtime-w-11886 [002] 193385.056395: funcgraph_entry: + 14.833 us | bpf_attach_type_to_tramp(); tokio-runtime-w-11886 [002] 193385.056428: funcgraph_entry: | bpf_trampoline_update.isra.23() { tokio-runtime-w-11886 [002] 193385.056459: funcgraph_entry: 2.917 us | bpf_jit_charge_modmem(); tokio-runtime-w-11886 [002] 193385.056531: funcgraph_entry: | find_vm_area() { tokio-runtime-w-11886 [002] 193385.056540: funcgraph_entry: 3.000 us | find_vmap_area(); tokio-runtime-w-11886 [002] 193385.056547: funcgraph_exit: + 16.208 us | } tokio-runtime-w-11886 [002] 193385.056554: funcgraph_entry: | __alloc_percpu_gfp() { tokio-runtime-w-11886 [002] 193385.056563: funcgraph_entry: | pcpu_alloc() { tokio-runtime-w-11886 [002] 193385.056568: funcgraph_entry: 4.875 us | mutex_lock_killable(); tokio-runtime-w-11886 [002] 193385.056591: funcgraph_entry: | pcpu_find_block_fit() { tokio-runtime-w-11886 [002] 193385.056599: funcgraph_entry: 8.625 us | pcpu_next_fit_region.constprop.38(); tokio-runtime-w-11886 [002] 193385.056608: funcgraph_exit: + 17.166 us | } tokio-runtime-w-11886 [002] 193385.056610: funcgraph_entry: | pcpu_alloc_area() { tokio-runtime-w-11886 [002] 193385.056639: funcgraph_entry: 9.167 us | pcpu_block_update(); tokio-runtime-w-11886 [002] 193385.056656: funcgraph_entry: 7.667 us | pcpu_block_update_hint_alloc(); tokio-runtime-w-11886 [002] 193385.056671: funcgraph_entry: 7.750 us | pcpu_chunk_relocate(); tokio-runtime-w-11886 [002] 193385.056679: funcgraph_exit: + 69.667 us | } tokio-runtime-w-11886 [002] 193385.056682: funcgraph_entry: 7.042 us | mutex_unlock(); tokio-runtime-w-11886 [002] 193385.056703: funcgraph_entry: 2.792 us | pcpu_memcg_post_alloc_hook(); tokio-runtime-w-11886 [002] 193385.056712: funcgraph_exit: ! 148.709 us | } tokio-runtime-w-11886 [002] 193385.056719: funcgraph_exit: ! 165.250 us | } tokio-runtime-w-11886 [002] 193385.056866: funcgraph_entry: | bpf_image_ksym_add() { tokio-runtime-w-11886 [002] 193385.056873: funcgraph_entry: | bpf_ksym_add() { tokio-runtime-w-11886 [002] 193385.056882: funcgraph_entry: 2.750 us | __local_bh_disable_ip(); tokio-runtime-w-11886 [002] 193385.056897: funcgraph_entry: 4.625 us | __local_bh_enable_ip(); tokio-runtime-w-11886 [002] 193385.056905: funcgraph_exit: + 32.459 us | } tokio-runtime-w-11886 [002] 193385.056922: funcgraph_entry: 7.584 us | perf_event_ksymbol(); tokio-runtime-w-11886 [002] 193385.056944: funcgraph_exit: + 78.417 us | } tokio-runtime-w-11886 [002] 193385.057492: funcgraph_entry: | set_memory_ro() { tokio-runtime-w-11886 [002] 193385.057501: funcgraph_entry: | change_memory_common() { tokio-runtime-w-11886 [002] 193385.057504: funcgraph_entry: | find_vm_area() { tokio-runtime-w-11886 [002] 193385.057506: funcgraph_entry: 8.875 us | find_vmap_area(); tokio-runtime-w-11886 [002] 193385.057518: funcgraph_exit: + 14.250 us | } tokio-runtime-w-11886 [002] 193385.057522: funcgraph_entry: | __change_memory_common() { tokio-runtime-w-11886 [002] 193385.057531: funcgraph_entry: | apply_to_page_range() { tokio-runtime-w-11886 [002] 193385.057538: funcgraph_entry: | __apply_to_page_range() { tokio-runtime-w-11886 [002] 193385.057544: funcgraph_entry: + 12.791 us | pud_huge(); tokio-runtime-w-11886 [002] 193385.057559: funcgraph_entry: 2.708 us | pmd_huge(); tokio-runtime-w-11886 [002] 193385.057574: funcgraph_entry: + 15.125 us | change_page_range(); tokio-runtime-w-11886 [002] 193385.057591: funcgraph_exit: + 53.792 us | } tokio-runtime-w-11886 [002] 193385.057597: funcgraph_exit: + 66.083 us | } tokio-runtime-w-11886 [002] 193385.057610: funcgraph_exit: + 88.125 us | } tokio-runtime-w-11886 [002] 193385.057619: funcgraph_entry: | vm_unmap_aliases() { tokio-runtime-w-11886 [002] 193385.057622: funcgraph_entry: | _vm_unmap_aliases.part.77() { tokio-runtime-w-11886 [002] 193385.057625: funcgraph_entry: 9.125 us | mutex_lock(); tokio-runtime-w-11886 [002] 193385.057637: funcgraph_entry: 3.084 us | purge_fragmented_blocks_allcpus(); tokio-runtime-w-11886 [002] 193385.057643: funcgraph_entry: | __purge_vmap_area_lazy() { tokio-runtime-w-11886 [002] 193385.057687: funcgraph_entry: | kmem_cache_free() { tokio-runtime-w-11886 [002] 193385.057693: funcgraph_entry: + 13.250 us | __slab_free(); tokio-runtime-w-11886 [002] 193385.057705: funcgraph_exit: + 18.750 us | } tokio-runtime-w-11886 [002] 193385.057718: funcgraph_entry: 7.416 us | __cond_resched_lock(); tokio-runtime-w-11886 [002] 193385.057733: funcgraph_exit: + 90.042 us | } tokio-runtime-w-11886 [002] 193385.057741: funcgraph_entry: 2.792 us | mutex_unlock(); tokio-runtime-w-11886 [002] 193385.057747: funcgraph_exit: ! 124.666 us | } tokio-runtime-w-11886 [002] 193385.057749: funcgraph_exit: ! 130.291 us | } tokio-runtime-w-11886 [002] 193385.057756: funcgraph_entry: | __change_memory_common() { tokio-runtime-w-11886 [002] 193385.057759: funcgraph_entry: | apply_to_page_range() { tokio-runtime-w-11886 [002] 193385.057765: funcgraph_entry: | __apply_to_page_range() { tokio-runtime-w-11886 [002] 193385.057768: funcgraph_entry: 4.125 us | pud_huge(); tokio-runtime-w-11886 [002] 193385.057778: funcgraph_entry: 8.750 us | pmd_huge(); tokio-runtime-w-11886 [002] 193385.057790: funcgraph_entry: 4.625 us | change_page_range(); tokio-runtime-w-11886 [002] 193385.057797: funcgraph_exit: + 31.958 us | } tokio-runtime-w-11886 [002] 193385.057803: funcgraph_exit: + 44.375 us | } tokio-runtime-w-11886 [002] 193385.057817: funcgraph_exit: + 61.208 us | } tokio-runtime-w-11886 [002] 193385.057820: funcgraph_exit: ! 319.292 us | } tokio-runtime-w-11886 [002] 193385.057826: funcgraph_exit: ! 333.667 us | } tokio-runtime-w-11886 [002] 193385.057840: funcgraph_entry: | set_memory_x() { tokio-runtime-w-11886 [002] 193385.057847: funcgraph_entry: | change_memory_common() { tokio-runtime-w-11886 [002] 193385.057855: funcgraph_entry: | find_vm_area() { tokio-runtime-w-11886 [002] 193385.057858: funcgraph_entry: 2.917 us | find_vmap_area(); tokio-runtime-w-11886 [002] 193385.057870: funcgraph_exit: + 14.375 us | } tokio-runtime-w-11886 [002] 193385.057876: funcgraph_entry: | vm_unmap_aliases() { tokio-runtime-w-11886 [002] 193385.057879: funcgraph_entry: | _vm_unmap_aliases.part.77() { tokio-runtime-w-11886 [002] 193385.057882: funcgraph_entry: 3.959 us | mutex_lock(); tokio-runtime-w-11886 [002] 193385.057893: funcgraph_entry: 3.000 us | purge_fragmented_blocks_allcpus(); tokio-runtime-w-11886 [002] 193385.057900: funcgraph_entry: 2.791 us | __purge_vmap_area_lazy(); tokio-runtime-w-11886 [002] 193385.057907: funcgraph_entry: 2.709 us | mutex_unlock(); tokio-runtime-w-11886 [002] 193385.057913: funcgraph_exit: + 33.708 us | } tokio-runtime-w-11886 [002] 193385.057915: funcgraph_exit: + 43.000 us | } tokio-runtime-w-11886 [002] 193385.057922: funcgraph_entry: | __change_memory_common() { tokio-runtime-w-11886 [002] 193385.057925: funcgraph_entry: | apply_to_page_range() { tokio-runtime-w-11886 [002] 193385.057930: funcgraph_entry: | __apply_to_page_range() { tokio-runtime-w-11886 [002] 193385.057933: funcgraph_entry: 4.292 us | pud_huge(); tokio-runtime-w-11886 [002] 193385.057945: funcgraph_entry: 8.750 us | pmd_huge(); tokio-runtime-w-11886 [002] 193385.057956: funcgraph_entry: 3.958 us | change_page_range(); tokio-runtime-w-11886 [002] 193385.058037: funcgraph_exit: + 32.083 us | } tokio-runtime-w-11886 [002] 193385.058089: funcgraph_entry: 7.667 us | irq_enter_rcu(); tokio-runtime-w-11886 [002] 193385.058233: funcgraph_exit: ! 308.041 us | } tokio-runtime-w-11886 [002] 193385.058239: funcgraph_exit: ! 316.709 us | } tokio-runtime-w-11886 [002] 193385.058247: funcgraph_exit: ! 400.417 us | } tokio-runtime-w-11886 [002] 193385.058255: funcgraph_exit: ! 415.000 us | } tokio-runtime-w-11886 [002] 193385.058555: funcgraph_entry: 8.250 us | irq_enter_rcu(); tokio-runtime-w-11886 [002] 193385.058958: funcgraph_entry: | kallsyms_lookup_size_offset() { tokio-runtime-w-11886 [002] 193385.058974: funcgraph_entry: + 36.333 us | get_symbol_pos(); tokio-runtime-w-11886 [002] 193385.059017: funcgraph_exit: + 59.750 us | } tokio-runtime-w-11886 [002] 193385.059043: funcgraph_entry: | kfree() { tokio-runtime-w-11886 [002] 193385.059057: funcgraph_entry: 3.000 us | __kmem_cache_free(); tokio-runtime-w-11886 [002] 193385.059065: funcgraph_exit: + 22.833 us | } tokio-runtime-w-11886 [002] 193385.059073: funcgraph_exit: # 2644.708 us | } tokio-runtime-w-11886 [002] 193385.059079: funcgraph_exit: # 2706.292 us | } tokio-runtime-w-11886 [002] 193385.059095: funcgraph_entry: 2.792 us | mutex_unlock(); tokio-runtime-w-11886 [002] 193385.059101: funcgraph_exit: # 2870.416 us | }
这次程序已经通过了arch_prepare_bpf_trampoline、set_memory_ro和set_memory_x,我们看到的最后一个函数是kallsyms_lookup_size_offset。
正如我们在kernel/bpf/trampoline.c中的bpf_trampoline_update函数中所看到的,这里并没有明确调用kallsyms_lookup_size_offset:
static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex) { // ... OTHER CODE ... #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS again: if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) && (tr->flags & BPF_TRAMP_F_CALL_ORIG)) tr->flags |= BPF_TRAMP_F_ORIG_STACK; #endif err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE, &tr->func.model, tr->flags, tlinks, tr->func.addr); if (err < 0) goto out; set_memory_ro((long)im->image, 1); set_memory_x((long)im->image, 1); WARN_ON(tr->cur_image && tr->selector == 0); WARN_ON(!tr->cur_image && tr->selector); if (tr->cur_image) /* progs already running at this address */ err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex); else /* first time registering */ err = register_fentry(tr, im->image); #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS if (err == -EAGAIN) { /* -EAGAIN from bpf_tramp_ftrace_ops_func. Now * BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the * trampoline again, and retry register. */ /* reset fops->func and fops->trampoline for re-register */ tr->fops->func = NULL; tr->fops->trampoline = 0; /* reset im->image memory attr for arch_prepare_bpf_trampoline */ set_memory_nx((long)im->image, 1); set_memory_rw((long)im->image, 1); goto again; } #endif if (err) goto out; if (tr->cur_image) bpf_tramp_image_put(tr->cur_image); tr->cur_image = im; tr->selector++; out: /* If any error happens, restore previous flags */ if (err) tr->flags = orig_flags; kfree(tlinks); return err; } ```shell > **注意:** `bpf_trampoline_update`的实现与之前的内核5.15稍有不同。 `kallsyms_lookup_size_offset`的调用被隐藏在另一个函数内部。我们在函数图中看不到它,因为编译器将其内联了。 看起来`kallsyms_lookup_size_offset`是由`ftrace_location`调用的: ```c unsigned long ftrace_location(unsigned long ip) { struct dyn_ftrace *rec; unsigned long offset; unsigned long size; rec = lookup_rec(ip, ip); if (!rec) { if (!kallsyms_lookup_size_offset(ip, &size, &offset)) goto out; /* map sym+0 to __fentry__ */ if (!offset) rec = lookup_rec(ip, ip + size - 1); } if (rec) return rec->ip; out: return 0; }
ftrace_location被register_fentry调用,而register_fentry在调用ftrace_location之后,在struct bpf_trampoline *tr的fops字段上包含了一次检查。
/* first time registering */ static int register_fentry(struct bpf_trampoline *tr, void *new_addr) { void *ip = tr->func.addr; unsigned long faddr; int ret; faddr = ftrace_location((unsigned long)ip); if (faddr) { if (!tr->fops) return -ENOTSUPP; tr->func.ftrace_managed = true; } if (bpf_trampoline_module_get(tr)) return -ENOENT; if (tr->func.ftrace_managed) { ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 1); ret = register_ftrace_direct_multi(tr->fops, (long)new_addr); } else { ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr); } if (ret) bpf_trampoline_module_put(tr); return ret; }
确实,如果tr->fops为false,该函数将返回错误-ENOTSUPP。
让我们找出tr->fops是在哪里初始化的。
如果我们是正确的,那么创建trampoline的地方应该在bpf_trampoline_lookup函数内部。
static struct bpf_trampoline *bpf_trampoline_lookup(u64 key) { struct bpf_trampoline *tr; struct hlist_head *head; int i; mutex_lock(&trampoline_mutex); head = &trampoline_table[hash_64(key, TRAMPOLINE_HASH_BITS)]; hlist_for_each_entry(tr, head, hlist) { if (tr->key == key) { refcount_inc(&tr->refcnt); goto out; } } tr = kzalloc(sizeof(*tr), GFP_KERNEL); if (!tr) goto out; #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL); if (!tr->fops) { kfree(tr); tr = NULL; goto out; } tr->fops->private = tr; tr->fops->ops_func = bpf_tramp_ftrace_ops_func; #endif tr->key = key; INIT_HLIST_NODE(&tr->hlist); hlist_add_head(&tr->hlist, head); refcount_set(&tr->refcnt, 1); mutex_init(&tr->mutex); for (i = 0; i < BPF_TRAMP_MAX; i++) INIT_HLIST_HEAD(&tr->progs_hlist[i]); out: mutex_unlock(&trampoline_mutex); return tr; }
在分配之后,只有在出现CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS标志时,才会填充trampoline的fops字段。这个标志依赖于HAVE_CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS标志,而这个标志在aarch64上不存在。
结论
当前情况下,由于缺少_ftrace直接调用_功能,无法在aarch64上使用BPF LSM。幸运的是,当前的mainline分支已经合并了一个补丁[8],该补丁将在aarch64上启用LSMs(以及其他功能)。
预计这些变化将会在下一个6.4版的Linux内核中发布。
审核编辑:汤梓红
全部0条评论
快来发表一下你的评论吧 !