eBPF技术实践之virtio-net网卡队列可观测

描述

在系统领域中,最具挑战性的问题通常是组件之间的边界定位。其中,virtio-net前后端的定界尤为困难。当网络报文从内核发送到virtio-net后端,或者从virtio-net后端发送到内核时,这一路径难以进行观测。一些复杂的网络抖动问题很可能是由于网卡队列不正常工作引起的。为了解决这类问题,我们基于eBPF技术扩展了网卡队列的可观测能力,使得virtio网卡前后端的定界问题不再困扰。

virtio-net 前后端驱动简介

virtio-net (后面称为 virtio 网卡)通常由两个组件组成:virtio driver(也称为virtio前端)和virtio device(也称为virtio后端)。virtio前端运行在客户机的内核中,而virtio后端可以由宿主机的内核承担。virtio网卡通常支持多队列,包括发送队列和接收队列。每个队列通过三个 ring 来实现,即avail ring、used ring和desc ring。现在我们将重点介绍 virtio 网卡前端的报文发送和接收流程,以更好地理解整个工作流程。

virtio 网卡前端发送报文

virto网卡前端发送报文主要流程包括:

a.start_xmit:virtio网卡驱动的报文发送入口函数会首先清理已发送的报文,即通过调用free_old_xmit_skbs函数来释放描述符中的报文,直到avail->idx等于used->idx为止;

b.xmit_skb:主要是为报文添加vnet_hdr头部信息,并将skb以scatter-gather形式显示,以记录报文数据的地址和长度信息;

c.virtqueue_add_outbuf:进行DMA映射,将scatter-gather记录的报文数据地址和长度信息添加到desc环中,并增加avail->idx的值;

d.virtqueue_notify:当发送队列存在数据,则通知后端。

程序

virtio 网卡前端接收报文

virito网阿卡前端接收报文主要流程包括:

a.网卡硬中断:硬中断会将napi加入到CPU的处理队列,并启用中断抑制,以及触发软中断;

b.net_rx_action:网络软中断入口函数;

c.virtnet_poll:这个函数是virtio网卡的NAPI poll的回调函数。如果当前队列是发送队列,它将清理发送队列,也就是执行virtnet_poll_cleantx函数。如果当前队列是接收队列,它将进行报文的接收;

d.virtnet_receive:根据used->idx的值,从描述符环中读取报文数据,并更新last_used_idx。内核会为报文数据分配skb,并进入GRO流程,进行报文的合并;e.try_fill_recv:要给desc环添加空的内存区域,并增加avail->idx的值,以确保接收队列始终有可用的内存;

f.virtqueue_napi_complete:当接收的报文数量少于预定的budget(一般为64)时,表示没有更多的数据可以接收。这时,调用virtqueue_napi_complete来表示单次napi处理完毕。同时,通过virtqueue_enable_cb_prepare来关闭中断抑制。

程序

网卡队列可观测

经过前面的分析,我们了解到virtio网卡队列中的几个重要参数,即avail->idx、used->idx和last_used_idx。使用这些参数,我们可以清晰地了解网卡队列当前包含的报文数量,并进一步得到以下可观测指标:

a.发送队列报文数:表示尚未被virtio网卡后端发送的报文数量。计算方法是avail->idx - used->idx;

b.接收队列报文数:表示尚未被virtio网卡前端接收的报文数量。计算方法是used->idx - last_used_idx;

c.网卡队列的last_used_idx:表示virtio网卡后端处理报文的进度;

d.队列饱和度:表示当前网卡队列使用量,计算方法是队列报文数/队列长度。

工作原理

我们将可观测的代码集成在了rtrace的工具里,rtrace是龙蜥社区推出的系统工具集SysAK的一个网络诊断分析工具,关于rtrace的具体原理,我们将在下回分析,eBPF 具体代码请参考代码:

 

 

https://gitee.com/anolis/sysak/blob/opensource_branch_sync/source/tools/detect/net/rtrace/src/bpf/virtio.bpf.c

 

 

virtio 网卡队列指标采集的主要流程如下:

a.rtrace挂载eBPF采集程序到内核dev_id_show和dev_port_show函数;

b.rtrace周期性读取/sys/class/net/[interface]/dev_id和/sys/class/net/[interface]/dev_port两个文件,其中dev_id文件用来表示采集发送队列信息,dev_port文件用来表示采集接收队列信息;

c.当读取文件时,会触发内核执行dev_id_show和dev_port_show两个函数。由于已经挂载了eBPF采集程序,内核会先执行eBPF采集程序;

d.eBPF采集程序通过解析dev_id_show和dev_port_show入参struct net_device获取网卡队列vring,然后从vring中解析出avail idx、used idx、队列长度和last_used_idx;

e.将数据发送给rtrace做进一步处理。

程序

故障检测

下面是rtrace采集的网卡队列信息输出。

我们可以看到0926的1号发送队列的饱和度和last_used_idx分别是0.05%/3593,0928的1号发送队列的饱和度和last_used_idx分别是0.07%/3593,可以看到发送队列的饱和度在增加,但是last_used_idx在多个采集周期内保持不变。因此,可以确定1号发送队列出现了故障。

随后我们修复了1号发送队列故障,可以看见在0906的1号发送队列饱和度和last_used_idx分别是0.00%/3599,队列里面不再有驻留的报文,恢复了正常。

 

 

0924
SendQueue 0.05%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/410   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2443  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/148   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/88    0.00%/85    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13297 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/200   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8796  0.00%/117   0.00%/301   0.00%/275   
0925
SendQueue 0.05%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/410   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2444  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/148   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/89    0.00%/85    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13297 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/200   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8796  0.00%/117   0.00%/303   0.00%/275   
0926
SendQueue 0.05%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/410   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2444  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/148   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/91    0.00%/85    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13297 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/200   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8796  0.00%/117   0.00%/305   0.00%/275   
0927
SendQueue 0.07%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/410   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2444  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/148   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/93    0.00%/85    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13298 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/200   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8796  0.00%/117   0.00%/307   0.00%/275   
0928
SendQueue 0.07%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/414   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2445  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/149   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/96    0.00%/87    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13298 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/205   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8797  0.00%/118   0.00%/309   0.00%/275   
0929
SendQueue 0.07%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/414   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2445  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/149   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/98    0.00%/87    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13298 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/205   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8797  0.00%/118   0.00%/311   0.00%/275   
0930
SendQueue 0.07%/3593  0.00%/852   0.00%/4506  0.00%/1600  0.00%/457   0.00%/509   0.00%/3140  0.00%/1352  0.00%/386   0.00%/414   0.00%/1714  0.00%/1758  0.00%/1619  0.00%/446   0.00%/3577  0.00%/2445  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/146   0.00%/149   0.00%/226   0.00%/64    0.00%/109   0.00%/84    0.00%/78    0.00%/56    0.00%/87    0.00%/100   0.00%/87    0.00%/52    
RecvQueue 0.00%/2805  0.00%/13298 0.00%/475   0.00%/367   0.00%/12378 0.00%/130   0.00%/222   0.00%/11120 0.00%/355   0.00%/3016  0.00%/133   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/650   0.00%/151   0.00%/505   0.00%/5180  0.00%/205   0.00%/26670 0.00%/169   0.00%/1042  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/229   0.00%/1402  0.00%/8797  0.00%/118   0.00%/313   0.00%/275  
// ...省略
0906
SendQueue 0.00%/3599  0.00%/856   0.00%/4511  0.00%/1602  0.00%/465   0.00%/510   0.00%/3140  0.00%/1352  0.00%/386   0.00%/420   0.00%/1716  0.00%/1766  0.00%/1619  0.00%/448   0.00%/3578  0.00%/2451  0.00%/46    0.00%/94    0.00%/212   0.00%/231   0.00%/148   0.00%/149   0.00%/226   0.00%/64    0.00%/109   0.00%/85    0.00%/87    0.00%/56    0.00%/87    0.00%/101   0.00%/103   0.00%/52    
RecvQueue 0.00%/2807  0.00%/13299 0.00%/477   0.00%/369   0.00%/12378 0.00%/140   0.00%/223   0.00%/11120 0.00%/355   0.00%/3032  0.00%/142   0.00%/180   0.00%/12980 0.00%/10363 0.00%/2825  0.00%/652   0.00%/151   0.00%/505   0.00%/5180  0.00%/205   0.00%/26670 0.00%/170   0.00%/1057  0.00%/9820  0.00%/9586  0.00%/3374  0.00%/230   0.00%/1414  0.00%/8800  0.00%/118   0.00%/327   0.00%/275  

 

 

总结

在virtio网卡中,前端和后端之间通过共享的网卡队列进行通信。为了更好地理解和观测网卡队列的状态和性能指标,通过观测avail idx、used idx、last_used_idx等指标,我们可以对virtio网卡的性能进行评估和优化。同时,这些指标也为我们提供了对网卡队列状态的深入理解,有助于进行故障排查和性能调优。

 

打开APP阅读更多精彩内容
声明:本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人,不代表电子发烧友网立场。文章及其配图仅供工程师学习之用,如有内容侵权或者其他违规问题,请联系本站处理。 举报投诉

全部0条评论

快来发表一下你的评论吧 !

×
20
完善资料,
赚取积分