epoll的基础数据结构

科技绿洲 2023-11-10 799

描述

一、epoll的基础数据结构

在开始研究源代码之前，我们先看一下 epoll 中使用的数据结构，分别是 eventpoll、epitem 和 eppoll_entry。

1、eventpoll

我们先看一下 eventpoll 这个数据结构，这个数据结构是我们在调用 epoll_create 之后内核创建的一个句柄，表示了一个 epoll 实例。后续如果我们再调用 epoll_ctl 和 epoll_wait 等，都是对这个 eventpoll 数据进行操作，这部分数据会被保存在 epoll_create 创建的匿名文件 file 的 private_data 字段中。

* This structure is stored inside the "private_data" member of the file
* structure and represents the main data structure for the eventpoll
* interface.
*/
struct eventpoll {
/* Protect the access to this structure */
spinlock_t lock;

/*
* This mutex is used to ensure that files are not removed
* while epoll is using them. This is held during the event
* collection loop, the file cleanup path, the epoll file exit
* code and the ctl operations.
*/
struct mutex mtx;

/* Wait queue used by sys_epoll_wait() */
// 这个队列里存放的是执行 epoll_wait 从而等待的进程队列
wait_queue_head_t wq;

/* Wait queue used by file->poll() */
// 这个队列里存放的是该 eventloop 作为 poll 对象的一个实例，加入到等待的队列
// 这是因为 eventpoll 本身也是一个 file, 所以也会有 poll 操作
wait_queue_head_t poll_wait;

/* List of ready file descriptors */
// 这里存放的是事件就绪的 fd 列表，链表的每个元素是下面的 epitem
struct list_head rdllist;

/* RB tree root used to store monitored fd structs */
// 这是用来快速查找 fd 的红黑树
struct rb_root_cached rbr;

/*
* This is a single linked list that chains all the "struct epitem" that
* happened while transferring ready events to userspace w/out
* holding ->lock.
*/
struct epitem *ovflist;

/* wakeup_source used when ep_scan_ready_list is running */
struct wakeup_source *ws;

/* The user that created the eventpoll descriptor */
struct user_struct *user;

// 这是 eventloop 对应的匿名文件，充分体现了 Linux 下一切皆文件的思想
struct file *file;

/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;

#ifdef CONFIG_NET_RX_BUSY_POLL
/* used to track busy poll napi_id */
unsigned int napi_id;
#endif
};

2、epitem

每当我们调用 epoll_ctl 增加一个 fd 时，内核就会为我们创建出一个 epitem 实例，并且把这个实例作为红黑树的一个子节点，增加到 eventpoll 结构体中的红黑树中，对应的字段是 rbr。这之后，查找每一个 fd 上是否有事件发生都是通过红黑树上的 epitem 来操作。

/*
* Each file descriptor added to the eventpoll interface will
* have an entry of this type linked to the "rbr" RB tree.
* Avoid increasing the size of this struct, there can be many thousands
* of these on a server and we do not want this to take another cache line.
*/
struct epitem {
union {
/* RB tree node links this structure to the eventpoll RB tree */
struct rb_node rbn;
/* Used to free the struct epitem */
struct rcu_head rcu;
};

/* List header used to link this structure to the eventpoll ready list */
// 将这个 epitem 连接到 eventpoll 里面的 rdllist 的 list 指针
struct list_head rdllink;

/*
* Works together "struct eventpoll"->ovflist in keeping the
* single linked chain of items.
*/
struct epitem *next;

/* The file descriptor information this item refers to */
//epoll 监听的 fd
struct epoll_filefd ffd;

/* Number of active wait queue attached to poll operations */
// 一个文件可以被多个 epoll 实例所监听，这里就记录了当前文件被监听的次数
int nwait;

/* List containing poll wait queues */
struct list_head pwqlist;

/* The "container" of this item */
// 当前 epollitem 所属的 eventpoll
struct eventpoll *ep;

/* List header used to link this item to the "struct file" items list */
struct list_head fllink;

/* wakeup_source used when EPOLLWAKEUP is set */
struct wakeup_source __rcu *ws;

/* The structure that describe the interested events and the source fd */
struct epoll_event event;
};

3、eppoll_entry

每次当一个 fd 关联到一个 epoll 实例，就会有一个 eppoll_entry 产生。eppoll_entry 的结构如下：

/* Wait structure used by the poll hooks */
struct eppoll_entry {
/* List header used to link this structure to the "struct epitem" */
struct list_head llink;

/* The "base" pointer is set to the container "struct epitem" */
struct epitem *base;

/*
* Wait queue item that will be linked to the target file wait
* queue head.
*/
wait_queue_entry_t wait;

/* The wait queue head that linked the "wait" wait queue item */
wait_queue_head_t *whead;
};

二、epoll底层原理

在高并发场景下，如果有100万用户同时与一个进程保持着TCP连接，而每一时刻只有几十个或几百个TCP连接是活跃的(接收TCP包)，也就是说在每一时刻进程只需要处理这100万连接中的一小部分连接。

对于这种场景，select或者poll事件驱动方式采用了轮询的方式操作系统收集有事件发生的TCP连接，把这100万个连接告诉操作系统。但这里有一个明显的问题，在某一时刻，进程收集有事件的连接时，其实这100万连接中的大部分都是没有事件发生的。因此如果每次收集事件时，都把100万连接的套接字传给操作系统，从用户态内存到内核态内存的大量复制，这无疑会产生巨大的开销。而由操作系统内核寻找这些连接上有没有未处理的事件，将会是巨大的资源浪费，然后select和poll就是这样做的，因此它们最多只能处理几千个并发连接。

而epoll不这样做，它在Linux内核中申请了一个简易的文件系统，把原先的一个select或poll调用分成了3部分：

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);

调用epoll_create建立一个epoll对象(在epoll文件系统中给这个句柄分配资源)；
调用epoll_ctl向epoll对象中添加这100万个连接的套接字；
调用epoll_wait收集发生事件的连接。

这样只需要在进程启动时建立1个epoll对象，并在需要的时候向它添加或删除连接就可以了，因此，在实际收集事件时，epoll_wait的效率就会非常高，因为调用epoll_wait时并没有向它传递这100万个连接，内核也不需要去遍历全部的连接。

1、epoll_create

我们在调用epoll_create时，内核除了帮我们在epoll文件系统里建了个file结点，在内核cache里建了个红黑树用于存储以后epoll_ctl传来的socket外，还会再建立一个rdllist双向链表，用于存储准备就绪的事件，当epoll_wait调用时，仅仅观察这个rdllist双向链表里有没有数据即可。有数据就返回，没有数据就sleep，等到timeout时间到后即使链表没数据也返回。所以，epoll_wait非常高效。

红黑树操作使用的是互斥锁，在添加和删除操作时需要加锁。

双向链表操作使用的是spinlock自旋锁，当没有竞争到锁资源时，不会睡眠，加快了链表操作的速度，添加和删除操作需要加锁。

总之，红黑树存储所监控的文件描述符的节点数据，就绪链表存储就绪的文件描述符的节点数据

TCP

epoll_create工作流程

首先，epoll_create 会对传入的 flags 参数做简单的验证。

/* Check the EPOLL_* constant for consistency. */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

if (flags & ~EPOLL_CLOEXEC)
return -EINVAL;
/*

接下来，内核申请分配 eventpoll 需要的内存空间。

/* Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
if (error < 0)
return error;

在接下来，epoll_create 为 epoll 实例分配了匿名文件和文件描述字，其中 fd 是文件描述字，file 是一个匿名文件。这里充分体现了 UNIX 下一切都是文件的思想。注意，eventpoll 的实例会保存一份匿名文件的引用，通过调用 fd_install 函数将匿名文件和文件描述字完成了绑定。

这里还有一个特别需要注意的地方，在调用 anon_inode_get_file 的时候，epoll_create 将 eventpoll 作为匿名文件 file 的 private_data 保存了起来，这样，在之后通过 epoll 实例的文件描述字来查找时，就可以快速地定位到 eventpoll 对象了。

最后，这个文件描述字作为 epoll 的文件句柄，被返回给 epoll_create 的调用者。

/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
if (fd < 0) {
error = fd;
goto out_free_ep;
}
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC));
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto out_free_fd;
}
ep->file = file;
fd_install(fd, file);
return fd;

2、epoll_ctl

接下来，我们看一下一个套接字是如何被添加到 epoll 实例中的。这就要解析一下 epoll_ctl 函数实现了。

查找 epoll 实例

首先，epoll_ctl 函数通过 epoll 实例句柄来获得对应的匿名文件，这一点很好理解，UNIX 下一切都是文件，epoll 的实例也是一个匿名文件。

// 获得 epoll 实例对应的匿名文件
f = fdget(epfd);
if (!f.file)
goto error_return;

接下来，获得添加的套接字对应的文件，这里 tf 表示的是 target file，即待处理的目标文件。

/* Get the "struct file *" for the target file */
// 获得真正的文件，如监听套接字、读写套接字
tf = fdget(fd);
if (!tf.file)
goto error_fput;

再接下来，进行了一系列的数据验证，以保证用户传入的参数是合法的，比如 epfd 真的是一个 epoll 实例句柄，而不是一个普通文件描述符。

/* The target file descriptor must support poll */
// 如果不支持 poll，那么该文件描述字是无效的
error = -EPERM;
if (!tf.file->f_op->poll)
goto error_tgt_fput;
...

红黑树查找

接下来 epoll_ctl 通过目标文件和对应描述字，在红黑树中查找是否存在该套接字，这也是 epoll 为什么高效的地方。红黑树（RB-tree）是一种常见的数据结构，这里 eventpoll 通过红黑树跟踪了当前监听的所有文件描述字，而这棵树的根就保存在 eventpoll 数据结构中。

对于每个被监听的文件描述字，都有一个对应的 epitem 与之对应，epitem 作为红黑树中的节点就保存在红黑树中。

红黑树是一棵二叉树，作为二叉树上的节点，epitem 必须提供比较能力，以便可以按大小顺序构建出一棵有序的二叉树。其排序能力是依靠 epoll_filefd 结构体来完成的，epoll_filefd 可以简单理解为需要监听的文件描述字，它对应到二叉树上的节点。

ep_insert

ep_insert 首先判断当前监控的文件值是否超过了 /proc/sys/fs/epoll/max_user_watches 的预设最大值，如果超过了则直接返回错误。接下来是分配资源和初始化动作。

if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;

/* Item initialization follow here ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;

再接下来的事情非常重要，ep_insert 会为加入的每个文件描述字设置回调函数。这个回调函数是通过函数 ep_ptable_queue_proc 来进行设置的。这个回调函数是干什么的呢？其实，对应的文件描述字上如果有事件发生，就会调用这个函数，比如套接字缓冲区有数据了，就会回调这个函数。这个函数就是 ep_poll_callback。这里你会发现，原来内核设计也是充满了事件回调的原理。

/*
* This is the callback that is used to add our wait queue to the
* target file wakeup lists.
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,poll_table *pt)
{
struct epitem *epi = ep_item_from_epqueue(pt);
struct eppoll_entry *pwq;

if (epi>nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
if (epi->event.events & EPOLLEXCLUSIVE)
add_wait_queue_exclusive(whead, &pwq->wait);
else
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* We have to signal that an error occurred */
epi->nwait = -1;
}
}

总而言之，当我们使用epoll_ctl()函数注册一个socket时，内核将会做这些事情：

分配一个红黑树节点对象epitem
添加等待事件到socket的等待队列中
将epitem插入到epoll对象的红黑树中

3、epoll_wait

epoll_wait被调用时会观察 eventpoll->rdllist 链表里有没有数据，有数据就返回，没有数据就创建一个等待队列项，将其添加到 eventpoll 的等待队列上（1.1节中的wait_queue_head_t），然后把自己阻塞掉就结束。

查找 epoll 实例

epoll_wait 函数首先进行一系列的检查，例如传入的 maxevents 应该大于 0。和前面介绍的 epoll_ctl 一样，通过 epoll 实例找到对应的匿名文件和描述字，并且进行检查和验证。

还是通过读取 epoll 实例对应匿名文件的 private_data 得到 eventpoll 实例。

/* The maximum number of event must be greater than zero */
if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
return -EINVAL;

/* Verify that the area passed by the user is writeable */
if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
return -EFAULT;
/* Get the "struct file *" for the eventpoll file */
f = fdget(epfd);
if (!f.file)
return -EBADF;

/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
if (!is_file_epoll(f.file))
goto error_fput;

4、总结

执行epoll_create()时，创建了红黑树和就绪链表；
执行epoll_ctl()时，如果增加socket句柄，则检查在红黑树中是否存在，存在立即返回，不存在则添加到树干上，然后向内核注册回调函数，用于当中断事件来临时向准备就绪链表中插入数据；
执行epoll_wait()时立刻返回准备就绪链表里的数据即可；

三、epoll的两种触发模式

epoll有EPOLLLT和EPOLLET两种触发模式，LT是默认的模式，ET是“高速”模式。

LT（水平触发）模式下，只要这个文件描述符还有数据可读，每次 epoll_wait都会返回它的事件，提醒用户程序去操作；

LT比ET多了一个开关EPOLLOUT事件(系统调用消耗，上下文切换）的步骤；对于监听的sockfd，最好使用水平触发模式（参考nginx），边缘触发模式会导致高并发情况下，有的客户端会连接不上，LT适合处理紧急事件；对于读写的connfd，水平触发模式下，阻塞和非阻塞效果都一样，不过为了防止特殊情况，还是建议设置非阻塞；LT的编程与poll/select接近，符合一直以来的习惯，不易出错；

ET（边缘触发）模式下，在它检测到有 I/O 事件时，通过 epoll_wait 调用会得到有事件通知的文件描述符，对于每一个被通知的文件描述符，如可读，则必须将该文件描述符一直读到空，让 errno 返回 EAGAIN 为止，否则下次的 epoll_wait 不会返回余下的数据，会丢掉事件。如果ET模式不是非阻塞的，那这个一直读或一直写势必会在最后一次阻塞。

边沿触发模式很大程度上降低了同一个epoll事件被重复触发的次数，所以效率更高；对于读写的connfd，边缘触发模式下，必须使用非阻塞IO，并要一次性全部读写完数据。ET的编程可以做到更加简洁，某些场景下更加高效，但另一方面容易遗漏事件，容易产生bug；

总之，ET和LT各有优缺点，需要根据业务场景选择最合适的模式。

四、epoll的不足之处

1.定时的精度不够，只到5ms级别，select可以到0.1ms；

2.当连接数少并且连接都十分活跃的情况下，select和poll的性能可能比epoll好；

3.epoll_ctrl每次只能够修改一个fd（kevent可以一次改多个，每次修改，epoll需要一个系统调用，不能 batch 操作，可能会影响性能）；

4.可能会在定时到期之前返回，导致还需要下一个epoll_wait调用；

打开APP阅读更多精彩内容