网络可编程数据面——DPDK Graph Library

——lvyilong316

Graph library是DPDK 20.05版本引入的新特性，最近抽时间把其中关键代码看了一遍，主要希望看下其实现思路是否有可以借鉴的东西。下面将其大概实现进行了总结。

DPDK中的Graph架构将数据的处理抽象成了node和link构成的graph，这种思想并不是新东西，VPP就采用了类似的思路实现，只不过DPDK这里讲graph思想实现为了一个lib框架，提供了graph的创建，删除，查找，node clone，edge update，adge shrink等操作，可供上层应用基于此框架开发。

这种graph架构有什么优势呢？首先graph架构将报文处理的相似逻辑抽象同样到一个node中，这样减少了报文处理过程中的I cache和D cache miss；其次提高了报文处理逻辑的复用性，同样的处理逻辑抽象为node可以在graph路径中多次出现复用；其次提高了报文处理模块的灵活性，通过不同node的任意顺序组合完成不同的报文处理流程。

node的构成

Graph lib中的node构成如下图所示，由以下几部分构成：

（1）process：这是一个callback function，是node的核心处理流程，包含了数据处理的实现，通过rte_graph_walk()函数遍历每个node，然后调用每个node的process函数实现数据的处理，如果数据处理完成后需要交给下一个node处理，则需要在其中调用rte_node_enqueue*()函数。

（2）init：node的初始化函数，在rte_graph_create()函数创建graph时会调用各个node的init函数；

（3）fini：node的析构函数，在rte_graph_destroy()函数销毁graph时会调用各个node的fini函数；

（4）Context memory：存放当前node所需的私有信息的内存空间，process，init，fini等函数都可能会用到；

（5）nb_edges：和当前node关联的边数；

（6）next_node[]：存放当前node的邻居节点，由于graph是一个有向图，所以这里next_node[]其实存放是其下游的邻居节点；

内置node类型

用户可以根据自身需求实现注册自己的node节点，dpdk中的rte_node库中也实现了一些内置的基础node。下面我们以ethdev_tx node为例分析一下一个node的定义和注册过程。

ethdev_tx

ethdev_tx node定义了网卡port的收包行为。在rte_node中每个node的定义是通过注册rte_node_register信息完成的，rte_node_register包含了node所需要的基本信息，包括其process处理函数， ethdev_tx的rte_node_register如下所示：

点击(此处)折叠或打开

static struct rte_node_register ethdev_tx_node_base = {
.process = ethdev_tx_node_process,
.name = "ethdev_tx",
.init = ethdev_tx_node_init,
.nb_edges = ETHDEV_TX_NEXT_MAX,
.next_nodes = {
[ETHDEV_TX_NEXT_PKT_DROP] = "pkt_drop",
},
};

之后通过RTE_NODE_REGISTER进行注册：

RTE_NODE_REGISTER(ethdev_tx_node_base);

RTE_NODE_REGISTER的实现如下，最终调用__rte_node_register。

点击(此处)折叠或打开

#define RTE_NODE_REGISTER(node) \
RTE_INIT(rte_node_register_##node) \
{ \
node.parent_id = RTE_NODE_ID_INVALID; \
node.id = __rte_node_register(&node); \
}

__rte_node_register的主要工作就是分配struct node结构和对应的nodeid，并将其加入到全局链表node_list中。

点击(此处)折叠或打开

rte_node_t
__rte_node_register(const struct rte_node_register *reg)
{
struct node *node;
rte_edge_t i;
size_t sz;
graph_spinlock_lock();
sz = sizeof(struct node) + (reg->nb_edges * RTE_NODE_NAMESIZE);
node = calloc(1, sz);
if (node == NULL) {
rte_errno = ENOMEM;
goto fail;
}
/* Initialize the node */
if (rte_strscpy(node->name, reg->name, RTE_NODE_NAMESIZE) < 0) {
rte_errno = E2BIG;
goto free;
}
node->flags = reg->flags;
node->process = reg->process;
node->init = reg->init;
node->fini = reg->fini;
node->nb_edges = reg->nb_edges;
node->parent_id = reg->parent_id;
for (i = 0; i < reg->nb_edges; i++) {
if (rte_strscpy(node->next_nodes[i], reg->next_nodes[i],
RTE_NODE_NAMESIZE) < 0) {
rte_errno = E2BIG;
goto free;
}
}
node->id = node_id++;
/* Add the node at tail */
STAILQ_INSERT_TAIL(&node_list, node, next);
graph_spinlock_unlock();
return node->id;
free:
free(node);
fail:
graph_spinlock_unlock();
return RTE_NODE_ID_INVALID;
}

对应数据结构如下所示：

下面我们看一下ethdev_tx node的process函数，即ethdev_tx_node_process。

点击(此处)折叠或打开

static uint16_t
ethdev_tx_node_process(struct rte_graph *graph, struct rte_node *node,
void **objs, uint16_t nb_objs)
{
ethdev_tx_node_ctx_t *ctx = (ethdev_tx_node_ctx_t *)node->ctx;
uint16_t port, queue;
uint16_t count;
/* Get Tx port id */
port = ctx->port;
queue = ctx->queue;
count = rte_eth_tx_burst(port, queue, (struct rte_mbuf **)objs,
nb_objs);
/* Redirect unsent pkts to drop node */
if (count != nb_objs) {
rte_node_enqueue(graph, node, ETHDEV_TX_NEXT_PKT_DROP,
&objs[count], nb_objs - count);
}
return count;
}

这里我们也可以清楚的看到node->ctx的作用，其中process所需的port和queue信息都存放在其中。

ethdev_rx

ethdev_rx是处理网卡port接收报文逻辑的node，其rte_node_register定义如下所示：

点击(此处)折叠或打开

static struct rte_node_register ethdev_rx_node_base = {
.process = ethdev_rx_node_process,
.flags = RTE_NODE_SOURCE_F,
.name = "ethdev_rx",
.init = ethdev_rx_node_init,
.nb_edges = ETHDEV_RX_NEXT_MAX,
.next_nodes = {[ETHDEV_RX_NEXT_IP4_LOOKUP] = "ip4_lookup"},
};

和ethdev_tx不同的是这里多了一个 RTE_NODE_SOURCE_F的flag，这个flag标识了这是一个source node，在DPDK graph中，source node是一种特殊的node，作为这个graph的起点，当然一个graph可以有多个source node。另外注意这里静态定义这个node的邻居node为ip4_lookup。

我们再来看一下其process处理函数ethdev_rx_node_process。

点击(此处)折叠或打开

static __rte_always_inline uint16_t
ethdev_rx_node_process(struct rte_graph *graph, struct rte_node *node,
void **objs, uint16_t cnt)
{
ethdev_rx_node_ctx_t *ctx = (ethdev_rx_node_ctx_t *)node->ctx;
uint16_t n_pkts = 0;
RTE_SET_USED(objs);
RTE_SET_USED(cnt);
n_pkts = ethdev_rx_node_process_inline(graph, node, ctx->port_id,
ctx->queue_id);
return n_pkts;
}

其中主要调用了 ethdev_rx_node_process_inline。

点击(此处)折叠或打开

static __rte_always_inline uint16_t
ethdev_rx_node_process_inline(struct rte_graph *graph, struct rte_node *node,
uint16_t port, uint16_t queue)
{
uint16_t count, next_index = ETHDEV_RX_NEXT_IP4_LOOKUP;
/* Get pkts from port */
count = rte_eth_rx_burst(port, queue, (struct rte_mbuf **)node->objs,
RTE_GRAPH_BURST_SIZE);
if (!count)
return 0;
node->idx = count;
/* Enqueue to next node */
rte_node_next_stream_move(graph, node, next_index);
return count;
}

通过rte_eth_rx_burst接收到报文后，再通过rte_node_next_stream_move将报文交给下一个节点处理，而下一个节点已经被指定好为next_index = ETHDEV_RX_NEXT_IP4_LOOKUP，即ip4_lookup node。

ip4_rewrite

这个node为IPv4重写节点：包含IPv4和以太网报文头的重写功能，可以通过rte_node_ip4_rewrite_add 函数进行配置。

ip4_lookup

IPv4查找节点：由IPv4提取和LPM查找节点组成。路由表可以由应用程序通过rte_node_ip4_route_add 函数进行配置。

点击(此处)折叠或打开

static struct rte_node_register ip4_lookup_node = {
.process = ip4_lookup_node_process,
.name = "ip4_lookup",
.init = ip4_lookup_node_init,
.nb_edges = RTE_NODE_IP4_LOOKUP_NEXT_MAX,
.next_nodes = {
[RTE_NODE_IP4_LOOKUP_NEXT_REWRITE] = "ip4_rewrite",
[RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP] = "pkt_drop",
},
};

其邻居节点有ip4_rewrite和pkt_drop。

null

空节点：定义了节点通用结构的skeleton节点。

pkt_drop

数据包丢弃节点：将各自mempool中接受到的数据包释放。

将node添加到graph的几种方法

将一个node加入图中link起来有以下几种方法：

1. 在node注册的时候提供并初始化next_node[]数组，上文提过next_node[]中用来存放当前节点的下游节点对象的，所以也体现了节点间的link关系，这种方法也称为静态方法；

2. 使用rte_node_edge_get(), rte_node_edge_update(), rte_node_edge_shrink() 几个函数来在运行时更新 next_nodes[]，但是这种方式需要在graph创建之前；

3. 使用rte_node_clone()方法clone一个已有节点，这种方法适用于将系统中更多CPU加入到转发逻辑的扩展操作，clone操作会复制原有节点的所有属性，但是不会复制"context memory"，context memory中包含了port和queue pair信息。

创建graph对象

graph是通过rte_graph_create函数创建的，这个函数也是理解DPDK graph架构的核心。

点击(此处)折叠或打开

rte_graph_t
rte_graph_create(const char *name, struct rte_graph_param *prm)
{
rte_node_t src_node_count;
struct graph *graph;
const char *pattern;
uint16_t i;
graph_spinlock_lock();
/* Check arguments sanity */
if (prm == NULL)
SET_ERR_JMP(EINVAL, fail, "Param should not be NULL");
if (name == NULL)
SET_ERR_JMP(EINVAL, fail, "Graph name should not be NULL");
/* Check for existence of duplicate graph */
/* 在graph全局链表中查找看是否已有同名的graph */
STAILQ_FOREACH(graph, &graph_list, next)
if (strncmp(name, graph->name, RTE_GRAPH_NAMESIZE) == 0)
SET_ERR_JMP(EEXIST, fail, "Found duplicate graph %s",
name);
/* Create graph object */
/* 分配graph结构 */
graph = calloc(1, sizeof(*graph));
if (graph == NULL)
SET_ERR_JMP(ENOMEM, fail, "Failed to calloc graph object");
/* Initialize the graph object */
STAILQ_INIT(&graph->node_list);
if (rte_strscpy(graph->name, name, RTE_GRAPH_NAMESIZE) < 0)
SET_ERR_JMP(E2BIG, free, "Too big name=%s", name);
/* Expand node pattern and add the nodes to the graph */
for (i = 0; i < prm->nb_node_patterns; i++) {
pattern = prm->node_patterns[i];
if (expand_pattern_to_node(graph, pattern))
goto graph_cleanup;
}
/* Go over all the nodes edges and add them to the graph */
if (graph_node_edges_add(graph))
goto graph_cleanup;
/* Update adjacency list of all nodes in the graph */
if (graph_adjacency_list_update(graph))
goto graph_cleanup;
/* Make sure at least a source node present in the graph */
src_node_count = graph_src_nodes_count(graph);
if (src_node_count == 0)
goto graph_cleanup;
/* Make sure no node is pointing to source node */
if (graph_node_has_edge_to_src_node(graph))
goto graph_cleanup;
/* Don't allow node has loop to self */
if (graph_node_has_loop_edge(graph))
goto graph_cleanup;
/* Do BFS from src nodes on the graph to find isolated nodes */
if (graph_has_isolated_node(graph))
goto graph_cleanup;
/* Initialize graph object */
graph->socket = prm->socket_id;
graph->src_node_count = src_node_count;
graph->node_count = graph_nodes_count(graph);
graph->id = graph_id;
/* Allocate the Graph fast path memory and populate the data */
if (graph_fp_mem_create(graph))
goto graph_cleanup;
/* Call init() of the all the nodes in the graph */
if (graph_node_init(graph))
goto graph_mem_destroy;
/* All good, Lets add the graph to the list */
graph_id++;
STAILQ_INSERT_TAIL(&graph_list, graph, next);
graph_spinlock_unlock();
return graph->id;
graph_mem_destroy:
graph_fp_mem_destroy(graph);
graph_cleanup:
graph_cleanup(graph);
free:
free(graph);
fail:
graph_spinlock_unlock();
return RTE_GRAPH_ID_INVALID;
}

我将此函数的实现展开并添加注释得到下图。内嵌函数不再单独说明。

这个函数的核心是创建并初始化struct graph结构，包括其中的node和邻居关系，然后将struct graph添加到全局链表graph_list中。相关数据结构如下图所示。注意struct graph底层有对应的struct rte_graph，struct node底层有对应的struct rte_node。

graph的遍历

graph创建后需要让整个graph运行起来，这个工作主要是由rte_graph_walk函数完成的。

点击(此处)折叠或打开

/**
* Perform graph walk on the circular buffer and invoke the process function
* of the nodes and collect the stats.
*
* @param graph
* Graph pointer returned from rte_graph_lookup function.
*
* @see rte_graph_lookup()
*/
__rte_experimental
static inline void
rte_graph_walk(struct rte_graph *graph)
{
const rte_graph_off_t *cir_start = graph->cir_start;
const rte_node_t mask = graph->cir_mask;
uint32_t head = graph->head;
struct rte_node *node;
uint64_t start;
uint16_t rc;
void **objs;
/*
* Walk on the source node(s) ((cir_start - head) -> cir_start) and then
* on the pending streams (cir_start -> (cir_start + mask) -> cir_start)
* in a circular buffer fashion.
*
* +-----+ <= cir_start - head [number of source nodes]
* | |
* | ... | <= source nodes
* | |
* +-----+ <= cir_start [head = 0] [tail = 0]
* | |
* | ... | <= pending streams
* | |
* +-----+ <= cir_start + mask
*/
while (likely(head != graph->tail)) {
node = RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
RTE_ASSERT(node->fence == RTE_GRAPH_FENCE);
objs = node->objs;
rte_prefetch0(objs);
if (rte_graph_has_stats_feature()) { /*如果graph开启了统计功能*/
start = rte_rdtsc();
rc = node->process(graph, node, objs, node->idx);
node->total_cycles += rte_rdtsc() - start;
node->total_calls++;
node->total_objs += rc;
} else {
node->process(graph, node, objs, node->idx);
}
node->idx = 0;
head = likely((int32_t)head > 0) ? head & mask : head;
}
graph->tail = 0;
}

要理解这个过程需要再次拿出来rte_graph的结构，如下图所示。该函数从graph->head开始遍历graph的每个node，并调用其process函数。而graph->head在graph创建时被做了如下初始化：

graph->head = (int32_t)-_graph->src_node_count;

也就是负的src_node_count，根据rte_graph的结构图可知cir_start[head]指向的就是graph的source node的起始位置。所以函数的整个过程是先从cir_start[-src_node_count]遍历到cir_start[0]将所有的source node遍历一遍，然后再从cir_start[0]遍历到cir_start[mask]将graph中的所有node遍历一遍。为什么要先遍历source node呢？这也很容易理解，source node一般负责原始数据的处理，如ethdev_rx负责将报文从网卡收上来，如果没有source node后续node也就没有报文可处理。

graph的统计

DPDK的graph也带了统计功能，可以统计graph的每个node被调用了多少次，每个node被调用消耗了多少个cycle。可以使用rte_graph_cluster_stats_get()函数获取，获取信息如下图所示。

l3fwd-graph示例分析

l3fwd-graph是DPDK使用graph架构对l3fwd example的重实现，也是我们学习使用graph的很好的例子。

l3fwd-graph通过DPDK中的ethdev_rx, ip4_lookup, ip4_rewrite, ethdev_tx,pkt_drop等内置node，在每个转发core上创建一个graph来实现三层的转发功能。

启动参数如下所示：

点击(此处)折叠或打开

./build/l3fwd-graph -l 1,2 -n 4 -- -p 0x3 --config="(0,0,1),(1,0,2)"

其中--config (port,queue,lcore)[,(port,queue,lcore)]表示queue，port，core三者的绑定关系。下面分段介绍这个例子的实现。

Graph Node Pre-Init Configuration

和正常的DPDK程序一样，l3fwd-graph启动会先调用rte_eth_dev_configure对每个port进行配置，然后调用rte_eth_tx_queue_setup配置tx queue，调用rte_eth_rx_queue_setup配置每个队列的tx queue，然后调用rte_eth_dev_start对每个port进行start。这些通用逻辑我们就不再展开，这里只关注和graph相关的。

其中第一点是在rte_eth_tx_queue_setup和rte_eth_rx_queue_setup后调用rte_node_eth_config。这个函数分配对ethdev_rx和ethdev_tx 两个node进行 clone ，其clone出来的节点name分别为 ethdev_rx-X-Y和ethdev_tx-X，其中X和Y分别代表port id和queue id，所以ethdev_tx是每个port一个，其queue id通过graph id指定，而ethdev_rx是每个port-queue映射对应一个。

Graph Initialization

在每个转发面上创建一个graph对象，每个core根据配置tx和rx的能力包含对应的ethdev_rx和ethdev_tx node，

点击(此处)折叠或打开

static const char *const default_patterns[] = {
"ip4*",
"ethdev_tx-*",
"pkt_drop",};const char **node_patterns;uint16_t nb_pattern;
/* ... */
/* Create a graph object per lcore with common nodes and * lcore specific nodes based on application arguments */
nb_patterns = RTE_DIM(default_patterns);
node_patterns = malloc((MAX_RX_QUEUE_PER_LCORE + nb_patterns) *sizeof(*node_patterns));
memcpy(node_patterns, default_patterns, nb_patterns * sizeof(*node_patterns));
memset(&graph_conf, 0, sizeof(graph_conf));
/* Common set of nodes in every lcore's graph object */
graph_conf.node_patterns = node_patterns;
for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
/* ... */
/* Skip graph creation if no source exists */
if (!qconf->n_rx_queue)
continue;
/* Add rx node patterns of this lcore based on --config */
for (i = 0; i < qconf->n_rx_queue; i++) {
graph_conf.node_patterns[nb_patterns + i] =
qconf->rx_queue_list[i].node_name;
}
graph_conf.nb_node_patterns = nb_patterns + i;
graph_conf.socket_id = rte_lcore_to_socket_id(lcore_id);
snprintf(qconf->name, sizeof(qconf->name), "worker_%u", lcore_id);
graph_id = rte_graph_create(qconf->name, &graph_conf);
/* ... */
qconf->graph = rte_graph_lookup(qconf->name);
/* ... */
}

注意：如果通过shell传递的一组节点pattern不满足他们的相互依赖关系或给定的正则表达式node pattern找不到对应node则graph将会创建失败。

Forwarding data(Route, Next-Hop) addition

graph创建好之后就可以通过rte_node_ip4_route_add() 和rte_node_ip4_rewrite_add() 函数分别向ipv4_lookup和ipv4_rewrite node添加配置规则。

点击(此处)折叠或打开

/* Add route to ip4 graph infra */for (i = 0; i < IPV4_L3FWD_LPM_NUM_ROUTES; i++) {
/* ... */
dst_port = ipv4_l3fwd_lpm_route_array[i].if_out;
next_hop = i;
/* ... */
ret = rte_node_ip4_route_add(ipv4_l3fwd_lpm_route_array[i].ip,
ipv4_l3fwd_lpm_route_array[i].depth, next_hop,
RTE_NODE_IP4_LOOKUP_NEXT_REWRITE);
/* ... */
memcpy(rewrite_data, val_eth + dst_port, rewrite_len);
/* Add next hop for a given destination */
ret = rte_node_ip4_rewrite_add(next_hop, rewrite_data,
rewrite_len, dst_port);
RTE_LOG(INFO, L3FWD_GRAPH, "Added route %s, next_hop %u\n",
route_str, next_hop);
}

Packet Forwarding using Graph Walk

graph配置完成后就可以启动graph的转发了，其核心函数为graph_main_loop.

点击(此处)折叠或打开

/* Main processing loop */
static intgraph_main_loop(void *conf){
// ...
lcore_id = rte_lcore_id();
qconf = &lcore_conf[lcore_id];
graph = qconf->graph;
RTE_LOG(INFO, L3FWD_GRAPH,
"Entering main loop on lcore %u, graph %s(%p)\n", lcore_id,
qconf->name, graph);
/* Walk over graph until signal to quit */
while (likely(!force_quit))
rte_graph_walk(graph);
return 0;
}

即使用rte_graph_walk遍历当前core上的graph，其中rte_graph_walk我们前面已经分析过，其内部会先遍历指向graph的source node，也就是 ethdev_rx-X-Y，然后遍历graph中的其他node，这个执行node的process函数。

参考文档：