QOS的控制分为Ingress 和Egress。这里主要分析出口.
调试需要iproute2的tc :
点击(此处)折叠或打开
- Linux Traffic Control is configured with the utility tc. It is part of the iproute2 package. Some Linux distributions already include tc, but it may be an old version, without support for Diffserv.
点击(此处)折叠或打开
-
调试版本iproute2-4.2.0
-
那么编译iproute2需要依赖的东西:
-
Bison
-
Flex
- Libdb-dev
点击(此处)折叠或打开
-
int main(int argc, char **argv)
-
{
-
int ret;
-
char *batch_file = NULL;
-
-
while (argc > 1) {
-
if (argv[1][0] != '-')
-
break;
-
if (matches(argv[1], "-stats") == 0 ||
-
matches(argv[1], "-statistics") == 0) {
-
++show_stats;
-
} else if (matches(argv[1], "-details") == 0) {
-
++show_details;
-
} else if (matches(argv[1], "-raw") == 0) {
-
++show_raw;
-
} else if (matches(argv[1], "-pretty") == 0) {
-
++show_pretty;
-
} else if (matches(argv[1], "-graph") == 0) {
-
show_graph = 1;
-
} else if (matches(argv[1], "-Version") == 0) {
-
printf("tc utility, iproute2-ss%s\n", SNAPSHOT);
-
return 0;
-
} else if (matches(argv[1], "-iec") == 0) {
-
++use_iec;
-
} else if (matches(argv[1], "-help") == 0) {
-
usage();
-
return 0;
-
} else if (matches(argv[1], "-force") == 0) {
-
++force;
-
} else if (matches(argv[1], "-batch") == 0) {
-
argc--; argv++;
-
if (argc <= 1)
-
usage();
-
batch_file = argv[1];
-
} else if (matches(argv[1], "-netns") == 0) {
-
NEXT_ARG();
-
if (netns_switch(argv[1]))
-
return -1;
-
} else if (matches(argv[1], "-names") == 0 ||
-
matches(argv[1], "-nm") == 0) {
-
use_names = true;
-
} else if (matches(argv[1], "-cf") == 0 ||
-
matches(argv[1], "-conf") == 0) {
-
NEXT_ARG();
-
conf_file = argv[1];
-
} else {
-
fprintf(stderr, "Option \"%s\" is unknown, try \"tc -help\".\n", argv[1]);
-
return -1;
-
}
-
argc--; argv++;
-
}
-
-
if (batch_file)
-
return batch(batch_file);
-
-
if (argc <= 1) {
-
usage();
-
return 0;
-
}
-
-
tc_core_init();
-
if (rtnl_open(&rth, 0) < 0) {
-
fprintf(stderr, "Cannot open rtnetlink\n");
-
exit(1);
-
}
-
-
if (use_names && cls_names_init(conf_file)) {
-
ret = -1;
-
goto Exit;
-
}
-
-
ret = do_cmd(argc-1, argv+1);
-
Exit:
-
rtnl_close(&rth);
-
-
if (use_names)
-
cls_names_uninit();
-
-
return ret;
- }
2. Batch的处理,可以把批处理的命令放到文件里,让其自动解析加载(参数 -batch filename 指定)
3. tc_core_init 、rtnl_open 和 cls_names_init
tc_core_init设置了tick之类的东西;
rtnl_open建立rth->fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, protocol); // protocol : NETLINK_ROUTE
设置send buf 和rec buf,附带开启netlink通信.
5. do_cmd
点击(此处)折叠或打开
-
static int do_cmd(int argc, char **argv)
-
{
-
if (matches(*argv, "qdisc") == 0)
-
return do_qdisc(argc-1, argv+1);
-
if (matches(*argv, "class") == 0)
-
return do_class(argc-1, argv+1);
-
if (matches(*argv, "filter") == 0)
-
return do_filter(argc-1, argv+1);
-
if (matches(*argv, "actions") == 0)
-
return do_action(argc-1, argv+1);
-
if (matches(*argv, "monitor") == 0)
-
return do_tcmonitor(argc-1, argv+1);
-
if (matches(*argv, "exec") == 0)
-
return do_exec(argc-1, argv+1);
-
if (matches(*argv, "help") == 0) {
-
usage();
-
return 0;
-
}
-
-
fprintf(stderr, "Object \"%s\" is unknown, try \"tc help\".\n",
-
*argv);
-
return -1;
- }
点击(此处)折叠或打开
-
Tc qdisc del dev eth0 root
-
tc qdisc add dev eth0 root handle 1: htb r2q 1
-
tc class add dev eth0 parent 1: classid 1:1 htb rate 20mbit ceil 100mbit
-
tc class add dev eth0 parent 1:1 classid 1:10 htb rate 200kbit ceil 250kbit
- tc filter add dev eth0 parent 1: protocol ip prio 16 u32 match ip dst 192.168.1.0/24 flowid 1:10
点击(此处)折叠或打开
- tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1);
Cmd=RTM_NEWQDISC
Flags=NLM_F_EXCL|NLM_F_CREATE
Argc
Argv--->dev
NEXT_ARG:
点击(此处)折叠或打开
-
Utils.h (include):#define NEXT_ARG() do { argv++; if (--argc <= 0) incomplete_command(); } while(0)
- Utils.h (include):#define NEXT_ARG_OK() (argc - 1 > 0)
1. dev --->d=eth0
2. 参数root
req.t.tcm_parent = TC_H_ROOT;
3.参数handle:这个handle即传递的1:(1值)
req.t.tcm_handle = handle;
handle句柄参数后边一般跟qdisc的具体算法.
4. htb --->进入默认处理:
q=get_qdisc_kind
点击(此处)折叠或打开
-
struct qdisc_util {
-
struct qdisc_util *next;
-
const char *id;
-
int (*parse_qopt)(struct qdisc_util *qu, int argc, char **argv, struct nlmsghdr *n);
-
int (*print_qopt)(struct qdisc_util *qu, FILE *f, struct rtattr *opt);
-
int (*print_xstats)(struct qdisc_util *qu, FILE *f, struct rtattr *xstats);
-
-
int (*parse_copt)(struct qdisc_util *qu, int argc, char **argv, struct nlmsghdr *n);
-
int (*print_copt)(struct qdisc_util *qu, FILE *f, struct rtattr *opt);
- };
点击(此处)折叠或打开
-
struct {
-
struct nlmsghdr n;
-
struct tcmsg t;
-
char buf[TCA_BUF_MAX];
- } req;
例如htb:
点击(此处)折叠或打开
-
struct rtattr {
-
unsigned short rta_len;
- unsigned short rta_type; //TCA_KIND
- };
但是我们知道不论是否找到q,它都返回为真,(1.在动态库里找到,2创建),对于创建的q自然解析不成功.默认的parse_qopt不支持参数解析.
刚才我们加了(htb)rtattr,接着是算法选项参数,例子htb_parse_opt
Htb r2q 1 同理以rtattr+data+…+rtattr+data组织方式添加options信息.
点击(此处)折叠或打开
-
struct tc_sizespec {
-
unsigned char cell_log;
-
unsigned char size_log;
-
short cell_align;
-
int overhead;
-
unsigned int linklayer;
-
unsigned int mpu;
-
unsigned int mtu;
-
unsigned int tsize;
- };
点击(此处)折叠或打开
-
struct {
-
struct tc_sizespec szopts;
-
__u16 *data;
- } stab;
req.t.tcm_ifindex = idx;
发送消息给rtnl,下发配置 . 其实Class、filter的操作调用和qdisc的一样(tc_class_modify)
1.之前我们已经创建了socket即 Rtnl_open开启socket 设置发送buf 和接收buf
2. Ll_init_map初始化rtnetlink
首先发送消息请求:
nlh = nlmsg_hdr(skb);
nlmsg_type=RTM_GETLINK
nlmsg_flags=NLM_F_REQUEST | NLM_F_DUMP
之后rtnl_dump_filter_l主要做了接收返回消息的工作。
3.最后就是rtnl_talk下发真正的配置信息.
关于rtnetlink也是netlink的一种,之前已经经过netlink的具体用法,既然用户空间socket已经建立,那么必然内核也需要对应的工作:
点击(此处)折叠或打开
-
static int __net_init rtnetlink_net_init(struct net *net)
-
{
-
struct sock *sk;
-
struct netlink_kernel_cfg cfg = {
-
.groups = RTNLGRP_MAX,
-
.input = rtnetlink_rcv,
-
.cb_mutex = &rtnl_mutex,
-
.flags = NL_CFG_F_NONROOT_RECV,
-
};
-
-
sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg);
-
if (!sk)
-
return -ENOMEM;
-
net->rtnl = sk;
-
return 0;
- }
但是我们还发现发送消息时那么多命令分类,对应不同的消息类型,又是如何区分和处理的呢?qdisc相关的,class相关的,filter相关的,在内核里我们会看到sch_api.c:
点击(此处)折叠或打开
-
static int __init pktsched_init(void)
-
{
-
int err;
-
-
err = register_pernet_subsys(&psched_net_ops);
-
if (err) {
-
pr_err("pktsched_init: "
-
"cannot initialize per netns operations\n");
-
return err;
-
}
-
-
register_qdisc(&pfifo_qdisc_ops);
-
register_qdisc(&bfifo_qdisc_ops);
-
register_qdisc(&pfifo_head_drop_qdisc_ops);
-
register_qdisc(&mq_qdisc_ops);
-
-
rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL, NULL);
-
rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL, NULL);
-
rtnl_register(PF_UNSPEC, RTM_GETQDISC, tc_get_qdisc, tc_dump_qdisc, NULL);
-
rtnl_register(PF_UNSPEC, RTM_NEWTCLASS, tc_ctl_tclass, NULL, NULL);
-
rtnl_register(PF_UNSPEC, RTM_DELTCLASS, tc_ctl_tclass, NULL, NULL);
-
rtnl_register(PF_UNSPEC, RTM_GETTCLASS, tc_ctl_tclass, tc_dump_tclass, NULL);
-
-
return 0;
- }
点击(此处)折叠或打开
-
static int __init tc_filter_init(void)
-
{
-
rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_ctl_tfilter, NULL, NULL);
-
rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_ctl_tfilter, NULL, NULL);
-
rtnl_register(PF_UNSPEC, RTM_GETTFILTER, tc_ctl_tfilter,
-
tc_dump_tfilter, NULL);
-
-
return 0;
- }
点击(此处)折叠或打开
-
rtnl_register(PF_UNSPEC, RTM_GETLINK, rtnl_getlink,
- rtnl_dump_ifinfo, rtnl_calcit);
点击(此处)折叠或打开
-
/**
-
* rtnl_register - Register a rtnetlink message type
-
*
-
* Identical to __rtnl_register() but panics on failure. This is useful
-
* as failure of this function is very unlikely, it can only happen due
-
* to lack of memory when allocating the chain to store all message
-
* handlers for a protocol. Meant for use in init functions where lack
-
* of memory implies no sense in continuing.
-
*/
-
void rtnl_register(int protocol, int msgtype,
-
rtnl_doit_func doit, rtnl_dumpit_func dumpit,
-
rtnl_calcit_func calcit)
-
{
-
if (__rtnl_register(protocol, msgtype, doit, dumpit, calcit) < 0)
-
panic("Unable to register rtnetlink message handler, "
-
"protocol = %d, message type = %d\n",
-
protocol, msgtype);
- }
说了这么多,还需要看看哪里关联起来了. 内核里rtnetlink的接收函数为:
点击(此处)折叠或打开
-
static void rtnetlink_rcv(struct sk_buff *skb)
-
{
-
rtnl_lock();
-
netlink_rcv_skb(skb, &rtnetlink_rcv_msg);
-
rtnl_unlock();
- }
如果消息类型为NLM_F_REQUEST则调用到 err = cb(skb, nlh); 这个cb即
点击(此处)折叠或打开
-
/* Process one rtnetlink message. */
-
-
static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
-
{
-
struct net *net = sock_net(skb->sk);
-
rtnl_doit_func doit;
-
int sz_idx, kind;
-
int min_len;
-
int family;
-
int type;
-
int err;
-
-
type = nlh->nlmsg_type;
-
if (type > RTM_MAX)
-
return -EOPNOTSUPP;
-
-
type -= RTM_BASE;
-
-
/* All the messages must have at least 1 byte length */
-
if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(struct rtgenmsg)))
-
return 0;
-
-
family = ((struct rtgenmsg *)NLMSG_DATA(nlh))->rtgen_family;
-
sz_idx = type>>2;
-
kind = type&3;
-
-
if (kind != 2 && !ns_capable(net->user_ns, CAP_NET_ADMIN))
-
return -EPERM;
-
-
if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
-
struct sock *rtnl;
-
rtnl_dumpit_func dumpit;
-
rtnl_calcit_func calcit;
-
u16 min_dump_alloc = 0;
-
-
dumpit = rtnl_get_dumpit(family, type);
-
if (dumpit == NULL)
-
return -EOPNOTSUPP;
-
calcit = rtnl_get_calcit(family, type);
-
if (calcit)
-
min_dump_alloc = calcit(skb, nlh);
-
-
__rtnl_unlock();
-
rtnl = net->rtnl;
-
{
-
struct netlink_dump_control c = {
-
.dump = dumpit,
-
.min_dump_alloc = min_dump_alloc,
-
};
-
err = netlink_dump_start(rtnl, skb, nlh, &c);
-
}
-
rtnl_lock();
-
return err;
-
}
-
-
memset(rta_buf, 0, (rtattr_max * sizeof(struct rtattr *)));
-
-
min_len = rtm_min[sz_idx];
-
if (nlh->nlmsg_len < min_len)
-
return -EINVAL;
-
-
if (nlh->nlmsg_len > min_len) {
-
int attrlen = nlh->nlmsg_len - NLMSG_ALIGN(min_len);
-
struct rtattr *attr = (void *)nlh + NLMSG_ALIGN(min_len);
-
-
while (RTA_OK(attr, attrlen)) {
-
unsigned int flavor = attr->rta_type & NLA_TYPE_MASK;
-
if (flavor) {
-
if (flavor > rta_max[sz_idx])
-
return -EINVAL;
-
rta_buf[flavor-1] = attr;
-
}
-
attr = RTA_NEXT(attr, attrlen);
-
}
-
}
-
-
doit = rtnl_get_doit(family, type);
-
if (doit == NULL)
-
return -EOPNOTSUPP;
-
-
return doit(skb, nlh, (void *)&rta_buf[0]);
- }
继续看看用户空间下发的req的消息:
req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg));// 消息头4字节对齐
req.n.nlmsg_flags = NLM_F_REQUEST|flags; //NLM_F_EXCL|NLM_F_CREATE
req.n.nlmsg_type = cmd; //RTM_NEWQDISC
req.t.tcm_family = AF_UNSPEC;
点击(此处)折叠或打开
-
在ll_init_map中情况是不同的:
-
int rtnl_wilddump_req_filter(struct rtnl_handle *rth, int family, int type,
-
__u32 filt_mask)
-
{
-
struct {
-
struct nlmsghdr nlh;
-
struct ifinfomsg ifm;
-
/* attribute has to be NLMSG aligned */
-
struct rtattr ext_req __attribute__ ((aligned(NLMSG_ALIGNTO)));
-
__u32 ext_filter_mask;
- } req;
/* Process one rtnetlink message. */
static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
点击(此处)折叠或打开
-
doit = rtnl_get_doit(family, type);
-
if (doit == NULL)
-
return -EOPNOTSUPP;
-
-
return doit(skb, nlh, (void *)&rta_buf[0]);
- }
对于RTM_NEWQDISC则直接调用doit函数即:tc_modify_qdisc,所以我们只需要全力分析这个函数即可
tc_modify_qdisc
1. 找到接口
dev = __dev_get_by_index(net, tcm->tcm_ifindex);
2. 获取管理控制信息
tcm = nlmsg_data(n);
ubuntu14 系统默认的队列规则pfifo_fast
#tc qdisc show
#qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 3.查询设备qdisc 对比handle值,如果没找到则创建
qdisc_create,然后绑定通过qdisc_graft ,这个是add qdisc的流程。
那么add class/ filter 呢? 通过tc_ctl_tclass/ tc_ctl_tfilter(前面说的对应的doit)
在之前我们已经提到过,根据算法名查找qdisc_util,即通过get_qdisc_kind中 dlopen(NULL,XXX)
它开了程序有所的有的库,
然后查找dlsym符号所在的地址。很明显我们之前看到tc在编译连接的时候,例如q_htb.c里的内容。
找到后 ,添加到qdisc_list中. 如果没有则创建一个.
我们知道机器可执行的是machine code, 而我们使用的高级语言编程, 并不是利用晦涩的机器码, 而是用human-readable的变量名, 函数名等, 这些名字就是symbolic name. 编译器
在编译时收集symbol信息, 并储存在object file的.symtab和.dynsym中. symbols是linker和debugger所必需的信息, 如果没有symbols, 试想debugger如何能展示给用户调试信息了?
如果没有symbol, 而只有地址(相对于object file的offset), linker如何能够链接多个object file了?
不过对于.dynsym的内容段的内容是程序运行必须的 ,strip也去不掉
查看符号的工具:debugger/nm/readelf等
Nm test
Readelf -s test
如果你把pathname输入为NULL,则返回的是一个全局的对象表,包括你在load前的进程镜像表。这种方式很少用,英文解释为:
点击(此处)折叠或打开
-
If pathname is NULL, dlopen() provides a handle to the running process's global symbol object. This provides access to the symbols from the original program image file,
-
the dependencies it loaded at startup, plus any objects opened with dlopen() calls using theRTLD_GLOBAL flag. This set of symbols can change dynamically if the
- application subsequently calls dlopen() using RTLD_GLOBAL.
但是在看tc的时候 看到dlopen(null,…); 百思不得其解,为什么它就找到了符号htb_qdisc_util呢?(原因为自己测试程序不成功,但是tc工具是ok的)
看了tc的makefile也是仅仅把q_htb.c连接进去,并没有直接调用,而它却在查找动态库符号的时候找到了。我自己随便写一个验证程序是不行,后来发现编译的时候要指定参数即:
-Wl, -export-dynamic 这个参数会把symbol里的符号都导入到dynsym中.
测试代码我就不贴了,用readelf –a 查看符号的时候,是有的,我们strip后,在.dynsym里已经没有了,所以dlsym里肯定找不到.
点击(此处)折叠或打开
-
--export-dynamic
-
When creating a dynamically linked executable, add all symbols to the dynamic symbol table. The
-
dynamic symbol table is the set of symbols which are visible from dynamic objects at run time.
-
If you do not use this option, the dynamic symbol table will normally contain only those symbols
-
which are referenced by some dynamic object mentioned in the link.
-
If you use "dlopen" to load a dynamic object which needs to refer back to the symbols defined by the
-
program, rather than some other dynamic object, then you will probably need to use this option when
- linking the program itself.