Linux之调度管理（3）-CFS调度器详解

🕗 发布于 2024-11-14 09:02 linux 运维 服务器

一、调度的发展历史

字段	版本
O(n) 调度器	linux0.11 - 2.4
O(1) 调度器	linux2.6
CFS调度器	linux2.6至今

O(n) 调度器是在内核2.4以及更早期版本采用的算法，其调度算法非常简单和直接，就绪队列是个全局列表，从就绪队列中查找下一个最佳任务，由于每次在寻找下一个任务时需要遍历系统中所有的任务（全局列表），因此被称为 O(n) 调度器（时间复杂度）。
内核2.6采用了O(1) 调度器，让每个CPU维护一个自己的就绪队列，从而减少了锁的竞争。就绪队列由两个优先级数组组成，分别是active优先级数组和expired优先级数组。每个优先级数组包含140个优先级队列，也就是每个优先级对应一个队列，其中前100个对应实时进程，后40个对应普通进程。如下图所示：
这样设计的好处，调度器选择下一个被调度任务就变得高效和简单多了，只需要在active优先级数组中选择优先级高，并且队列中有可运行的任务即可。这里使用位图来定义该队列中是否有可运行的任务，如果有，则位图中相应的位就会被置1。这样选择下一个被调用任务的时间就变成了查询位图的操作。
但上面的算法有个问题，一个高优先级多线程的应用会比低优先级单线程的应用获得更多的资源，这就会导致一个调度周期内，低优先级的应用可能一直无法响应，直到高优先级应用结束。CFS调度器就是站在一视同仁的角度解决了这个问题，保证在一个调度周期内每个任务都有执行的机会，执行时间的长短，取决于任务的权重。下面详细看下CFS调度器是如何动态调整任务的运行时间，达到公平调度的。

二、目前内核中调度器类

Linux定义了5种调度器类，分别对应stop、deadline、realtime、cfs、idle，他们通过next串联起来。

const struct sched_class stop_sched_class = {
    .next            = &dl_sched_class,
...
};

const struct sched_class dl_sched_class = {
    .next            = &rt_sched_class,
...
};

const struct sched_class rt_sched_class = {
    .next            = &fair_sched_class,
...
};

const struct sched_class fair_sched_class = {
    .next            = &idle_sched_class,
...
}

const struct sched_class idle_sched_class = {
    /* .next is NULL */
...
};

/*
 * Scheduling policies
 */
#define SCHED_NORMAL        0
#define SCHED_FIFO        1
#define SCHED_RR        2
#define SCHED_BATCH        3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE        5
#define SCHED_DEADLINE        6

同时定义了6中调度策略，3中调度实体，他们之间的关系如下表：

调度器类	调度策略	调度实体	优先级
stop_sched_class
dl_sched_class	SCHED_DEADLINE	sched_dl_entity	(, 0)
rt_sched_class	SCHED_FIFO 、SCHED_RR	sched_rt_entity	[0, 100)
fair_sched_class	SCHED_NORMAL 、SCHED_BATCH	sched_entity	[100, )
idle_sched_class	SCHED_IDLE

创建进程时，会根据调度策略初始化调度器类。对于实时调度器类，后面一篇文章介绍。

三个调度实体是定义在 struct task_struct 结构体中的，可以方便把进程放入其中一个调取器类。

三、CFS 调度器中涉及的几个重要算法

1、权重计算

1.1 计算优先级

计算优先级之前，首先要明白struct task_struct中各个关于优先级成员的含义：

struct task_struct {
...
    int prio, static_prio, normal_prio;
    unsigned int rt_priority;
...
    unsigned int policy;
...
};

prio：保存进程动态优先级，系统根据prio选择调度类，有些情况需要暂时提高进程优先级。
static_prio：静态优先级，在进程启动时分配。内核不保存nice值，通过PRIO_TO_NICE 根据task_struct->static_prio计算得到。这个值可以通过nice/renice或者setpriority()修改。
normal_prio：是基于static_prio和调度策略计算出来的优先级，在创建进程时会继承父进程normal_prio。对普通进程来说，normal_prio等于static_prio；对实时进程，会根据rt_priority重新计算normal_prio。
rt_priority：实时进程的优先级，和进程设置参数sched_param.sched_priority等价。

rt_priority在普通进程中等于0，实时进程中范围是1~99。

normal_prio在普通进程中等于static_prio；在实时进程中normal_prio=99-rt_priority。

下面有几个问题：

第一：在内核中，normal_prio（）函数如何实现优先级的计算？获取normal_prio的函数是normal_prio()。

static inline int __normal_prio(struct task_struct *p)
{
    return p->static_prio;
}

static inline int normal_prio(struct task_struct *p)
{
    int prio;

    if (task_has_dl_policy(p))
        prio = MAX_DL_PRIO-1;-----------------------------------对于DEADLINE类进程来说固定值为-1。
    else if (task_has_rt_policy(p))
        prio = MAX_RT_PRIO-1 - p->rt_priority;------------------对于实时进程来说，normal_prio=100-1-rt_priority
    else
        prio = __normal_prio(p);--------------------------------对普通进程来说normal_prio=static_prio
    return prio;
}

normal_prio 函数是根据调度策略来计算 prio 优先级。如果用户在使用sched_setscheduler()系统调用函数设置优先级和调度策略要匹配。

普通进程：prio和static_prio相等;

实时进程：prio和rt_priority存在prio+rt_priority=99关系。

第二：在内核中，如何获取 prio ? 获取prio的函数是effective_prio()。仅仅是获取某个进程的 prio

static int effective_prio(struct task_struct *p)
{
    p->normal_prio = normal_prio(p); // normal_prio 优先级的计算函数
    /*
     * If we are RT tasks or we were boosted to RT priority,
     * keep the priority unchanged. Otherwise, update priority
     * to the normal priority:
     */
    if (!rt_prio(p->prio))-------------------即prio大于99的情况，此时为普通进程，prio=normal_prio=static_prio。
        return p->normal_prio;
    return p->prio;  // 上面是设置normal_prio 优先级，此处是返回动态优先级 prio.
}

第三：在进程创建时，prio 动态优先级是多少？是继承父进程的 normal_prio 优先级，如果需要修改子进程的优先级，需要调用 sched_setscheduler()函数，进行 prio 优先级修改。

进程创建函数栈：kernel_clone-->copy_process

copy_process()
  sched_fork()
    __sched_fork()
    fair_sched_class->task_fork()->task_fork_fair()
      __set_task_cpu()
      update_curr()
      place_entity()
  wake_up_new_task()
    activate_task()
      enqueue_task
        fair_sched_class->enqueue_task-->enqueue_task_fair()

在sched_fork()调用__sched_fork()对struct task_struct进行初始化，

int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
    unsigned long flags;
    int cpu = get_cpu();--------------------------------------------------禁止任务抢占并且获取cpu序号

    __sched_fork(clone_flags, p);
    p->state = TASK_RUNNING;----------------------------------------------此时并没有真正运行，还没有加入到调度器
    p->prio = current->normal_prio;

    /*
     * Revert to default priority/policy on fork if requested.
     */
    if (unlikely(p->sched_reset_on_fork)) {-------------------------------如果sched_reset_on_fork为true，重置policy、static_prio、prio、weight、inv_weight等。
        if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
            p->policy = SCHED_NORMAL;
            p->static_prio = NICE_TO_PRIO(0);
            p->rt_priority = 0;
        } else if (PRIO_TO_NICE(p->static_prio) < 0)
            p->static_prio = NICE_TO_PRIO(0);

        p->prio = p->normal_prio = __normal_prio(p);
        set_load_weight(p);
        p->sched_reset_on_fork = 0;
    }

    if (dl_prio(p->prio)) {
        put_cpu();
        return -EAGAIN;
    } else if (rt_prio(p->prio)) {
        p->sched_class = &rt_sched_class;
    } else {
        p->sched_class = &fair_sched_class;-------------------------------根据task_struct->prio选择调度器类，
    }

    if (p->sched_class->task_fork)
        p->sched_class->task_fork(p);-------------------------------------调用调度器类的task_fork方法，cfs对应task_fork_fair()。

    raw_spin_lock_irqsave(&p->pi_lock, flags);
    set_task_cpu(p, cpu);-------------------------------------------------将p指定到cpu上运行，如果task_struct->stack->cpu和当前所在cpu不一致，需要将cpu相关设置到新CPU上。
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
    if (likely(sched_info_on()))
        memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
    p->on_cpu = 0;
#endif
    init_task_preempt_count(p);-------------------------------------------初始化preempt_count
#ifdef CONFIG_SMP
    plist_node_init(&p->pushable_tasks, MAX_PRIO);
    RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

    put_cpu();------------------------------------------------------------启用任务抢占
    return 0;
}

可以看出，新进程的prio :

 p->prio = current->normal_prio;

调用sched_setscheduler()函数修改优先级：

sched_setscheduler->_sched_setscheduler->__setscheduler_prio

static void __setscheduler_prio(struct task_struct *p, int prio)
{
        if (dl_prio(prio))
                p->sched_class = &dl_sched_class;
        else if (rt_prio(prio))
                p->sched_class = &rt_sched_class;
        else
                p->sched_class = &fair_sched_class;

        p->prio = prio;// 新的优先级 赋值给 prio
}

从上面代码可以发现，修改一个进程的优先级，会根据这个优先级来更换调度器类。进一步说明，在内核中，是根据prio 来选择调度器类，即根据 prio 来选择调度算法。进程优先级的设定非常重要。

第四：如果父进程是普通进程，创建子进程时，父进程和子进程的优先级关系？如果父进程是实时进程，创建子进程时，父进程和子进程的关系？

答案是和父进程的调度策略和优先级一致。但是对于动态优先级 prio 是不一样的。

总结：

普通进程：static_prio=prio=normal_prio；rt_priority=0。

实时进程：prio=normal_prio=99-rt_priority；rt_priority=sched_param.sched_priority，rt_priority=[1, 99]；static_prio保持默认值不改变。

1.1.1 static_prio和nice之间的关系

内核使用0~139数值表示优先级，数值越低优先级越高。其中0~99给实时进程使用，100~139给普通进程(SCHED_NORMAL/SCHED_BATCH)使用

用户空间nice传递的变量映射到普通进程优先级，即100~139；

关于nice和prio之间的转换，内核提供NICE_TO_PRIO和PRIO_TO_NICE两个宏。

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO        MAX_USER_RT_PRIO

#define MAX_PRIO        (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO        (MAX_RT_PRIO + NICE_WIDTH / 2)

/*
 * Convert user-nice values [ -20 ... 0 ... 19 ]
 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
 * and back.
 */
#define NICE_TO_PRIO(nice)    ((nice) + DEFAULT_PRIO)
#define PRIO_TO_NICE(prio)    ((prio) - DEFAULT_PRIO)

/*
 * 'User priority' is the nice value converted to something we
 * can work with better when scaling various scheduler parameters,
 * it's a [ 0 ... 39 ] range.
 */
#define USER_PRIO(p)        ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p)    USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO        (USER_PRIO(MAX_PRIO))

参考文章：Linux之进程优先级多方面分析_linux fifo priority -51-CSDN博客

1.2 计算权重

内核中使用struct load_weight数据结构来记录调度实体的权重信息。

权重信息是根据优先级来计算的，通过task_struct->se.load来获取进程的权重信息。

因为权重仅适用于普通进程，普通进程的nice对应范围是-20~19。

struct task_struct {
...
    struct sched_entity se;
...
};

struct sched_entity {
    struct load_weight    load;        /* for load-balancing */
...
};

struct load_weight {
    unsigned long weight;----------------调度实体的权重
    u32 inv_weight;----------------------inverse weight，是全中一个中间计算结果。
};

set_load_weight()设置进程的权重值，通过task_struct->static_prio从prio_to_weight[]和prio_to_wmult[]获取。

static void set_load_weight(struct task_struct *p)
{
    int prio = p->static_prio - MAX_RT_PRIO;---------------------权重值取决于static_prio，减去100而不是120，对应了下面数组下标。
    struct load_weight *load = &p->se.load;

    /*
     * SCHED_IDLE tasks get minimal weight:
     */
    if (p->policy == SCHED_IDLE) {
        load->weight = scale_load(WEIGHT_IDLEPRIO);-------------IDLE调度策略进程使用固定优先级权重，取最低普通优先级权重的1/5。
        load->inv_weight = WMULT_IDLEPRIO;----------------------取最低普通优先级反转权重的5倍。
        return;
    }

    load->weight = scale_load(prio_to_weight[prio]);
    load->inv_weight = prio_to_wmult[prio];
}

prio_to_weight[]以nice 0为基准权重1024，然后将nice从-20~19预先计算出。set_load_weight()就可以通过优先级得到进程对应的权重。

prio_to_wmult[]为了方便计算vruntime而预先计算结果。

inv_weight=2^32/weight

static const int prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

static const u32 prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

第一，static_prio 决定权重值的选择；

第二，权重值已经是形成权重值表，方便内核直接使用，快速且高效，根本不用CPU参与运算，直接使用；

第三，权重值是CFS进程运行时间计算的参数，权重越大，分配的时间越多，权重值越小，分配的时间越少；

1.3 runtime 和vruntime 的计算

CFS 调度器没有时间片的概念了，而是根据实际的运行时间和虚拟运行时间来对任务进行排序，从而选择调度。那么，运行时间和虚拟运行时间是怎么计算的呢？看一下流程调用：

(1) Linux 内核默认的 sysctl_sched_latency 是 6ms，这个值用户态可设。sched_period 用于保证可运行任务都能至少运行一次的时间间隔；

(2) 当可运行任务大于 8 个的时候，sched_period 的计算则需要根据任务个数乘以最小调度颗粒值，这个值系统默认为 0.75ms；

(3) 每个任务的运行时间计算，是用 sched_period 值，去乘以该任务在整个 CFS 运行队列中的权重占比；

(4) 虚拟运行的时间 = 实际运行时间 * NICE_0_LOAD / 该任务的权重；

从上面计算公式可以看出，权重高的进程运行时间 runtime 更大，但是 vruntime 由于分子分母互相消除，权重高的进程的虚拟运行时间的增速却是一样的。

当调用CFS调度器类入队函数 enqueue_entity时，会计算上面的虚拟运行时间：

enqueue_entity->place_entity->sched_vslice->calc_delta_fair

static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

sched_slice:

/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
        //如果当前进程权重是NICE_0_WEIGHT，虚拟时间就是delta，不需要__calc_delta()计算。
        if (unlikely(se->load.weight != NICE_0_LOAD))
                delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

        return delta;
}

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
        if (unlikely(nr_running > sched_nr_latency))// 大于8
                return nr_running * sysctl_sched_min_granularity;//个数*0.75ms
        else
                return sysctl_sched_latency;//6ms
}

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        unsigned int nr_running = cfs_rq->nr_running;
        u64 slice;

        if (sched_feat(ALT_PERIOD))
                nr_running = rq_of(cfs_rq)->cfs.h_nr_running;

        slice = __sched_period(nr_running + !se->on_rq); //获取CFS运行队列调度周期

        for_each_sched_entity(se) {
                struct load_weight *load;
                struct load_weight lw;

                cfs_rq = cfs_rq_of(se);
                load = &cfs_rq->load;

                if (unlikely(!se->on_rq)) {
                        lw = cfs_rq->load;

                        update_load_add(&lw, se->load.weight);
                        load = &lw;
                }
                slice = __calc_delta(slice, se->load.weight, load);
        }

        if (sched_feat(BASE_SLICE))
                slice = max(slice, (u64)sysctl_sched_min_granularity);//必须大于0.75ms

        return slice;
}

calc_delta_fair()是计算虚拟时间的函数，其返回值是虚拟时间。

__calc_delta()是计算vruntime的核心，delta_exec是进程实际运行时间，weight是nice_0_weight，lw是对应进程的load_weight，里面包含了其inv_weight值。

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
    u64 fact = scale_load_down(weight);--------------fact等于weight。
    int shift = WMULT_SHIFT;-------------------------WMULT_SHIFT等于32

    __update_inv_weight(lw);-------------------------更新load_weight->inv_weight，一般情况下已经设置，不需要进行操作。

    if (unlikely(fact >> 32)) {----------------------一般fact>>32为0，所以跳过
        while (fact >> 32) {
            fact >>= 1;
            shift--;
        }
    }

    /* hint to use a 32x32->64 mul */
    fact = (u64)(u32)fact * lw->inv_weight;----------此处相当于nice_0_weight*inv_weight

    while (fact >> 32) {
        fact >>= 1;
        shift--;
    }

    return mul_u64_u32_shr(delta_exec, fact, shift);----此处相当于delta_exec*(nice_0_weight*inv_weight)>>32。
}

优先级越低的进程inv_weight值越大，其它nice_0_weight和位置都是一样的。

所以相同的delta_exec情况下，优先级越低vruntime越大。

cfs总是在红黑树中选择vrunime最小的进程进行调度，优先级高的进程在相同实际运行时间的情况下vruntime最小，所以总会被优先选择。但是随着vruntime的增长，优先级低的进程也有机会运行。

计算图例：

补充内核参数的设置：

unsigned int sysctl_sched_latency = 6000000ULL; 默认是6ms,

unsigned int sysctl_sched_min_granularity = 750000ULL; 默认是 0.75ms

static unsigned int sched_nr_latency = 8; 默认值是8

这个6ms 和触发scheduer_tick 的4ms 是不一样的。在少于8个进程时，一个是进程得到调度最大延迟为6ms, 而调度tick 4ms 周期是时钟中断触发调度的频率。

1.4 负载计算

内核中计算CPU负载的方法是PELT(Per-Entity Load Tracing)，不仅考虑进程权重，而且跟踪每个调度实体的负载情况。

3.8版本的linux内核引入了PELT算法来跟踪每一个sched entity的负载，把负载跟踪的算法从per-CPU进化到per-entity。PELT算法不但能知道CPU的负载，而且知道负载来自哪一个调度实体，从而可以更精准的进行负载均衡。

sched_entity结构中有一个struct sched_avg用于描述进程的负载。

struct sched_avg {
        u64                             last_update_time;
        u64                             load_sum;
        u64                             runnable_sum;
        u32                             util_sum;
        u32                             period_contrib;
        unsigned long                   load_avg;
        unsigned long                   runnable_avg;
        unsigned long                   util_avg;
        struct util_est                 util_est;
} ____cacheline_aligned;


struct sched_entity {
...
#ifdef CONFIG_SMP
    /* Per entity load average tracking */
    struct sched_avg    avg;
#endif
}

last_update_time变量： sched avg会定期更新，last_update_time是上次更新的时间点，结合当前的时间值，我们可以计算delta并更新load_sum、runnable_sum、util_sum和load_avg、runnable_avg、util_avg。对task se而言，还有一种特殊情况就是迁移到新的CPU，这时候需要将last_update_time设置为0以便新cpu上可以重新开始计算负载，此外，新任务的last_update_time也会设置为0。
load_sum变量：load_sum值是几何级数的累加（按照1ms为一个周期，距离当前点越远，衰减的越厉害，32个周期之后的load衰减50％，load_sum就是这些load的累加）。load_sum是running+runnable时间。（公式 7计算）
util_sum变量：util_sum值是几何级数的累加（按照1ms为一个周期，距离当前点越远，衰减的越厉害，32个周期之后的load衰减50％）。util_sum是running时间累加。
runnable_sum变量：对于task se，其runnable_sum等于util_sum，对于group se，runnable_sum综合考虑了其所属cfs上所有任务（整个层级结构中的任务）个数和group se处于running+runnable的时间。
period_contrib变量：period_contrib是一个中间计算变量，在更新负载的时候分三段，d1（合入上次更新负载的剩余时间，即不足1ms窗口的时间），d2（满窗时间），d3（不足1ms窗口的时间），period_contrib记录了d3窗口的时间，方便合入下次的d1。
load_avg：跟据load_sun计算得到的负载均值。
util_avg：跟据util_sun计算得到的负载均值。
runnable_avg：跟据runnable_sun计算得到的负载均值。
util_est变量：任务阻塞后，其负载会不断衰减。如果一个重载任务阻塞太长时间，那么根据标准PELT算法计算出来的负载会非常的小，当该任务被唤醒重新参与调度的时候，由于负载较小会让调度器做出错误的判断。因此引入了这个成员，记录阻塞之前的load avg信息。

对与CFS的负载均衡在另一篇文章详解，目前只是了解。

四、进程创建，调用task_fork_fair

struct sched_class是调度类操作方法，CFS调度器的调度类fair_sched_class定义了CFS相关操作方法。

/*
 * All the scheduling class methods:
 */
const struct sched_class fair_sched_class
        __section("__fair_sched_class") = {
        .enqueue_task           = enqueue_task_fair,
        .dequeue_task           = dequeue_task_fair,
        .yield_task             = yield_task_fair,
        .yield_to_task          = yield_to_task_fair,

        .check_preempt_curr     = check_preempt_wakeup,

        .pick_next_task         = __pick_next_task_fair,
        .put_prev_task          = put_prev_task_fair,
        .set_next_task          = set_next_task_fair,

#ifdef CONFIG_SMP
        .balance                = balance_fair,
        .select_task_rq         = select_task_rq_fair,
        .migrate_task_rq        = migrate_task_rq_fair,

        .rq_online              = rq_online_fair,
        .rq_offline             = rq_offline_fair,

        .task_dead              = task_dead_fair,
        .set_cpus_allowed       = set_cpus_allowed_common,
#endif

        .task_tick              = task_tick_fair,
        .task_fork              = task_fork_fair,

        .prio_changed           = prio_changed_fair,
        .switched_from          = switched_from_fair,
        .switched_to            = switched_to_fair,

        .get_rr_interval        = get_rr_interval_fair,

        .update_curr            = update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
        .task_change_group      = task_change_group_fair,
#endif

#ifdef CONFIG_UCLAMP_TASK
        .uclamp_enabled         = 1,
#endif
};

4.1 进程的创建：

进程创建由 kernel_clone()函数来完成，kernel_clone-->copy_process参与了进程调度相关初始化

copy_process()
  sched_fork()
    __sched_fork()
  sched_post_fork（）
    fair_sched_class->task_fork()->task_fork_fair()
      __set_task_cpu()
      update_curr()
      place_entity()
  wake_up_new_task()
    activate_task()
      enqueue_task
        fair_sched_class->enqueue_task-->enqueue_task_fair()

4.1.1 sched_fork()

sched_fork()调用__sched_fork()对struct task_struct进行初始化，

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        __sched_fork(clone_flags, p);
        /*
         * We mark the process as NEW here. This guarantees that
         * nobody will actually run it, and a signal or other external
         * event cannot wake it up and insert it on the runqueue either.
         */
        p->state = TASK_NEW;------------------------此时并没有真正运行，还没有加入到调度器

        /*
         * Make sure we do not leak PI boosting priority to the child.
         */
        p->prio = current->normal_prio;

        uclamp_fork(p);

        /*
         * Revert to default priority/policy on fork if requested.
         */
  //如果sched_reset_on_fork为true，重置policy、static_prio、prio、weight、inv_weight等。
        if (unlikely(p->sched_reset_on_fork)) {
                if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
                        p->policy = SCHED_NORMAL;
                        p->static_prio = NICE_TO_PRIO(0);
                        p->rt_priority = 0;
                } else if (PRIO_TO_NICE(p->static_prio) < 0)
                        p->static_prio = NICE_TO_PRIO(0);

                p->prio = p->normal_prio = p->static_prio;
                set_load_weight(p, false);

                /*
                 * We don't need the reset flag anymore after the fork. It has
                 * fulfilled its duty:
                 */
                p->sched_reset_on_fork = 0;
        }

        if (dl_prio(p->prio))--根据task_struct->prio选择调度器类
                return -EAGAIN;
        else if (rt_prio(p->prio))
                p->sched_class = &rt_sched_class;
        else
                p->sched_class = &fair_sched_class;

        init_entity_runnable_average(&p->se);

#ifdef CONFIG_SCHED_INFO
        if (likely(sched_info_on()))
                memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
        p->on_cpu = 0;
#endif
        init_task_preempt_count(p);-----初始化preempt_count
#ifdef CONFIG_HAVE_PREEMPT_LAZY
        task_thread_info(p)->preempt_lazy_count = 0;
#endif
#ifdef CONFIG_SMP
        plist_node_init(&p->pushable_tasks, MAX_PRIO);
        RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
        return 0;
}

__sched_fork()对task_struct数据结构进行初始值设定

static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        p->on_rq                        = 0;

        p->se.on_rq                     = 0;
        p->se.exec_start                = 0;
        p->se.sum_exec_runtime          = 0;
        p->se.prev_sum_exec_runtime     = 0;
        p->se.nr_migrations             = 0;
        p->se.vruntime                  = 0;
        INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
        p->se.cfs_rq                    = NULL;
#endif

#ifdef CONFIG_SCHEDSTATS
        /* Even if schedstat is disabled, there should not be garbage */
        memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif

        RB_CLEAR_NODE(&p->dl.rb_node);
        init_dl_task_timer(&p->dl);
        init_dl_inactive_task_timer(&p->dl);
        __dl_clear_params(p);

        INIT_LIST_HEAD(&p->rt.run_list);
        p->rt.timeout           = 0;
        p->rt.time_slice        = sched_rr_timeslice;
        p->rt.on_rq             = 0;
        p->rt.on_list           = 0;

#ifdef CONFIG_PREEMPT_NOTIFIERS
        INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif

#ifdef CONFIG_COMPACTION
        p->capture_control = NULL;
#endif
        init_numa_balancing(clone_flags, p);
#ifdef CONFIG_SMP
        p->wake_entry.u_flags = CSD_TYPE_TTWU;
        p->migration_pending = NULL;
#endif
}

copy_process()--->sched_post_fork()--->task_fork_fair();task_fork_fair()参数是新创建的进程

static void task_fork_fair(struct task_struct *p)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se, *curr;
        struct rq *rq = this_rq();
        struct rq_flags rf;

        rq_lock(rq, &rf);
        update_rq_clock(rq);

        cfs_rq = task_cfs_rq(current);----获取当前进程所在cpu的cfs_rq
        curr = cfs_rq->curr;
        if (curr) {
                update_curr(cfs_rq);-----更新当前调度实体的cfs_rq->curr信息
                se->vruntime = curr->vruntime;
        }
        place_entity(cfs_rq, se, 1);//cfs_rq是父进程对应的cfs就绪队列，se对应的是进程p调度实体，initial为1。

        if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
                /*
                 * Upon rescheduling, sched_class::put_prev_task() will place
                 * 'current' within the tree based on its new key value.
                 */
                swap(curr->vruntime, se->vruntime);
                resched_curr_lazy(rq);
        }

        se->vruntime -= cfs_rq->min_vruntime;
        rq_unlock(rq, &rf);
}

update_curr()是cfs调度器核心函数，主要更新cfs_rq->curr，即当前调度实体。

主要更新了调度实体的vruntime、sum_exec_runtime、exec_start等等。

static void update_curr(struct cfs_rq *cfs_rq)
{
        struct sched_entity *curr = cfs_rq->curr;--------------curr指向父进程调度实体。
        u64 now = rq_clock_task(rq_of(cfs_rq));//获取当前就绪队列保存的rq->clock_task值，该变量在每次时钟tick到来时更新
        u64 delta_exec;

        if (unlikely(!curr))
                return;

        delta_exec = now - curr->exec_start;//delta_exec计算该进程从上次调用update_curr()函数到现在的时间差。
        if (unlikely((s64)delta_exec <= 0))
                return;

        curr->exec_start = now;

        schedstat_set(curr->statistics.exec_max,
                      max(delta_exec, curr->statistics.exec_max));

        curr->sum_exec_runtime += delta_exec;---sum_exec_runtime直接加上delta_exec。
        schedstat_add(cfs_rq->exec_clock, delta_exec);

        curr->vruntime += calc_delta_fair(delta_exec, curr);--根据delta_exec和进程curr->load计算该进程的虚拟事件curr->vruntime。
        update_min_vruntime(cfs_rq);//更新当前cfs_rq->min_vruntime

        if (entity_is_task(curr)) {--如果curr->my_q为null，那么当前调度实体是进程
                struct task_struct *curtask = task_of(curr);

                trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
                cgroup_account_cputime(curtask, delta_exec);
                account_group_exec_runtime(curtask, delta_exec);
        }

        account_cfs_rq_runtime(cfs_rq, delta_exec);
}

place_entity()参数cfs_rq是se对应进程的父进程对应的cfs就绪队列，se是新进程调度实体，initial为1。

place_entity()考虑当前se所在cfs_rq总体权重，然后更新se->vruntime。

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
        //是单步递增的，用于跟踪整个cfs就绪队列中红黑树里最小的vruntime值
        u64 vruntime = cfs_rq->min_vruntime;

        /*
         * The 'current' period is already promised to the current tasks,
         * however the extra weight of the new task will slow them down a
         * little, place the new task so that it fits in the slot that
         * stays open at the end.
         */
//如果当前进程用于fork新进程，那么这里会对新进程的vruntime做一些惩罚，因为新创建了一个新进程导致cfs运行队列权重发生了变化。
//sched_vslice()计算得到虚拟时间作为惩罚值，累加到vruntime。
        if (initial && sched_feat(START_DEBIT))
                vruntime += sched_vslice(cfs_rq, se);

        /* sleeps up to a single latency don't count. */
        if (!initial) {
                unsigned long thresh = sysctl_sched_latency;

                /*
                 * Halve their sleep time's effect, to allow
                 * for a gentler effect of sleepers:
                 */
                if (sched_feat(GENTLE_FAIR_SLEEPERS))
                        thresh >>= 1;

                vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */

//取se->vruntime和惩罚后的vruntime的最大值，方式vruntime回退。
        se->vruntime = max_vruntime(se->vruntime, vruntime);
}


static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    //根据sched_slice()计算得到的执行时间和se中的权重，计算出虚拟时间。
        return calc_delta_fair(sched_slice(cfs_rq, se), se);
}




static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        unsigned int nr_running = cfs_rq->nr_running;
        u64 slice;

        if (sched_feat(ALT_PERIOD))
                nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
//根据运行中进程数目计算就绪队列调度周期长度。
        slice = __sched_period(nr_running + !se->on_rq);

        for_each_sched_entity(se) {-遍历当前se所在就绪队列上所有的调度实体。
                struct load_weight *load;
                struct load_weight lw;
            //通过sched_entity找到其所在的cfs_rq，进而获得cfs_rq->load。
                cfs_rq = cfs_rq_of(se);
                load = &cfs_rq->load;

                if (unlikely(!se->on_rq)) {
                        lw = cfs_rq->load;

                        update_load_add(&lw, se->load.weight);
                        load = &lw;
                }
        //-根据当前进程的权重来计算在cfs就绪队列总权重中可以瓜分的调度时间。
                slice = __calc_delta(slice, se->load.weight, load);
        }

        if (sched_feat(BASE_SLICE))
                slice = max(slice, (u64)sysctl_sched_min_granularity);

        return slice;
}


static u64 __sched_period(unsigned long nr_running)
{
        if (unlikely(nr_running > sched_nr_latency))
                return nr_running * sysctl_sched_min_granularity;
        else
                return sysctl_sched_latency;-cfs默认调度时间片6ms
}


unsigned int sysctl_sched_latency = 6000000ULL;

static unsigned int sched_nr_latency = 8;

unsigned int sysctl_sched_min_granularity = 750000ULL;每个进程最小的调度延时0.75ms

总结：

在父进程通过 fork 创建子进程的时候，task_fork_fair 函数会被调用，这个函数的传入参数是子进程的 task_struct。该函数的主要作用，就是确定子任务的 vruntime，因此也能确定子任务的调度实体在红黑树 RB 中的位置。

task_fork_fair 本身比较简单，流程如下图：

sysctl_sched_child_runs_first 默认是0，父进程先执行。

4.1.2 wake_up_new_task()

// kernel/kernel/sched/core.c

void wake_up_new_task(struct task_struct *p)
{
        struct rq_flags rf;
        struct rq *rq;

        raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
        p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
        /*
         * Fork balancing, do it here and not earlier because:
         *  - cpus_ptr can change in the fork path
         *  - any previously selected CPU might disappear through hotplug
         *
         * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
         * as we're not fully set-up yet.
         */
        p->recent_used_cpu = task_cpu(p);
        rseq_migrate(p);
//重新选择CPU，有可能cpus_allowed在fork中被改变，或者之前前选择的CPU被关闭了。
        __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);
        post_init_entity_util_avg(p);

//最终调用到enqueue_task_fair()将进程p添加到cfs就绪队列中。
        activate_task(rq, p, ENQUEUE_NOCLOCK);
        trace_sched_wakeup_new(p);

//检查是否有进程可以抢占当前正在运行的进程。
        check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
        if (p->sched_class->task_woken) {
                /*
                 * Nothing relies on rq->lock after this, so its fine to
                 * drop it.
                 */
                rq_unpin_lock(rq, &rf);
                p->sched_class->task_woken(rq, p);
                rq_repin_lock(rq, &rf);
        }
#endif
        task_rq_unlock(rq, p, &rf);
}


void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
        enqueue_task(rq, p, flags);

        p->on_rq = TASK_ON_RQ_QUEUED;
}


static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
        if (!(flags & ENQUEUE_NOCLOCK))
                update_rq_clock(rq);

        if (!(flags & ENQUEUE_RESTORE)) {
                sched_info_queued(rq, p);
                psi_enqueue(p, flags & ENQUEUE_WAKEUP);
        }

        uclamp_rq_inc(rq, p);
        p->sched_class->enqueue_task(rq, p, flags);
}

enqueue_task_fair()把新进程p放入cfs就绪队列rq中。

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int idle_h_nr_running = task_has_idle_policy(p);
        int task_new = !(flags & ENQUEUE_WAKEUP);

        /*
         * The code below (indirectly) updates schedutil which looks at
         * the cfs_rq utilization to select a frequency.
         * Let's add the task's estimated utilization to the cfs_rq's
         * estimated utilization, before we update schedutil.
         */
        util_est_enqueue(&rq->cfs, p);

        /*
         * If in_iowait is set, the code below may not trigger any cpufreq
         * utilization updates, so do it here explicitly with the IOWAIT flag
         * passed.
         */
        if (p->in_iowait)
                cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
//---对于没有定义CONFIG_FAIR_GROUP_SCHED的情况，只有一次结束for循环，即只有se一个调度实体。
        for_each_sched_entity(se) {
                if (se->on_rq)
                        break;
                cfs_rq = cfs_rq_of(se);
                enqueue_entity(cfs_rq, se, flags);--把调度实体se添加到cfs_rq就绪队列中。

                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                /* end evaluation on encountering a throttled cfs_rq */
                if (cfs_rq_throttled(cfs_rq))
                        goto enqueue_throttle;

                flags = ENQUEUE_WAKEUP;
        }

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);

//更新该调度实体的负载和就绪队列负载。

                update_load_avg(cfs_rq, se, UPDATE_TG);
                se_update_runnable(se);
                update_cfs_group(se);

                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                /* end evaluation on encountering a throttled cfs_rq */
                if (cfs_rq_throttled(cfs_rq))
                        goto enqueue_throttle;

               /*
                * One parent has been throttled and cfs_rq removed from the
                * list. Add it back to not break the leaf list.
                */
               if (throttled_hierarchy(cfs_rq))
                       list_add_leaf_cfs_rq(cfs_rq);
        }

        /* At this point se is NULL and we are at root level*/
        add_nr_running(rq, 1);

        /*
         * Since new tasks are assigned an initial util_avg equal to
         * half of the spare capacity of their CPU, tiny tasks have the
         * ability to cross the overutilized threshold, which will
         * result in the load balancer ruining all the task placement
         * done by EAS. As a way to mitigate that effect, do not account
         * for the first enqueue operation of new tasks during the
         * overutilized flag detection.
         *
         * A better way of solving this problem would be to wait for
         * the PELT signals of tasks to converge before taking them
         * into account, but that is not straightforward to implement,
         * and the following generally works well enough in practice.
         */
        if (!task_new)
                update_overutilized_status(rq);

enqueue_throttle:
        if (cfs_bandwidth_used()) {
                /*
                 * When bandwidth control is enabled; the cfs_rq_throttled()
                 * breaks in the above iteration can result in incomplete
                 * leaf list maintenance, resulting in triggering the assertion
                 * below.
                 */
                for_each_sched_entity(se) {
                        cfs_rq = cfs_rq_of(se);

                        if (list_add_leaf_cfs_rq(cfs_rq))
                                break;
                }
        }

        assert_list_leaf_cfs_rq(rq);

        hrtick_update(rq);
}

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
        bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
        bool curr = cfs_rq->curr == se;

        /*
         * If we're the current task, we must renormalise before calling
         * update_curr().
         */
        if (renorm && curr)
                se->vruntime += cfs_rq->min_vruntime;
----更新当前进程的vruntime和该cfs就绪队列的min_vruntime。
        update_curr(cfs_rq);

        /*
         * Otherwise, renormalise after, such that we're placed at the current
         * moment in time, instead of some random moment in the past. Being
         * placed in the past could significantly boost this task to the
         * fairness detriment of existing tasks.
         */
        if (renorm && !curr)
                se->vruntime += cfs_rq->min_vruntime;

        /*
         * When enqueuing a sched_entity, we must:
         *   - Update loads to have both entity and cfs_rq synced with now.
         *   - Add its load to cfs_rq->runnable_avg
         *   - For group_entity, update its weight to reflect the new share of
         *     its group cfs_rq
         *   - Add its new weight to cfs_rq->load.weight
         */
--计算调度实体se的load_avg_contrib，然后添加到整个cfs就绪队列总平局负载cfs_rq->runnable_load_avg中。
        update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
        se_update_runnable(se);
        update_cfs_group(se);
        account_entity_enqueue(cfs_rq, se);

        if (flags & ENQUEUE_WAKEUP)--处理刚被唤醒的进程。
-对唤醒进程有一定补偿，最多可以补偿一个调度周期的一般，即vruntime减去半个调度周期时间。
                place_entity(cfs_rq, se, 0);

        check_schedstat_required();
        update_stats_enqueue(cfs_rq, se, flags);
        check_spread(cfs_rq, se);
        if (!curr)------------把调度实体se加入到cfs就绪队列的红黑树中。
                __enqueue_entity(cfs_rq, se);
        se->on_rq = 1;---------------表示该调度实体已经在cfs就绪队列中。

        /*
         * When bandwidth control is enabled, cfs might have been removed
         * because of a parent been throttled but cfs->nr_running > 1. Try to
         * add it unconditionnally.
         */
        if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
                list_add_leaf_cfs_rq(cfs_rq);

        if (cfs_rq->nr_running == 1)
                check_enqueue_throttle(cfs_rq);
}

总结：

CFS 调度器，使用 enqueue_task_fair 函数将任务入队到 CFS 队列，使用 dequeue_task_fair 函数将任务从 CFS 队列中出队操作。

(1) 出队与入队的操作中，核心的逻辑可以分成两部分：

1）更新运行时的数据，比如负载、权重、组调度的占比等等；

2）将 sched_entity 插入红黑树，或者从红黑树移除；

这个过程中，涉及到了 CPU 负载计算、task_group 组调度、CFS Bandwidth 带宽控制等，后面分析介绍。

五、进程调度

__schedule()是调度器的核心函数，其作用是让调度器选择和切换到一个合适进程运行。调度轨迹如下：

__schedule()
  ->pick_next_task()
    ->pick_next_task_fair()
  ->context_switch()
    ->switch_mm()
      ->cpu_v7_switch_mm()
    ->switch_to()
      ->__switch_to

5.1 进程调度时机

调度的时机分为如下3种：

1. 阻塞操作：互斥量(mutex)、信号量(semaphore)、等待队列(waitqueue)等。

2. 在中断返回前和系统调用返回用户空间时，去检查TIF_NEED_RESCHED标志位以判断是否需要调度。

3. 将要被唤醒的进程不会马上调用schedule()要求被调度，而是会被添加到cfs就绪队列中，并且设置TIF_NEED_RESCHED标志位。那么唤醒进程什么时候被调度呢？这要根据内核是否具有可抢占功能(CONFIG_PREEMPT=y)分两种情况。

3.1 如果内核可抢占，则：（目前内核是可抢占）

如果唤醒动作发生在系统调用或者异常处理上下文中，在下一次调用preempt_enable()时会检查是否需要抢占调度。
如果唤醒动作发生在硬中断处理上下文中，硬件中断处理返回前夕(不管中断发生点在内核空间还是用户空间)会检查是否要抢占当前进程。

3.2 如果内核不可抢占，则：

当前进程调用cond_resched()时会检查是否要调度。
主动调度用schedule()。
系统调用或者异常处理返回用户空间时。
中断处理完成返回用户空间时(只有中断发生点在用户空间才会检查)。

5.2 preempt_schedule()

#ifdef CONFIG_PREEMPTION
/*
 * This is the entry point to schedule() from in-kernel preemption
 * off of preempt_enable.
 */
asmlinkage __visible void __sched notrace preempt_schedule(void)
{
        /*
         * If there is a non-zero preempt_count or interrupts are disabled,
         * we do not want to preempt the current task. Just return..
         */
        if (likely(!preemptible()))
                return;
        if (!preemptible_lazy())
                return;
        preempt_schedule_common();
}
NOKPROBE_SYMBOL(preempt_schedule);
EXPORT_SYMBOL(preempt_schedule);



static void __sched notrace preempt_schedule_common(void)
{
        do {
                /*
                 * Because the function tracer can trace preempt_count_sub()
                 * and it also uses preempt_enable/disable_notrace(), if
                 * NEED_RESCHED is set, the preempt_enable_notrace() called
                 * by the function tracer will call this function again and
                 * cause infinite recursion.
                 *
                 * Preemption must be disabled here before the function
                 * tracer can trace. Break up preempt_disable() into two
                 * calls. One to disable preemption without fear of being
                 * traced. The other to still record the preemption latency,
                 * which can also be traced by the function tracer.
                 */
                preempt_disable_notrace();
                preempt_latency_start(1);
                __schedule(true, false);
                preempt_latency_stop(1);
                preempt_enable_no_resched_notrace();

                /*
                 * Check again in case we missed a preemption opportunity
                 * between schedule and now.
                 */
        } while (need_resched());
}

5.3 __schedule()

__schedule()函数调用pick_next_task()让进程调度器从就绪队列中选择一个最合适的进程next，然后context_switch()切换到next进程运行。

static void __sched notrace __schedule(bool preempt, bool spinning_lock)
{
        struct task_struct *prev, *next;
        unsigned long *switch_count;
        unsigned long prev_state;
        struct rq_flags rf;
        struct rq *rq;
        int cpu;

        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
        prev = rq->curr;

        schedule_debug(prev, preempt);

        if (sched_feat(HRTICK))
                hrtick_clear(rq);

        local_irq_disable();
        rcu_note_context_switch(preempt);

        /*
         * Make sure that signal_pending_state()->signal_pending() below
         * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
         * done by the caller to avoid the race with signal_wake_up():
         *
         * __set_current_state(@state)          signal_wake_up()
         * schedule()                             set_tsk_thread_flag(p, TIF_SIGPENDING)
         *                                        wake_up_state(p, state)
         *   LOCK rq->lock                          LOCK p->pi_state
         *   smp_mb__after_spinlock()               smp_mb__after_spinlock()
         *     if (signal_pending_state())          if (p->state & @state)
         *
         * Also, the membarrier system call requires a full memory barrier
         * after coming from user-space, before storing to rq->curr.
         */
        rq_lock(rq, &rf);
        smp_mb__after_spinlock();

        /* Promote REQ to ACT */
        rq->clock_update_flags <<= 1;
        update_rq_clock(rq);

        switch_count = &prev->nivcsw;

        /*
         * We must load prev->state once (task_struct::state is volatile), such
         * that:
         *
         *  - we form a control dependency vs deactivate_task() below.
         *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
         */
        prev_state = prev->state;
        if ((!preempt || spinning_lock) && prev_state) {--当前进程状态不处于TASK_RUNNING状态，
                if (signal_pending_state(prev_state, prev)) {
                        prev->state = TASK_RUNNING;
                } else {
                        prev->sched_contributes_to_load =
                                (prev_state & TASK_UNINTERRUPTIBLE) &&
                                !(prev_state & TASK_NOLOAD) &&
                                !(prev->flags & PF_FROZEN);

                        if (prev->sched_contributes_to_load)
                                rq->nr_uninterruptible++;

                        /*
                         * __schedule()                 ttwu()
                         *   prev_state = prev->state;    if (p->on_rq && ...)
                         *   if (prev_state)                goto out;
                         *     p->on_rq = 0;              smp_acquire__after_ctrl_dep();
                         *                                p->state = TASK_WAKING
                         *
                         * Where __schedule() and ttwu() have matching control dependencies.
                         *
                         * After this, schedule() must not care about p->state any more.
                         */
                        deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

                        if (prev->in_iowait) {
                                atomic_inc(&rq->nr_iowait);
                                delayacct_blkio_start();
                        }
                }
                switch_count = &prev->nvcsw;
        }
-调用pick_next_task_fair()从就绪队列rq上选择合适的进程返回给next。
        next = pick_next_task(rq, prev, &rf);
        clear_tsk_need_resched(prev);
        clear_tsk_need_resched_lazy(prev);
        clear_preempt_need_resched();

-如果待切入的进程next和待切出的进程next不等，那么调用context_switch()进行上下文切换
        if (likely(prev != next)) {
                rq->nr_switches++;
                /*
                 * RCU users of rcu_dereference(rq->curr) may not see
                 * changes to task_struct made by pick_next_task().
                 */
                RCU_INIT_POINTER(rq->curr, next);
                /*
                 * The membarrier system call requires each architecture
                 * to have a full memory barrier after updating
                 * rq->curr, before returning to user-space.
                 *
                 * Here are the schemes providing that barrier on the
                 * various architectures:
                 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
                 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
                 * - finish_lock_switch() for weakly-ordered
                 *   architectures where spin_unlock is a full barrier,
                 * - switch_to() for arm64 (weakly-ordered, spin_unlock
                 *   is a RELEASE barrier),
                 */
                ++*switch_count;

                migrate_disable_switch(rq, prev);
                psi_sched_switch(prev, next, !task_on_rq_queued(prev));

                trace_sched_switch(preempt, prev, next);

                /* Also unlocks the rq: */
                rq = context_switch(rq, prev, next, &rf);
        } else {
                rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

                rq_unpin_lock(rq, &rf);
                __balance_callbacks(rq);
                raw_spin_unlock_irq(&rq->lock);
        }
}

下面重点分析选择待切入函数pick_next_task()和进行切换函数context_switch()两部分。

5.3.1 pick_next_task()

pick_next_task()是对调度类中pick_next_task()方法的包裹，这里主要对应cfs调度策略的pick_next_task_fair()。

// kernel/kernel/sched/core.c

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        const struct sched_class *class;
        struct task_struct *p;

        /*
         * Optimization: we know that if all tasks are in the fair class we can
         * call that function directly, but only if the @prev task wasn't of a
         * higher scheduling class, because otherwise those loose the
         * opportunity to pull in more work from other CPUs.
         */
-如果当前进程prev的调度类是cfs，并且该CPU就绪队列中进程数量等于cfs就绪队列中进程数量。说明该CPU就绪队列中只有普通进程没有其它调度类进程。
        if (likely(prev->sched_class <= &fair_sched_class &&
                   rq->nr_running == rq->cfs.h_nr_running)) {

                p = pick_next_task_fair(rq, prev, rf);
                if (unlikely(p == RETRY_TASK))
                        goto restart;

                /* Assumes fair_sched_class->next == idle_sched_class */
                if (!p) {
                        put_prev_task(rq, prev);
                        p = pick_next_task_idle(rq);
                }

                return p;
        }

-其它情况就需要遍历整个调度类，优先级为stop->deadline->realtime->cfs->idle。从这里也可以看出不同调度策略的优先级。
restart:
        put_prev_task_balance(rq, prev, rf);

        for_each_class(class) {
                p = class->pick_next_task(rq);
                if (p)
                        return p;
        }

        /* The idle class should always have a runnable task: */
        BUG();
}

pick_next_task_fair

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev)
{
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    struct task_struct *p;
    int new_tasks;

again:
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif

    if (!cfs_rq->nr_running)--------------------------------如果cfs就绪队列上没有进程，那么选择idle进程。
        goto idle;

    put_prev_task(rq, prev);

    do {
        se = pick_next_entity(cfs_rq, NULL);----------------选择cfs就绪队列中的红黑树最左边进程。
        set_next_entity(cfs_rq, se);
        cfs_rq = group_cfs_rq(se);--------------------------如果定义CONFIG_FAIR_GROUP_SCHED，需要遍历cfs_rq->rq上的就绪队列。如果没定义，则返回NULL。
    } while (cfs_rq);

    p = task_of(se);

    if (hrtick_enabled(rq))
        hrtick_start_fair(rq, p);

    return p;

idle:
    new_tasks = idle_balance(rq);
    /*
     * Because idle_balance() releases (and re-acquires) rq->lock, it is
     * possible for any higher priority task to appear. In that case we
     * must re-start the pick_next_entity() loop.
     */
    if (new_tasks < 0)
        return RETRY_TASK;

    if (new_tasks > 0)
        goto again;

    return NULL;
}

在没有定义CONFIG_FAIR_GROUP_SCHED的情况下，pick_next_entity()参数curr为NULL。表示pick_next_entity()优先获取cfs_rq->rb_leftmost结点。

set_next_entity()将cfs_rq->curr指向se，并且更行se的exec_start和prev_sum_exec_runtime

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    struct sched_entity *left = __pick_first_entity(cfs_rq);
    struct sched_entity *se;

    /*
     * If curr is set we have to see if its left of the leftmost entity
     * still in the tree, provided there was anything in the tree at all.
     */
    if (!left || (curr && entity_before(curr, left)))-----------------如果left不存在，left指向curr；或者left存在，curr不为NULL且curr的vruntime小于left的，那么left指向curr。
        left = curr;

    se = left; /* ideally we run the leftmost entity */---------------在curr为NULL情况下，se即cfs_rq的最左侧节点。
...
    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)----如果cfs_rq->last存在，且其vruntime小于left的。那么更新se为cfs_rq->last。
        se = cfs_rq->last;

    /*
     * Someone really wants this to run. If it's not unfair, run it.
     */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)----类似于cfs_rq->next，如果cfs_rq->next小于left的vruntime，那么更新se为cfs_rq->next。
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);

    return se;
}

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    /* 'current' is not kept within the tree. */
    if (se->on_rq) {-----------------------------------------------------如果当前调度实体在就绪队列，则移除。
        /*
         * Any task has to be enqueued before it get to execute on
         * a CPU. So account for the time it spent waiting on the
         * runqueue.
         */
        update_stats_wait_end(cfs_rq, se);
        __dequeue_entity(cfs_rq, se);
    }

    update_stats_curr_start(cfs_rq, se);
    cfs_rq->curr = se;
#ifdef CONFIG_SCHEDSTATS
    /*
     * Track our maximum slice length, if the CPU's load is at
     * least twice that of our own weight (i.e. dont track it
     * when there are only lesser-weight tasks around):
     */
    if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
        se->statistics.slice_max = max(se->statistics.slice_max,
            se->sum_exec_runtime - se->prev_sum_exec_runtime);
    }
#endif
    se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

上面有一个注意点，如果调度实体正在on_rq, 即正在被调度，就从红黑树中移除。

总结：schedule 函数执行时，调度器都需要选择下一个将要执行的任务。在 CFS 调度器中，是通过 pick_next_task_fair 函数完成的，流程如下：

(1) 当需要进程任务切换的时候，pick_next_task_fair 函数的传入参数中包含了需要被切换出去的任务，也就是 pre_task；

(2) 当 pre_task 不是普通进程时，也就是调度类不是 CFS，那么它就不使用 sched_entity 的调度实体来参与调度，因此会执行 simple 分支，通过 put_pre_task 函数来通知系统当前的任务需要被切换，而不是通过 put_prev_entity 函数来完成；

(3) 当 pre_task 是普通进程时，调用 pick_next_entity 来选择下一个执行的任务，这个选择过程实际是有两种情况：当调度实体对应 task 时，do while()遍历一次，当调度实体对应 task_group 时，则需要遍历任务组来选择下一个执行的任务了。

(4) put_prev_entity，用于切换任务前的准备工作，更新运行时的统计数据，并不进行 dequeue 的操作，其中需要将 CFS 队列的 curr 指针置位成 NULL；

(5) set_next_entity，用于设置下一个要运行的调度实体，设置 CFS 队列的 curr 指针；

(6) 如果使能了 hrtimer，则将 hrtimer 的到期时间设置为调度实体的剩余运行时间；

(7) 调度类的优先级体现在 pick_next_task 函数中。

5.3.2 context_switch()

context_switch()共4个参数，其中rq表示进程切换所在的就绪队列，prev将要被换出的进程，next将要被换入执行的进程，rf代表rq_flags。

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next, struct rq_flags *rf)
{
    // 和finish_task_switch()成对操作，其中next->on_cpu置1。
        prepare_task_switch(rq, prev, next);

        /*
         * For paravirt, this is coupled with an exit in switch_to to
         * combine the page table reload and the switch backend into
         * one hypercall.
         */
        arch_start_context_switch(prev);

        /*
         * kernel -> kernel   lazy + transfer active
         *   user -> kernel   lazy + mmgrab() active
         *
         * kernel ->   user   switch + mmdrop() active
         *   user ->   user   switch
         */
        if (!next->mm) { --对于内核线程来说是没有进程地址空间的   // to kernel                         
                enter_lazy_tlb(prev->active_mm, next);

//因为进程调度的需要，需要借用一个进程的地址空间，因此有了active_mm成员。为什么不用prev->mm呢？因为prev也可能是内核线程。
                next->active_mm = prev->active_mm;
                if (prev->mm)                           // from user
                        mmgrab(prev->active_mm);
                else
                        prev->active_mm = NULL;
        } else {    对普通进程                                    // to user
                membarrier_switch_mm(rq, prev->active_mm, next->mm);
                /*
                 * sys_membarrier() requires an smp_mb() between setting
                 * rq->curr / membarrier_switch_mm() and returning to userspace.
                 *
                 * The below provides this either through switch_mm(), or in
                 * case 'prev->active_mm == next->mm' through
                 * finish_task_switch()'s mmdrop().
                 */
-对普通进程，需要调用switch_mm()函数做一些进程地址空间切换的处理。
                switch_mm_irqs_off(prev->active_mm, next->mm, next);

--对于prev是内核线程情况，prev->active_mm为NULL，rq->prev_mm记录prev->active_mm。
                if (!prev->mm) {                        // from kernel
                        /* will mmdrop() in finish_task_switch(). */
                        rq->prev_mm = prev->active_mm;
                        prev->active_mm = NULL;
                }
        }

        rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

        prepare_lock_switch(rq, next, rf);

        /* Here we just switch the register state and the stack. */
--切换进程，从prev进程切换到next进程来运行。该函数完成时，CPU运行next进程，prev进程被调度出去，俗称“睡眠”。
        switch_to(prev, next, prev);
        barrier();

--进程切换后的清理工作，prev->on_cpu置0，递减old_mm->mm_count，由next处理prev进程残局。
        return finish_task_switch(prev);
}

switch_mm_irqs_off 和架构相关，arm64 switch_mm_irqs_off = switch_mm()。

switch_mm()和switch_to()都是体系结构密切相关函数。

switch_mm()把新进程页表基地址设置到页目录表基地址寄存器中。

switch_mm()首先把当前CPU设置到下一个进程的cpumask位图中，然后调用check_and_switch_context()来完成ARM体系结构相关的硬件设置，例如flush TLB。

//kernel/arch/arm64/include/asm/mmu_context.h

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
          struct task_struct *tsk)
{
        if (prev != next)
                __switch_mm(next);

        /*
         * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
         * value may have not been initialised yet (activate_mm caller) or the
         * ASID has changed since the last run (following the context switch
         * of another thread of the same process).
         */
        update_saved_ttbr0(tsk, next);
}

switch_to()最终调用__switch_to():

/*
 * Thread switching.
 */
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
                                struct task_struct *next)
{
        struct task_struct *last;

        fpsimd_thread_switch(next);
        tls_thread_switch(next);
        hw_breakpoint_thread_switch(next);
        contextidr_thread_switch(next);
        entry_task_switch(next);
        uao_thread_switch(next);
        ssbs_thread_switch(next);
        erratum_1418040_thread_switch(next);

        /*
         * Complete any pending TLB or cache maintenance on this CPU in case
         * the thread migrates to a different CPU.
         * This full barrier is also required by the membarrier system
         * call.
         */
        dsb(ish);

        /*
         * MTE thread switching must happen after the DSB above to ensure that
         * any asynchronous tag check faults have been logged in the TFSR*_EL1
         * registers.
         */
        mte_thread_switch(next);

        /* the actual thread switch */
        last = cpu_switch_to(prev, next);

        return last;
}

这里把prev进程相关寄存器上下文保存到该进程的thread_info->cpu_context结构体中，然后再把next进程thread_info->cpu_context结构体中的值设置到物理CPU寄存器中，从而实现进程堆栈的切换。

5.4 调度实体sched_entity红黑树操作

cfs使用红黑树来管理调度实体，红黑树的键值为sched_entity->vruntime。

__enqueue_entity()用于将调度实体se键入到cfs_rq运行队列上，具体是加入到cfs_rq->tasks_timeline的红黑树上。

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;----------------------取当前cfs_rq->tasks_timeline树上的第一个节点，注意不一定是最左侧节点。
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     */
    while (*link) {---------------------------------------------------------------从第一个节点开始遍历当前cfs_rq红黑树，知道找到空的插入节点。
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);------------------通过parent找到其对应的调度实体
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (entity_before(se, entry)) {-------------------------------------------如果se->vruntime < entry->vruntime则条件成立，插入点指向entry对应的左节点。
            link = &parent->rb_left;
        } else {------------------------------------------------------------------否则插入点指向entry对应的右节点，则leftmost为0。
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    if (leftmost)----------------------------------------------------------------如果新插入的节点为最左侧节点，那么需要改变cfs_rq->rb_leftmost。
        cfs_rq->rb_leftmost = &se->run_node;

    rb_link_node(&se->run_node, parent, link);-----------------------------------将link指向se->run_node
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);---------------------在将se->run_node插入后，进行平衡调整。
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {---------------------------------如果待删除的节点是cfs_rq->rb_leftmose，那么还需要更新cfs_rq->rb_leftmost，然后再删除。
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);---------------------------从cfs_rq->tasks_timeline删除节点se->run_node。
}

struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)-----------------获取cfs_rq->rb_leftmost对应的调度实体。
{
    struct rb_node *left = cfs_rq->rb_leftmost;

    if (!left)
        return NULL;

    return rb_entry(left, struct sched_entity, run_node);
}

static struct sched_entity *__pick_next_entity(struct sched_entity *se)----------获取当前调度实体右侧的调度实体。
{
    struct rb_node *next = rb_next(&se->run_node);

    if (!next)
        return NULL;

    return rb_entry(next, struct sched_entity, run_node);
}

struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)------------------获取cfs_rq最右侧的调度实体。
{
    struct rb_node *last = rb_last(&cfs_rq->tasks_timeline);--------------------rb_last在cfs_rq->tasks_timeline不停遍历右节点，直到最后一个。

    if (!last)
        return NULL;

    return rb_entry(last, struct sched_entity, run_node);
}

static inline int entity_before(struct sched_entity *a,
struct sched_entity *b)
{
return (s64)(a->vruntime - b->vruntime) < 0;----------------------------比较调度实体a->vruntime和b->vruntime，如果a before b返回true。
}

六、schedule tick

时钟分为周期性时钟和单次触发时钟，通过clockevents_register_device()进行注册。

广播和非广播时钟的区别在于设备的clock_event_device->cpumask设置。

clockevnets_register_device()
->tick_check_new_device()
->tick_setup_device()
->tick_setup_periodic()-----------------------------如果tick_device->mode定义为TICKDEV_MODE_PERIODIC，则注册为周期性时钟。
->tick_set_periodic_handler()
->tick_handle_periodic()------------------------周期性时钟
->tick_handle_periodic_broadcast()---------周期性广播时钟
->tick_setup_oneshot()-----------------------------如果tick_device->mode定义为TICKDEV_MODE_ONESHOT，则为单次触发时钟。

tick_set_periodic_handler()将struct clock_event_device的event_handler设置为tick_handle_periodic()。

上面是时钟的注册，时钟是由中断驱动的，在中断的处理函数中会调用到clock_event_device->event_handler()。

对于周期性时钟对应函数为tick_handle_periodic()-->tick_periodic()-->update_process_times()-->scheduler_tick()。

这个部分的细节，在另外一篇文章中已经详细介绍：linux之调度管理（2）-调度器如何触发运行-CSDN博客

CFS 调度 tick

CFS 调度器中的 tick 函数为 task_tick_fair，系统中每个调度 tick 都会调用到，此外如果使用了 hrtimer，也会调用到这个函数。流程如下：

七、组调度

CFS调度器的调度粒度是进程，在某些场景下希望调度粒度是组。

组与组之间的关系是公平的，组内的调度实体又是公平的。组调度就是解决这方面的应用需求。

CFS调度器定义一个数据结构来抽象组调度struct task_group。

/* task group related information */
struct task_group {
    struct cgroup_subsys_state css;

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* schedulable entities of this group on each cpu */
    struct sched_entity **se;
    /* runqueue "owned" by this group on each cpu */
    struct cfs_rq **cfs_rq;
    unsigned long shares;

#ifdef    CONFIG_SMP
    atomic_long_t load_avg;
    atomic_t runnable_avg;
#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
    struct sched_rt_entity **rt_se;
    struct rt_rq **rt_rq;

    struct rt_bandwidth rt_bandwidth;
#endif

    struct rcu_head rcu;
    struct list_head list;

    struct task_group *parent;
    struct list_head siblings;
    struct list_head children;

#ifdef CONFIG_SCHED_AUTOGROUP
    struct autogroup *autogroup;
#endif

    struct cfs_bandwidth cfs_bandwidth;
}

7.1 创建组调度

组调度属于cgroup架构中的cpu子系统，在系统配置时需要打开CONFIG_CGROUP_SCHED和CONFIG_FAIR_GROUP_SCHED。

创建一个组调度的接口是sched_create_group()。

//kernel/kernel/sched/core.c

/* allocate runqueue etc for a new task group */

-parent指上一级的组调度节点，系统中有一个组调度的根root_task_group。
struct task_group *sched_create_group(struct task_group *parent)
{
        struct task_group *tg;

-分配task_group数据结构
        tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
        if (!tg)
                return ERR_PTR(-ENOMEM);

-创建cfs调度器需要的组调度数据结构
        if (!alloc_fair_sched_group(tg, parent))
                goto err;

-创建rt调度器需要的组调度数据结构
        if (!alloc_rt_sched_group(tg, parent))
                goto err;

        alloc_uclamp_sched_group(tg, parent);

        return tg;

err:
        sched_free_group(tg);
        return ERR_PTR(-ENOMEM);
}

alloc_fair_sched_group()创建cfs调度器需要的组调度数据结构。

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se;
    int i;

    tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);--分配NR_CPUS个cfs_rq数据结构，存放到指针数组中，这里数据结构不是struct cfs_rq。
    if (!tg->cfs_rq)
        goto err;
    tg->se = kzalloc(sizeof(se) * nr_cpu_ids, GFP_KERNEL);----------分配NR_CPUS个se数据结构，注意这里不是struct sched_entity。
    if (!tg->se)
        goto err;

    tg->shares = NICE_0_LOAD;---------------------------------------调度组的权重初始化为NICE值为0的权重。

    init_cfs_bandwidth(tg_cfs_bandwidth(tg));

    for_each_possible_cpu(i) {--------------------------------------遍历系统中所有possible CPU，为每个CPU分配一个struct cfs_rq调度队列和struct sched_entity调度实体。
        cfs_rq = kzalloc_node(sizeof(struct cfs_rq),----------------之前分配的是指针数组，这里为每个CPU分配struct cfs_rq和struct sched_entity数据结构。
                      GFP_KERNEL, cpu_to_node(i));
        if (!cfs_rq)
            goto err;

        se = kzalloc_node(sizeof(struct sched_entity),
                  GFP_KERNEL, cpu_to_node(i));
        if (!se)
            goto err_free_rq;

        init_cfs_rq(cfs_rq);----------------------------------------初始化cfs_rq就绪队列中的tasks_timeline和min_vruntime等信息。
        init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);--------构建组调度结构的关键函数。
    }

    return 1;

err_free_rq:
    kfree(cfs_rq);
err:
    return 0;
}

init_cfs_rq()初始化cfs_rq的tasks_timeline红黑树、min_vruntime。

init_tg_cfs_entry()初始化构建组调度结构的关键函数,，将rg和cfs_rq关联，。

void init_cfs_rq(struct cfs_rq *cfs_rq)
{
    cfs_rq->tasks_timeline = RB_ROOT;
    cfs_rq->min_vruntime = (u64)(-(1LL << 20));
#ifndef CONFIG_64BIT
    cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
#ifdef CONFIG_SMP
    atomic64_set(&cfs_rq->decay_counter, 1);
    atomic_long_set(&cfs_rq->removed_load, 0);
#endif
}

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
            struct sched_entity *se, int cpu,
            struct sched_entity *parent)
{
    struct rq *rq = cpu_rq(cpu);

    cfs_rq->tg = tg;
    cfs_rq->rq = rq;
    init_cfs_rq_runtime(cfs_rq);

    tg->cfs_rq[cpu] = cfs_rq;-----------------------------将alloc_fair_sched_group()分配的指针数组和对应的数据结构关联上。
    tg->se[cpu] = se;

    /* se could be NULL for root_task_group */
    if (!se)
        return;

    if (!parent) {
        se->cfs_rq = &rq->cfs;
        se->depth = 0;
    } else {
        se->cfs_rq = parent->my_q;
        se->depth = parent->depth + 1;
    }

    se->my_q = cfs_rq;------------------------------------针对组调度中实体才有的my_q。
    /* guarantee group entities always have weight */
    update_load_set(&se->load, NICE_0_LOAD);
    se->parent = parent;
}

7.2 将进程加入组调度

通过调用cpu_cgrp_subsys的接口函数cpu_cgroup_attach()将今晨加入到组调度中。

struct cgroup_subsys cpu_cgrp_subsys = {
        .css_alloc      = cpu_cgroup_css_alloc,
        .css_online     = cpu_cgroup_css_online,
        .css_released   = cpu_cgroup_css_released,
        .css_free       = cpu_cgroup_css_free,
        .css_extra_stat_show = cpu_extra_stat_show,
        .fork           = cpu_cgroup_fork,
        .can_attach     = cpu_cgroup_can_attach,
        .attach         = cpu_cgroup_attach,
        .legacy_cftypes = cpu_legacy_files,
        .dfl_cftypes    = cpu_files,
        .early_init     = true,
        .threaded       = true,
};

static void cpu_cgroup_attach(struct cgroup_taskset *tset)
{
        struct task_struct *task;
        struct cgroup_subsys_state *css;

        cgroup_taskset_for_each(task, css, tset)-遍历tset包含的进程链表。
                sched_move_task(task);--将task进程迁移到组调度中。
}

sched_move_task:

void sched_move_task(struct task_struct *tsk)
{
    struct task_group *tg;
    int queued, running;
    unsigned long flags;
    struct rq *rq;

    rq = task_rq_lock(tsk, &flags);

    running = task_current(rq, tsk);--------------------------判断进程tsk是否正在运行
    queued = task_on_rq_queued(tsk);--------------------------判断进程tsk是否在就绪队列里，tsk->on_rq等于TASK_ON_RQ_QUEUED表示该进程在就绪队列中。

    if (queued)
        dequeue_task(rq, tsk, 0);-----------------------------如果进程在就绪队列中，那么要让该进程暂时先退出就绪队列。
    if (unlikely(running))------------------------------------如果该进程在在运行中，刚才已经调用dequeue_task()把进程退出就绪队列，现在只能继续加回到就绪队列中。
        put_prev_task(rq, tsk);

    /*
     * All callers are synchronized by task_rq_lock(); we do not use RCU
     * which is pointless here. Thus, we pass "true" to task_css_check()
     * to prevent lockdep warnings.
     */
    tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
              struct task_group, css);
    tg = autogroup_task_group(tsk, tg);
    tsk->sched_task_group = tg;

#ifdef CONFIG_FAIR_GROUP_SCHED
    if (tsk->sched_class->task_move_group)
        tsk->sched_class->task_move_group(tsk, queued);
    else
#endif
        set_task_rq(tsk, task_cpu(tsk));---------------------将tsk对应的调度实体的cfs_rq、parent和当前CPU对应的cfs_rq、se关联起来。

    if (unlikely(running))
        tsk->sched_class->set_curr_task(rq);
    if (queued)
        enqueue_task(rq, tsk, 0);

    task_rq_unlock(rq, tsk, &flags);
}

static void task_move_group_fair(struct task_struct *p, int queued)
{
    struct sched_entity *se = &p->se;
    struct cfs_rq *cfs_rq;

    if (!queued && (!se->sum_exec_runtime || p->state == TASK_WAKING))
        queued = 1;

    if (!queued)
        se->vruntime -= cfs_rq_of(se)->min_vruntime;
    set_task_rq(p, task_cpu(p));
    se->depth = se->parent ? se->parent->depth + 1 : 0;
    if (!queued) {
        cfs_rq = cfs_rq_of(se);
        se->vruntime += cfs_rq->min_vruntime;
#ifdef CONFIG_SMP
        se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
        cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
#endif
    }
}

static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
    struct task_group *tg = task_group(p);----------获取当前进程对应的task_group。
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    p->se.cfs_rq = tg->cfs_rq[cpu];-----------------设置调度实体的cfs_rq和parent。
    p->se.parent = tg->se[cpu];
#endif...
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
    update_rq_clock(rq);
    sched_info_queued(rq, p);
    p->sched_class->enqueue_task(rq, p, flags);
}

又走到了 enqueue_task_fair 函数：

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int idle_h_nr_running = task_has_idle_policy(p);
        int task_new = !(flags & ENQUEUE_WAKEUP);

        /*
         * The code below (indirectly) updates schedutil which looks at
         * the cfs_rq utilization to select a frequency.
         * Let's add the task's estimated utilization to the cfs_rq's
         * estimated utilization, before we update schedutil.
         */
        util_est_enqueue(&rq->cfs, p);

        /*
         * If in_iowait is set, the code below may not trigger any cpufreq
         * utilization updates, so do it here explicitly with the IOWAIT flag
         * passed.
         */
        if (p->in_iowait)
                cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

--在打开CONFIG_FAIR_GROUP_SCHED之后，需要遍历进程调度实体和它的上一级调度实体。第一次遍历是p->se，第二滴遍历是对应组调度实体tg->se[]。

        for_each_sched_entity(se) {
                if (se->on_rq)
                        break;
                cfs_rq = cfs_rq_of(se);
                enqueue_entity(cfs_rq, se, flags);

                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                /* end evaluation on encountering a throttled cfs_rq */
                if (cfs_rq_throttled(cfs_rq))
                        goto enqueue_throttle;

                flags = ENQUEUE_WAKEUP;
        }

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);

                update_load_avg(cfs_rq, se, UPDATE_TG);
                se_update_runnable(se);
                update_cfs_group(se);

                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                /* end evaluation on encountering a throttled cfs_rq */
                if (cfs_rq_throttled(cfs_rq))
                        goto enqueue_throttle;

               /*
                * One parent has been throttled and cfs_rq removed from the
                * list. Add it back to not break the leaf list.
                */
               if (throttled_hierarchy(cfs_rq))
                       list_add_leaf_cfs_rq(cfs_rq);
        }

        /* At this point se is NULL and we are at root level*/
        add_nr_running(rq, 1);

        /*
         * Since new tasks are assigned an initial util_avg equal to
         * half of the spare capacity of their CPU, tiny tasks have the
         * ability to cross the overutilized threshold, which will
         * result in the load balancer ruining all the task placement
         * done by EAS. As a way to mitigate that effect, do not account
         * for the first enqueue operation of new tasks during the
         * overutilized flag detection.
         *
         * A better way of solving this problem would be to wait for
         * the PELT signals of tasks to converge before taking them
         * into account, but that is not straightforward to implement,
         * and the following generally works well enough in practice.
         */
        if (!task_new)
                update_overutilized_status(rq);

enqueue_throttle:
        if (cfs_bandwidth_used()) {
                /*
                 * When bandwidth control is enabled; the cfs_rq_throttled()
                 * breaks in the above iteration can result in incomplete
                 * leaf list maintenance, resulting in triggering the assertion
                 * below.
                 */
                for_each_sched_entity(se) {
                        cfs_rq = cfs_rq_of(se);

                        if (list_add_leaf_cfs_rq(cfs_rq))
                                break;
                }
        }

        assert_list_leaf_cfs_rq(rq);

        hrtick_update(rq);
}

组调度基本策略如下：

在创建组调度tg时，tg为每个CPU同时创建组调度内部使用的cfs_rq就绪队列。
组调度作为一个调度实体加入到系统的cfs就绪队列rq->cfs_rq中。
进程加入到一个组中后，就脱离了系统的cfs就绪队列，并且加入到组调度的cfs就绪队列tg->cfs_rq[]中。
在选择下一个进程时，从系统的cfs就绪队列开始，如果选中的调度实体是组调度tg，那么还需要继续遍历tg中的就绪队里，从中选择一个进程来运行。

原文地址：https://blog.csdn.net/z20230508/article/details/143703918

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：大语言模型LLM综述
下一篇：C#入门 017 字段，属性，索引器，常量

Linux之调度管理（3）-CFS调度器 详解