Linux Kernel Internals(2)
时间:2007-02-17 来源:PHP爱好者
Next Previous Contents
--------------------------------------------------------------------------------
2. 进程和中断管理
2.1 任务结构和进程表
Linux下每一个进程动态分配一个struct task_struct结构。可以建立的最大进程数只是由当前的物理内存数量所限制(参见 kernel/fork.c:fork_init()):
--------------------------------------------------------------------------------
/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
* of memory.
*/
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
--------------------------------------------------------------------------------
在IA32体系结构中, 它基本上等于num_physpages/4. 例如,在一台具有512M的机器上,你可以建立32k个线程。对于老的核心(2.2和更早的版本)4k多一点的限制来说,这是一个很大的进步。而且,这还可以在运行时用KERN_MAX_THREADS sysctl(2)改变或是简单的通过procfs接口来调整:
--------------------------------------------------------------------------------
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
#0 0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
--------------------------------------------------------------------------------
Linux系统的一组进程是通过许多的struct task_struct结构来表示的, 它们通过两种方法来链接在一起:
作为一个哈希表, 通过pid来建立
作为一个圆环,用p->next_task 和 p->prev_task指针建立的一个双向链表。
这个哈希表称为 pidhash[],在include/linux/sched.h中定义:
--------------------------------------------------------------------------------
/* PID hashing. (shouldnt this be dynamic?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
--------------------------------------------------------------------------------
任务通过它们的pid值来建立哈希表,上面的哈希函数可以在它们的范围中(0 to PID_MAX-1)均一的分配元素。哈希表用include/linux/sched.h中的find_task_pid() 内联函数,可以快速查找一个给定pid的任务:
--------------------------------------------------------------------------------
static inline struct task_struct *find_task_by_pid(int pid)
{
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
;
return p;
}
--------------------------------------------------------------------------------
每个哈希列表中的任务(即散列成同样的值)是通过p->pidhash_next/pidhash_pprev 来链接的,hash_pid() 和 unhash_pid() 使用它们来插入和删除一个哈希表中指定的进程 。这些都是在被称为tasklist_lock的读写自旋锁的保护下完成的。
圆环双向链表使用p->next_task/prev_task 来进行维护,因此可以在系统中容易地遍历所有任务。这是通过include/linux/sched.h中的宏for_each_task() 来完成的:
--------------------------------------------------------------------------------
#define for_each_task(p)
for (p = &init_task ; (p = p->next_task) != &init_task ; )
--------------------------------------------------------------------------------
for_each_task()使用者应该采用tasklist_lock来读。注意for_each_task() 是用init_task来标记列表的开始和结束,这是安全的方法,因为空闲任务(pid=0)从来不存在。
进程哈希表或进程表链的修改,特别是 fork(), exit() 和 ptrace(), 必须使用 tasklist_lock 来写。更有趣的是写时还必须在本地CPU上关闭中断。原因很简单:send_sigio() 函数遍历任务列表从而采用tasklist_lock 来读,在中断上下文中,它还被 kill_fasync() 调用。这就是为什么要在写时禁止中断的原因了,而读取时却不需要。
现在我们已明白 task_struct 结构是如何相互链接在一起,让我们检查一下 task_struct的成员。它们松散地对应着 UNIX 'proc' 和 'user' 结构的组合。
UNIX 的其它版本将任务状态信息分为两部分,一部分内容一直保留在内存(称为 'proc structure' ,它包括进程状态,调度信息等等) ;另外一部分是只有当进程运行时才需要的(称为'u 区' ,包括文件描述符表,磁盘限额信息等等)。这么丑陋的设计唯一的原因就是内存时非常紧缺的资源。现代操作系统(当然,目前只有Linux而不是其它操作系统,比如FreeBSD似乎在这方面朝着Linux提高)不需要这样区分,从而在任何时候都是在核心内存驻留数据结构中维护进程状态。
task_struct 结构申明在include/linux/sched.h ,目前的大小为1680字节。
状态域申明为:
--------------------------------------------------------------------------------
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
#define TASK_EXCLUSIVE 32
--------------------------------------------------------------------------------
为什么TASK_EXCLUSIVE 定义为32而不是16?因为16已经被TASK_SWAPPING 使用,当我已经删除了TASK_SWAPPING (有时是在2.3.x中)所有的引用后,却忘记了将TASK_EXCLUSIVE 提前。
p->state 中的volatile 申明意味着它可以异步地修改(从中断处理器中):
TASK_RUNNING: 意味着此任务被假定是在运行队列。它可能还不在运行队列中的原因是标记一个任务为TASK_RUNNING和将它放置在运行队列并不是原子性的。为了查看运行队列,你需要持有runqueue_lock 读写自旋锁进行读操作。如果你那样做的话,你将看见任务队列中的每一个任务都处于TASK_RUNNING 状态。然而,反之却未必正确,原因上面已经解释了。同样地,驱动程序(准确地说是它们所运行的进程上下文)可以标记它们自己为TASK_INTERRUPTIBLE (或 TASK_UNINTERRUPTIBLE)然后调用schedule(),这样将会把它从运行队列中删除掉(除非有一个未决的信号,这样它会留在运行队列)。
TASK_INTERRUPTIBLE: 意味着进程正在睡眠,但可以通过一个信号或定时器超时来唤醒。
TASK_UNINTERRUPTIBLE: 跟TASK_INTERRUPTIBLE一样,但它不能被唤醒。
TASK_ZOMBIE: 任务已经终止,但它的状态还没有被父进程(本身的或是收养的)收集(通过wait()调用)
TASK_STOPPED: 任务停止,要么是由于任务控制信号或是因为调用ptrace(2)。
TASK_EXCLUSIVE: 这不是一个单独的状态,它可以与TASK_INTERRUPTIBLE 或 TASK_UNINTERRUPTIBLE中的任一个相或。这意味着此任务与其它许多任务睡眠在等待队列中时,它将被单独唤醒而不是引起一个"雷鸣般的牧群"问题,唤醒队列中的所有任务。
任务标记包含不是互斥的进程状态信息:
--------------------------------------------------------------------------------
unsigned long flags; /* per process flags, defined below */
/*
* Per process flags
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
--------------------------------------------------------------------------------
p->has_cpu, p->processor, p->counter, p->priority, p->policy 和 p->rt_priority 域与调度器相关,后面我们将会看到。
域p->mm 和 p->active_mm 分别指向进程的地址空间和活动地址空间(如果没有一个真正的进程,如核心线程),它是由mm_struct 结构描述。 这在当任务被调度出去交换地址空间时可以最小化TLB表。因此,如果我们正调度进核心线程(它没有p->mm) ,于是它的next->active_mm将被设置成调度出去的任务的prev->active_mm ,如果prev->mm != NULL,它将与prev->mm一致。如果CLONE_VM标记被传给clone(2) 系统调用或vfork(2)系统调用,那么地址空间就可以在线程之间共享。
域p->exec_domain 和 p->personality跟任务的特征相关,比如说为了仿真UNIX的外来有趣"特征"的某些系统调用行为方法。
域p->fs 包含文件系统信息,Linux下代表三种信息:
根目录项和安装点,
预备根目录项和安装点,
当前工作目录项和安装点。
这个结构还包括引用记数,因为当CLONE_FS标记传送给clone(2)系统调用时它要在克隆的任务之间共享。
域p->files 包含文件描述符表。这也可以在任务之间共享,只要在clone(2)系统调用中指定了CLONE_FILES标记。
域p->sig包含信号处理器,可以通过CLONE_SIGHAND在克隆的任务之间共享。
2.2 任务和核心线程的建立和终止
操作系统类不同的书定义"进程"有不同的方法,但都是以"程序执行的事例"开始,以"通过clone(2)或fork(2)系统调用来产生"结束。在Linux下,有3种进程:
空闲线程,
核心线程,
用户任务。
空闲线程是在编译时为第一个CPU建立的;然后通过特定体系结构arch/i386/kernel/smpboot.c中的fork_by_hand()(在某些体系结构中它是fork(2)系统调用的手工展开)来为每一个CPU“手工的”建立一个。空闲任务有一个私有的TSS结构,它们在每个CPU数组init_tss中,但共享一个init_task结构。 所有空闲任务的pid = 0 ,没有其它的任务能够共享此pid,即使在clone(2)时使用了CLONE_PID。
核心线程是在核心模式下使用kernel_thread()函数执行clone(2)系统调用来建立的。核心线程通常没有用户地址空间,即p->mm = NULL,因为它们显式地调用exit_mm(),如通过daemonize()函数。核心线程总是直接地访问核心地址空间。它们在很低的区间分配pid号。运行在处理器的第0环意味着核心线程享有所有的I/O特权,并且不能够被调度器抢占。
用户任务是通过clone(2) 或 fork(2) 系统调用来建立的,它们内部都是执行kernel/fork.c:do_fork()。
让我们看看当一个用户进程运行fork(2)系统调用时会发生什么。尽管fork(2)是与体系结构相关的,因为传递给用户栈和寄存器有不同方式,但真正基本的函数do_fork()完成那些工作并且是可移植的。它位于kernel/fork.c。
它完成下面的步骤:
局部变量retval设置成-ENOMEM,如果fork(2)未能分配一个新的任务结构的话errno应该设置成这个值。
如果在clone_flags中设置了CLONE_PID,就返回错误(-EPERM), 除非调用者是空闲线程(仅在启动时)。于是,普通用户线程不能传递CLONE_PID 给clone(2) 并期望能成功。对于fork(2)来说,由于clone_flags 被设置成 SIFCHLD,这并不相关 - 它只有当do_fork() 从 sys_clone()执行时才相关,它从需要的用户空间请求的值传递给clone_flags。
current->vfork_sem 初始化(它在子进程中清除)。sys_vfork() (vfork(2) 系统调用,对应于clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD) 使父进程睡眠时会用到,直到子进程执行mm_release(), 例如作为exec()执行另外的程序或exit(2)的结果。
使用体系结构相关的alloc_task_struct()宏分配一个新的任务结构。在x86中,它就是用GFP_KERNEL优先级获取一个空闲页。这就是为什么fork(2)系统调用可能睡眠的第一个原因。如果分配失败,我们返回-ENOMEM。
使用赋值语句*p = *current,当前进程的任务结构所有的值都拷贝到新的结构中。 也许这应该用memset来替换?稍后,不应该被子进程继承的值设置成准确的值。
余下的代码采用大核心锁,它们是不可重入的。
如果父进程有用户资源(UID思想,Linux 很灵活,它是提出问题而不是制造事实),然后检验用户是否超出了RLIMIT_NPROC软限制 - 如果是的话,返回-EAGAIN,如果没有超出,就增加由给定uid进程记数p->user->count。
如果系统中任务数量超出可调的max_threads值,返回错误-EAGAIN。
如果正在执行的二进制代码属于模块化的执行域,增加相应的模块引用记数。
如果正在执行的二进制代码属于模块化的二进制格式,增加相应的模块引用记数。
子进程标记为'还未执行' (p->did_exec = 0)
子进程标记为'不可交换' (p->swappable = 0)
将子进程置为'不可中断睡眠'状态,即p->state = TASK_UNINTERRUPTIBLE (TODO: 为什么要这样做? 我认为它是不需要的 - 去掉它,Linus 确认它是不需要的)
根据clone_flags的值设置子进程的p->flags;对于简单的fork(2),它将是p->flags = PF_FORKNOEXEC。
子进程的pid p->pid在kernel/fork.c:get_pid()里用快速算法来设置 (TODO: lastpid_lock 自旋锁显得多余了,因为get_pid() 总是在大核心锁下从do_fork()中调用,同样去掉get_pid()的标记参数,在2000-06-20将补丁送给了Alan - 后来又发了一次)。
do_fork()中剩下的代码初始化子进程其余的任务结构。在最后,子进程的任务结构被散列进pidhash哈希表,子进程被唤醒 (TODO: wake_up_process(p) 设置 p->state = TASK_RUNNING 并且增加这个进程到运行队列,因此我们do_fork())之前多半不需要设置 p->state 为 TASK_RUNNING 。感兴趣的部分是设置p->exit_signal 为 clone_flags & CSIGNAL, 这对于 fork(2) 就意味着SIGCHLD ,设 p->pdeath_signal 为 0。当一个进程‘忘记'了原始的父进程(由于死了),就会用到pdeath_signal,它可以通过prctl(2)系统调用中的PR_GET/SET_PDEATHSIG命令来设置/获取(你也许会认为通过在prctl(2)用户空间的指针参数来返回pdeath_signal值这种方法很笨 - 是我的过失,在Andries Brouwer更新手册页之后已经太迟了还没有更正;)
这样任务就建立了。有几种原因使得任务终止:
执行exit(2) 系统调用;
接收到一个信号,缺省处理就是死亡;
在某些异常条件下被迫死亡;
通过func == 1调用bdflush(2)(这是Linux专用的,由于和在/etc/inittab中仍然有‘update'行的老的发布兼容的目的 - 现在更新的工作是通过核心线程kupdate来完成的)。
在Linux下实现系统调用的函数都是以sys_为前缀的,但是它们通常只考虑参数检查或者用结构相关的方法传递一些信息,而真正的工作是由do_函数来完成的。因此sys_exit() 调用do_exit() 来工作。尽管如此,核心的其它部分有时执行sys_exit() 而它们应该真正调用do_exit()。
do_exit() 函数在kernel/exit.c中可以找到。对于do_exit()应该注意的点有:
使用全局核心锁(锁但是不解锁)。
最后调用schedule(),它永远不返回。
设置任务状态为TASK_ZOMBIE。
如果current->pdeath_signal不为0,则用它通知所有子进程。
用current->exit_signal通知父进程,它通常等于 SIGCHLD。
释放有fock分配的资源,关闭打开的文件等。
对于使用懒散的FPU交换结构(ia64, mips, mips64) (TODO: 去掉sparc, sparc64的'flags' 参数), 任何硬件请求传递FPU所有者(如果由current拥有)为“无”。
2.3 Linux 调度器
The job of a scheduler is to arbitrate access to the current CPU between multiple processes. The scheduler is implemented in the 'main kernel file' kernel/sched.c. The corresponding header file include/linux/sched.h is included (either explicitly or indirectly) by virtually every kernel source file.
The fields of task structure relevant to scheduler include:
p->need_resched: this field is set if schedule() should be invoked at the 'next opportunity'.
p->counter: number of clock ticks left to run in this scheduling slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset to 0 and p->need_resched is set. This is also sometimes called 'dynamic priority' of a process because it can change by itself.
p->priority: the process' static priority, only changed through well-known system calls like nice(2), POSIX.1b sched_setparam(2) or 4.4BSD/SVR4 setpriority(2).
p->rt_priority: realtime priority
p->policy: the scheduling policy, specifies which scheduling class the task belongs to. Tasks can change their scheduling class using the sched_setscheduler(2) system call. The valid values are SCHED_OTHER (traditional UNIX process), SCHED_FIFO (POSIX.1b FIFO realtime process) and SCHED_RR (POSIX round-robin realtime process). One can also OR SCHED_YIELD to any of these values to signify that the process decided to yield the CPU, for example by calling sched_yield(2) system call. A FIFO realtime process will run until either a) it blocks on I/O, b) it explicitly yields the CPU or c) it is preempted by another realtime process with a higher p->rt_priority value. SCHED_RR is the same as SCHED_FIFO, except that when its timeslice expires it goes back to the end of the runqueue.
The scheduler's algorithm is simple, despite the great apparent complexity of the schedule() function. The function is complex because it implements three scheduling algorithms in one and also because of the subtle SMP-specifics.
The apparently 'useless' gotos in schedule() are there for a purpose - to generate the best optimised (for i386) code. Also, note that scheduler (like most of the kernel) was completely rewritten for 2.4, therefore the discussion below does not apply to 2.2 or earlier kernels.
Let us look at the function in detail:
If current->active_mm == NULL then something is wrong. Current process, even a kernel thread (current->mm == NULL) must have a valid p->active_mm at all times.
If there is something to do on the tq_scheduler task queue, process it now. Task queues provide a kernel mechanism to schedule execution of functions at a later time. We shall look at it in details elsewhere.
Initialise local variables prev and this_cpu to current task and current CPU respectively.
Check if schedule() was invoked from interrupt handler (due to a bug) and panic if so.
Release the global kernel lock.
If there is some work to do via softirq mechanism, do it now.
Initialise local pointer struct schedule_data *sched_data to point to per-CPU (cacheline-aligned to prevent cacheline ping-pong) scheduling data area, which contains the TSC value of last_schedule and the pointer to last scheduled task structure (TODO: sched_data is used on SMP only but why does init_idle() initialises it on UP as well?).
runqueue_lock spinlock is taken. Note that we use spin_lock_irq() because in schedule() we guarantee that interrupts are enabled. Therefore, when we unlock runqueue_lock, we can just re-enable them instead of saving/restoring eflags (spin_lock_irqsave/restore variant).
task state machine: if the task is in TASK_RUNNING state, it is left alone; if it is in TASK_INTERRUPTIBLE state and a signal is pending, it is moved into TASK_RUNNING state. In all other cases, it is deleted from the runqueue.
next (best candidate to be scheduled) is set to the idle task of this cpu. However, the goodness of this candidate is set to a very low value (-1000), in hope that there is someone better than that.
If the prev (current) task is in TASK_RUNNING state, then the current goodness is set to its goodness and it is marked as a better candidate to be scheduled than the idle task.
Now the runqueue is examined and a goodness of each process that can be scheduled on this cpu is compared with current value; the process with highest goodness wins. Now the concept of "can be scheduled on this cpu" must be clarified: on UP, every process on the runqueue is eligible to be scheduled; on SMP, only process not already running on another cpu is eligible to be scheduled on this cpu. The goodness is calculated by a function called goodness(), which treats realtime processes by making their goodness very high (1000 + p->rt_priority), this being greater than 1000 guarantees that no SCHED_OTHER process can win; so they only contend with other realtime processes that may have a greater p->rt_priority. The goodness function returns 0 if the process' time slice (p->counter) is over. For non-realtime processes, the initial value of goodness is set to p->counter - this way, the process is less likely to get CPU if it already had it for a while, i.e. interactive processes are favoured more than CPU bound number crunchers. The arch-specific constant PROC_CHANGE_PENALTY attempts to implement "cpu affinity" (i.e. give advantage to a process on the same CPU). It also gives a slight advantage to processes with mm pointing to current active_mm or to processes with no (user) address space, i.e. kernel threads.
if the current value of goodness is 0 then the entire list of processes (not just the ones on the runqueue!) is examined and their dynamic priorities are recalculated using simple algorithm:
--------------------------------------------------------------------------------
recalculate:
{
struct task_struct *p;
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) + p->priority;
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
}
--------------------------------------------------------------------------------
Note that the we drop the runqueue_lock before we recalculate. The reason is that we go through entire set of processes; this can take a long time, during which the schedule() could be called on another CPU and select a process with goodness good enough for that CPU, whilst we on this CPU were forced to recalculate. Ok, admittedly this is somewhat inconsistent because while we (on this CPU) are selecting a process with the best goodness, schedule() running on another CPU could be recalculating dynamic priorities.
From this point on it is certain that next points to the task to be scheduled, so we initialise next->has_cpu to 1 and next->processor to this_cpu. The runqueue_lock can now be unlocked.
If we are switching back to the same task (next == prev) then we can simply reacquire the global kernel lock and return, i.e. skip all the hardware-level (registers, stack etc.) and VM-related (switch page directory, recalculate active_mm etc.) stuff.
The macro switch_to() is architecture specific. On i386, it is concerned with a) FPU handling, b) LDT handling, c) reloading segment registers, d) TSS handling and e) reloading debug registers.
2.4 Linux linked list implementation
Before we go on to examine implementation of wait queues, we must acquaint ourselves with the Linux standard doubly-linked list implementation. Wait queues (as well as everything else in Linux) make heavy use of them and they are called in jargon "list.h implementation" because the most relevant file is include/linux/list.h.
The fundamental data structure here is struct list_head:
--------------------------------------------------------------------------------
struct list_head {
struct list_head *next, *prev;
};
#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name)
struct list_head name = LIST_HEAD_INIT(name)
#define INIT_LIST_HEAD(ptr) do {
(ptr)->next = (ptr); (ptr)->prev = (ptr);
} while (0)
#define list_entry(ptr, type, member)
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
#define list_for_each(pos, head)
for (pos = (head)->next; pos != (head); pos = pos->next)
--------------------------------------------------------------------------------
The first three macros are for initialising an empty list by pointing both next and prev pointers to itself. It is obvious from C syntactical restrictions which ones should be used where - for example, LIST_HEAD_INIT() can be used for structure's element initialisation in declaration, the second can be used for static variable initialising declarations and the third can be used inside a function.
The macro list_entry() gives access to individual list element, for example (from fs/file_table.c:fs_may_remount_ro()):
--------------------------------------------------------------------------------
struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;
struct file {
...
struct list_head f_list;
...
} *file;
struct list_head *p;
for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
struct file *file = list_entry(p, struct file, f_list);
do something to 'file'
}
--------------------------------------------------------------------------------
A good example of the use of list_for_each() macro is in the scheduler where we walk the runqueue looking for the process with highest goodness:
--------------------------------------------------------------------------------
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}
--------------------------------------------------------------------------------
Here, p->run_list is declared as struct list_head run_list inside task_struct structure and serves as anchor to the list. Removing an element from the list and adding (to head or tail of the list) is done by list_del()/list_add()/list_add_tail() macros. The examples below are adding and removing a task from runqueue:
--------------------------------------------------------------------------------
static inline void del_from_runqueue(struct task_struct * p)
{
nr_running--;
list_del(&p->run_list);
p->run_list.next = NULL;
}
static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}
static inline void move_last_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add_tail(&p->run_list, &runqueue_head);
}
static inline void move_first_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add(&p->run_list, &runqueue_head);
}
--------------------------------------------------------------------------------
2.5 Wait Queues
When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.
Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag. With waitqueues, you can either use a well-known queue and then simply sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout, or you can define your own waitqueue and use add/remove_wait_queue to add and remove yourself from it and wake_up/wake_up_interruptible to wake up when needed.
An example of the first usage of waitqueues is interaction between the page allocator (in mm/page_alloc.c:__alloc_pages()) and the kswapd kernel daemon (in mm/vmscan.c:kswap()), by means of wait queue kswapd_wait, declared in mm/vmscan.c; the kswapd daemon sleeps on this queue, and it is woken up whenever the page allocator needs to free up some pages.
An example of autonomous waitqueue usage is interaction between user process requesting data via read(2) system call and kernel running in the interrupt context to supply the data. An interrupt handler might look like (simplified drivers/char/rtc_interrupt()):
--------------------------------------------------------------------------------
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}
--------------------------------------------------------------------------------
So, the interrupt handler obtains the data by reading from some device-specific I/O port (CMOS_READ() macro turns into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.
Now, the read(2) system call could be implemented as:
--------------------------------------------------------------------------------
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;
add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);
if (data != 0)
break;
if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);
out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}
--------------------------------------------------------------------------------
What happens in rtc_read() is this:
We declare a wait queue element pointing to current process context.
We add this element to the rtc_wait waitqueue.
We mark current context as TASK_INTERRUPTIBLE which means it will not be rescheduled after the next time it sleeps.
We check if there is no data available; if there is we break out, copy data to user buffer, mark ourselves as TASK_RUNNING, remove ourselves from the wait queue and return
If there is no data yet, we check whether the user specified non-blocking I/O and if so we fail with EAGAIN (which is the same as EWOULDBLOCK)
We also check if a signal is pending and if so inform the "higher layers" to restart the system call if necessary. By "if necessary" I meant the details of signal disposition as specified in sigaction(2) system call.
Then we "switch out", i.e. fall asleep, until woken up by the interrupt handler. If we didn't mark ourselves as TASK_INTERRUPTIBLE then the scheduler could schedule us sooner than when the data is available, thus causing unneeded processing.
It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:
--------------------------------------------------------------------------------
static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;
poll_wait(file, &rtc_wait, wait);
spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);
if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}
--------------------------------------------------------------------------------
All the work is done by the device-independent function poll_wait() which does the necessary waitqueue manipulations; all we need to do is point it to the waitqueue which is woken up by our device-specific interrupt handler.
2.6 Kernel Timers
Now let us turn our attention to kernel timers. Kernel timers are used to dispatch execution of a particular function (called 'timer handler') at a specified time in the future. The main data structure is struct timer_list declared in include/linux/timer.h:
--------------------------------------------------------------------------------
struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
volatile int running;
};
--------------------------------------------------------------------------------
The list field is for linking into the internal list, protected by the timerlist_lock spinlock. The expires field is the value of jiffies when the function handler should be invoked with data passed as a parameter. The running field is used on SMP to test if the timer handler is currently running on another CPU.
The functions add_timer() and del_timer() add and remove a given timer to the list. When a timer expires, it is removed automatically. Before a timer is used, it MUST be initialised by means of init_timer() function. And before it is added, the fields function and expires must be set.
php爱好者站 http://www.phpfans.net PHP|MySQL|javascript|ajax|html.
--------------------------------------------------------------------------------
2. 进程和中断管理
2.1 任务结构和进程表
Linux下每一个进程动态分配一个struct task_struct结构。可以建立的最大进程数只是由当前的物理内存数量所限制(参见 kernel/fork.c:fork_init()):
--------------------------------------------------------------------------------
/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
* of memory.
*/
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
--------------------------------------------------------------------------------
在IA32体系结构中, 它基本上等于num_physpages/4. 例如,在一台具有512M的机器上,你可以建立32k个线程。对于老的核心(2.2和更早的版本)4k多一点的限制来说,这是一个很大的进步。而且,这还可以在运行时用KERN_MAX_THREADS sysctl(2)改变或是简单的通过procfs接口来调整:
--------------------------------------------------------------------------------
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
#0 0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
--------------------------------------------------------------------------------
Linux系统的一组进程是通过许多的struct task_struct结构来表示的, 它们通过两种方法来链接在一起:
作为一个哈希表, 通过pid来建立
作为一个圆环,用p->next_task 和 p->prev_task指针建立的一个双向链表。
这个哈希表称为 pidhash[],在include/linux/sched.h中定义:
--------------------------------------------------------------------------------
/* PID hashing. (shouldnt this be dynamic?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
--------------------------------------------------------------------------------
任务通过它们的pid值来建立哈希表,上面的哈希函数可以在它们的范围中(0 to PID_MAX-1)均一的分配元素。哈希表用include/linux/sched.h中的find_task_pid() 内联函数,可以快速查找一个给定pid的任务:
--------------------------------------------------------------------------------
static inline struct task_struct *find_task_by_pid(int pid)
{
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
;
return p;
}
--------------------------------------------------------------------------------
每个哈希列表中的任务(即散列成同样的值)是通过p->pidhash_next/pidhash_pprev 来链接的,hash_pid() 和 unhash_pid() 使用它们来插入和删除一个哈希表中指定的进程 。这些都是在被称为tasklist_lock的读写自旋锁的保护下完成的。
圆环双向链表使用p->next_task/prev_task 来进行维护,因此可以在系统中容易地遍历所有任务。这是通过include/linux/sched.h中的宏for_each_task() 来完成的:
--------------------------------------------------------------------------------
#define for_each_task(p)
for (p = &init_task ; (p = p->next_task) != &init_task ; )
--------------------------------------------------------------------------------
for_each_task()使用者应该采用tasklist_lock来读。注意for_each_task() 是用init_task来标记列表的开始和结束,这是安全的方法,因为空闲任务(pid=0)从来不存在。
进程哈希表或进程表链的修改,特别是 fork(), exit() 和 ptrace(), 必须使用 tasklist_lock 来写。更有趣的是写时还必须在本地CPU上关闭中断。原因很简单:send_sigio() 函数遍历任务列表从而采用tasklist_lock 来读,在中断上下文中,它还被 kill_fasync() 调用。这就是为什么要在写时禁止中断的原因了,而读取时却不需要。
现在我们已明白 task_struct 结构是如何相互链接在一起,让我们检查一下 task_struct的成员。它们松散地对应着 UNIX 'proc' 和 'user' 结构的组合。
UNIX 的其它版本将任务状态信息分为两部分,一部分内容一直保留在内存(称为 'proc structure' ,它包括进程状态,调度信息等等) ;另外一部分是只有当进程运行时才需要的(称为'u 区' ,包括文件描述符表,磁盘限额信息等等)。这么丑陋的设计唯一的原因就是内存时非常紧缺的资源。现代操作系统(当然,目前只有Linux而不是其它操作系统,比如FreeBSD似乎在这方面朝着Linux提高)不需要这样区分,从而在任何时候都是在核心内存驻留数据结构中维护进程状态。
task_struct 结构申明在include/linux/sched.h ,目前的大小为1680字节。
状态域申明为:
--------------------------------------------------------------------------------
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
#define TASK_EXCLUSIVE 32
--------------------------------------------------------------------------------
为什么TASK_EXCLUSIVE 定义为32而不是16?因为16已经被TASK_SWAPPING 使用,当我已经删除了TASK_SWAPPING (有时是在2.3.x中)所有的引用后,却忘记了将TASK_EXCLUSIVE 提前。
p->state 中的volatile 申明意味着它可以异步地修改(从中断处理器中):
TASK_RUNNING: 意味着此任务被假定是在运行队列。它可能还不在运行队列中的原因是标记一个任务为TASK_RUNNING和将它放置在运行队列并不是原子性的。为了查看运行队列,你需要持有runqueue_lock 读写自旋锁进行读操作。如果你那样做的话,你将看见任务队列中的每一个任务都处于TASK_RUNNING 状态。然而,反之却未必正确,原因上面已经解释了。同样地,驱动程序(准确地说是它们所运行的进程上下文)可以标记它们自己为TASK_INTERRUPTIBLE (或 TASK_UNINTERRUPTIBLE)然后调用schedule(),这样将会把它从运行队列中删除掉(除非有一个未决的信号,这样它会留在运行队列)。
TASK_INTERRUPTIBLE: 意味着进程正在睡眠,但可以通过一个信号或定时器超时来唤醒。
TASK_UNINTERRUPTIBLE: 跟TASK_INTERRUPTIBLE一样,但它不能被唤醒。
TASK_ZOMBIE: 任务已经终止,但它的状态还没有被父进程(本身的或是收养的)收集(通过wait()调用)
TASK_STOPPED: 任务停止,要么是由于任务控制信号或是因为调用ptrace(2)。
TASK_EXCLUSIVE: 这不是一个单独的状态,它可以与TASK_INTERRUPTIBLE 或 TASK_UNINTERRUPTIBLE中的任一个相或。这意味着此任务与其它许多任务睡眠在等待队列中时,它将被单独唤醒而不是引起一个"雷鸣般的牧群"问题,唤醒队列中的所有任务。
任务标记包含不是互斥的进程状态信息:
--------------------------------------------------------------------------------
unsigned long flags; /* per process flags, defined below */
/*
* Per process flags
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
--------------------------------------------------------------------------------
p->has_cpu, p->processor, p->counter, p->priority, p->policy 和 p->rt_priority 域与调度器相关,后面我们将会看到。
域p->mm 和 p->active_mm 分别指向进程的地址空间和活动地址空间(如果没有一个真正的进程,如核心线程),它是由mm_struct 结构描述。 这在当任务被调度出去交换地址空间时可以最小化TLB表。因此,如果我们正调度进核心线程(它没有p->mm) ,于是它的next->active_mm将被设置成调度出去的任务的prev->active_mm ,如果prev->mm != NULL,它将与prev->mm一致。如果CLONE_VM标记被传给clone(2) 系统调用或vfork(2)系统调用,那么地址空间就可以在线程之间共享。
域p->exec_domain 和 p->personality跟任务的特征相关,比如说为了仿真UNIX的外来有趣"特征"的某些系统调用行为方法。
域p->fs 包含文件系统信息,Linux下代表三种信息:
根目录项和安装点,
预备根目录项和安装点,
当前工作目录项和安装点。
这个结构还包括引用记数,因为当CLONE_FS标记传送给clone(2)系统调用时它要在克隆的任务之间共享。
域p->files 包含文件描述符表。这也可以在任务之间共享,只要在clone(2)系统调用中指定了CLONE_FILES标记。
域p->sig包含信号处理器,可以通过CLONE_SIGHAND在克隆的任务之间共享。
2.2 任务和核心线程的建立和终止
操作系统类不同的书定义"进程"有不同的方法,但都是以"程序执行的事例"开始,以"通过clone(2)或fork(2)系统调用来产生"结束。在Linux下,有3种进程:
空闲线程,
核心线程,
用户任务。
空闲线程是在编译时为第一个CPU建立的;然后通过特定体系结构arch/i386/kernel/smpboot.c中的fork_by_hand()(在某些体系结构中它是fork(2)系统调用的手工展开)来为每一个CPU“手工的”建立一个。空闲任务有一个私有的TSS结构,它们在每个CPU数组init_tss中,但共享一个init_task结构。 所有空闲任务的pid = 0 ,没有其它的任务能够共享此pid,即使在clone(2)时使用了CLONE_PID。
核心线程是在核心模式下使用kernel_thread()函数执行clone(2)系统调用来建立的。核心线程通常没有用户地址空间,即p->mm = NULL,因为它们显式地调用exit_mm(),如通过daemonize()函数。核心线程总是直接地访问核心地址空间。它们在很低的区间分配pid号。运行在处理器的第0环意味着核心线程享有所有的I/O特权,并且不能够被调度器抢占。
用户任务是通过clone(2) 或 fork(2) 系统调用来建立的,它们内部都是执行kernel/fork.c:do_fork()。
让我们看看当一个用户进程运行fork(2)系统调用时会发生什么。尽管fork(2)是与体系结构相关的,因为传递给用户栈和寄存器有不同方式,但真正基本的函数do_fork()完成那些工作并且是可移植的。它位于kernel/fork.c。
它完成下面的步骤:
局部变量retval设置成-ENOMEM,如果fork(2)未能分配一个新的任务结构的话errno应该设置成这个值。
如果在clone_flags中设置了CLONE_PID,就返回错误(-EPERM), 除非调用者是空闲线程(仅在启动时)。于是,普通用户线程不能传递CLONE_PID 给clone(2) 并期望能成功。对于fork(2)来说,由于clone_flags 被设置成 SIFCHLD,这并不相关 - 它只有当do_fork() 从 sys_clone()执行时才相关,它从需要的用户空间请求的值传递给clone_flags。
current->vfork_sem 初始化(它在子进程中清除)。sys_vfork() (vfork(2) 系统调用,对应于clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD) 使父进程睡眠时会用到,直到子进程执行mm_release(), 例如作为exec()执行另外的程序或exit(2)的结果。
使用体系结构相关的alloc_task_struct()宏分配一个新的任务结构。在x86中,它就是用GFP_KERNEL优先级获取一个空闲页。这就是为什么fork(2)系统调用可能睡眠的第一个原因。如果分配失败,我们返回-ENOMEM。
使用赋值语句*p = *current,当前进程的任务结构所有的值都拷贝到新的结构中。 也许这应该用memset来替换?稍后,不应该被子进程继承的值设置成准确的值。
余下的代码采用大核心锁,它们是不可重入的。
如果父进程有用户资源(UID思想,Linux 很灵活,它是提出问题而不是制造事实),然后检验用户是否超出了RLIMIT_NPROC软限制 - 如果是的话,返回-EAGAIN,如果没有超出,就增加由给定uid进程记数p->user->count。
如果系统中任务数量超出可调的max_threads值,返回错误-EAGAIN。
如果正在执行的二进制代码属于模块化的执行域,增加相应的模块引用记数。
如果正在执行的二进制代码属于模块化的二进制格式,增加相应的模块引用记数。
子进程标记为'还未执行' (p->did_exec = 0)
子进程标记为'不可交换' (p->swappable = 0)
将子进程置为'不可中断睡眠'状态,即p->state = TASK_UNINTERRUPTIBLE (TODO: 为什么要这样做? 我认为它是不需要的 - 去掉它,Linus 确认它是不需要的)
根据clone_flags的值设置子进程的p->flags;对于简单的fork(2),它将是p->flags = PF_FORKNOEXEC。
子进程的pid p->pid在kernel/fork.c:get_pid()里用快速算法来设置 (TODO: lastpid_lock 自旋锁显得多余了,因为get_pid() 总是在大核心锁下从do_fork()中调用,同样去掉get_pid()的标记参数,在2000-06-20将补丁送给了Alan - 后来又发了一次)。
do_fork()中剩下的代码初始化子进程其余的任务结构。在最后,子进程的任务结构被散列进pidhash哈希表,子进程被唤醒 (TODO: wake_up_process(p) 设置 p->state = TASK_RUNNING 并且增加这个进程到运行队列,因此我们do_fork())之前多半不需要设置 p->state 为 TASK_RUNNING 。感兴趣的部分是设置p->exit_signal 为 clone_flags & CSIGNAL, 这对于 fork(2) 就意味着SIGCHLD ,设 p->pdeath_signal 为 0。当一个进程‘忘记'了原始的父进程(由于死了),就会用到pdeath_signal,它可以通过prctl(2)系统调用中的PR_GET/SET_PDEATHSIG命令来设置/获取(你也许会认为通过在prctl(2)用户空间的指针参数来返回pdeath_signal值这种方法很笨 - 是我的过失,在Andries Brouwer更新手册页之后已经太迟了还没有更正;)
这样任务就建立了。有几种原因使得任务终止:
执行exit(2) 系统调用;
接收到一个信号,缺省处理就是死亡;
在某些异常条件下被迫死亡;
通过func == 1调用bdflush(2)(这是Linux专用的,由于和在/etc/inittab中仍然有‘update'行的老的发布兼容的目的 - 现在更新的工作是通过核心线程kupdate来完成的)。
在Linux下实现系统调用的函数都是以sys_为前缀的,但是它们通常只考虑参数检查或者用结构相关的方法传递一些信息,而真正的工作是由do_函数来完成的。因此sys_exit() 调用do_exit() 来工作。尽管如此,核心的其它部分有时执行sys_exit() 而它们应该真正调用do_exit()。
do_exit() 函数在kernel/exit.c中可以找到。对于do_exit()应该注意的点有:
使用全局核心锁(锁但是不解锁)。
最后调用schedule(),它永远不返回。
设置任务状态为TASK_ZOMBIE。
如果current->pdeath_signal不为0,则用它通知所有子进程。
用current->exit_signal通知父进程,它通常等于 SIGCHLD。
释放有fock分配的资源,关闭打开的文件等。
对于使用懒散的FPU交换结构(ia64, mips, mips64) (TODO: 去掉sparc, sparc64的'flags' 参数), 任何硬件请求传递FPU所有者(如果由current拥有)为“无”。
2.3 Linux 调度器
The job of a scheduler is to arbitrate access to the current CPU between multiple processes. The scheduler is implemented in the 'main kernel file' kernel/sched.c. The corresponding header file include/linux/sched.h is included (either explicitly or indirectly) by virtually every kernel source file.
The fields of task structure relevant to scheduler include:
p->need_resched: this field is set if schedule() should be invoked at the 'next opportunity'.
p->counter: number of clock ticks left to run in this scheduling slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset to 0 and p->need_resched is set. This is also sometimes called 'dynamic priority' of a process because it can change by itself.
p->priority: the process' static priority, only changed through well-known system calls like nice(2), POSIX.1b sched_setparam(2) or 4.4BSD/SVR4 setpriority(2).
p->rt_priority: realtime priority
p->policy: the scheduling policy, specifies which scheduling class the task belongs to. Tasks can change their scheduling class using the sched_setscheduler(2) system call. The valid values are SCHED_OTHER (traditional UNIX process), SCHED_FIFO (POSIX.1b FIFO realtime process) and SCHED_RR (POSIX round-robin realtime process). One can also OR SCHED_YIELD to any of these values to signify that the process decided to yield the CPU, for example by calling sched_yield(2) system call. A FIFO realtime process will run until either a) it blocks on I/O, b) it explicitly yields the CPU or c) it is preempted by another realtime process with a higher p->rt_priority value. SCHED_RR is the same as SCHED_FIFO, except that when its timeslice expires it goes back to the end of the runqueue.
The scheduler's algorithm is simple, despite the great apparent complexity of the schedule() function. The function is complex because it implements three scheduling algorithms in one and also because of the subtle SMP-specifics.
The apparently 'useless' gotos in schedule() are there for a purpose - to generate the best optimised (for i386) code. Also, note that scheduler (like most of the kernel) was completely rewritten for 2.4, therefore the discussion below does not apply to 2.2 or earlier kernels.
Let us look at the function in detail:
If current->active_mm == NULL then something is wrong. Current process, even a kernel thread (current->mm == NULL) must have a valid p->active_mm at all times.
If there is something to do on the tq_scheduler task queue, process it now. Task queues provide a kernel mechanism to schedule execution of functions at a later time. We shall look at it in details elsewhere.
Initialise local variables prev and this_cpu to current task and current CPU respectively.
Check if schedule() was invoked from interrupt handler (due to a bug) and panic if so.
Release the global kernel lock.
If there is some work to do via softirq mechanism, do it now.
Initialise local pointer struct schedule_data *sched_data to point to per-CPU (cacheline-aligned to prevent cacheline ping-pong) scheduling data area, which contains the TSC value of last_schedule and the pointer to last scheduled task structure (TODO: sched_data is used on SMP only but why does init_idle() initialises it on UP as well?).
runqueue_lock spinlock is taken. Note that we use spin_lock_irq() because in schedule() we guarantee that interrupts are enabled. Therefore, when we unlock runqueue_lock, we can just re-enable them instead of saving/restoring eflags (spin_lock_irqsave/restore variant).
task state machine: if the task is in TASK_RUNNING state, it is left alone; if it is in TASK_INTERRUPTIBLE state and a signal is pending, it is moved into TASK_RUNNING state. In all other cases, it is deleted from the runqueue.
next (best candidate to be scheduled) is set to the idle task of this cpu. However, the goodness of this candidate is set to a very low value (-1000), in hope that there is someone better than that.
If the prev (current) task is in TASK_RUNNING state, then the current goodness is set to its goodness and it is marked as a better candidate to be scheduled than the idle task.
Now the runqueue is examined and a goodness of each process that can be scheduled on this cpu is compared with current value; the process with highest goodness wins. Now the concept of "can be scheduled on this cpu" must be clarified: on UP, every process on the runqueue is eligible to be scheduled; on SMP, only process not already running on another cpu is eligible to be scheduled on this cpu. The goodness is calculated by a function called goodness(), which treats realtime processes by making their goodness very high (1000 + p->rt_priority), this being greater than 1000 guarantees that no SCHED_OTHER process can win; so they only contend with other realtime processes that may have a greater p->rt_priority. The goodness function returns 0 if the process' time slice (p->counter) is over. For non-realtime processes, the initial value of goodness is set to p->counter - this way, the process is less likely to get CPU if it already had it for a while, i.e. interactive processes are favoured more than CPU bound number crunchers. The arch-specific constant PROC_CHANGE_PENALTY attempts to implement "cpu affinity" (i.e. give advantage to a process on the same CPU). It also gives a slight advantage to processes with mm pointing to current active_mm or to processes with no (user) address space, i.e. kernel threads.
if the current value of goodness is 0 then the entire list of processes (not just the ones on the runqueue!) is examined and their dynamic priorities are recalculated using simple algorithm:
--------------------------------------------------------------------------------
recalculate:
{
struct task_struct *p;
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) + p->priority;
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
}
--------------------------------------------------------------------------------
Note that the we drop the runqueue_lock before we recalculate. The reason is that we go through entire set of processes; this can take a long time, during which the schedule() could be called on another CPU and select a process with goodness good enough for that CPU, whilst we on this CPU were forced to recalculate. Ok, admittedly this is somewhat inconsistent because while we (on this CPU) are selecting a process with the best goodness, schedule() running on another CPU could be recalculating dynamic priorities.
From this point on it is certain that next points to the task to be scheduled, so we initialise next->has_cpu to 1 and next->processor to this_cpu. The runqueue_lock can now be unlocked.
If we are switching back to the same task (next == prev) then we can simply reacquire the global kernel lock and return, i.e. skip all the hardware-level (registers, stack etc.) and VM-related (switch page directory, recalculate active_mm etc.) stuff.
The macro switch_to() is architecture specific. On i386, it is concerned with a) FPU handling, b) LDT handling, c) reloading segment registers, d) TSS handling and e) reloading debug registers.
2.4 Linux linked list implementation
Before we go on to examine implementation of wait queues, we must acquaint ourselves with the Linux standard doubly-linked list implementation. Wait queues (as well as everything else in Linux) make heavy use of them and they are called in jargon "list.h implementation" because the most relevant file is include/linux/list.h.
The fundamental data structure here is struct list_head:
--------------------------------------------------------------------------------
struct list_head {
struct list_head *next, *prev;
};
#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name)
struct list_head name = LIST_HEAD_INIT(name)
#define INIT_LIST_HEAD(ptr) do {
(ptr)->next = (ptr); (ptr)->prev = (ptr);
} while (0)
#define list_entry(ptr, type, member)
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
#define list_for_each(pos, head)
for (pos = (head)->next; pos != (head); pos = pos->next)
--------------------------------------------------------------------------------
The first three macros are for initialising an empty list by pointing both next and prev pointers to itself. It is obvious from C syntactical restrictions which ones should be used where - for example, LIST_HEAD_INIT() can be used for structure's element initialisation in declaration, the second can be used for static variable initialising declarations and the third can be used inside a function.
The macro list_entry() gives access to individual list element, for example (from fs/file_table.c:fs_may_remount_ro()):
--------------------------------------------------------------------------------
struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;
struct file {
...
struct list_head f_list;
...
} *file;
struct list_head *p;
for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
struct file *file = list_entry(p, struct file, f_list);
do something to 'file'
}
--------------------------------------------------------------------------------
A good example of the use of list_for_each() macro is in the scheduler where we walk the runqueue looking for the process with highest goodness:
--------------------------------------------------------------------------------
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}
--------------------------------------------------------------------------------
Here, p->run_list is declared as struct list_head run_list inside task_struct structure and serves as anchor to the list. Removing an element from the list and adding (to head or tail of the list) is done by list_del()/list_add()/list_add_tail() macros. The examples below are adding and removing a task from runqueue:
--------------------------------------------------------------------------------
static inline void del_from_runqueue(struct task_struct * p)
{
nr_running--;
list_del(&p->run_list);
p->run_list.next = NULL;
}
static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}
static inline void move_last_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add_tail(&p->run_list, &runqueue_head);
}
static inline void move_first_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add(&p->run_list, &runqueue_head);
}
--------------------------------------------------------------------------------
2.5 Wait Queues
When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.
Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag. With waitqueues, you can either use a well-known queue and then simply sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout, or you can define your own waitqueue and use add/remove_wait_queue to add and remove yourself from it and wake_up/wake_up_interruptible to wake up when needed.
An example of the first usage of waitqueues is interaction between the page allocator (in mm/page_alloc.c:__alloc_pages()) and the kswapd kernel daemon (in mm/vmscan.c:kswap()), by means of wait queue kswapd_wait, declared in mm/vmscan.c; the kswapd daemon sleeps on this queue, and it is woken up whenever the page allocator needs to free up some pages.
An example of autonomous waitqueue usage is interaction between user process requesting data via read(2) system call and kernel running in the interrupt context to supply the data. An interrupt handler might look like (simplified drivers/char/rtc_interrupt()):
--------------------------------------------------------------------------------
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}
--------------------------------------------------------------------------------
So, the interrupt handler obtains the data by reading from some device-specific I/O port (CMOS_READ() macro turns into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.
Now, the read(2) system call could be implemented as:
--------------------------------------------------------------------------------
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;
add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);
if (data != 0)
break;
if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);
out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}
--------------------------------------------------------------------------------
What happens in rtc_read() is this:
We declare a wait queue element pointing to current process context.
We add this element to the rtc_wait waitqueue.
We mark current context as TASK_INTERRUPTIBLE which means it will not be rescheduled after the next time it sleeps.
We check if there is no data available; if there is we break out, copy data to user buffer, mark ourselves as TASK_RUNNING, remove ourselves from the wait queue and return
If there is no data yet, we check whether the user specified non-blocking I/O and if so we fail with EAGAIN (which is the same as EWOULDBLOCK)
We also check if a signal is pending and if so inform the "higher layers" to restart the system call if necessary. By "if necessary" I meant the details of signal disposition as specified in sigaction(2) system call.
Then we "switch out", i.e. fall asleep, until woken up by the interrupt handler. If we didn't mark ourselves as TASK_INTERRUPTIBLE then the scheduler could schedule us sooner than when the data is available, thus causing unneeded processing.
It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:
--------------------------------------------------------------------------------
static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;
poll_wait(file, &rtc_wait, wait);
spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);
if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}
--------------------------------------------------------------------------------
All the work is done by the device-independent function poll_wait() which does the necessary waitqueue manipulations; all we need to do is point it to the waitqueue which is woken up by our device-specific interrupt handler.
2.6 Kernel Timers
Now let us turn our attention to kernel timers. Kernel timers are used to dispatch execution of a particular function (called 'timer handler') at a specified time in the future. The main data structure is struct timer_list declared in include/linux/timer.h:
--------------------------------------------------------------------------------
struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
volatile int running;
};
--------------------------------------------------------------------------------
The list field is for linking into the internal list, protected by the timerlist_lock spinlock. The expires field is the value of jiffies when the function handler should be invoked with data passed as a parameter. The running field is used on SMP to test if the timer handler is currently running on another CPU.
The functions add_timer() and del_timer() add and remove a given timer to the list. When a timer expires, it is removed automatically. Before a timer is used, it MUST be initialised by means of init_timer() function. And before it is added, the fields function and expires must be set.
php爱好者站 http://www.phpfans.net PHP|MySQL|javascript|ajax|html.
相关阅读 更多 +