Linux Kernel Internals(2cont.)

时间：2007-02-17 来源：PHP爱好者

2.7 Bottom Halves

Sometimes it is reasonable to split the amount of work to be performed inside an interrupt handler into immediate work (e.g. acknowledging the interrupt, updating the stats etc.) and work which can be postponed until later, when interrupts are enabled (e.g. to do some postprocessing on data, wake up processes waiting for this data, etc).

Bottom halves are the oldest mechanism for deferred execution of kernel tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism was added, called 'task queues', which will be the subject of next section.

Bottom halves are serialised by the global_bh_lock spinlock, i.e. there can only be one bottom half running on any CPU at a time. However, when attempting to execute the handler, if global_bh_lock is not available, the bottom half is marked (i.e. scheduled) for execution - so processing can continue, as opposed to a busy loop on global_bh_lock.

There can only be 32 bottom halves registered in total. The functions required to manipulate bottom halves are as follows (all exported to modules):

void init_bh(int nr, void (*routine)(void)): installs a bottom half handler pointed to by routine argument into slot nr. The slot ought to be enumerated in include/linux/interrupt.h in the form XXXX_BH, e.g. TIMER_BH or TQUEUE_BH. Typically, a subsystem's initialisation routine (init_module() for modules) installs the required bottom half using this function.
void remove_bh(int nr): does the opposite of init_bh(), i.e. de-installs bottom half installed at slot nr. There is no error checking performed there, so, for example remove_bh(32) will panic/oops the system. Typically, a subsystem's cleanup routine (cleanup_module() for modules) uses this function to free up the slot that can later be reused by some other subsystem. (TODO: wouldn't it be nice to have /proc/bottom_halves list all registered bottom halves on the system? That means global_bh_lock must be made read/write, obviously)
void mark_bh(int nr): marks bottom half in slot nr for execution. Typically, an interrupt handler will mark its bottom half (hence the name!) for execution at a "safer time".
Bottom halves are globally locked tasklets, so the question "when are bottom half handlers executed?" is really "when are tasklets executed?". And the answer is, in two places: a) on each schedule() and b) on each interrupt/syscall return path in entry.S (TODO: therefore, the schedule() case is really boring - it like adding yet another very very slow interrupt, why not get rid of handle_softirq label from schedule() altogether?).

2.8 Task Queues

Task queues can be though of as a dynamic extension to old bottom halves. In fact, in the source code they are sometimes referred to as "new" bottom halves. More specifically, the old bottom halves discussed in previous section have these limitations:

There are only a fixed number (32) of them.
Each bottom half can only be associated with one handler function.
Bottom halves are consumed with a spinlock held so they cannot block.
So, with task queues, arbitrary number of functions can be chained and processed one after another at a later time. One creates a new task queue using the DECLARE_TASK_QUEUE() macro and queues a task onto it using the queue_task() function. The task queue then can be processed using run_task_queue(). Instead of creating your own task queue (and having to consume it manually) you can use one of Linux' predefined task queues which are consumed at well-known points:

tq_timer: the timer task queue, run on each timer interrupt and when releasing a tty device (closing or releasing a half-opened terminal device). Since the timer handler runs in interrupt context, the tq_timer tasks also run in interrupt context and thus cannot block.
tq_scheduler: the scheduler task queue, consumed by the scheduler (and also when closing tty devices, like tq_timer). Since the scheduler executed in the context of the process being re-scheduled, the tq_scheduler tasks can do anything they like, i.e. block, use process context data (but why would they want to), etc.
tq_immediate: this is really a bottom half IMMEDIATE_BH, so drivers can queue_task(task, &tq_immediate) and then mark_bh(IMMEDIATE_BH) to be consumed in interrupt context.
tq_disk: used by low level block device access (and RAID) to start the actual requests. This task queue is exported to modules but shouldn't be used except for the special purposes which it was designed for.
Unless a driver uses its own task queues, it does not need to call run_tasks_queues() to process the queue, except under circumstances explained below.

The reason tq_timer/tq_scheduler task queues are consumed not only in the usual places but elsewhere (closing tty device is but one example) becomes clear if one remembers that the driver can schedule tasks on the queue, and these tasks only make sense while a particular instance of the device is still valid - which usually means until the application closes it. So, the driver may need to call run_task_queue() to flush the tasks it (and anyone else) has put on the queue, because allowing them to run at a later time may make no sense - i.e. the relevant data structures may have been freed/reused by a different instance. This is the reason you see run_task_queue() on tq_timer and tq_scheduler in places other than timer interrupt and schedule() respectively.

2.9 Tasklets

Not yet, will be in future revision.

2.10 Softirqs

Not yet, will be in future revision.

2.11 How System Calls Are Implemented on i386 Architecture?

There are two mechanisms under Linux for implementing system calls:

lcall7/lcall27 call gates;
int 0x80 software interrupt.
Native Linux programs use int 0x80 whilst binaries from foreign flavours of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is historically misleading because it also covers lcall27 (e.g. Solaris/x86), but the handler function is called lcall7_func.

When the system boots, the function arch/i386/kernel/traps.c:trap_init() is called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to the address of system_call entry from arch/i386/kernel/entry.S.

When a userspace application makes a system call, the arguments are passed via registers and the application executes 'int 0x80' instruction. This causes a trap into kernel mode and processor jumps to system_call entry point in entry.S. What this does is:

Save registers.
Set %ds and %es to KERNEL_DS, so that all data (and extra segment) references are made in kernel address space.
If the value of %eax is greater than NR_syscalls (currently 256), fail with ENOSYS error.
If the task is being ptraced (tsk->ptrace & PF_TRACESYS), do special processing. This is to support programs like strace (analogue of SVR4 truss(1)) or debuggers.
Call sys_call_table+4*(syscall_number from %eax). This table is initialised in the same file (arch/i386/kernel/entry.S) to point to individual system call handlers which under Linux are (usually) prefixed with sys_, e.g. sys_open, sys_exit, etc. These C system call handlers will find their arguments on the stack where SAVE_ALL stored them.
Enter 'system call return path'. This is a separate label because it is used not only by int 0x80 but also by lcall7, lcall27. This is concerned with handling tasklets (including bottom halves), checking if a schedule() is needed (tsk->need_resched != 0), checking if there are signals pending and if so handling them.
Linux supports up to 6 arguments for system calls. They are passed in %ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see _syscall6() in asm-i386/unistd.h). The system call number is passed via %eax.

2.12 Atomic Operations

There are two types of atomic operations: bitmaps and atomic_t. Bitmaps are very convenient for maintaining a concept of "allocated" or "free" units from some large collection where each unit is identified by some number, for example free inodes or free blocks. They are also widely used for simple locking, for example to provide exclusive access to open a device. An example of this can be found in arch/i386/kernel/microcode.c:

--------------------------------------------------------------------------------

/*
* Bits in microcode_status. (31 bits of room for future expansion)
*/
#define MICROCODE_IS_OPEN 0 /* set if device is in use */

static unsigned long microcode_status;

--------------------------------------------------------------------------------

There is no need to initialise microcode_status to 0 as BSS is zero-cleared under Linux explicitly.

--------------------------------------------------------------------------------

/*
* We enforce only one user at a time here with open/close.
*/
static int microcode_open(struct inode *inode, struct file *file)
{
if (!capable(CAP_SYS_RAWIO))
return -EPERM;

/* one at a time, please */
if (test_and_set_bit(MICROCODE_IS_OPEN, µcode_status))
return -EBUSY;

MOD_INC_USE_COUNT;
return 0;
}

--------------------------------------------------------------------------------

The operations on bitmaps are:

void set_bit(int nr, volatile void *addr): set bit nr in the bitmap pointed to by addr.
void clear_bit(int nr, volatile void *addr): clear bit nr in the bitmap pointed to by addr.
void change_bit(int nr, volatile void *addr): toggle bit nr (if set clear, if clear set) in the bitmap pointed to by addr.
int test_and_set_bit(int nr, volatile void *addr): atomically set bit nr and return the old bit value.
int test_and_clear_bit(int nr, volatile void *addr): atomically clear bit nr and return the old bit value.
int test_and_change_bit(int nr, volatile void *addr): atomically toggle bit nr and return the old bit value.
These operations use the LOCK_PREFIX macro, which on SMP kernels evaluates to bus lock instruction prefix and to nothing on UP. This guarantees atomicity of access in SMP environment.

Sometimes bit manipulations are not convenient, but instead we need to perform arithmetic operations - add, subtract, increment decrement. The typical cases are reference counts (e.g. for inodes). This facility is provided by the atomic_t data type and the following operations:

atomic_read(&v): read the value of atomic_t variable v.
atomic_set(&v, i): set the value of atomic_t variable v to integer i.
void atomic_add(int i, volatile atomic_t *v): add integer i to the value of atomic variable pointed to by v.
void atomic_sub(int i, volatile atomic_t *v): subtract integer i from the value of atomic variable pointed to by v.
int atomic_sub_and_test(int i, volatile atomic_t *v): subtract integer i from the value of atomic variable pointed to by v; return 1 if the new value is 0, return 0 otherwise.
void atomic_inc(volatile atomic_t *v): increment the value by 1.
void atomic_dec(volatile atomic_t *v): decrement the value by 1.
int atomic_dec_and_test(volatile atomic_t *v): decrement the value; return 1 if the new value is 0, return 0 otherwise.
int atomic_inc_and_test(volatile atomic_t *v): increment the value; return 1 if the new value is 0, return 0 otherwise.
int atomic_add_negative(int i, volatile atomic_t *v): add the value of i to v and return 1 if the result is negative. Return 0 if the result is greater than or equal to 0. This operation is used for implementing semaphores.

2.13 Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks

Since the early days of Linux support (early 90s, this century), developers were faced with the classical problem of accessing shared data between different types of context (user process vs interrupt) and different instances of the same context from multiple cpus.

SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch was made to 1.3.37 in October the same year).

If the critical region of code may be executed by either process context and interrupt context, then the way to protect it using cli/sti instructions on UP is:

--------------------------------------------------------------------------------

unsigned long flags;

save_flags(flags);
cli();
/* critical code */
restore_flags(flags);

--------------------------------------------------------------------------------

While this is ok on UP, it obviously is of no use on SMP because the same code sequence may be executed simultaneously on another cpu, and while cli() provides protection against races with interrupt context on each CPU individually, it provides no protection at all against races between contexts running on different CPUs. This is where spinlocks are useful for.

There are three types of spinlocks: vanilla (basic), read-write and big-reader spinlocks. Read-write spinlocks should be used when there is a natural tendency of 'many readers and few writers'. Example of this is access to the list of registered filesystems (see fs/super.c). The list is guarded by the file_systems_lock read-write spinlock because one needs exclusive access only when registering/unregistering a filesystem, but any process can read the file /proc/filesystems or use the sysfs(2) system call to force a read-only scan of the file_systems list. This makes it sensible to use read-write spinlocks. With read-write spinlocks, one can have multiple readers at a time but only one writer and there can be no readers while there is a writer. Btw, it would be nice if new readers would not get a lock while there is a writer trying to get a lock, i.e. if Linux could correctly deal with the issue of potential writer starvation by multiple readers. This would mean that readers must be blocked while there is a writer attempting to get the lock. This is not currently the case and it is not obvious whether this should be fixed - the argument to the contrary is - readers usually take the lock for a very short time so should they really be starved while the writer takes the lock for potentially longer periods?

Big-reader spinlocks are a form of read-write spinlocks heavily optimised for very light read access, with a penalty for writes. There is a limited number of big-reader spinlocks - currently only two exist, of which one is used only on sparc64 (global irq) and the other is used for networking. In all other cases where the access pattern does not fit into any of these two scenarios, one should use basic spinlocks. You cannot block while holding any kind of spinlock.

Spinlocks come in three flavours: plain, _irq() and _bh().

Plain spin_lock()/spin_unlock(): if you know the interrupts are always disabled or if you do not race with interrupt context (e.g. from within interrupt handler), then you can use this one. It does not touch interrupt state on the current CPU.
spin_lock_irq()/spin_unlock_irq(): if you know that interrupts are always enabled then you can use this version, which simply disables (on lock) and re-enables (on unlock) interrupts on the current CPU. For example, rtc_read() uses spin_lock_irq(&rtc_lock) (interrupts are always enabled inside read()) whilst rtc_interrupt() uses spin_lock(&rtc_lock) (interrupts are always disabled inside interrupt handler). Note that rtc_read() uses spin_lock_irq() and not the more generic spin_lock_irqsave() because on entry to any system call interrupts are always enabled.
spin_lock_irqsave()/spin_unlock_irqrestore(): the strongest form, to be used when the interrupt state is not known, but only if interrupts matter at all, i.e. there is no point in using it if our interrupt handlers don't execute any critical code.
The reason you cannot use plain spin_lock() if you race against interrupt handlers is because if you take it and then an interrupt comes in on the same CPU, it will busy wait for the lock forever: the lock holder, having been interrupted, will not continue until the interrupt handler returns.

The most common usage of a spinlock is to access a data structure shared between user process context and interrupt handlers:

--------------------------------------------------------------------------------

spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

my_ioctl()
{
spin_lock_irq(&my_lock);
/* critical section */
spin_unlock_irq(&my_lock);
}

my_irq_handler()
{
spin_lock(&lock);
/* critical section */
spin_unlock(&lock);
}

--------------------------------------------------------------------------------

There are a couple of things to note about this example:

The process context, represented here as a typical driver method - ioctl() (arguments and return values omitted for clarity), must use spin_lock_irq() because it knows that interrupts are always enabled while executing the device ioctl() method.
Interrupt context, represented here by my_irq_handler() (again arguments omitted for clarity) can use plain spin_lock() form because interrupts are disabled inside an interrupt handler.

2.14 Semaphores and read/write Semaphores

Sometimes, while accessing a shared data structure, one must perform operations that can block, for example copy data to userspace. The locking primitive available for such scenarios under Linux is called a semaphore. There are two types of semaphores: basic and read-write semaphores. Depending on the initial value of the semaphore, they can be used for either mutual exclusion (initial value of 1) or to provide more sophisticated type of access.

Read-write semaphores differ from basic semaphores in the same way as read-write spinlocks differ from basic spinlocks: one can have multiple readers at a time but only one writer and there can be no readers while there are writers - i.e. the writer blocks all readers and new readers block while a writer is waiting.

Also, basic semaphores can be interruptible - just use the operations down/up_interruptible() instead of the plain down()/up() and check the value returned from down_interruptible(): it will be non zero if the operation was interrupted.

Using semaphores for mutual exclusion is ideal in situations where a critical code section may call by reference unknown functions registered by other subsystems/modules, i.e. the caller cannot know apriori whether the function blocks or not.

A simple example of semaphore usage is in kernel/sys.c, implementation of gethostname(2)/sethostname(2) system calls.

--------------------------------------------------------------------------------

asmlinkage long sys_sethostname(char *name, int len)
{
int errno;

if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (len <0' 'len > __NEW_UTS_LEN)
return -EINVAL;
down_write(&uts_sem);
errno = -EFAULT;
if (!copy_from_user(system_utsname.nodename, name, len)) {
system_utsname.nodename[len] = 0;
errno = 0;
}
up_write(&uts_sem);
return errno;
}

asmlinkage long sys_gethostname(char *name, int len)
{
int i, errno;

if (len <0)
return -EINVAL;
down_read(&uts_sem);
i = 1 + strlen(system_utsname.nodename);
if (i > len)
i = len;
errno = 0;
if (copy_to_user(name, system_utsname.nodename, i))
errno = -EFAULT;
up_read(&uts_sem);
return errno;
}

--------------------------------------------------------------------------------

The points to note about this example are:

The functions may block while copying data from/to userspace in copy_from_user()/copy_to_user(). Therefore they could not use any form of spinlock here.
The semaphore type chosen is read-write as opposed to basic because there may be lots of concurrent gethostname(2) requests which need not be mutually exclusive.
Although Linux implementation of semaphores and read-write semaphores is very sophisticated, there are possible scenarios one can think of which are not yet implemented, for example there is no concept of interruptible read-write semaphores. This is obviously because there are no real-world situations which require these exotic flavours of the primitives.

2.15 Kernel Support for Loading Modules

Linux is a monolithic operating system and despite all the modern hype about some "advantages" offered by operating systems based on micro-kernel design, the truth remains (quoting Linus Torvalds himself):

... message passing as the fundamental operation of the OS is just an exercise in computer science masturbation. It may feel good, but you don't actually get anything DONE.
Therefore, Linux is and will always be based on a monolithic design, which means that all subsystems run in the same privileged mode and share the same address space; communication between them is achieved by the usual C function call means.

However, although separating kernel functionality into separate "processes" as is done in micro-kernels is definitely a bad idea, separating it into dynamically loadable on demand kernel modules is desirable in some circumstances (e.g. on machines with low memory or for installation kernels which could otherwise contain ISA auto-probing device drivers that are mutually exclusive). The decision whether to include support for loadable modules is made at compile time and is determined by the CONFIG_MODULES option. Support for module autoloading via request_module() mechanism is a separate compilation option (CONFIG_KMOD).

The following functionality can be implemented as loadable modules under Linux:

Character and block device drivers, including misc device drivers.
Terminal line disciplines.
Virtual (regular) files in /proc and in devfs (e.g. /dev/cpu/microcode vs /dev/misc/microcode).
Binary file formats (e.g. ELF, aout, etc).
Execution domains (e.g. Linux, UnixWare7, Solaris, etc).
Filesystems.
System V IPC.
There a few things that cannot be implemented as modules under Linux (probably because it makes no sense for them to be modularised):

Scheduling algorithms.
VM policies.
Buffer cache, page cache and other caches.
Linux provides several system calls to assist in loading modules:

caddr_t create_module(const char *name, size_t size): allocates size bytes using vmalloc() and maps a module structure at the beginning thereof. This new module is then linked into the list headed by module_list. Only a process with CAP_SYS_MODULE can invoke this system call, others will get EPERM returned.
long init_module(const char *name, struct module *image): loads the relocated module image and causes the module's initialisation routine to be invoked. Only a process with CAP_SYS_MODULE can invoke this system call, others will get EPERM returned.
long delete_module(const char *name): attempts to unload the module. If name == NULL, attempt is made to unload all unused modules.
long query_module(const char *name, int which, void *buf, size_t bufsize, size_t *ret): returns information about a module (or about all modules).
The command interface available to users consists of:

insmod: insert a single module.
modprobe: insert a module including all other modules it depends on.
rmmod: remove a module.
modinfo: print some information about a module, e.g. author, description, parameters the module accepts, etc.
Apart from being able to load a module manually using either insmod or modprobe, it is also possible to have the module inserted automatically by the kernel when a particular functionality is required. The kernel interface for this is the function called request_module(name) which is exported to modules, so that modules can load other modules as well. The request_module(name) internally creates a kernel thread which execs the userspace command modprobe -s -k module_name, using the standard exec_usermodehelper() kernel interface (which is also exported to modules). The function returns 0 on success, however it is usually not worth checking the return code from request_module(). Instead, the programming idiom is:

--------------------------------------------------------------------------------

if (check_some_feature() == NULL)
request_module(module);
if (check_some_feature() == NULL)
return -ENODEV;

--------------------------------------------------------------------------------

For example, this is done by fs/block_dev.c:get_blkfops() to load a module block-major-N when attempt is made to open a block device with major N. Obviously, there is no such module called block-major-N (Linux developers only chose sensible names for their modules) but it is mapped to a proper module name using the file /etc/modules.conf. However, for most well-known major numbers (and other kinds of modules) the modprobe/insmod commands know which real module to load without needing an explicit alias statement in /etc/modules.conf.

A good example of loading a module is inside the mount(2) system call. The mount(2) system call accepts the filesystem type as a string which fs/super.c:do_mount() then passes on to fs/super.c:get_fs_type():

--------------------------------------------------------------------------------

static struct file_system_type *get_fs_type(const char *name)
{
struct file_system_type *fs;

read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
if (!fs && (request_module(name) == 0)) {
read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
}
return fs;
}

--------------------------------------------------------------------------------

A few things to note in this function:

First we attempt to find the filesystem with the given name amongst those already registered. This is done under protection of file_systems_lock taken for read (as we are not modifying the list of registered filesystems).
If such a filesystem is found then we attempt to get a new reference to it by trying to increment its module's hold count. This always returns 1 for statically linked filesystems or for modules not presently being deleted. If try_inc_mod_count() returned 0 then we consider it a failure - i.e. if the module is there but is being deleted, it is as good as if it were not there at all.
We drop the file_systems_lock because what we are about to do next (request_module()) is a blocking operation, and therefore we can't hold a spinlock over it. Actually, in this specific case, we would have to drop file_systems_lock anyway, even if request_module() were guaranteed to be non-blocking and the module loading were executed in the same context atomically. The reason for this is that the module's initialisation function will try to call register_filesystem(), which will take the same file_systems_lock read-write spinlock for write.
If the attempt to load was successful, then we take the file_systems_lock spinlock and try to locate the newly registered filesystem in the list. Note that this is slightly wrong because it is in principle possible for a bug in modprobe command to cause it to coredump after it successfully loaded the requested module, in which case request_module() will fail even though the new filesystem will be registered, and yet get_fs_type() won't find it.
If the filesystem is found and we are able to get a reference to it, we return it. Otherwise we return NULL.
When a module is loaded into the kernel, it can refer to any symbols that are exported as public by the kernel using EXPORT_SYMBOL() macro or by other currently loaded modules. If the module uses symbols from another module, it is marked as depending on that module during dependency recalculation, achieved by running depmod -a command on boot (e.g. after installing a new kernel).

Usually, one must match the set of modules with the version of the kernel interfaces they use, which under Linux simply means the "kernel version" as there is no special kernel interface versioning mechanism in general. However, there is a limited functionality called "module versioning" or CONFIG_MODVERSIONS which allows to avoid recompiling modules when switching to a new kernel. What happens here is that the kernel symbol table is treated differently for internal access and for access from modules. The elements of public (i.e. exported) part of the symbol table are built by 32bit checksumming the C declaration. So, in order to resolve a symbol used by a module during loading, the loader must match the full representation of the symbol that includes the checksum; it will refuse to load the module if these symbols differ. This only happens when both the kernel and the module are compiled with module versioning enabled. If either one of them uses the original symbol names, the loader simply tries to match the kernel version declared by the module and the one exported by the kernel and refuses to load if they differ.

--------------------------------------------------------------------------------
Next Previous Contents
php爱好者站 http://www.ｐｈｐfans.net dreamweaver|flash|fireworks|photoshop.