Linux内核代码分析 slab.c(2.4.22版） by 刘亢(一..

时间：2007-04-17 来源：Echo CHEN

Linux内核代码分析 slab.c by 刘亢 [email protected]

slab.c来自linux内核2.4.22版，本文件按照GNU协议发布。

一、准备知识：

slab的概念：

提出的原因：由于操作系统在运行中会不断产生、使用、释放大量重复的对象， 所以对这样的重复对象的生成进行改进可以大大提高效率

最早由sun的工程师提出(1994年)并首先在sunos5.4上应用。

slab算法的基本思路：

分配： if(对相对应的缓存区有空闲位置) 使用这个位置，不必再初始化； else{ 分配内存； 初始化对象； } 释放： 在缓存中标记空闲，不做析构； 资源不足： 寻找未使用的对象空间； 按照要求对部分对象做析构； 释放对象占用的空间；

缓存区：每一个对象放在一个缓存区

slab：每个slab块都是页面大小的整数倍（有上限）

着色：字节数按照硬件的要求对齐，可以大大提高硬件缓存的利用率和效率。

slab块的两种管理模式：

on-slab 适用于小对象（小于1/8页），slab管理结构存放在slab块中。

off-slab适用于大对象，(大于等于1/8页),对象和slab块的管理结构都由cache_slabp中分配。 根据slab提出者的论文,slab不适合用在大对象上。

slab涉及的重要操作：

缓存区创建kmem_cache_create与销毁kmem_cache_destory

缓存区收缩kmem_cache_shrink与扩张kmem_cache_grow

对象分配kmem_cache_alloc与释放kmem_cache_free

内核态内存的申请kmalloc与释放kfree

二、涉及的重要数据结构：

typedef unsigned int kmem_bufctl_t：slab块中的管理结构

cache_size表：保存了不同大小(2^n)页面指向cache_cache的两种指针(dma和非dma)。

链表：最重要的是在管理slab结构中出现的3个链表，分别为完全使用的，部分使用的和完全没用过的slab。

```
结构体：见下面的代码分析。
```

三、代码分析：

每个颜色代表的含义：

红色：代码注释；

藕荷色：编译预处理需要处理的内容；

蓝色：C语言关键字、函数定义；

绿色：宏定义；

黑色：代码；

灰色：输出信息；

深蓝色：我给出的注解。

/* * linux/mm/slab.c * Written by Mark Hemment, 1996/97. * ([email protected]) * * kmem_cache_destroy() + some cleanup - 1999 Andrea Arcangeli * * Major cleanup, different bufctl logic, per-cpu arrays * (c) 2000 Manfred Spraul *以上为版权信息 * An implementation of the Slab Allocator as described in outline in; * UNIX Internals: The New Frontiers by Uresh Vahalia * Pub: Prentice Hall ISBN 0-13-101908-2 关于slab分配符的一本书 * or with a little more detail in; * The Slab Allocator: An Object-Caching Kernel Memory Allocator * Jeff Bonwick (Sun Microsystems). * Presented at: USENIX Summer 1994 Technical Conference  这个人在1994年USENIX年会上首先提出了关于slab（对象缓存）的概念 www.usenix.org * * * The memory is organized in caches, one cache for each object type. * (e.g. inode_cache, dentry_cache, buffer_head, vm_area_struct) * Each cache consists out of many slabs (they are small (usually one * page long) and always contiguous), and each slab contains multiple * initialized objects. *注释大意：在缓存中，每一个类型的对象都对应一种缓存，比如inode_cache dentry_cache buffer_head vm_area_struct等等。 每一个缓存包含了很多slab(通常都很小，可能只有一个页那么大) * Each cache can only support one memory type (GFP_DMA, GFP_HIGHMEM, * normal). If you need a special memory type, then must create a new * cache for that memory type. *注释大意：每个缓存只能支持一种内存的模式(GFP_DMA, GFP_HIGHMEM, normal,这些都在include/linux/mm.h中作为宏定义) * In order to reduce fragmentation, the slabs are sorted in 3 groups: * full slabs with 0 free objects * partial slabs * empty slabs with no allocated objects *注释大意：为了减少碎片，slab被分在3个组： 全都使用了的slab,没有空闲的对象 部分slab 全空的slab,没有分配任何对象 * If partial slabs exist, then new allocations come from these slabs, * otherwise from empty slabs or new slabs are allocated. *注释大意：如果部分slab存在，则从这些slab中分配，如果不存在那就分配空的或者新的slab。 * kmem_cache_destroy() CAN CRASH if you try to allocate from the cache * during kmem_cache_destroy(). The caller must prevent concurrent allocs. *注释大意：假如在执行kmem_cache_destory()的时候，又要求缓存分配，则会出现崩溃。调用的时候一定要注意避免并发申请。 * On SMP systems, each cache has a short per-cpu head array, most allocs * and frees go into that array, and if that array overflows, then 1/2 * of the entries in the array are given back into the global cache. * This reduces the number of spinlock operations. *注释大意：在对称多处理器器的系统上，每个缓存都有一个对应CPU的数组， 几乎所有的分配和释放操作都会进入这个数组。如果这个数组超过限制了，则数组中一般的内容送回到全局缓存中。 这样可以减少自旋锁的数目。 * The c_cpuarray may not be read with enabled local interrupts. *注释大意：当本地的中断处在激活的状态下，c_cpuarry是不可读的 * SMP synchronization: * constructors and destructors are called without any locking. * Several members in kmem_cache_t and slab_t never change, they * are accessed without any locking. * The per-cpu arrays are never accessed from the wrong cpu, no locking. * The non-constant members are protected with a per-cache irq spinlock. *注释大意：多处理器的同步： 构建和析构都是在不加锁的情况下调用的 许多kmem_cache_t和slab_t的成员是永远不会改变的，因此不用加锁。 那些改变的成员通过每一个缓存的中断请求自旋锁来保护。 * Further notes from the original documentation: *注释大意：更多的资料： * 11 April '97. Started multi-threading - markhe * The global cache-chain is protected by the semaphore 'cache_chain_sem'. * The sem is only needed when accessing/extending the cache-chain, which * can never happen inside an interrupt (kmem_cache_create(), * kmem_cache_shrink() and kmem_cache_reap()). *注释大意：1997年4月11日，markhe开始做多线程的支持工作。 全局缓存链通过互斥锁cache_chain_sem来保护 这个互斥锁只在访问或者扩展缓存链的时候才需要，不会在中断的过程中(kmem_cache_create(),kmem_cache_shrink(),kmem_cache_reap())出现 * To prevent kmem_cache_shrink() trying to shrink a 'growing' cache (which * maybe be sleeping and therefore not holding the semaphore/lock), the * growing field is used. This also prevents reaping from a cache. *注释大意：为了避免kmem_cache_shrink()试图收缩正在增长的缓存（处在睡眠状态，并且不持有互斥锁或者锁）， 避免收缩正在被使用的增长的区域，(这个互斥锁)还可以避免回收缓存。 * At present, each engine can be growing a cache. This should be blocked. *注释大意：目前，每个部件都可以是一个正在增长的缓存，这是需要在未来做出改变的。 */ #include <linux/config.h>编译的时候调用生成的autoconf.h #include <linux/slab.h>自己的头文件 #include <linux/interrupt.h>中断相关的头文件 #include <linux/init.h>初始化相关的头文件 #include <linux/compiler.h>编译器相关的头文件 #include <linux/seq_file.h>对顺序文件作操作的头文件 #include <asm/uaccess.h>访问用户态内存操作的头文件 /* * DEBUG - 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL, * SLAB_RED_ZONE & SLAB_POISON. * 0 for faster, smaller code (especially in the critical paths). *注释大意：如果宏DEBUG为1，则kmem_cache_create()中执行SLAB_DEBUG_INITIAL,SLAB_RED_ZONE,SLAB_POSION相关操作 * STATS - 1 to collect stats for /proc/slabinfo. * 0 for faster, smaller code (especially in the critical paths). *注释大意：如果STATS为1，则从/proc/slabinfo中收集状态信息。为0的时候可以产生更快并且更小的代码(尤其是在重要的步骤中) * FORCED_DEBUG - 1 enables SLAB_RED_ZONE and SLAB_POISON (if possible) */注释大意：如果FORCED_DEBUG为1，则激活SLAB_RED_ZONE,并在可能的情况下激活SLAB_POSION #ifdef CONFIG_DEBUG_SLAB预编译处理，如果定义了CONFIG_DEBUG_SLAB，则将下面三个宏定义为1，否则定义为0 #define DEBUG 1 #define STATS 1 #define FORCED_DEBUG 1 #else #define DEBUG 0 #define STATS 0 #define FORCED_DEBUG 0 #endif /* * Parameters for kmem_cache_reap */注释大意：缓存回收需要的参数 #define REAP_SCANLEN 10 #define REAP_PERFECT 10 /* Shouldn't this be in a header file somewhere? */注释大意：这个是否应该加入到某个头文件中？ #define BYTES_PER_WORD sizeof(void *) /* Legal flag mask for kmem_cache_create(). */注释大意：kmem_cache_create()法定的标志位 #if DEBUG条件编译，如果在调试模式下 # define CREATE_MASK (SLAB_DEBUG_INITIAL | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_HWCACHE_ALIGN | \ SLAB_NO_REAP | SLAB_CACHE_DMA | \ SLAB_MUST_HWCACHE_ALIGN) #else在非调试模式下 # define CREATE_MASK (SLAB_HWCACHE_ALIGN | SLAB_NO_REAP | \ SLAB_CACHE_DMA | SLAB_MUST_HWCACHE_ALIGN) #endif /* * kmem_bufctl_t: * * Bufctl's are used for linking objs within a slab * linked offsets. *注释大意：Bufctl是用来连接slab中的对象的 * This implementation relies on "struct page" for locating the cache & * slab an object belongs to. 注释大意：这个调用通过寻找页面结构体来找对向所属的缓存和slab。 * This allows the bufctl structure to be small (one int), but limits * the number of objects a slab (not a cache) can contain when off-slab * bufctls are used. The limit is the size of the largest general cache * that does not use off-slab slabs. 注释大意：bufctl结构体可以非常的小(比如一个整型），但是在off-slab bufctls使用后 slab(不是缓存)中的对象数目是有限的。这个限制数是不使用off-slab的slab最大的普通缓存的大小 * For 32bit archs with 4 kB pages, is this 56. 注释大意：对于32位结构的系统而言，4k的页面，这个限制数目为56。 * This is not serious, as it is only for large objects, when it is unwise * to have too many per slab. 注释大意：这个限制并不是很严重，因为它只是针对大的对象而言的。 每个slab中包含很多大的对象是不明智的。 * Note: This limit can be raised by introducing a general cache whose size * is less than 512 (PAGE_SIZE<<3), but greater than 256. */注释大意：这个限制可以通过引入一个小于512(PAGE_SIZE<<3)但是大于256的普通缓存来提升。 #define BUFCTL_END 0xffffFFFF定义宏BUFCTL_END #define SLAB_LIMIT 0xffffFFFE定义宏SLAB_LIMIT

typedef unsigned int kmem_bufctl_t;定义类型kem_bufctl_t实际上是无符号的整型数 /* Max number of objs-per-slab for caches which use off-slab slabs. * Needed to avoid a possible looping condition in kmem_cache_grow(). */注释大意：使用off-slab对象缓存的每个slab中对象的最大数目 在kmem_cache_grow()中需要避免可能出现的自我循环情况 static unsigned long offslab_limit;定义offslab_limit为一个无符号整型数 /* * slab_t * * Manages the objs in a slab. Placed either at the beginning of mem allocated * for a slab, or allocated from an general cache. * Slabs are chained into three list: fully used, partial, fully free slabs. *注释大意：管理slab中的对象，出现在为slab分配的内存的起始处或者分配的普通缓存。 slab有3个链，一个是完全使用的，一个是部分使用的，一个是完全空的 typedef struct slab_s { struct list_head list; unsigned long colouroff; void *s_mem; /* including colour offset */着色的偏移量 unsigned int inuse; /* num of objs active in slab */在slab中正在被使用的对象数 kmem_bufctl_t free;slab中第一个空闲对象相对s_mem的偏移量。 } slab_t;slab的链状结构定义。 #define slab_bufctl(slabp) \ ((kmem_bufctl_t *)(((slab_t*)slabp)+1))宏定义slab_bufctl /* * cpucache_t * * Per cpu structures * The limit is stored in the per-cpu structure to reduce the data cache * footprint. */注释大意：每个CPU的结构  typedef struct cpucache_s { unsigned int avail;可用 unsigned int limit;限制 } cpucache_t;定义cpucache结构体 #define cc_entry(cpucache) \ ((void **)(((cpucache_t*)(cpucache))+1))宏定义cc_entry(cpu缓存入口)为一个函数指针 #define cc_data(cachep) \ ((cachep)->cpudata[smp_processor_id()])宏定义cc_data为缓存中cpu的标号 /* * kmem_cache_t * * manages a cache. */ #define CACHE_NAMELEN 20 /* max name length for a slab cache */ 宏定义slab中最长的命名为20 struct kmem_cache_s { /* 1) each alloc & free */对于每次申请和释放操作，都首先从满的和部分使用的slab开始，然后再是空的slab /* full, partial first, then free */ struct list_head slabs_full; struct list_head slabs_partial; struct list_head slabs_free;前面说到的3个不同状态的链 unsigned int objsize;对象的大小 unsigned int flags; /* constant flags */属性标志 属性标志可能存在的几种： SLAB_POISON：标志未初始化的部分，用0xA5(即10100101)填充 SLAB_RED_ZONE: 标志红色区域。红色区域的开始和结束的位置有一个特殊标示来保存这个对象的状态 RED_MAGIC1(0x5A2CF071)为活跃状态，RED_MAGIC2(0x170FC2A5)为不活跃状态 当分配对象的时色区域变为活跃状态，初始化空闲对象和收回对象空间时变为不活跃。 红色区域可以防止堆栈溢出(划了边界了，不能越界)。 SLAB_NO_REAP: 即使内存紧缺也不自动收缩这块缓存 SLAB_HWCACHE_ALIGN: 使用硬件对齐 CFLAGS_OFF_SLAB: off-slab模式(对大的对象操作的时候用这个) 以上变量定义在include/linux/slab.h unsigned int num; /* # of objs per slab */每个slab中对象的数目 spinlock_t spinlock;自旋锁 #ifdef CONFIG_SMP如果定义了SMP unsigned int batchcount;则定义一个批处理计数 #endif /* 2) slab additions /removals */ slab的增加和消除 /* order of pgs per slab (2^n) */  unsigned int gfporder;每个slab中页面数目是2的多少次方 /* force GFP flags, e.g. GFP_DMA */ unsigned int gfpflags;申请页面的时候的优先级，在include/linux/mm.h中定义 size_t colour; /* cache colouring range */着色的范围 unsigned int colour_off; /* colour offset */着色的偏移量 unsigned int colour_next; /* cache colouring */下一个着色的 kmem_cache_t *slabp_cache;针对off slab模式指向cache_slabp缓冲区的指针 unsigned int growing;对正在增长的slab设置的标志，以便避免在增长的时候进行了收缩操作。 unsigned int dflags; /* dynamic flags */对动态作的标志 /* constructor func */ void (*ctor)(void *, kmem_cache_t *, unsigned long);构造函数 /* de-constructor func */ void (*dtor)(void *, kmem_cache_t *, unsigned long);析构函数 unsigned long failures;失败标记 /* 3) cache creation/removal */缓存增加和消除 char name[CACHE_NAMELEN];缓存区的名字(在/proc/slabinfo中的名字) struct list_head next;指向下一个缓存结构的指针 #ifdef CONFIG_SMP编译预处理，如果是对称多处理器 /* 4) per-cpu data */ cpucache_t *cpudata[NR_CPUS]; 则设置一个指向每一个CPU运行的进程的指针(NR_CPUS作为宏定义在include/linux/threads.h) #endif #if STATS 编译预处理，如果需要记录状态 unsigned long num_active;活跃的数目 unsigned long num_allocations;分配的数目 unsigned long high_mark;最多活跃的标记 unsigned long grown;增长标记 unsigned long reaped;回收的标记 unsigned long errors;出错的标记 #ifdef CONFIG_SMP 编译预处理，如果定义了对称多处理器 atomic_t allochit;原子计数器分配命中数 atomic_t allocmiss;原子计数器分配未命中数 atomic_t freehit;原子计数器释放命中数  atomic_t freemiss;原子计数器释放未命中数 #endif #endif }; /* internal c_flags */ #define CFLGS_OFF_SLAB 0x010000UL /* slab management in own cache */ slab管理自己的缓存 #define CFLGS_OPTIMIZE 0x020000UL /* optimized slab lookup */ 优化slab查找 /* c_dflags (dynamic flags). Need to hold the spinlock to access this member */动态标志，访问的时候要加一个自旋锁 #define DFLGS_GROWN 0x000001UL /* don't reap a recently grown */ #define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB)设置off slab模式 #define OPTIMIZE(x) ((x)->flags & CFLGS_OPTIMIZE)设置优化模式 #define GROWN(x) ((x)->dlags & DFLGS_GROWN)设置动态增长标志 #if STATS 编译预处理，如果察看状态 #define STATS_INC_ACTIVE(x) ((x)->num_active++)活跃加1 #define STATS_DEC_ACTIVE(x) ((x)->num_active--)活跃减1 #define STATS_INC_ALLOCED(x) ((x)->num_allocations++)已经分配加1 #define STATS_INC_GROWN(x) ((x)->grown++)增长加1 #define STATS_INC_REAPED(x) ((x)->reaped++)回收加1 #define STATS_SET_HIGH(x) do { if ((x)->num_active > (x)->high_mark) \ (x)->high_mark = (x)->num_active; \ } while (0)设置最多活跃 #define STATS_INC_ERR(x) ((x)->errors++)错误加1 #else 编译预处理，如果不察看状态，那么都是空操作 #define STATS_INC_ACTIVE(x) do { } while (0) #define STATS_DEC_ACTIVE(x) do { } while (0) #define STATS_INC_ALLOCED(x) do { } while (0) #define STATS_INC_GROWN(x) do { } while (0) #define STATS_INC_REAPED(x) do { } while (0) #define STATS_SET_HIGH(x) do { } while (0) #define STATS_INC_ERR(x) do { } while (0) #endif #if STATS && defined(CONFIG_SMP)编译预处理，如果察看状态并且是对称多处理器 #define STATS_INC_ALLOCHIT(x) atomic_inc(&(x)->allochit)原子操作增加分配命中 #define STATS_INC_ALLOCMISS(x) atomic_inc(&(x)->allocmiss)原子操作增加分配没有命中 #define STATS_INC_FREEHIT(x) atomic_inc(&(x)->freehit)原子操作增加释放命中的 #define STATS_INC_FREEMISS(x) atomic_inc(&(x)->freemiss)原子操作增加释放没有命中的 #else 编译预处理，如果不察看状态，那么都是空操作 #define STATS_INC_ALLOCHIT(x) do { } while (0) #define STATS_INC_ALLOCMISS(x) do { } while (0) #define STATS_INC_FREEHIT(x) do { } while (0) #define STATS_INC_FREEMISS(x) do { } while (0) #endif #if DEBUG编译预处理，如果设置了查错模式 /* Magic nums for obj red zoning. * Placed in the first word before and the first word after an obj. */为红色区域标记的magic number.(前面已经提到过) #define RED_MAGIC1 0x5A2CF071UL /* when obj is active */ #define RED_MAGIC2 0x170FC2A5UL /* when obj is inactive */ /* ...and for poisoning */没有初始化标记 #define POISON_BYTE 0x5a /* byte value for poisoning */01011010作为起始标记 #define POISON_END 0xa5 /* end-byte of poisoning */10100101作为结束标记 额外的知识：使用0xA5填充未初始化的区域的原因： 对于为初始化的区域，也可以考虑用0xFF或0x00填充，但是用0xA5填充可以确保不出现偶然的相邻位的短路： 例如，D0 D1 D2 D3 ....D7,其中D1-D2出现了短路 对于用0x00填充而言：D0-D7 00000000 对于用0xFF填充而言：D0-D7 11111111 对于用0xA5填充而言：D0-D7 10000101 可以非常容易的检查出来硬件的失效或者偶然的错误 参考：Software-Based Memory Testing 1997 by Michael Barr http://www.netrino.com/Articles/MemoryTesting/paper.html #endif /* maximum size of an obj (in 2^order pages) */ 对象可以占用的最大的页面的2的幂 #define MAX_OBJ_ORDER 5 /* 32 pages */ 2的5次方等于32 /* * Do not go above this order unless 0 objects fit into the slab. */当没有对象适合在slab中的时候，空闲页最多不超过4个页，最少不小于2个页。 #define BREAK_GFP_ORDER_HI 2 #define BREAK_GFP_ORDER_LO 1 static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;初始为2个页 /* * Absolute limit for the gfp order 最多的空闲页的硬上限为2的5次方，即32 */ #define MAX_GFP_ORDER 5 /* 32 pages */ /* Macros for storing/retrieving the cachep and or slab from the * global 'mem_map'. These are used to find the slab an obj belongs to. * With kfree(), these are used to find the cache which an obj belongs to. */注释大意：下面的这些宏是用来在全局mem_map(内存映射)中存储/找回cachep或slab。 这些宏是用来找到对象所属的slab,通过使用kfree()来找到对象所属的缓存。 #define SET_PAGE_CACHE(pg,x) ((pg)->list.next = (struct list_head *)(x)) #define GET_PAGE_CACHE(pg) ((kmem_cache_t *)(pg)->list.next) #define SET_PAGE_SLAB(pg,x) ((pg)->list.prev = (struct list_head *)(x)) #define GET_PAGE_SLAB(pg) ((slab_t *)(pg)->list.prev) 上面这些宏是通过对页面的链来做操作实现功能的 /* Size description struct for general caches. */下面的结构体是对于普通缓存的描述 typedef struct cache_sizes { size_t cs_size;缓存的大小 kmem_cache_t *cs_cachep;指向cache_cache中kmem_cache_cache_s型通用缓存区描述结构  kmem_cache_t *cs_dmacachep;指向cache_cache中kmem_cache_cache_s型通用缓存区描述结构，处理dma数据块用的 } cache_sizes_t; static cache_sizes_t cache_sizes[] = {定义缓存的大小 #if PAGE_SIZE == 4096 编译预处理，假如页面的大小为4096 { 32, NULL, NULL}, #endif { 64, NULL, NULL}, { 128, NULL, NULL}, { 256, NULL, NULL}, { 512, NULL, NULL}, { 1024, NULL, NULL}, { 2048, NULL, NULL}, { 4096, NULL, NULL}, { 8192, NULL, NULL}, { 16384, NULL, NULL}, { 32768, NULL, NULL}, { 65536, NULL, NULL}, {131072, NULL, NULL}, { 0, NULL, NULL} };后面的NULL就是为cs_cachep和cs_dmacachep准备的 /* internal cache of cache description objs */ 内部缓存的缓存描述对象结构体 static kmem_cache_t cache_cache = {  slabs_full: LIST_HEAD_INIT(cache_cache.slabs_full),  slabs_partial: LIST_HEAD_INIT(cache_cache.slabs_partial),  slabs_free: LIST_HEAD_INIT(cache_cache.slabs_free), 三种状态的链表  objsize: sizeof(kmem_cache_t), 对象的大小  flags: SLAB_NO_REAP, 设置标志为不自动回收  spinlock: SPIN_LOCK_UNLOCKED,设置自旋锁为不锁定状态  colour_off: L1_CACHE_BYTES,设定着色范围为1级缓存的大小  name: "kmem_cache",设置名称 }; /* Guard access to the cache-chain. */ static struct semaphore cache_chain_sem;设置互斥锁，以便保护缓存链 /* Place maintainer for reaping. */准备回收用的指针 static kmem_cache_t *clock_searchp = &cache_cache; #define cache_chain (cache_cache.next)宏定义缓存链 #ifdef CONFIG_SMP 编译预处理，如果是对称多处理器 /* * chicken and egg problem: delay the per-cpu array allocation * until the general caches are up. */注释大意：先有鸡还是先有蛋的问题：等普通缓存就绪之后再分配每个CPU的数组。 static int g_cpucache_up;定义普通缓存是否就绪的状态变量 static void enable_cpucache (kmem_cache_t *cachep);激活cpu缓存 static void enable_all_cpucaches (void);激活所有cpu缓存 #endif /* Cal the num objs, wastage, and bytes left over for a given slab size. */ 本函数负责计算对象的数目，浪费的空间，以及在所给的slab中剩余的空间。 static void kmem_cache_estimate (unsigned long gfporder, size_t size, int flags, size_t *left_over, unsigned int *num) { int i; size_t wastage = PAGE_SIZE<<gfporder; size_t extra = 0; size_t base = 0; if (!(flags & CFLGS_OFF_SLAB)) { base = sizeof(slab_t); extra = sizeof(kmem_bufctl_t); } i = 0; while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage) i++; if (i > 0) i--; if (i > SLAB_LIMIT) i = SLAB_LIMIT; *num = i; wastage -= i*size; wastage -= L1_CACHE_ALIGN(base+i*extra); *left_over = wastage;计算出来的浪费的空间 } /* Initialisation - setup the `cache' cache. */ 本函数负责初始化缓存的"缓存" void __init kmem_cache_init(void) { size_t left_over; init_MUTEX(&cache_chain_sem); INIT_LIST_HEAD(&cache_chain); kmem_cache_estimate(0, cache_cache.objsize, 0, &left_over, &cache_cache.num); if (!cache_cache.num) BUG(); cache_cache.colour = left_over/cache_cache.colour_off; cache_cache.colour_next = 0; } /* Initialisation - setup remaining internal and general caches. * Called after the gfp() functions have been enabled, and before smp_init(). */初始化cache_size表的过程。设置保留的内部和普通缓存。在函数gfp() (GFP, get free page)已经被激活之后再调用， 并且在smp_init() (对称多处理器初始化) 执行后再调用。 void __init kmem_cache_sizes_init(void) { cache_sizes_t *sizes = cache_sizes; char name[20];显然有问题的，前面已经定义了CACHE_NAMELEN，这里竟然不用！ 显然是开发不统一造成的，未来修改代码的时候很可能造成不好影响 /* * Fragmentation resistance on low memory - only use bigger * page orders on machines with more than 32MB of memory. */为了避免在小内存的时候出现碎片，只有当内存大于32M的时候才会用比较大的页面数,2^n(幂) if (num_physpages > (32 << 20) >> PAGE_SHIFT) slab_break_gfp_order = BREAK_GFP_ORDER_HI; do { /* For performance, all the general caches are L1 aligned. * This should be particularly beneficial on SMP boxes, as it * eliminates "false sharing". * Note for systems short on memory removing the alignment will * allow tighter packing of the smaller caches. */  注释大意:出于性能的考虑，所有的普通缓存都是按照L1缓存的大小对齐的。 这样做对对称多处理器的系统来说是非常有益的，这是由于对称多处理器系统消除了假共享。 snprintf(name, sizeof(name), "size-%Zd",sizes->cs_size);为/proc/slabinfo做准备 if (!(sizes->cs_cachep = kmem_cache_create(name, sizes->cs_size, 0, SLAB_HWCACHE_ALIGN, NULL, NULL))) { BUG(); }如果创建缓存失败，则报错。 /* Inc off-slab bufctl limit until the ceiling is hit. */增加off-slab模式的控制限制，直到到达底线 if (!(OFF_SLAB(sizes->cs_cachep))) { offslab_limit = sizes->cs_size-sizeof(slab_t); offslab_limit /= 2;这里实际上有问题，应该写成offslab_limit /=sizeof(kmem_bufctl_t) 如果按照/2计算的话，那永远都不会到达底线了。这个问题在2.6的内核中已经修正 参考资料：http://www.cs.helsinki.fi/linux/linux-kernel/2001-17/1193.html  } snprintf(name, sizeof(name), "size-%Zd(DMA)",sizes->cs_size);设置名称 sizes->cs_dmacachep = kmem_cache_create(name, sizes->cs_size, 0, SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL); if (!sizes->cs_dmacachep) BUG(); sizes++; } while (sizes->cs_size); } int __init kmem_cpucache_init(void) { #ifdef CONFIG_SMP 编译预处理，如果是多处理器 g_cpucache_up = 1;设置普通缓存已经激活标志 enable_all_cpucaches();激活所有cpu的缓存 怀疑有问题，不作为原子操作可以吗？而且先设置激活然后执行？ #endif return 0; } __initcall(kmem_cpucache_init); /* Interface to system's page allocator. No need to hold the cache-lock. */对系统页分配器的借口，不需要加缓存锁。 static inline void * kmem_getpages (kmem_cache_t *cachep, unsigned long flags) { void *addr; /* * If we requested dmaable memory, we will get it. Even if we * did not request dmaable memory, we might get it, but that * would be relatively rare and ignorable. */如果我们要求dma方式的内存，那么我们将获得。即使我们没有要求可以dma方式的内存， 我们仍然可能会获取到，但是通常情况下这个是不会被忽略的。 flags |= cachep->gfpflags; addr = (void*) __get_free_pages(flags, cachep->gfporder); /* Assume that now we have the pages no one else can legally * messes with the 'struct page's. * However vm_scan() might try to test the structure to see if * it is a named-page or buffer-page. The members it tests are * of no interest here..... */到此为止，我们已经有了别人不能弄乱的页面了。尽管vm_scan()有可能会去检测这个结构，看看是一个命名了的页还是一个缓存页。 对于成员的测试在这里并不关心。 return addr; } /* Interface to system's page release. */ 系统释放页面的接口。 static inline void kmem_freepages (kmem_cache_t *cachep, void *addr) { unsigned long i = (1<<cachep->gfporder); struct page *page = virt_to_page(addr); /* free_pages() does not clear the type bit - we do that. * The pages have been unlinked from their cache-slab, * but their 'struct page's might be accessed in * vm_scan(). Shouldn't be a worry. */free_page()不清除标志位，我们这里手工去做。 这些页面从slab缓存中移出，但是他们的结构化页仍然可能被vm_scan()访问到，但是不必担心。 while (i--) { PageClearSlab(page);清除标记位 page++; } free_pages((unsigned long)addr, cachep->gfporder);释放页 } #if DEBUG 条件编译，如果设置调试模式 static inline void kmem_poison_obj (kmem_cache_t *cachep, void *addr) { int size = cachep->objsize; if (cachep->flags & SLAB_RED_ZONE) { addr += BYTES_PER_WORD; size -= 2*BYTES_PER_WORD; }留出红区 memset(addr, POISON_BYTE, size);设置未初始化的地址内容为POSION_BYTE *(unsigned char *)(addr+size-1) = POISON_END;写结尾 } static inline int kmem_check_poison_obj (kmem_cache_t *cachep, void *addr)检查未初始化的空间的对象 { int size = cachep->objsize; void *end; if (cachep->flags & SLAB_RED_ZONE) { addr += BYTES_PER_WORD; size -= 2*BYTES_PER_WORD; }红区 end = memchr(addr, POISON_END, size); if (end != (addr+size-1)) return 1;出错退出 return 0;正常退出 } #endif /* Destroy all the objs in a slab, and release the mem back to the system. * Before calling the slab must have been unlinked from the cache. * The cache-lock is not held/needed. */注释大意：销毁slab中的所有对象，释放内存给系统。 调用前，slab必须已经和cache取消了连接 缓存锁不被占用，也不需要缓存锁。 static void kmem_slab_destroy (kmem_cache_t *cachep, slab_t *slabp) { if (cachep->dtor #if DEBUG || cachep->flags & (SLAB_POISON | SLAB_RED_ZONE)如果调试模式，则作红区处理 #endif ) { int i; for (i = 0; i < cachep->num; i++) { void* objp = slabp->s_mem+cachep->objsize*i; #if DEBUG if (cachep->flags & SLAB_RED_ZONE) { if (*((unsigned long*)(objp)) != RED_MAGIC1) BUG(); if (*((unsigned long*)(objp + cachep->objsize -BYTES_PER_WORD)) != RED_MAGIC1) BUG();红区的边界不对，则报错 objp += BYTES_PER_WORD; } #endif if (cachep->dtor) (cachep->dtor)(objp, cachep, 0);清空 #if DEBUG if (cachep->flags & SLAB_RED_ZONE) { objp -= BYTES_PER_WORD;减去一个字的长度 } if ((cachep->flags & SLAB_POISON) && kmem_check_poison_obj(cachep, objp))检查未初始化的部分，如果有问题则报错 BUG(); #endif } } kmem_freepages(cachep, slabp->s_mem-slabp->colouroff);释放资源 if (OFF_SLAB(cachep))释放off-slab模式的资源 kmem_cache_free(cachep->slabp_cache, slabp); } /** * kmem_cache_create - Create a cache. * @name: A string which is used in /proc/slabinfo to identify this cache. * @size: The size of objects to be created in this cache. * @offset: The offset to use within the page. * @flags: SLAB flags * @ctor: A constructor for the objects. * @dtor: A destructor for the objects. * * Returns a ptr to the cache on success, NULL on failure. * Cannot be called within a int, but can be interrupted. * The @ctor is run when new pages are allocated by the cache * and the @dtor is run before the pages are handed back. * The flags are * * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5) * to catch references to uninitialised memory. * * %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check * for buffer overruns. * * %SLAB_NO_REAP - Don't automatically reap this cache when we're under * memory pressure. * * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware * cacheline. This can be beneficial if you're counting cycles as closely * as davem. */ kmem_cache_t * kmem_cache_create (const char *name, size_t size, size_t offset, unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long), void (*dtor)(void*, kmem_cache_t *, unsigned long)) { const char *func_nm = KERN_ERR "kmem_create: "; size_t left_over, align, slab_size; kmem_cache_t *cachep = NULL; /* * Sanity checks... these are all serious usage bugs. */健壮性检察 if ((!name) || ((strlen(name) >= CACHE_NAMELEN - 1)) || in_interrupt() || (size < BYTES_PER_WORD) || (size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) || (dtor && !ctor) || (offset < 0 || offset > size)) BUG(); #if DEBUG 条件编译 if ((flags & SLAB_DEBUG_INITIAL) && !ctor) { /* No constructor, but inital state check requested */没有构建者，但是要求初始化检查 printk("%sNo con, but init state check requested - %s\n", func_nm, name); flags &= ~SLAB_DEBUG_INITIAL; } if ((flags & SLAB_POISON) && ctor) {在没有构建者的情况下要求设置未初始化 /* request for poisoning, but we can't do that with a constructor */ printk("%sPoisoning requested, but con given - %s\n", func_nm, name); flags &= ~SLAB_POISON; } #if FORCED_DEBUG if ((size < (PAGE_SIZE>>3)) && !(flags & SLAB_MUST_HWCACHE_ALIGN)) /* * do not red zone large object, causes severe * fragmentation. */不将大的对象放入到红区，否则会造成大量碎片 flags |= SLAB_RED_ZONE; if (!ctor) flags |= SLAB_POISON; #endif #endif /* * Always checks flags, a caller might be expecting debug * support which isn't available. */ BUG_ON(flags & ~CREATE_MASK); /* Get cache's description obj. */调用kmem_cache_alloc从cache_cache中分配一个对象 cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL); if (!cachep) goto opps; memset(cachep, 0, sizeof(kmem_cache_t));将新分配的空间全都设置为0 /* Check that size is in terms of words. This is needed to avoid * unaligned accesses for some archs when redzoning is used, and makes * sure any on-slab bufctl's are also correctly aligned. */检查每个字的大小。在某些体系结构的系统中，要通过这个避免再没有对齐的情况下对红区的访问 并且确认所有on-slab 缓存控制结构体已经正确对齐 if (size & (BYTES_PER_WORD-1)) { size += (BYTES_PER_WORD-1); size &= ~(BYTES_PER_WORD-1); printk("%sForcing size word alignment - %s\n", func_nm, name); } #if DEBUG if (flags & SLAB_RED_ZONE) { /* * There is no point trying to honour cache alignment * when redzoning. */ flags &= ~SLAB_HWCACHE_ALIGN; size += 2*BYTES_PER_WORD; /* words for redzone */红区的字数 } #endif align = BYTES_PER_WORD; if (flags & SLAB_HWCACHE_ALIGN)如果要求硬件对齐，则按照CPU L1缓存的大小对齐，否则按照字长对齐 align = L1_CACHE_BYTES; /* Determine if the slab management is 'on' or 'off' slab. */ if (size >= (PAGE_SIZE>>3))判断on-slab还是off-slab /* * Size is large, assume best to place the slab management obj * off-slab (should allow better packing of objs). */如果大(超过512字节），那就采用off-slab模式 flags |= CFLGS_OFF_SLAB; if (flags & SLAB_HWCACHE_ALIGN) { /* Need to adjust size so that objs are cache aligned. */ /* Small obj size, can get at least two per cache line. */ /* FIXME: only power of 2 supported, was better */ 调整对象的大小，以便能和缓存对齐 while (size < align/2) align /= 2; size = (size+align-1)&(~(align-1)); } /* Cal size (in pages) of slabs, and the num of objs per slab. * This could be made much more intelligent. For now, try to avoid * using high page-orders for slabs. When the gfp() funcs are more * friendly towards high-order requests, this should be changed. */计算页面中slab的大小，每个slab中对象的个数。 do { unsigned int break_flag = 0; cal_wastage: kmem_cache_estimate(cachep->gfporder, size, flags, &left_over, &cachep->num); 计算消耗，left_over保存剩余的空间，cachep->num保存slab块中可以存放的对象个数 if (break_flag) break; if (cachep->gfporder >= MAX_GFP_ORDER)如果超大(32*4096=128K)，则退出循环 break; if (!cachep->num)超过可以保存的数目了，则退出循环。 goto next; if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) {超过offslab最大的限制，则重新计算花费，然后退出 /* Oops, this num of objs will cause problems. */ cachep->gfporder--; break_flag++; goto cal_wastage; } /* * Large num of objs is good, but v. large slabs are currently * bad for the gfp()s. */对象数目越多越好，但是过多的slab目前会对gfp()造成不好的影响 if (cachep->gfporder >= slab_break_gfp_order) break; if ((left_over*8) <= (PAGE_SIZE<<cachep->gfporder)) 防止浪费的操作：假设slab只比对象大一点点，那么可能会造成一个对象的空间大的浪费，增加slab的大小，以便能存放更多的对象。 如果浪费小于等于1/8则不再增长slab break; /* Acceptable internal fragmentation. */ next: cachep->gfporder++; } while (1); if (!cachep->num) {如果超限，则报错释放资源，返回 printk("kmem_cache_create: couldn't create cache %s.\n", name); kmem_cache_free(&cache_cache, cachep); cachep = NULL; goto opps; } slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t)+sizeof(slab_t));slab块中管理变量的大小总和(L1 cache对齐) /* * If the slab has been placed off-slab, and we have enough space then * move it on-slab. This is at the expense of any extra colouring. */能用on-slab的时候就用on-slab if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) { flags &= ~CFLGS_OFF_SLAB; left_over -= slab_size; } /* Offset must be a multiple of the alignment. */将offset设置为合适的对齐偏移量 offset += (align-1); offset &= ~(align-1); if (!offset) offset = L1_CACHE_BYTES;如果没有偏移，那就按照L1缓存的大小对齐。 cachep->colour_off = offset;着色偏移量 cachep->colour = left_over/offset;当前着色 /* init remaining fields */初始化其他的部分 if (!cachep->gfporder && !(flags & CFLGS_OFF_SLAB)) flags |= CFLGS_OPTIMIZE; cachep->flags = flags;设标志 cachep->gfpflags = 0; if (flags & SLAB_CACHE_DMA) cachep->gfpflags |= GFP_DMA; spin_lock_init(&cachep->spinlock);初始化锁 cachep->objsize = size;设大小 INIT_LIST_HEAD(&cachep->slabs_full); INIT_LIST_HEAD(&cachep->slabs_partial); INIT_LIST_HEAD(&cachep->slabs_free);初始化3个队列 if (flags & CFLGS_OFF_SLAB) cachep->slabp_cache = kmem_find_general_cachep(slab_size,0);指向cache_cache中与slab_size对应的变量 cachep->ctor = ctor;构建者 cachep->dtor = dtor;析构 /* Copy name over so we don't have problems with unloaded modules */ strcpy(cachep->name, name);为了避免模块被卸载后出现问题，在这里保存一下名字。 #ifdef CONFIG_SMP 条件编译，对称多处理器的情况下，如果普通缓存激活了，那么激活cpu缓存。 if (g_cpucache_up) enable_cpucache(cachep); #endif /* Need the semaphore to access the chain. */ down(&cache_chain_sem);设置信号量，以便可以访问缓存链 { struct list_head *p; list_for_each(p, &cache_chain) { kmem_cache_t *pc = list_entry(p, kmem_cache_t, next); /* The name field is constant - no lock needed. */名称出错，则报错 if (!strcmp(pc->name, name)) BUG(); } } /* There is no reason to lock our new cache before we * link it in - no one knows about it yet... */在加入链之前没有必要锁定新的缓存，因为还没有任何进程可以知道他 list_add(&cachep->next, &cache_chain); up(&cache_chain_sem);锁定 opps: return cachep;返回新建缓存区的指针 } #if DEBUG 条件编译 /* * This check if the kmem_cache_t pointer is chained in the cache_cache * list. -arca */检查kmem_cache_t是否连接到了cache_cache链表中 static int is_chained_kmem_cache(kmem_cache_t * cachep) { struct list_head *p; int ret = 0; /* Find the cache in the chain of caches. */ down(&cache_chain_sem); list_for_each(p, &cache_chain) { if (p == &cachep->next) { ret = 1; break; } } up(&cache_chain_sem); return ret; } #else如果不要求调试，则定义空操作 #define is_chained_kmem_cache(x) 1 #endif #ifdef CONFIG_SMP 条件编译，对称多处理器 /* * Waits for all CPUs to execute func(). */在所有的CPU上都执行某个函数 static void smp_call_function_all_cpus(void (*func) (void *arg), void *arg) { local_irq_disable();关中断 func(arg);执行函数 local_irq_enable();开中断 if (smp_call_function(func, arg, 1, 1))如果掉用函数有问题，则报错 BUG(); } typedef struct ccupdate_struct_s { kmem_cache_t *cachep; cpucache_t *new[NR_CPUS]; } ccupdate_struct_t; static void do_ccupdate_local(void *info) { ccupdate_struct_t *new = (ccupdate_struct_t *)info; cpucache_t *old = cc_data(new->cachep); cc_data(new->cachep) = new->new[smp_processor_id()]; new->new[smp_processor_id()] = old; }本地作cpu缓存更新 static void free_block (kmem_cache_t* cachep, void** objpp, int len); static void drain_cpu_caches(kmem_cache_t *cachep) {耗尽cpu缓存 ccupdate_struct_t new; int i; memset(&new.new,0,sizeof(new.new)); new.cachep = cachep; down(&cache_chain_sem); smp_call_function_all_cpus(do_ccupdate_local, (void *)&new); for (i = 0; i < smp_num_cpus; i++) { cpucache_t* ccold = new.new[cpu_logical_map(i)]; if (!ccold || (ccold->avail == 0)) continue; local_irq_disable(); free_block(cachep, cc_entry(ccold), ccold->avail); local_irq_enable(); ccold->avail = 0; } smp_call_function_all_cpus(do_ccupdate_local, (void *)&new); up(&cache_chain_sem); } #else 不是对称多处理器的情况定义一个空操作 #define drain_cpu_caches(cachep) do { } while (0) #endif