文章详情

  • 游戏榜单
  • 软件榜单
关闭导航
热搜榜
热门下载
热门标签
php爱好者> php文档>023 mm/swap_state.c

023 mm/swap_state.c

时间:2009-03-27  来源:hylpro

2006-8-10 
mm/swap_state.c

当一个page要和外部存储设备发生联系的时候,就要建立一个address_space,对于swap
就是 swapper_space .还要提供address_space_operations,对于swap 就是swap_aops.
考虑page cache/swap cache/shmem/filemap,无不如此.

建立着两个结构只是解决了页面写出的问题,而读入靠的是handle_pte_fault->直接的
函数调用.对于swap就是do_swap_page,file map/mmap靠vma->vm_ops->nopage.没有一个统
一的解决方案.

不打算太多分析这些东西了.这里重点讨论物理内存页面,page->count以及swp entry
的引用计数.(真的需要逐函数列到这里?)

page, 何去何从

看page_alloc.c, buddy系统,所有物理页面都受buddy管理(reserve除外,那是外设内存,
或者特殊用途). page的去向只看分配函数的调用关系即可.
page_alloc.c 提供的分配接口:(只有这几个被应用--2.4)

1.(alloc_pages:call by)-->page_cache_alloc
从这个接口流出的页面都在page cache(swap cache)中.用于磁盘(or疑似)文件缓
存.具体的使用者是: swap cache, page cache,file read(page cache),filemap
(page cache,or copy from page cache),COW(may not in page cache),
shmem_no_page(page cache).

2.__get_free_pages:
广泛应用于驱动, 内核使用的hash表, task struct结构,网络(hash等),buffers
(文件系统的meta data,blk设备文件读写.(还有fly的buffers,创建于需要io的页
面,这种页面不是从__get_free_pages流出) ), slab(slab).

3. __get_free_page
page table(pdir,pmd),驱动, 用户参数页.

4.get_zeroed_page:
驱动(tty), shmem(建立于内核的直接/间接映射表,永不与后备缓存打交道),

5.alloc_page:
buffers,string参数页,匿名页(缺页中断),vmalloc(内核页面,永不交换).

现在可以回答这个问题,物理内存都用到哪里去了?:(fix me,i think everything is here)
1)内核'自己'使用
包括驱动,网络,页表(内核或者进程),各种hash表,从用户copy的参数,通过slab作为
各种内核数据结构的cache(inode,dentry.....),shmem映射表,vmalloc使用的内核
页面.

2)page cache/swap cache
页面只能位于这两个cache中的一个.用于缓存位于磁盘上的文件内容(不是meta).包
括普通文件,filemap(共享),shemem.

3)buffers
用于缓存文件系统的meta data,用于设备文件读写的缓存.不包括那些为了进行page
io而创建的fly buffers,但是这些fly buffer也page->count++了.

4)用户进程页面
这是一个混合体. 进程使用的页面也可以位于page cache/swap cache, 还可以拥有
buffer. 除了这些有所属的页面,进程使用的页面还有一种叫做匿名页,即无mapping.
包括还未进入swap的进程页面,filemap(no shared),COW页面.

page->count

先贴一段从mm.h中的注释,这个值得一看.注意,这个注释太老了,inode->i_pages在2.4中已
经不存在了. 这段话--> For pages belonging to inodes, the page->count is the number
of attaches, plus 1 if buffers are allocated to the page.已经不正确了.(和我们的文
档一样,好久没有更新了,2.6中还行.)

/*
* Various page->flags bits:
*
* PG_reserved is set for a page which must never be accessed (which
* may not even be present).
*
* PG_DMA has been removed, page->zone now tells exactly wether the
* page is suited to do DMAing into.
*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
* zeroes, and text pages of executables and shared libraries have
* only one copy in memory, at most, normally.
*
* For the non-reserved pages, page->count denotes a reference count.
* page->count == 0 means the page is free.
* page->count == 1 means the page is used for exactly one purpose
* (e.g. a private data page of one process).
*
* A page may be used for kmalloc() or anyone else who does a
* __get_free_page(). In this case the page->count is at least 1, and
* all other fields are unused but should be 0 or NULL. The
* management of this page is the responsibility of the one who uses
* it.
*
* The other pages (we may call them "process pages") are completely
* managed by the Linux memory manager: I/O, buffers, swapping etc.
* The following discussion applies only to them.
*
* A page may belong to an inode's memory mapping. In this case,
* page->inode is the pointer to the inode, and page->offset is the
* file offset of the page (not necessarily a multiple of PAGE_SIZE).
*
* A page may have buffers allocated to it. In this case,
* page->buffers is a circular list of these buffer heads. Else,
* page->buffers == NULL.
*
* For pages belonging to inodes, the page->count is the number of
* attaches, plus 1 if buffers are allocated to the page.
*
* All pages belonging to an inode make up a doubly linked list
* inode->i_pages, using the fields page->next and page->prev. (These
* fields are also used for freelist management when page->count==0.)
* There is also a hash table mapping (inode,offset) to the page
* in memory if present. The lists for this hash table use the fields
* page->next_hash and page->pprev_hash.
*
* All process pages can do I/O:
* - inode pages may need to be read from disk,
* - inode pages which have been modified and are MAP_SHARED may need
* to be written to disk,
* - private pages which have been modified may need to be swapped out
* to swap space and (later) to be read back into memory.
* During disk I/O, PG_locked is used. This bit is set before I/O
* and reset when I/O completes. page->wait is a wait queue of all
* tasks waiting for the I/O on this page to complete.
* PG_uptodate tells whether the page's contents is valid.
* When a read completes, the page becomes uptodate, unless a disk I/O
* error happened.
*
* For choosing which pages to swap out, inode pages carry a
* PG_referenced bit, which is set any time the system accesses
* that page through the (inode,offset) hash table.
*
* PG_skip is used on sparc/sparc64 architectures to "skip" certain
* parts of the address space.
*
* PG_error is set to indicate that an I/O error occurred on this page.
*
* PG_arch_1 is an architecture specific page state bit. The generic
* code guarentees that this bit is cleared for a page when it first
* is entered into the page cache.
*/

根据刚才分析的物理页面,page的流向, 对page->count的简单描述如下:
1)第一类内核自己使用的页面,一般引用计数都是1.(fixme).

2)page/swap cache中的页面,增加1, buffers 增加1.

3)用户进程: 每个进程增加1.

4)许多地方为了保护页面临时不被释放, get后很快释放.此类忽略.

page->count 实例分析

我选择了函数is_page_shared来进行详细分析. 02年10月份的时候,linuxforum很是热闹.
对此函数的讨论,淹没在一片汪洋之中.不过对page->count的好奇和争论一直未曾停歇.或许
国外的论坛上早已经不存在这种问题的活跃讨论了,而我们仍将继续.
请仔细阅读注释.
/*
* Work out if there are any other processes sharing this page, ignoring
* any page reference coming from the swap cache, or from outstanding
* swap IO on this page. (The page cache _does_ count as another valid
* reference to the page, however.)
*/
/* I)这种情况下page 引用计数来源:
* 1. 进程,one per process 2. swap or page cahce, one 3.buffers one
*
* II)page 相关的swap entry:
* page加入了swap cache, 当page 对应的swap entry引用计数不是1 的时候(例如tmpfs),
* 代表另外一个地方依然希望通过swap entry 找到此page(tmpfs).所以相当于此page 多
* 了一个匿名的引用方式.
*
* III) page cache 算作了"另一个进程" (见上面的en comment)
*/
static inline int is_page_shared(struct page *page)
{
unsigned int count;
if (PageReserved(page))
return 1;
count = page_count(page); //page 本身的引用计数

/* II) page在swap cache: (不在page cache)
* 所有进程的引用= page count + swap entry -(swap 本身对page的引用)
* swap 本身对page的引用是:
* swap cache 对page 引用 1,此page 对swap entry 的引用 1 如果有
* buffers, 算作swap 对其引用,1(反正不是进程).
*/
if (PageSwapCache(page))
count += swap_count(page) - 2 - !!page->buffers;

/* III) 存在于page cache 或者不存在于page cahce
* 此中情况下,如有buffers,则必然属于page cache(filemap).否则
* 进程的页面何故需要写入磁盘?
* 进程+ page cache(bind buffers)的引用计数=page count
*/

/* 如果是在swap cache, 剩下的计数有一个是当前进程
* 所以>1 时才是有其他进程使用此页面
*/
return count > 1;
}

其含义以经分析如上,下面看看使用条件和具体使用的方式:
此函数假设已经有进程在使用此page(ref one),这就是使用的条件.共有三处引用:
1. do_wp_page-> 目的是pte_mkwrite. 引用计数已知,就是2,如果只有swap cahce 引用
此页面(不会有buffer),此操作安全.此函数适用.
2.do_swap_page->页面肯定在swap cache.并且即使有buffers, 读入操作也已完成.故可
以减去buffers的引用.
3. memory.c : free_pte->free_page_and_swap_cache(所有情况都是进程期望释放自己
的pte.),已知在swap cache, 并且后续对于buffers也要释放掉(锁定页面). 所以这个
情况使用此函数应该是最初的目的.

另外就是deactivate_page_nolock这个函数,参考try_to_swap_out ->deactivate_page->
deactivate_page_nolock:
try swap out:考察 当前 进程的时候,觉得要deacite此页面,但是除了swap cache,当前
进程和可能有的buffer之外如果还有其他引用的地方,则暂时不要deactive等到另外的一个进
程也决定deactive的时候再真正deactive.
另外page_ramdisk的页面不应该deactive,保证ramdisk的页面永驻内存.
另外refill_inactive_scan 是个特例.请参考相关代码.

deactive后页面转入lru队列的inactive_dirty_list,对于这个队列中的页面,将做何处
理?:
就是page_launder,清洗dirt 页面(脏了就洗干净吗!^_^).而清洗的时候要lock页面,如
果还有其他进程或者像tmpfs,ramdisk这样的人在悄悄的使用这个页面,情况将不堪设想.所
以不要清洗除了swap cache/buffer之外还有其他引用的页面.(caller extra ref或者当前
进程的引用再调用完这个函数后会page->count--,见try swap out.

/**
* (de)activate_page - move pages from/to active and inactive lists
* @page: the page we want to move
* @nolock - are we already holding the pagemap_lru_lock?
*
* Deactivate_page will move an active page to the right
* inactive list, while activate_page will move a page back
* from one of the inactive lists to the active list. If
* called on a page which is not on any of the lists, the
* page is left alone.
*/
void deactivate_page_nolock(struct page * page)
{
/*
* One for the cache, one for the extra reference the
* caller has and (maybe) one for the buffers.
*
* This isn't perfect, but works for just about everything.
* Besides, as long as we don't move unfreeable pages to the
* inactive_clean list it doesn't need to be perfect...
*/
/* extra reference: 当前进程或者调用者.记住
* ref count 的三个来源,才能灵活运用.
*/
int maxcount = (page->buffers ? 3 : 2);
page->age = 0;
ClearPageReferenced(page);

/*
* Don't touch it if it's not on the active list.
* (some pages aren't on any list at all)
*/
if (PageActive(page) && page_count(page) <= maxcount && !page_ramdisk(page)) {
del_page_from_active_list(page);
add_page_to_inactive_dirty_list(page);
}
}

对付page count的思路就是如此了.

题外, swap entry的引用计数
紧紧分析一下shmem_writepage对swap entry的引用计数的处理.
/*
* Move the page from the page cache to the swap cache
* (未做真正写入,留给swap cache 写入)
*/
/* page_launder:page->mapping->a_ops->writepage
* filemap_fdatasync-> page->mapping->a_ops->writepage
*/
static int shmem_writepage(struct page * page)
{
int error;
struct shmem_inode_info *info;
swp_entry_t *entry, swap;

/*
*
*/
info = &page->mapping->host->u.shmem_i;
if (info->locked)
return 1;
swap = __get_swap_page(2); /* 分配swap page(tmpfs(映射表) +swap cache(page->index) ,so refs is 2)*/
if (!swap.val)
return 1;

spin_lock(&info->lock);
/*寻找tmpfs内记录swap entry 的散列表*/
entry = shmem_swp_entry (info, page->index);
if (!entry) /* this had been allocted on page allocation */
BUG();
error = -EAGAIN;
if (entry->val) { /*已经有了swap entry与之对应*/
__swap_free(swap, 2);
goto out;
}

*entry = swap; /*tempfs ref swap entry, 释放引用见shmem_unuse-..>shmem_clear_swp*/
error = 0;
/* Remove the from the page cache */
lru_cache_del(page);
remove_inode_page(page);

/* Add it to the swap cache */
add_to_swap_cache(page, swap); /*swap cache ref swap entry,释放引用见try_to_unuse,or __delete_from_swap_cache*/
page_cache_release(page);
set_page_dirty(page);
info->swapped++;
out:
spin_unlock(&info->lock);
UnlockPage(page);
return error;
}

总之,对于ref count,目的是从一个地方能到他的时候,就应该对应一个ref.
相关阅读 更多 +
排行榜 更多 +
房间毁灭模拟器最新版

房间毁灭模拟器最新版

休闲益智 下载
街头追逐者最新版

街头追逐者最新版

休闲益智 下载
弓箭手2内置作弊菜单

弓箭手2内置作弊菜单

休闲益智 下载