文章详情

  • 游戏榜单
  • 软件榜单
关闭导航
热搜榜
热门下载
热门标签
php爱好者> php文档>026_pre_fs.c

026_pre_fs.c

时间:2009-03-27  来源:hylpro

2006-12-21 
 阅读fs之前普通文件系统包含大量与disk互动的部分.为了更好的理解这些操作.先将IDE驱动的相关部分研究一下. 解决一直有点模糊的问题.

大容量磁盘相关问题


最初的问题来自于BIOS的设计人员和ATA接口的设计人员没有达成一致的意见,BIOS和ATA CHS分配的总的字节数,以及cylinder, head, and sector各占用多少bit,都不相同。更严 重的问题是好像谁也没有预见到磁盘的容量增长的如此迅速!

 先来看看各种借口的容量限制: 
BIOS int 13接口:
Cylinder head sector limitation time reach limitation
bits 10 8 6 total 24bits 8.4GB
ATA 接口: cylinder head sector
bits 16 4 8 total 28bits 137.4GB Sept 2001 160 GB Maxtor Diamondmax
扩展 int 13 接口:(97年之后的bios基本都支持)
bits 8bytes*8 64bit LBA number 9.4 trillion gigabytes !!!!
ATA-6 接口:
bits 6bytes*8 48bit LBA number less tan Extended int13,but Large enough!!!
 可以看到,BIOS被设计成最大可寻址8GB, ATA-1-5也只能寻址到137GB。这是设计上的硬伤。这两个限制也由此而来。

在达到8GB之前值得注意的限制是528MB限制:
bios 13 和 ATA-5的联合,old bios 直接使用用户传入的CHS给ATA, BIOS mode->Normal
Cylinder head sector limitation time
10 4 6 2^20 = 528MB being a problem around 1993
 为了解决这个问题,BIOS引入了Extended CHS,即ECHS,有些bios中叫large mode. 
以一个2.95GiB的硬盘为例, 硬件报告的CHS是 6136/16/63
Cylinders Heads Sectors Capacity
IDE/ATA Limits 65,536 16 256 128 GiB
Hard Disk Logical Geometry 6,136 16 63 2.95 GiB
BIOS Translation Factor divide by 8 multiply by 8
BIOS Translated Geometry 767 128 63 2.95 GiB
BIOS Int 13h Limits 1024 256 63 7.88GiB
 突破8G的限制只用使用extended int13+ LBA mode(ATA). 如果在使用LBA模式的情况下还有int 13的程序,则BIOS将CHS直接转换成LBA地址,这个叫assisted LBA. 无论如何,只要使用BIOS无论是ECHS转换还是assisted LBA, 都无法突破8.5 GBlimit.并且还有一个问题值得一提: 最大head 数不是256,而是255,因为dos和window95 不能处理head为256的情况.所以总的限制比8.5GB要稍微少一些. 
 ATA规定,大于8.4 GB 的硬盘应该报告CHS为16383/16/63,这意味着`geometry'过时了,硬盘的总大小不能通过geometry来计算了,只能从IDENTIFY command返回的LBA capacity域来获知. 大于137.4 GB的硬盘应该报告LBA capacity是0xfffffff = 268435455 sectors (137G),正确的disksize在新的48 bit的域中.
 下面列出linux对大容量磁盘的支持情况:
>8.4 GB kernel should be 2.0.34 or later.
>33.8 GB kernel should be 2.0.39/2.2.14/2.3.21 or later.
> 137 GB kernel should be 2.4.19/2.5.3 or later.
 检查一个版本的linux是否支持大容量硬盘,可以看函数do_rw_disk (ide-disk.c).
refrence:
1. Large Disk Drives >8.4Gb (in addtion, a IBM doc attached)
http://www-oss.fnal.gov/projects/fermilinux/common/faq/old/0009.html
2. PC guid of hard disk
http://www.pcguide.com/ref/hdd/index.htm
 
block size的种种问题
 分析mm的时候说过do_generic_file_read的几个问题,关键的一点是理解最基本的观点。首先是磁盘上的文件尽量缓存在内存,这样才能更快的读写。缓存在内存中,最基本的单位就是内存页面了,在i386上,常见的大小是4k。 通过缓存读取文件的时候,首先是把用户指定的以字节为单位的offset,size转换成以4k为单位的内存页,这样可以直接拷贝数据给用户。如果文件不在缓存中,就要从磁盘读取,比如通过block_read_full_page从磁盘读取一个page大小的数据。

文件存储于一个具体的文件系统,而这个文件系统有自己的分配单位,那就是block,比如对于ext2,block就是具体的ext2可以分配的最小单位,常见的ext2的block size是1k,可以为2k,4k,但是不能大于4k(refer. ext2_read_super)。
作为存储在这个ext2上的文件,属于它的所有block纪录在磁文件的inode中,纪录的是每一个block的block number。ext2上bocknumber 从1开始(block0#是boot),最大看磁盘容量了,呵呵。这样一来,就把每个文件,以block size为单位分成了从0开始的block。每个文件都是这样一个线型空间,通过inode的一个数组映射到ext2文件系统上从1开始的block空间去。

过了这样一个步骤,就要和磁盘打交道。通常这个接口是bread(block#, size).
struct buffer_head * bread(kdev_t dev, int block, int size)
{
struct buffer_head * bh;
 bh = getblk(dev, block, size); /*bh包含了block#和block的size信息*/
if (buffer_uptodate(bh))
return bh;
ll_rw_block(READ, 1, &bh); /* 传递给硬盘驱动*/
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
brelse(bh);
return NULL;
}
这个函数的意思是按照块大小是size读取块号为block的块. 换一种角度,bread按照文件系统理解磁盘的方式提供一个访问磁盘的接口,块大小由size指定,读取那个块由block指定.至于磁盘怎么划分扇区,就不用操心了.
 IDE Driver overview
 我们从bread入手,看看磁盘驱动如何读取磁盘扇区。上边说了bread,这里从ll_rw_block开始。过程虽然从代码里看很复杂,但是主线并不复杂: 给buffer设置一个回叫函数,等磁盘完成读取后通过这个回叫函数设置bh的uptodate位,同时,如果有任务等待这个bh读取完成则唤醒等待的任务. 提交给磁盘的时候,磁盘将这个操作安排到一个队列,然后对所有请求进行调度,以提高磁盘io
速度,然后根据调度的结果执行读取任务.
ll_rw_block(int rw, int nr, struct buffer_head * bhs[])
{
unsigned int major;
int correct_size;
int i;
/*先进行一系列的检查*/
 1. /* Determine correct block size for this device. */
2. /* Verify requested block sizes. */
3. 如果是写操作,看看设备是否容许写

/*接着是为bh设置b_end_io:end_buffer_io_sync,通过这个函数通知等待的进程*/
for (i = 0; i < nr; i++) {
struct buffer_head *bh;
bh = bhs[i];
 /* Only one thread can actually submit the I/O. */
if (test_and_set_bit(BH_Lock, &bh->b_state))
continue;
 /* We have the buffer lock */
bh->b_end_io = end_buffer_io_sync;
 ......... //考虑一些可能存在竞争的情况
 submit_bh(rw, bh); /*提交申请给磁盘驱动程序*/
}
return;
.... //clean
}
 然后是通过submit_bh给磁盘驱动提交申请:
void submit_bh(int rw, struct buffer_head * bh)
{
if (!test_bit(BH_Lock, &bh->b_state))
BUG();
 set_bit(BH_Req, &bh->b_state);
 /*
* First step, 'identity mapping' - RAID or LVM might
* further remap this.
* 这里把文件系统定义的block#(size)转化为扇区号
*/
bh->b_rdev = bh->b_dev;
bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
 generic_make_request(rw, bh);
 switch (rw) {
case WRITE:
kstat.pgpgout++;
break;
default:
kstat.pgpgin++;
break;
}
}
 看看如何向磁盘驱动提交申请:
void generic_make_request (int rw, struct buffer_head * bh)
{
int major = MAJOR(bh->b_rdev);
request_queue_t *q;
 .....//检查读取范围是否存在于磁盘,比如超出最大扇区号
 /*
* Resolve the mapping until finished. (drivers are
* still free to implement/resolve their own stacking
* by explicitly returning 0)
*/
/* NOTE: we don't repeat the blk_size check for each new device.
* Stacking drivers are expected to know what they are doing.
*/
do {
q = blk_get_queue(bh->b_rdev);
if (!q) {
printk(KERN_ERR
"generic_make_request: Trying to access nonexistent block-device %s (%ld)\n",
kdevname(bh->b_rdev), bh->b_rsector);
buffer_IO_error(bh);
break;
}
 }
while (q->make_request_fn(q, rw, bh)); /*参考blk_init_queue,初始化为 __make_request*/
}
 这里通过一个while循环来提交一个请求,但是对于IDE,这是没有必要的.__make_request总是返回0.
static int __make_request(request_queue_t * q, int rw,
struct buffer_head * bh)
{
unsigned int sector, count;
int max_segments = MAX_SEGMENTS;
struct request * req = NULL, *freereq = NULL;
int rw_ahead, max_sectors, el_ret;
struct list_head *head;
int latency;
elevator_t *elevator = &q->elevator;
 again:
........
 if (list_empty(head)) {
q->plug_device_fn(q, bh->b_rdev); /* is atomic */
/*这个函数对IDE来讲是generic_plug_device,见blk_init_queue*/
goto get_rq;
}
 el_ret = elevator->elevator_merge_fn(q, &req, bh, rw,
&max_sectors, &max_segments);
switch (el_ret) {
 case ELEVATOR_BACK_MERGE:
if (!q->back_merge_fn(q, req, bh, max_segments))
break;
req->bhtail->b_reqnext = bh;
req->bhtail = bh;
req->nr_sectors = req->hard_nr_sectors += count;
req->e = elevator;
drive_stat_acct(req->rq_dev, req->cmd, count, 0);
attempt_back_merge(q, req, max_sectors, max_segments);
goto out;
 case ELEVATOR_FRONT_MERGE:
if (!q->front_merge_fn(q, req, bh, max_segments))
break;
bh->b_reqnext = req->bh;
req->bh = bh;
req->buffer = bh->b_data;
req->current_nr_sectors = count;
req->sector = req->hard_sector = sector;
req->nr_sectors = req->hard_nr_sectors += count;
req->e = elevator;
drive_stat_acct(req->rq_dev, req->cmd, count, 0);
attempt_front_merge(q, head, req, max_sectors, max_segments);
goto out;
/*
* elevator says don't/can't merge. get new request
*/
case ELEVATOR_NO_MERGE:
break;
 default:
printk("elevator returned crap (%d)\n", el_ret);
BUG();
}
 /*
* Grab a free request from the freelist. Read first try their
* own queue - if that is empty, we steal from the write list.
* Writes must block if the write list is empty, and read aheads
* are not crucial.
*/
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
} else if ((req = get_request(q, rw)) == NULL) {
spin_unlock_irq(&io_request_lock);
if (rw_ahead)
goto end_io;
 freereq = __get_request_wait(q, rw);
goto again;
}
/* fill up the request-info, and add it to the queue */
req->cmd = rw;
req->errors = 0;
req->hard_sector = req->sector = sector;
req->hard_nr_sectors = req->nr_sectors = count;
req->current_nr_sectors = count;
req->nr_segments = 1; /* Always 1 for a new request. */
req->nr_hw_segments = 1; /* Always 1 for a new request. */
req->buffer = bh->b_data;
req->sem = NULL;
req->bh = bh;
req->bhtail = bh;
req->rq_dev = bh->b_rdev;
req->e = elevator;
add_request(q, req, head, latency); /*提交给磁盘驱动*/
out:
if (!q->plugged)
(q->request_fn)(q);/*见ide_init_queue,将其初始化为do_ide_request */
 if (freereq)
blkdev_release_request(freereq);
spin_unlock_irq(&io_request_lock);
return 0;
end_io:
bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
return 0;
}
待会儿再说q->plugged的含义.先看看do_ide_request做了什么:
void do_ide_request(request_queue_t *q)
{
ide_do_request(q->queuedata, 0);
}
static void ide_do_request(ide_hwgroup_t *hwgroup, int masked_irq)
{
ide_drive_t *drive;
ide_hwif_t *hwif;
ide_startstop_t startstop;
 ide_get_lock(&ide_lock, ide_intr, hwgroup); /* for atari only: POSSIBLY BROKEN HERE(?) */
 __cli(); /* necessary paranoia: ensure IRQs are masked on local CPU */
 while (!hwgroup->busy) { /*hwgroup不忙的时候需要处理,否则这就是一个空函数而已*/
hwgroup->busy = 1; /*如果busy置位,代表其他进程已经进入次循环,第一个进入此循环的
线程负责处理所有连接到此hwgroup上drive的请求。一个hwgorp共享
同一个中断。
*/

drive = choose_drive(hwgroup); /*选择一个控制器,呵呵,处理的请求未必就是你刚刚提交的
那个,甚至你读hda,这里却选中了hdc,注意
drive->queue.plugged ==0 才会被选中,
plugged 置位代表
这个drive开始处理请求,这种情况下不需要这个线程调用
ide_do_request而是通过中断ide_intr->ide_do_request(drive);
来获取cpu处理请求

*/

if (drive == NULL) {
unsigned long sleep = 0;
hwgroup->rq = NULL;
drive = hwgroup->drive;
do {
if (drive->sleep && (!sleep || 0 < (signed long)(sleep - drive->sleep)))
sleep = drive->sleep;
} while ((drive = drive->next) != hwgroup->drive);
if (sleep) {
/*
* Take a short snooze, and then wake up this hwgroup again.
* This gives other hwgroups on the same a chance to
* play fairly with us, just in case there are big differences
* in relative throughputs.. don't want to hog the cpu too much.
*/
if (0 < (signed long)(jiffies + WAIT_MIN_SLEEP - sleep))
sleep = jiffies + WAIT_MIN_SLEEP;
#if 1
if (timer_pending(&hwgroup->timer))
printk("ide_set_handler: timer already active\n");
#endif
hwgroup->sleeping = 1; /* so that ide_timer_expiry knows what to do */
mod_timer(&hwgroup->timer, sleep);
/* we purposely leave hwgroup->busy==1 while sleeping */
} else {
/* Ugly, but how can we sleep for the lock otherwise? perhaps from tq_disk? */
ide_release_lock(&ide_lock); /* for atari only */
hwgroup->busy = 0;
}
return; /* no more work for this hwgroup (for now) */
}
hwif = HWIF(drive);
if (hwgroup->hwif->sharing_irq && hwif != hwgroup->hwif && hwif->io_ports[IDE_CONTROL_OFFSET]) {
/* set nIEN for previous hwif */
SELECT_INTERRUPT(hwif, drive);
}
hwgroup->hwif = hwif;
hwgroup->drive = drive;
drive->sleep = 0;
drive->service_start = jiffies;
 if ( drive->queue.plugged ) /* paranoia */
printk("%s: Huh? nuking plugged queue\n", drive->name);
hwgroup->rq = blkdev_entry_next_request(&drive->queue.queue_head);
/*
* Some systems have trouble with IDE IRQs arriving while
* the driver is still setting things up. So, here we disable
* the IRQ used by this interface while the request is being started.
* This may look bad at first, but pretty much the same thing
* happens anyway when any interrupt comes in, IDE or otherwise
* -- the kernel masks the IRQ while it is being handled.
*/
if (masked_irq && hwif->irq != masked_irq)
disable_irq_nosync(hwif->irq);
spin_unlock(&io_request_lock);
ide__sti(); /* allow other IRQs while we start this request */
startstop = start_request(drive);
spin_lock_irq(&io_request_lock);
if (masked_irq && hwif->irq != masked_irq)
enable_irq(hwif->irq);
if (startstop == ide_stopped)
hwgroup->busy = 0;
}
}
 IDE分析到这种地步,我们开始接触磁盘操作的‘核心’逻辑:__make_request,ide_do_request,plugged,ide_intr,tq_disk。
_
_make_request,tq_disk
主要负责调度磁盘的读写请求。ide_do_request,ide_intr完成ide借口的操作,真正的完成读写磁盘。_make_request 第一次接到磁盘读写请求(que为空),直接将请求挂如队列,置plug,放入tq_task(延后对ide_do_request的调用)。后续的读写请求则首先进行调度,然后再决定是否马上向hw发起操作。当向hw请求发出后(ide_do_request得以执行),intr接管对ide_do_request的调用同时que plug位被清除,hwgroup的busy位置位 。当plug到tq_disk时,不会进行hw操作的ide_do_request只选择非plug的队列)。

intr接管对ide_do_request的调用之后,也不见得会将所有的读写请求处理完,这要看磁盘级别的调度结果,ide_do_request负责在磁盘之间调度。这里注意一下head_acitve,对于ide,此位总是 1,这代表在对读写请求调度时,如果处于unplug状态,则不能操作第一个req(unplug时有可能在进行io操作,即ide_intr已经在进行真正的io操作了)。

处于plug状态的队列其实是在等待进行读写请求的调度,以便达到比较好的io吞吐率。但是也不能这样长久的等待下去。所以,如果我们搜索一下tq_task,就会发现内核有许多地方在调整着吞吐率和延迟之间的矛盾。具体细节就不再罗列了。真正操作ide的代码是start_request,drive->do_request(对于ide 硬盘是do_rw_disk):
/*
* do_rw_disk() issues READ and WRITE commands to a disk,
* using LBA if supported, or CHS otherwise, to address sectors.
* It also takes care of issuing special DRIVE_CMDs.
*/
static ide_startstop_t do_rw_disk (ide_drive_t *drive, struct request *rq, unsigned long block)
{
if (IDE_CONTROL_REG)
OUT_BYTE(drive->ctl,IDE_CONTROL_REG);
OUT_BYTE(rq->nr_sectors,IDE_NSECTOR_REG);

if (drive->select.b.lba) { /*LBA,可以看到,2.4.0的内核还不支持48bitLBA操作,不能支持〉137G的硬盘*/

#ifdef DEBUG
printk("%s: %sing: LBAsect=%ld, sectors=%ld, buffer=0x%08lx\n",
drive->name, (rq->cmd==READ)?"read":"writ",
block, rq->nr_sectors, (unsigned long) rq->buffer);
#endif
OUT_BYTE(block,IDE_SECTOR_REG);
OUT_BYTE(block>>=8,IDE_LCYL_REG);
OUT_BYTE(block>>=8,IDE_HCYL_REG);
OUT_BYTE(((block>>8)&0x0f)|drive->select.all,IDE_SELECT_REG);
} else {
unsigned int sect,head,cyl,track;
track = block / drive->sect;
sect = block % drive->sect + 1;
OUT_BYTE(sect,IDE_SECTOR_REG);
head = track % drive->head;
cyl = track / drive->head;
OUT_BYTE(cyl,IDE_LCYL_REG);
OUT_BYTE(cyl>>8,IDE_HCYL_REG);
OUT_BYTE(head|drive->select.all,IDE_SELECT_REG);
#ifdef DEBUG
printk("%s: %sing: CHS=%d/%d/%d, sectors=%ld, buffer=0x%08lx\n",
drive->name, (rq->cmd==READ)?"read":"writ", cyl,
head, sect, rq->nr_sectors, (unsigned long) rq->buffer);
#endif
}
#ifdef CONFIG_BLK_DEV_PDC4030
if (IS_PDC4030_DRIVE) {
extern ide_startstop_t do_pdc4030_io(ide_drive_t *, struct request *);
return do_pdc4030_io (drive, rq);
}
#endif /* CONFIG_BLK_DEV_PDC4030 */
if (rq->cmd == READ) {
#ifdef CONFIG_BLK_DEV_IDEDMA
if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_read, drive)))
return ide_started;
#endif /* CONFIG_BLK_DEV_IDEDMA */
ide_set_handler(drive, &read_intr, WAIT_CMD, NULL);
OUT_BYTE(drive->mult_count ? WIN_MULTREAD : WIN_READ, IDE_COMMAND_REG);
return ide_started;
}
if (rq->cmd == WRITE) {
ide_startstop_t startstop;
#ifdef CONFIG_BLK_DEV_IDEDMA
if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_write, drive)))
return ide_started;
#endif /* CONFIG_BLK_DEV_IDEDMA */
OUT_BYTE(drive->mult_count ? WIN_MULTWRITE : WIN_WRITE, IDE_COMMAND_REG);
if (ide_wait_stat(&startstop, drive, DATA_READY, drive->bad_wstat, WAIT_DRQ)) {
printk(KERN_ERR "%s: no DRQ after issuing %s\n", drive->name,
drive->mult_count ? "MULTWRITE" : "WRITE");
return startstop;
}
if (!drive->unmask)
__cli(); /* local CPU only */
if (drive->mult_count) {
ide_hwgroup_t *hwgroup = HWGROUP(drive);
/*
* Ugh.. this part looks ugly because we MUST set up
* the interrupt handler before outputting the first block
* of data to be written. If we hit an error (corrupted buffer list)
* in ide_multwrite(), then we need to remove the handler/timer
* before returning. Fortunately, this NEVER happens (right?).
*
* Except when you get an error it seems...
*/
hwgroup->wrq = *rq; /* scratchpad */
ide_set_handler (drive, &multwrite_intr, WAIT_CMD, NULL);
if (ide_multwrite(drive, drive->mult_count)) {
unsigned long flags;
spin_lock_irqsave(&io_request_lock, flags);
hwgroup->handler = NULL;
del_timer(&hwgroup->timer);
spin_unlock_irqrestore(&io_request_lock, flags);
return ide_stopped;
}
} else {
ide_set_handler (drive, &write_intr, WAIT_CMD, NULL);
idedisk_output_data(drive, rq->buffer, SECTOR_WORDS);
}
return ide_started;
}
printk(KERN_ERR "%s: bad command: %d\n", drive->name, rq->cmd);
ide_end_request(0, HWGROUP(drive));
return ide_stopped;
}
 
 
相关阅读 更多 +
排行榜 更多 +
房间毁灭模拟器最新版

房间毁灭模拟器最新版

休闲益智 下载
街头追逐者最新版

街头追逐者最新版

休闲益智 下载
弓箭手2内置作弊菜单

弓箭手2内置作弊菜单

休闲益智 下载