Php文档 Php问答行业资讯 Php论坛 Php手册 Php博客

游戏榜单

软件榜单

关闭导航

热搜榜

热门下载

热门标签

关闭搜索

php爱好者> php文档>026_pre_fs.c

026_pre_fs.c

时间：2009-03-27 来源：hylpro

2006-12-21

 阅读fs之前普通文件系统包含大量与disk互动的部分.为了更好的理解这些操作.先将IDE驱动的相关部分研究一下. 解决一直有点模糊的问题.

大容量磁盘相关问题

最初的问题来自于BIOS的设计人员和ATA接口的设计人员没有达成一致的意见，BIOS和ATA CHS分配的总的字节数，以及cylinder, head, and sector各占用多少bit，都不相同。更严重的问题是好像谁也没有预见到磁盘的容量增长的如此迅速！

 先来看看各种借口的容量限制： 
BIOS int 13接口：
 Cylinder head sector limitation time reach limitation
bits 10 8 6 total 24bits 8.4GB 
ATA 接口： cylinder head sector
bits 16 4 8 total 28bits 137.4GB Sept 2001 160 GB Maxtor Diamondmax

扩展 int 13 接口：（97年之后的bios基本都支持）
bits 8bytes*8 64bit LBA number 9.4 trillion gigabytes !!!!
ATA-6 接口:
bits 6bytes*8 48bit LBA number less tan Extended int13,but Large enough!!!

 可以看到,BIOS被设计成最大可寻址8GB, ATA-1-5也只能寻址到137GB。这是设计上的硬伤。这两个限制也由此而来。

在达到8GB之前值得注意的限制是528MB限制：
bios 13 和 ATA-5的联合,old bios 直接使用用户传入的CHS给ATA, BIOS mode->Normal
 Cylinder head sector limitation time
 10 4 6 2^20 = 528MB being a problem around 1993

 为了解决这个问题,BIOS引入了Extended CHS,即ECHS,有些bios中叫large mode. 
以一个2.95GiB的硬盘为例, 硬件报告的CHS是 6136/16/63
 Cylinders Heads Sectors Capacity 
IDE/ATA Limits 65,536 16 256 128 GiB
Hard Disk Logical Geometry 6,136 16 63 2.95 GiB
BIOS Translation Factor divide by 8 multiply by 8
BIOS Translated Geometry 767 128 63 2.95 GiB
BIOS Int 13h Limits 1024 256 63 7.88GiB

 突破8G的限制只用使用extended int13+ LBA mode(ATA). 如果在使用LBA模式的情况下还有int 13的程序,则BIOS将CHS直接转换成LBA地址,这个叫assisted LBA. 无论如何,只要使用BIOS无论是ECHS转换还是assisted LBA, 都无法突破8.5 GBlimit.并且还有一个问题值得一提: 最大head 数不是256,而是255,因为dos和window95 不能处理head为256的情况.所以总的限制比8.5GB要稍微少一些.

 ATA规定,大于8.4 GB 的硬盘应该报告CHS为16383/16/63,这意味着`geometry'过时了,硬盘的总大小不能通过geometry来计算了,只能从IDENTIFY command返回的LBA capacity域来获知. 大于137.4 GB的硬盘应该报告LBA capacity是0xfffffff = 268435455 sectors (137G),正确的disksize在新的48 bit的域中.

 下面列出linux对大容量磁盘的支持情况：
 >8.4 GB kernel should be 2.0.34 or later.
 >33.8 GB kernel should be 2.0.39/2.2.14/2.3.21 or later.
 > 137 GB kernel should be 2.4.19/2.5.3 or later.

 检查一个版本的linux是否支持大容量硬盘，可以看函数do_rw_disk (ide-disk.c).

refrence：
1. Large Disk Drives >8.4Gb (in addtion, a IBM doc attached)
http://www-oss.fnal.gov/projects/fermilinux/common/faq/old/0009.html

2. PC guid of hard disk
 http://www.pcguide.com/ref/hdd/index.htm

 
 block size的种种问题

 分析mm的时候说过do_generic_file_read的几个问题,关键的一点是理解最基本的观点。首先是磁盘上的文件尽量缓存在内存，这样才能更快的读写。缓存在内存中，最基本的单位就是内存页面了，在i386上，常见的大小是4k。 通过缓存读取文件的时候，首先是把用户指定的以字节为单位的offset，size转换成以4k为单位的内存页，这样可以直接拷贝数据给用户。如果文件不在缓存中，就要从磁盘读取，比如通过block_read_full_page从磁盘读取一个page大小的数据。

文件存储于一个具体的文件系统，而这个文件系统有自己的分配单位，那就是block，比如对于ext2，block就是具体的ext2可以分配的最小单位，常见的ext2的block size是1k，可以为2k，4k，但是不能大于4k（refer. ext2_read_super）。
 作为存储在这个ext2上的文件，属于它的所有block纪录在磁文件的inode中，纪录的是每一个block的block number。ext2上bocknumber 从1开始(block0#是boot)，最大看磁盘容量了，呵呵。这样一来，就把每个文件，以block size为单位分成了从0开始的block。每个文件都是这样一个线型空间，通过inode的一个数组映射到ext2文件系统上从1开始的block空间去。

过了这样一个步骤，就要和磁盘打交道。通常这个接口是bread(block#, size).
struct buffer_head * bread(kdev_t dev, int block, int size)
{
 struct buffer_head * bh;

 bh = getblk(dev, block, size); /*bh包含了block#和block的size信息*/
 if (buffer_uptodate(bh))
 return bh;
 ll_rw_block(READ, 1, &bh); /* 传递给硬盘驱动*/
 wait_on_buffer(bh);
 if (buffer_uptodate(bh))
 return bh;
 brelse(bh);
 return NULL;
}
 这个函数的意思是按照块大小是size读取块号为block的块. 换一种角度,bread按照文件系统理解磁盘的方式提供一个访问磁盘的接口，块大小由size指定，读取那个块由block指定.至于磁盘怎么划分扇区,就不用操心了.

 IDE Driver overview

 我们从bread入手，看看磁盘驱动如何读取磁盘扇区。上边说了bread，这里从ll_rw_block开始。过程虽然从代码里看很复杂，但是主线并不复杂: 给buffer设置一个回叫函数,等磁盘完成读取后通过这个回叫函数设置bh的uptodate位,同时,如果有任务等待这个bh读取完成则唤醒等待的任务. 提交给磁盘的时候,磁盘将这个操作安排到一个队列,然后对所有请求进行调度,以提高磁盘io
速度,然后根据调度的结果执行读取任务.
ll_rw_block(int rw, int nr, struct buffer_head * bhs[])
{
 unsigned int major;
 int correct_size;
 int i;
 /*先进行一系列的检查*/

 1. /* Determine correct block size for this device. */
 2. /* Verify requested block sizes. */
 3. 如果是写操作,看看设备是否容许写

/*接着是为bh设置b_end_io:end_buffer_io_sync,通过这个函数通知等待的进程*/
 for (i = 0; i < nr; i++) {
 struct buffer_head *bh;
 bh = bhs[i];

 /* Only one thread can actually submit the I/O. */
 if (test_and_set_bit(BH_Lock, &bh->b_state))
 continue;

 /* We have the buffer lock */
 bh->b_end_io = end_buffer_io_sync;

 ......... //考虑一些可能存在竞争的情况

 submit_bh(rw, bh); /*提交申请给磁盘驱动程序*/
 }
 return;
 .... //clean
}

 然后是通过submit_bh给磁盘驱动提交申请:
void submit_bh(int rw, struct buffer_head * bh)
{
 if (!test_bit(BH_Lock, &bh->b_state))
 BUG();

 set_bit(BH_Req, &bh->b_state);

 /*
 * First step, 'identity mapping' - RAID or LVM might
 * further remap this.
 * 这里把文件系统定义的block#(size)转化为扇区号
 */
 bh->b_rdev = bh->b_dev;
 bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);

 generic_make_request(rw, bh);

 switch (rw) {
 case WRITE:
 kstat.pgpgout++;
 break;
 default:
 kstat.pgpgin++;
 break;
 }
}

 看看如何向磁盘驱动提交申请:
void generic_make_request (int rw, struct buffer_head * bh)
{
 int major = MAJOR(bh->b_rdev);
 request_queue_t *q;

 .....//检查读取范围是否存在于磁盘,比如超出最大扇区号

 /*
 * Resolve the mapping until finished. (drivers are
 * still free to implement/resolve their own stacking
 * by explicitly returning 0)
 */
 /* NOTE: we don't repeat the blk_size check for each new device.
 * Stacking drivers are expected to know what they are doing.
 */
 do {
 q = blk_get_queue(bh->b_rdev);
 if (!q) {
 printk(KERN_ERR
 "generic_make_request: Trying to access nonexistent block-device %s (%ld)\n",
 kdevname(bh->b_rdev), bh->b_rsector);
 buffer_IO_error(bh);
 break;
 }

 }
 while (q->make_request_fn(q, rw, bh)); /*参考blk_init_queue,初始化为 __make_request*/
}

 这里通过一个while循环来提交一个请求,但是对于IDE,这是没有必要的.__make_request总是返回0.
static int __make_request(request_queue_t * q, int rw,
 struct buffer_head * bh)
{
 unsigned int sector, count;
 int max_segments = MAX_SEGMENTS;
 struct request * req = NULL, *freereq = NULL;
 int rw_ahead, max_sectors, el_ret;
 struct list_head *head;
 int latency;
 elevator_t *elevator = &q->elevator;

 again:
 ........

 if (list_empty(head)) {
 q->plug_device_fn(q, bh->b_rdev); /* is atomic */
 /*这个函数对IDE来讲是generic_plug_device,见blk_init_queue*/
 goto get_rq;
 }

 el_ret = elevator->elevator_merge_fn(q, &req, bh, rw,
 &max_sectors, &max_segments);
 switch (el_ret) {

 case ELEVATOR_BACK_MERGE:
 if (!q->back_merge_fn(q, req, bh, max_segments))
 break;
 req->bhtail->b_reqnext = bh;
 req->bhtail = bh;
 req->nr_sectors = req->hard_nr_sectors += count;
 req->e = elevator;
 drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 attempt_back_merge(q, req, max_sectors, max_segments);
 goto out;

 case ELEVATOR_FRONT_MERGE:
 if (!q->front_merge_fn(q, req, bh, max_segments))
 break;
 bh->b_reqnext = req->bh;
 req->bh = bh;
 req->buffer = bh->b_data;
 req->current_nr_sectors = count;
 req->sector = req->hard_sector = sector;
 req->nr_sectors = req->hard_nr_sectors += count;
 req->e = elevator;
 drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 attempt_front_merge(q, head, req, max_sectors, max_segments);
 goto out;
 /*
 * elevator says don't/can't merge. get new request
 */
 case ELEVATOR_NO_MERGE:
 break;

 default:
 printk("elevator returned crap (%d)\n", el_ret);
 BUG();
 }

 /*
 * Grab a free request from the freelist. Read first try their
 * own queue - if that is empty, we steal from the write list.
 * Writes must block if the write list is empty, and read aheads
 * are not crucial.
 */
get_rq:
 if (freereq) {
 req = freereq;
 freereq = NULL;
 } else if ((req = get_request(q, rw)) == NULL) {
 spin_unlock_irq(&io_request_lock);
 if (rw_ahead)
 goto end_io;

 freereq = __get_request_wait(q, rw);
 goto again;
 }

/* fill up the request-info, and add it to the queue */
 req->cmd = rw;
 req->errors = 0;
 req->hard_sector = req->sector = sector;
 req->hard_nr_sectors = req->nr_sectors = count;
 req->current_nr_sectors = count;
 req->nr_segments = 1; /* Always 1 for a new request. */
 req->nr_hw_segments = 1; /* Always 1 for a new request. */
 req->buffer = bh->b_data;
 req->sem = NULL;
 req->bh = bh;
 req->bhtail = bh;
 req->rq_dev = bh->b_rdev;
 req->e = elevator;
 add_request(q, req, head, latency); /*提交给磁盘驱动*/
out:
 if (!q->plugged) 
 (q->request_fn)(q);/*见ide_init_queue,将其初始化为do_ide_request */

 if (freereq)
 blkdev_release_request(freereq);
 spin_unlock_irq(&io_request_lock);
 return 0;
end_io:
 bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
 return 0;
}
 待会儿再说q->plugged的含义.先看看do_ide_request做了什么:
void do_ide_request(request_queue_t *q)
{
 ide_do_request(q->queuedata, 0);
}
static void ide_do_request(ide_hwgroup_t *hwgroup, int masked_irq)
{
 ide_drive_t *drive;
 ide_hwif_t *hwif;
 ide_startstop_t startstop;

 ide_get_lock(&ide_lock, ide_intr, hwgroup); /* for atari only: POSSIBLY BROKEN HERE(?) */

 __cli(); /* necessary paranoia: ensure IRQs are masked on local CPU */

 while (!hwgroup->busy) { /*hwgroup不忙的时候需要处理，否则这就是一个空函数而已*/
 hwgroup->busy = 1; /*如果busy置位，代表其他进程已经进入次循环，第一个进入此循环的
 线程负责处理所有连接到此hwgroup上drive的请求。一个hwgorp共享
 同一个中断。
 */
 drive = choose_drive(hwgroup); /*选择一个控制器，呵呵，处理的请求未必就是你刚刚提交的
 那个，甚至你读hda，这里却选中了hdc，注意
 drive->queue.plugged ==0 才会被选中，plugged 置位代表
 这个drive开始处理请求，这种情况下不需要这个线程调用
 ide_do_request而是通过中断ide_intr->ide_do_request(drive);
 来获取cpu处理请求
 */
 if (drive == NULL) {
 unsigned long sleep = 0;
 hwgroup->rq = NULL;
 drive = hwgroup->drive;
 do {
 if (drive->sleep && (!sleep || 0 < (signed long)(sleep - drive->sleep)))
 sleep = drive->sleep;
 } while ((drive = drive->next) != hwgroup->drive);
 if (sleep) {
 /*
 * Take a short snooze, and then wake up this hwgroup again.
 * This gives other hwgroups on the same a chance to
 * play fairly with us, just in case there are big differences
 * in relative throughputs.. don't want to hog the cpu too much.
 */
 if (0 < (signed long)(jiffies + WAIT_MIN_SLEEP - sleep)) 
 sleep = jiffies + WAIT_MIN_SLEEP;
#if 1
 if (timer_pending(&hwgroup->timer))
 printk("ide_set_handler: timer already active\n");
#endif
 hwgroup->sleeping = 1; /* so that ide_timer_expiry knows what to do */
 mod_timer(&hwgroup->timer, sleep);
 /* we purposely leave hwgroup->busy==1 while sleeping */
 } else {
 /* Ugly, but how can we sleep for the lock otherwise? perhaps from tq_disk? */
 ide_release_lock(&ide_lock); /* for atari only */
 hwgroup->busy = 0;
 }
 return; /* no more work for this hwgroup (for now) */
 }
 hwif = HWIF(drive);
 if (hwgroup->hwif->sharing_irq && hwif != hwgroup->hwif && hwif->io_ports[IDE_CONTROL_OFFSET]) {
 /* set nIEN for previous hwif */
 SELECT_INTERRUPT(hwif, drive);
 }
 hwgroup->hwif = hwif;
 hwgroup->drive = drive;
 drive->sleep = 0;
 drive->service_start = jiffies;

 if ( drive->queue.plugged ) /* paranoia */
 printk("%s: Huh? nuking plugged queue\n", drive->name);
 hwgroup->rq = blkdev_entry_next_request(&drive->queue.queue_head);
 /*
 * Some systems have trouble with IDE IRQs arriving while
 * the driver is still setting things up. So, here we disable
 * the IRQ used by this interface while the request is being started.
 * This may look bad at first, but pretty much the same thing
 * happens anyway when any interrupt comes in, IDE or otherwise
 * -- the kernel masks the IRQ while it is being handled.
 */
 if (masked_irq && hwif->irq != masked_irq)
 disable_irq_nosync(hwif->irq);
 spin_unlock(&io_request_lock);
 ide__sti(); /* allow other IRQs while we start this request */
 startstop = start_request(drive);
 spin_lock_irq(&io_request_lock);
 if (masked_irq && hwif->irq != masked_irq)
 enable_irq(hwif->irq);
 if (startstop == ide_stopped)
 hwgroup->busy = 0;
 }
}

 IDE分析到这种地步，我们开始接触磁盘操作的‘核心’逻辑：__make_request，ide_do_request，plugged，ide_intr，tq_disk。
__make_request，tq_disk 主要负责调度磁盘的读写请求。ide_do_request，ide_intr完成ide借口的操作，真正的完成读写磁盘。_make_request 第一次接到磁盘读写请求（que为空），直接将请求挂如队列，置plug，放入tq_task(延后对ide_do_request的调用)。后续的读写请求则首先进行调度，然后再决定是否马上向hw发起操作。当向hw请求发出后（ide_do_request得以执行），intr接管对ide_do_request的调用，同时que plug位被清除，hwgroup的busy位置位 。（当plug到tq_disk时，不会进行hw操作的ide_do_request只选择非plug的队列）。

intr接管对ide_do_request的调用之后，也不见得会将所有的读写请求处理完，这要看磁盘级别的调度结果，ide_do_request负责在磁盘之间调度。这里注意一下head_acitve,对于ide，此位总是 1，这代表在对读写请求调度时，如果处于unplug状态，则不能操作第一个req(unplug时有可能在进行io操作，即ide_intr已经在进行真正的io操作了)。

处于plug状态的队列其实是在等待进行读写请求的调度，以便达到比较好的io吞吐率。但是也不能这样长久的等待下去。所以，如果我们搜索一下tq_task,就会发现内核有许多地方在调整着吞吐率和延迟之间的矛盾。具体细节就不再罗列了。真正操作ide的代码是start_request，drive->do_request（对于ide 硬盘是do_rw_disk）：

/*
 * do_rw_disk() issues READ and WRITE commands to a disk,
 * using LBA if supported, or CHS otherwise, to address sectors.
 * It also takes care of issuing special DRIVE_CMDs.
 */
static ide_startstop_t do_rw_disk (ide_drive_t *drive, struct request *rq, unsigned long block)
{
 if (IDE_CONTROL_REG)
 OUT_BYTE(drive->ctl,IDE_CONTROL_REG);
 OUT_BYTE(rq->nr_sectors,IDE_NSECTOR_REG);

if (drive->select.b.lba) { /*LBA,可以看到，2.4.0的内核还不支持48bitLBA操作，不能支持〉137G的硬盘*/

#ifdef DEBUG
 printk("%s: %sing: LBAsect=%ld, sectors=%ld, buffer=0x%08lx\n",
 drive->name, (rq->cmd==READ)?"read":"writ",
 block, rq->nr_sectors, (unsigned long) rq->buffer);
#endif
 OUT_BYTE(block,IDE_SECTOR_REG);
 OUT_BYTE(block>>=8,IDE_LCYL_REG);
 OUT_BYTE(block>>=8,IDE_HCYL_REG);
 OUT_BYTE(((block>>8)&0x0f)|drive->select.all,IDE_SELECT_REG);
 } else {
 unsigned int sect,head,cyl,track;
 track = block / drive->sect;
 sect = block % drive->sect + 1;
 OUT_BYTE(sect,IDE_SECTOR_REG);
 head = track % drive->head;
 cyl = track / drive->head;
 OUT_BYTE(cyl,IDE_LCYL_REG);
 OUT_BYTE(cyl>>8,IDE_HCYL_REG);
 OUT_BYTE(head|drive->select.all,IDE_SELECT_REG);
#ifdef DEBUG
 printk("%s: %sing: CHS=%d/%d/%d, sectors=%ld, buffer=0x%08lx\n",
 drive->name, (rq->cmd==READ)?"read":"writ", cyl,
 head, sect, rq->nr_sectors, (unsigned long) rq->buffer);
#endif
 }
#ifdef CONFIG_BLK_DEV_PDC4030
 if (IS_PDC4030_DRIVE) {
 extern ide_startstop_t do_pdc4030_io(ide_drive_t *, struct request *);
 return do_pdc4030_io (drive, rq);
 }
#endif /* CONFIG_BLK_DEV_PDC4030 */
 if (rq->cmd == READ) {
#ifdef CONFIG_BLK_DEV_IDEDMA
 if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_read, drive)))
 return ide_started;
#endif /* CONFIG_BLK_DEV_IDEDMA */
 ide_set_handler(drive, &read_intr, WAIT_CMD, NULL);
 OUT_BYTE(drive->mult_count ? WIN_MULTREAD : WIN_READ, IDE_COMMAND_REG);
 return ide_started;
 }
 if (rq->cmd == WRITE) {
 ide_startstop_t startstop;
#ifdef CONFIG_BLK_DEV_IDEDMA
 if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_write, drive)))
 return ide_started;
#endif /* CONFIG_BLK_DEV_IDEDMA */
 OUT_BYTE(drive->mult_count ? WIN_MULTWRITE : WIN_WRITE, IDE_COMMAND_REG);
 if (ide_wait_stat(&startstop, drive, DATA_READY, drive->bad_wstat, WAIT_DRQ)) {
 printk(KERN_ERR "%s: no DRQ after issuing %s\n", drive->name,
 drive->mult_count ? "MULTWRITE" : "WRITE");
 return startstop;
 }
 if (!drive->unmask)
 __cli(); /* local CPU only */
 if (drive->mult_count) {
 ide_hwgroup_t *hwgroup = HWGROUP(drive);
 /*
 * Ugh.. this part looks ugly because we MUST set up
 * the interrupt handler before outputting the first block
 * of data to be written. If we hit an error (corrupted buffer list)
 * in ide_multwrite(), then we need to remove the handler/timer
 * before returning. Fortunately, this NEVER happens (right?).
 *
 * Except when you get an error it seems...
 */
 hwgroup->wrq = *rq; /* scratchpad */
 ide_set_handler (drive, &multwrite_intr, WAIT_CMD, NULL);
 if (ide_multwrite(drive, drive->mult_count)) {
 unsigned long flags;
 spin_lock_irqsave(&io_request_lock, flags);
 hwgroup->handler = NULL;
 del_timer(&hwgroup->timer);
 spin_unlock_irqrestore(&io_request_lock, flags);
 return ide_stopped;
 }
 } else {
 ide_set_handler (drive, &write_intr, WAIT_CMD, NULL);
 idedisk_output_data(drive, rq->buffer, SECTOR_WORDS);
 }
 return ide_started;
 }
 printk(KERN_ERR "%s: bad command: %d\n", drive->name, rq->cmd);
 ide_end_request(0, HWGROUP(drive));
 return ide_stopped;
}