2006-12-21
阅读fs之前普通文件系统包含大量与disk互动的部分.为了更好的理解这些操作.先将IDE驱动的相关部分研究一下. 解决一直有点模糊的问题.
大容量磁盘相关问题
最初的问题来自于BIOS的设计人员和ATA接口的设计人员没有达成一致的意见,BIOS和ATA CHS分配的总的字节数,以及cylinder, head, and sector各占用多少bit,都不相同。更严 重的问题是好像谁也没有预见到磁盘的容量增长的如此迅速!
先来看看各种借口的容量限制: BIOS int 13接口: Cylinder head sector limitation time reach limitation bits 10 8 6 total 24bits 8.4GB ATA 接口: cylinder head sector bits 16 4 8 total 28bits 137.4GB Sept 2001 160 GB Maxtor Diamondmax
扩展 int 13 接口:(97年之后的bios基本都支持) bits 8bytes*8 64bit LBA number 9.4 trillion gigabytes !!!! ATA-6 接口: bits 6bytes*8 48bit LBA number less tan Extended int13,but Large enough!!!
可以看到,BIOS被设计成最大可寻址8GB, ATA-1-5也只能寻址到137GB。这是设计上的硬伤。这两个限制也由此而来。
在达到8GB之前值得注意的限制是528MB限制: bios 13 和 ATA-5的联合,old bios 直接使用用户传入的CHS给ATA, BIOS mode->Normal Cylinder head sector limitation time 10 4 6 2^20 = 528MB being a problem around 1993
为了解决这个问题,BIOS引入了Extended CHS,即ECHS,有些bios中叫large mode. 以一个2.95GiB的硬盘为例, 硬件报告的CHS是 6136/16/63 Cylinders Heads Sectors Capacity IDE/ATA Limits 65,536 16 256 128 GiB Hard Disk Logical Geometry 6,136 16 63 2.95 GiB BIOS Translation Factor divide by 8 multiply by 8 BIOS Translated Geometry 767 128 63 2.95 GiB BIOS Int 13h Limits 1024 256 63 7.88GiB
突破8G的限制只用使用extended int13+ LBA mode(ATA). 如果在使用LBA模式的情况下还有int 13的程序,则BIOS将CHS直接转换成LBA地址,这个叫assisted LBA. 无论如何,只要使用BIOS无论是ECHS转换还是assisted LBA, 都无法突破8.5 GBlimit.并且还有一个问题值得一提: 最大head 数不是256,而是255,因为dos和window95 不能处理head为256的情况.所以总的限制比8.5GB要稍微少一些.
ATA规定,大于8.4 GB 的硬盘应该报告CHS为16383/16/63,这意味着`geometry'过时了,硬盘的总大小不能通过geometry来计算了,只能从IDENTIFY command返回的LBA capacity域来获知. 大于137.4 GB的硬盘应该报告LBA capacity是0xfffffff = 268435455 sectors (137G),正确的disksize在新的48 bit的域中.
下面列出linux对大容量磁盘的支持情况: >8.4 GB kernel should be 2.0.34 or later. >33.8 GB kernel should be 2.0.39/2.2.14/2.3.21 or later. > 137 GB kernel should be 2.4.19/2.5.3 or later.
检查一个版本的linux是否支持大容量硬盘,可以看函数do_rw_disk (ide-disk.c).
refrence: 1. Large Disk Drives >8.4Gb (in addtion, a IBM doc attached) http://www-oss.fnal.gov/projects/fermilinux/common/faq/old/0009.html
2. PC guid of hard disk http://www.pcguide.com/ref/hdd/index.htm
block size的种种问题
分析mm的时候说过do_generic_file_read的几个问题,关键的一点是理解最基本的观点。首先是磁盘上的文件尽量缓存在内存,这样才能更快的读写。缓存在内存中,最基本的单位就是内存页面了,在i386上,常见的大小是4k。 通过缓存读取文件的时候,首先是把用户指定的以字节为单位的offset,size转换成以4k为单位的内存页,这样可以直接拷贝数据给用户。如果文件不在缓存中,就要从磁盘读取,比如通过block_read_full_page从磁盘读取一个page大小的数据。
文件存储于一个具体的文件系统,而这个文件系统有自己的分配单位,那就是block,比如对于ext2,block就是具体的ext2可以分配的最小单位,常见的ext2的block size是1k,可以为2k,4k,但是不能大于4k(refer. ext2_read_super)。 作为存储在这个ext2上的文件,属于它的所有block纪录在磁文件的inode中,纪录的是每一个block的block number。ext2上bocknumber 从1开始(block0#是boot),最大看磁盘容量了,呵呵。这样一来,就把每个文件,以block size为单位分成了从0开始的block。每个文件都是这样一个线型空间,通过inode的一个数组映射到ext2文件系统上从1开始的block空间去。
过了这样一个步骤,就要和磁盘打交道。通常这个接口是bread(block#, size). struct buffer_head * bread(kdev_t dev, int block, int size) { struct buffer_head * bh;
bh = getblk(dev, block, size); /*bh包含了block#和block的size信息*/ if (buffer_uptodate(bh)) return bh; ll_rw_block(READ, 1, &bh); /* 传递给硬盘驱动*/ wait_on_buffer(bh); if (buffer_uptodate(bh)) return bh; brelse(bh); return NULL; } 这个函数的意思是按照块大小是size读取块号为block的块. 换一种角度,bread按照文件系统理解磁盘的方式提供一个访问磁盘的接口,块大小由size指定,读取那个块由block指定.至于磁盘怎么划分扇区,就不用操心了.
IDE Driver overview
我们从bread入手,看看磁盘驱动如何读取磁盘扇区。上边说了bread,这里从ll_rw_block开始。过程虽然从代码里看很复杂,但是主线并不复杂: 给buffer设置一个回叫函数,等磁盘完成读取后通过这个回叫函数设置bh的uptodate位,同时,如果有任务等待这个bh读取完成则唤醒等待的任务. 提交给磁盘的时候,磁盘将这个操作安排到一个队列,然后对所有请求进行调度,以提高磁盘io 速度,然后根据调度的结果执行读取任务. ll_rw_block(int rw, int nr, struct buffer_head * bhs[]) { unsigned int major; int correct_size; int i; /*先进行一系列的检查*/
1. /* Determine correct block size for this device. */ 2. /* Verify requested block sizes. */ 3. 如果是写操作,看看设备是否容许写
/*接着是为bh设置b_end_io:end_buffer_io_sync,通过这个函数通知等待的进程*/ for (i = 0; i < nr; i++) { struct buffer_head *bh; bh = bhs[i];
/* Only one thread can actually submit the I/O. */ if (test_and_set_bit(BH_Lock, &bh->b_state)) continue;
/* We have the buffer lock */ bh->b_end_io = end_buffer_io_sync;
......... //考虑一些可能存在竞争的情况
submit_bh(rw, bh); /*提交申请给磁盘驱动程序*/ } return; .... //clean }
然后是通过submit_bh给磁盘驱动提交申请: void submit_bh(int rw, struct buffer_head * bh) { if (!test_bit(BH_Lock, &bh->b_state)) BUG();
set_bit(BH_Req, &bh->b_state);
/* * First step, 'identity mapping' - RAID or LVM might * further remap this. * 这里把文件系统定义的block#(size)转化为扇区号 */ bh->b_rdev = bh->b_dev; bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
generic_make_request(rw, bh);
switch (rw) { case WRITE: kstat.pgpgout++; break; default: kstat.pgpgin++; break; } }
看看如何向磁盘驱动提交申请: void generic_make_request (int rw, struct buffer_head * bh) { int major = MAJOR(bh->b_rdev); request_queue_t *q;
.....//检查读取范围是否存在于磁盘,比如超出最大扇区号
/* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking * by explicitly returning 0) */ /* NOTE: we don't repeat the blk_size check for each new device. * Stacking drivers are expected to know what they are doing. */ do { q = blk_get_queue(bh->b_rdev); if (!q) { printk(KERN_ERR "generic_make_request: Trying to access nonexistent block-device %s (%ld)\n", kdevname(bh->b_rdev), bh->b_rsector); buffer_IO_error(bh); break; }
} while (q->make_request_fn(q, rw, bh)); /*参考blk_init_queue,初始化为 __make_request*/ }
这里通过一个while循环来提交一个请求,但是对于IDE,这是没有必要的.__make_request总是返回0. static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh) { unsigned int sector, count; int max_segments = MAX_SEGMENTS; struct request * req = NULL, *freereq = NULL; int rw_ahead, max_sectors, el_ret; struct list_head *head; int latency; elevator_t *elevator = &q->elevator;
again: ........
if (list_empty(head)) { q->plug_device_fn(q, bh->b_rdev); /* is atomic */ /*这个函数对IDE来讲是generic_plug_device,见blk_init_queue*/ goto get_rq; }
el_ret = elevator->elevator_merge_fn(q, &req, bh, rw, &max_sectors, &max_segments); switch (el_ret) {
case ELEVATOR_BACK_MERGE: if (!q->back_merge_fn(q, req, bh, max_segments)) break; req->bhtail->b_reqnext = bh; req->bhtail = bh; req->nr_sectors = req->hard_nr_sectors += count; req->e = elevator; drive_stat_acct(req->rq_dev, req->cmd, count, 0); attempt_back_merge(q, req, max_sectors, max_segments); goto out;
case ELEVATOR_FRONT_MERGE: if (!q->front_merge_fn(q, req, bh, max_segments)) break; bh->b_reqnext = req->bh; req->bh = bh; req->buffer = bh->b_data; req->current_nr_sectors = count; req->sector = req->hard_sector = sector; req->nr_sectors = req->hard_nr_sectors += count; req->e = elevator; drive_stat_acct(req->rq_dev, req->cmd, count, 0); attempt_front_merge(q, head, req, max_sectors, max_segments); goto out; /* * elevator says don't/can't merge. get new request */ case ELEVATOR_NO_MERGE: break;
default: printk("elevator returned crap (%d)\n", el_ret); BUG(); }
/* * Grab a free request from the freelist. Read first try their * own queue - if that is empty, we steal from the write list. * Writes must block if the write list is empty, and read aheads * are not crucial. */ get_rq: if (freereq) { req = freereq; freereq = NULL; } else if ((req = get_request(q, rw)) == NULL) { spin_unlock_irq(&io_request_lock); if (rw_ahead) goto end_io;
freereq = __get_request_wait(q, rw); goto again; }
/* fill up the request-info, and add it to the queue */ req->cmd = rw; req->errors = 0; req->hard_sector = req->sector = sector; req->hard_nr_sectors = req->nr_sectors = count; req->current_nr_sectors = count; req->nr_segments = 1; /* Always 1 for a new request. */ req->nr_hw_segments = 1; /* Always 1 for a new request. */ req->buffer = bh->b_data; req->sem = NULL; req->bh = bh; req->bhtail = bh; req->rq_dev = bh->b_rdev; req->e = elevator; add_request(q, req, head, latency); /*提交给磁盘驱动*/ out: if (!q->plugged) (q->request_fn)(q);/*见ide_init_queue,将其初始化为do_ide_request */
if (freereq) blkdev_release_request(freereq); spin_unlock_irq(&io_request_lock); return 0; end_io: bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state)); return 0; } 待会儿再说q->plugged的含义.先看看do_ide_request做了什么: void do_ide_request(request_queue_t *q) { ide_do_request(q->queuedata, 0); } static void ide_do_request(ide_hwgroup_t *hwgroup, int masked_irq) { ide_drive_t *drive; ide_hwif_t *hwif; ide_startstop_t startstop;
ide_get_lock(&ide_lock, ide_intr, hwgroup); /* for atari only: POSSIBLY BROKEN HERE(?) */
__cli(); /* necessary paranoia: ensure IRQs are masked on local CPU */
while (!hwgroup->busy) { /*hwgroup不忙的时候需要处理,否则这就是一个空函数而已*/ hwgroup->busy = 1; /*如果busy置位,代表其他进程已经进入次循环,第一个进入此循环的 线程负责处理所有连接到此hwgroup上drive的请求。一个hwgorp共享 同一个中断。 */ drive = choose_drive(hwgroup); /*选择一个控制器,呵呵,处理的请求未必就是你刚刚提交的 那个,甚至你读hda,这里却选中了hdc,注意 drive->queue.plugged ==0 才会被选中,plugged 置位代表 这个drive开始处理请求,这种情况下不需要这个线程调用 ide_do_request而是通过中断ide_intr->ide_do_request(drive); 来获取cpu处理请求 */ if (drive == NULL) { unsigned long sleep = 0; hwgroup->rq = NULL; drive = hwgroup->drive; do { if (drive->sleep && (!sleep || 0 < (signed long)(sleep - drive->sleep))) sleep = drive->sleep; } while ((drive = drive->next) != hwgroup->drive); if (sleep) { /* * Take a short snooze, and then wake up this hwgroup again. * This gives other hwgroups on the same a chance to * play fairly with us, just in case there are big differences * in relative throughputs.. don't want to hog the cpu too much. */ if (0 < (signed long)(jiffies + WAIT_MIN_SLEEP - sleep)) sleep = jiffies + WAIT_MIN_SLEEP; #if 1 if (timer_pending(&hwgroup->timer)) printk("ide_set_handler: timer already active\n"); #endif hwgroup->sleeping = 1; /* so that ide_timer_expiry knows what to do */ mod_timer(&hwgroup->timer, sleep); /* we purposely leave hwgroup->busy==1 while sleeping */ } else { /* Ugly, but how can we sleep for the lock otherwise? perhaps from tq_disk? */ ide_release_lock(&ide_lock); /* for atari only */ hwgroup->busy = 0; } return; /* no more work for this hwgroup (for now) */ } hwif = HWIF(drive); if (hwgroup->hwif->sharing_irq && hwif != hwgroup->hwif && hwif->io_ports[IDE_CONTROL_OFFSET]) { /* set nIEN for previous hwif */ SELECT_INTERRUPT(hwif, drive); } hwgroup->hwif = hwif; hwgroup->drive = drive; drive->sleep = 0; drive->service_start = jiffies;
if ( drive->queue.plugged ) /* paranoia */ printk("%s: Huh? nuking plugged queue\n", drive->name); hwgroup->rq = blkdev_entry_next_request(&drive->queue.queue_head); /* * Some systems have trouble with IDE IRQs arriving while * the driver is still setting things up. So, here we disable * the IRQ used by this interface while the request is being started. * This may look bad at first, but pretty much the same thing * happens anyway when any interrupt comes in, IDE or otherwise * -- the kernel masks the IRQ while it is being handled. */ if (masked_irq && hwif->irq != masked_irq) disable_irq_nosync(hwif->irq); spin_unlock(&io_request_lock); ide__sti(); /* allow other IRQs while we start this request */ startstop = start_request(drive); spin_lock_irq(&io_request_lock); if (masked_irq && hwif->irq != masked_irq) enable_irq(hwif->irq); if (startstop == ide_stopped) hwgroup->busy = 0; } }
IDE分析到这种地步,我们开始接触磁盘操作的‘核心’逻辑:__make_request,ide_do_request,plugged,ide_intr,tq_disk。 __make_request,tq_disk 主要负责调度磁盘的读写请求。ide_do_request,ide_intr完成ide借口的操作,真正的完成读写磁盘。_make_request 第一次接到磁盘读写请求(que为空),直接将请求挂如队列,置plug,放入tq_task(延后对ide_do_request的调用)。后续的读写请求则首先进行调度,然后再决定是否马上向hw发起操作。当向hw请求发出后(ide_do_request得以执行),intr接管对ide_do_request的调用,同时que plug位被清除,hwgroup的busy位置位 。(当plug到tq_disk时,不会进行hw操作的ide_do_request只选择非plug的队列)。
intr接管对ide_do_request的调用之后,也不见得会将所有的读写请求处理完,这要看磁盘级别的调度结果,ide_do_request负责在磁盘之间调度。这里注意一下head_acitve,对于ide,此位总是 1,这代表在对读写请求调度时,如果处于unplug状态,则不能操作第一个req(unplug时有可能在进行io操作,即ide_intr已经在进行真正的io操作了)。
处于plug状态的队列其实是在等待进行读写请求的调度,以便达到比较好的io吞吐率。但是也不能这样长久的等待下去。所以,如果我们搜索一下tq_task,就会发现内核有许多地方在调整着吞吐率和延迟之间的矛盾。具体细节就不再罗列了。真正操作ide的代码是start_request,drive->do_request(对于ide 硬盘是do_rw_disk):
/* * do_rw_disk() issues READ and WRITE commands to a disk, * using LBA if supported, or CHS otherwise, to address sectors. * It also takes care of issuing special DRIVE_CMDs. */ static ide_startstop_t do_rw_disk (ide_drive_t *drive, struct request *rq, unsigned long block) { if (IDE_CONTROL_REG) OUT_BYTE(drive->ctl,IDE_CONTROL_REG); OUT_BYTE(rq->nr_sectors,IDE_NSECTOR_REG);
if (drive->select.b.lba) { /*LBA,可以看到,2.4.0的内核还不支持48bitLBA操作,不能支持〉137G的硬盘*/
#ifdef DEBUG printk("%s: %sing: LBAsect=%ld, sectors=%ld, buffer=0x%08lx\n", drive->name, (rq->cmd==READ)?"read":"writ", block, rq->nr_sectors, (unsigned long) rq->buffer); #endif OUT_BYTE(block,IDE_SECTOR_REG); OUT_BYTE(block>>=8,IDE_LCYL_REG); OUT_BYTE(block>>=8,IDE_HCYL_REG); OUT_BYTE(((block>>8)&0x0f)|drive->select.all,IDE_SELECT_REG); } else { unsigned int sect,head,cyl,track; track = block / drive->sect; sect = block % drive->sect + 1; OUT_BYTE(sect,IDE_SECTOR_REG); head = track % drive->head; cyl = track / drive->head; OUT_BYTE(cyl,IDE_LCYL_REG); OUT_BYTE(cyl>>8,IDE_HCYL_REG); OUT_BYTE(head|drive->select.all,IDE_SELECT_REG); #ifdef DEBUG printk("%s: %sing: CHS=%d/%d/%d, sectors=%ld, buffer=0x%08lx\n", drive->name, (rq->cmd==READ)?"read":"writ", cyl, head, sect, rq->nr_sectors, (unsigned long) rq->buffer); #endif } #ifdef CONFIG_BLK_DEV_PDC4030 if (IS_PDC4030_DRIVE) { extern ide_startstop_t do_pdc4030_io(ide_drive_t *, struct request *); return do_pdc4030_io (drive, rq); } #endif /* CONFIG_BLK_DEV_PDC4030 */ if (rq->cmd == READ) { #ifdef CONFIG_BLK_DEV_IDEDMA if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_read, drive))) return ide_started; #endif /* CONFIG_BLK_DEV_IDEDMA */ ide_set_handler(drive, &read_intr, WAIT_CMD, NULL); OUT_BYTE(drive->mult_count ? WIN_MULTREAD : WIN_READ, IDE_COMMAND_REG); return ide_started; } if (rq->cmd == WRITE) { ide_startstop_t startstop; #ifdef CONFIG_BLK_DEV_IDEDMA if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_write, drive))) return ide_started; #endif /* CONFIG_BLK_DEV_IDEDMA */ OUT_BYTE(drive->mult_count ? WIN_MULTWRITE : WIN_WRITE, IDE_COMMAND_REG); if (ide_wait_stat(&startstop, drive, DATA_READY, drive->bad_wstat, WAIT_DRQ)) { printk(KERN_ERR "%s: no DRQ after issuing %s\n", drive->name, drive->mult_count ? "MULTWRITE" : "WRITE"); return startstop; } if (!drive->unmask) __cli(); /* local CPU only */ if (drive->mult_count) { ide_hwgroup_t *hwgroup = HWGROUP(drive); /* * Ugh.. this part looks ugly because we MUST set up * the interrupt handler before outputting the first block * of data to be written. If we hit an error (corrupted buffer list) * in ide_multwrite(), then we need to remove the handler/timer * before returning. Fortunately, this NEVER happens (right?). * * Except when you get an error it seems... */ hwgroup->wrq = *rq; /* scratchpad */ ide_set_handler (drive, &multwrite_intr, WAIT_CMD, NULL); if (ide_multwrite(drive, drive->mult_count)) { unsigned long flags; spin_lock_irqsave(&io_request_lock, flags); hwgroup->handler = NULL; del_timer(&hwgroup->timer); spin_unlock_irqrestore(&io_request_lock, flags); return ide_stopped; } } else { ide_set_handler (drive, &write_intr, WAIT_CMD, NULL); idedisk_output_data(drive, rq->buffer, SECTOR_WORDS); } return ide_started; } printk(KERN_ERR "%s: bad command: %d\n", drive->name, rq->cmd); ide_end_request(0, HWGROUP(drive)); return ide_stopped; }
|