Btrfs的简短历史

时间：2009-08-04 来源：linux论坛

Btrfs，由Oracle于2007年宣布开发的文件系统，它将取代Linux目前的ext3/4文件系统。Linux文件系统开发者和前ZFS设计师Valerie Aurora写了一篇简要但深刻的文章讲述Btrfs的历史和工作原理。

Btrfs诞生前：2007年Linux文件系统的前景似乎充满坎坷：Reiserfs，在受到质量和可持续性维护问题的困扰之后，又因为创始人Hans Reiser的被捕而失去了所有信任；ext4仍在开发之中，从根本上它只是有数十年历史旧版本的延伸；雪上加霜的是企业又在压缩Linux开发的基金。即便在这种情况下，开发者也没有放弃希望，Ohad Rodeh发明了写入时拷贝（copy-on-write，COW）的B-tree，前Reiserfs开发者Chris Mason又为这些B-tree加入了令人兴奋的新特性：小文件包，能快速查找的B-tree，灵活部署。最后他提出了B-tree文件系统（即B- tree FS或Btrfs）的原型.....Valerie Aurora称，从他个人角度看，Btrfs和ZFS很类似，两者都是写时拷贝校验和（copy-on-write checksummed）文件系统，支持多种设备和可写的快照。但从架构、开发模式、成熟度和许可等上看，两者又是截然不同的。

来源：solidot

You probably have heard of the cool new kid on the file system block,btrfs (pronounced"butter-eff-ess") - after all, Linus Torvalds is using it as his rootfile system on one of his laptops.  But you might not know muchabout it beyond a few high-level keywords - copy-on-write, checksums,writable snapshots - and a few sensational rumors and stories - thePhoronixbenchmarks, btrfs is a ZFS ripoff, btrfs is a secret plan forOracle domination of Linux, etc.  When it comes to file systems, it'shard to tell truth from rumor from vile slander: the code is socomplex, the personalities are so exaggerated, and the users are soangry when they lose their data.  You can't even settle things with abattle of the benchmarks: file system workloads vary so wildly thatyou can make a plausible argument for why any benchmark is eithertotally irrelevant or crucially important.
In this article, we'll take a behind-the-scenes look at the design anddevelopment of btrfs on many levels - technical, political, personal -and trace it from its origins at a workshop to its current position asLinus's root file system.  Knowing the background and motivation foreach step will help you understand why btrfs was started, how itworks, and where it's going in the future.  By the end, you should beable to hand-wave your way through a description of btrfs's on-diskformat.
Disclaimer:I have two huge disclaimers to make: One, I worked on ZFS for severalyears while at Sun.  Two, I have already been subpoenaed and deposedfor the various Sun/NetApp patent lawsuits and I'd like to avoidgiving them any excuse to subpoena me again.  I'll do my best to befair, honest, and scrupulously correct.
btrfs: Pre-historyImagine you are a Linux file system developer.  It's 2007, and you areat the Linux Storage andFile systems workshop.  Things are looking dim for Linux filesystems: Reiserfs, plagued with quality issues and an unsustainablefunding model, has just lost all credibility with the arrest of HansReiser a few months ago.  ext4 is still in development; in fact, itisn't even called ext4yet.  Fundamentally, ext4 is just a straightforward extension ofa 30-year-old format and is light-years behind the competition interms of features.  At the same time, companies are clamping down onfunding for Linux development; IBM's Linux division is coming to theend of its grace period and needs to show profitability now.  Othercompanies are catching wind of an upcoming recession and are cuttingresearch across the board.  They want projects with time to resultsmeasured in months, not years.
Ever hopeful, the file systems developers are meeting anyway.  Sincethe workshop is co-locatedwith USENIX FAST'07, several researchers from academia and industry are presentingtheir ideas to the workshop.  One of them is Ohad Rodeh.  He'sinvented a kindof btreethat is copy-on-write (COW) friendly [PDF].  To startwith, btrees intheir native form are wildly incompatible with COW.  The leaves of thetree are linked together, so when the location of one leaf changes(via a write - which implies a copy to a new block), the link in theadjacent leaf changes, which triggers another copy-on-write andlocation change, which changes the link in the next leaf... The resultis that the entire btree, from top to bottom, has to be rewrittenevery time one leaf is changed.
Rodeh's btrees are different: first, he got rid of the links betweenleaves of the tree - which also "throws out a lot of the existingb-tree literature", as he says inhis slides [PDF]- but keeps enough btree traits to be useful. (This is a fairlystandard form of btrees in file systems, sometimes called "B+trees".)He added some algorithms for traversing the btree that take advantageof reference counts to limit the amount of the tree that has to betraversed when deleting a snapshot, as well as a few other things,like proactive split and merge of interior nodes so that inserts anddeletes don't require any backtracking.  The result is a simple,robust, generic data structure which very efficiently tracks extents(groups of contiguous data blocks)in a COW file system.  Rodeh successfully prototyped the system someyears ago, but he's done with that area of research and just wantssomeone to take his COW-friendly btrees and put them to good use.
btrfs: The beginningChris Mason took these COW-friendly btrees and ran with them.  Back inthe day, Chris worked on Reiserfs, where he learned a lot about whatto do and what not to do in a file system.  Reiserfs had some coolfeatures - small file packing, btrees for fast lookup, flexible layout- but the implementation tended to be haphazard and ad hoc.  Codepaths proliferated wildly, and along with them potential bugs.
Chris had an insight: What if everything in the file system - inodes,file data, directory entries, bitmaps, the works - was an item in acopy-on-write btree?  All reads and writes to storage would go throughthe same code path, one that packed the items into btree nodes andleaves without knowing or caring about the item type.  Then you onlyhave to write the code once and you get checksums, reference counting(for snapshots), compression, fragmentation, etc., for anything in thefile system.
Chris came up with thefollowing basicstructure for btrfs ("btrfs" comes from "btree file system").Btrfs consists of three types of on-disk structures: block headers,keys, and items, currently defined as follows:
struct btrfs_header {
u8 csum[32];
u8 fsid[16];
__le64 blocknr;
__le64 flags;

u8 chunk_tree_uid[16];
__le64 generation;
__le64 owner;
__le32 nritems;
u8 level;
}

struct btrfs_disk_key {
__le64 objectid;
u8 type;
__le64 offset;
}

struct btrfs_item {
struct btrfs_disk_key key;
__le32 offset;
__le32 size;
}
Inside the btree (that is, the "branches" of the tree, as opposed tothe leaves at the bottom of the tree), nodes consist only of keys andblock headers.  The keys tell you where to go looking for the item youwant, and the block headers tell you where the next node or leaf inthe btree is located on disk.
The leaves of the btree contain items, which are a combination of keysand data.  Similarly to reiserfs, the items and data are packed inextremely space-efficient way: the item headers (that is, the itemstructure described above) are packed together starting at thebeginning of the block, and the data associated with each item ispacked together starting at the end of the block.  So item headers anddata grow towards each other, as shown in the diagram to the right.
Besides being code efficient, this scheme is space and time efficientas well.  Normally, file systems put only one kind of data - bitmaps,or inodes, or directory entries - in any given file system block.This wastes disk space, since unused space in one kind of block can'tbe used for any other purpose, and it wastes time, since getting toone particular piece of file data requires reading several differentkinds of metadata, all located in different blocks in the file system.In btrfs, items are packed together (or pushed out to leaves) inarrangements that optimize both access time and disk space.  You cansee the difference in these (very schematic, very simplified) diagrams.Old-school filesystems tend to organize data like this:

Btrfs, instead, creates a disk layout which looks more like:

In both diagrams, red blocks denote wasted disk space and red arrows denote seeks.
Each kind of metadata and data in the file system - a directory entry,an inode, an extended attribute, file data itself - is stored as aparticular type of item.  If we go back to the definition of an item,we see that its first element is a key:
struct btrfs_disk_key {
__le64 objectid;
u8 type;
__le64 offset;
}
Let's start with the objectid field.  Each object in thefile system - generally an inode - has a unique objectid.  This isfairly standard practice - it's the equivalent of inode numbers.  Whatmakes btrfs interesting is that the objectid makes up the mostsignificant bits of the item key - what we use to look up an item inthe btree - and the lower bits are different kinds of items related tothat objectid.  This results in grouping together all the informationassociated with a particular objectid.  If you allocate adjacentobjectids, then all the items from those objectids are also allocatedclose together.  The <objectid, type> pair automaticallygroups related data close to each other regardless of the actualcontent of the data, as opposed to the classical file system approach,which writes separate optimized allocators for each kind of filesystem data.
The type field tells you what kind of data is stored inthe item.  Is it the inode?  Is it a directory entry?  Is it an extenttelling you where the file data is on disk?  Is it the file dataitself?  With the combination of objectid and the type, you can lookup any file system data you need in the btree.We should take a quick look at the structure of the btree nodes andleaves themselves.  Each node and leaf is an extent in the btree -nodes are extents full of <key, block header> pairs, andleaves contain items.  Large file data is stored outside of the btreeleaves, with the item describing the extent kept in the leafitself. (What constitutes a "large" file is tunable based on theworkload.) Each extent describing part of the btree has a checksum anda reference count, which permits writable snapshots.  Each extent alsoincludes an explicit back reference to each of the extents that referto it.
Back references give btrfs a major advantage over every other filesystem in its class because now we can quickly and efficiently migratedata, incrementally check and repair the file system, and check thecorrectness of reference counts during normal operation.  The proof isthat btrfs already supports fast, efficient device removal andshrinking of the available storage for a file system.  Many other filesystems list "shrink file system" as a feature, but it usually ends upimplemented inefficiently and slowly and several years late - or notat all.  For example, ext3/4 can shrink a file system - by traversingthe entire file system searching for data located in the area of thedevice being removed.  It's a slow, fraught, bug-prone process.  ZFSstill can'tshrink a file system.
The result is beautifully generic and elegant: Everything on disk is abtree containing reference counted, checksummed extents of items,organized by <objectid, type> keys.  A great deal of thebtrfs code doesn't care at all what is stored in the items, it justknows how to add or remove them from the btree.  Optimizing disklayout is simple: allocate things with similar keys close together.
btrfs: The politicsAt the same time that Chris was figuring out the technical design ofbtrfs, he was also figuring out how to fund the development of btrfsin both the short and the long term.  Chris had recently moved fromSUSE to a special Linux group at Oracle, one that employs severalhigh-level Linux storage developers, including Martin K. Petersen,Zach Brown, and Jens Axboe.  Oracle funds a lot of Linux development,some of it obviously connected to the Oracle database (OCFS2,DIF/DIX), and some of it less so (generic block layer work, syslets).Here's how Chris put it in a recentinterviewwith AmandaMcPherson from the Linux Foundation:
Amanda: Why did you start this project? Why is Oracle supporting thisproject so prominently?
Chris: I started Btrfs soon after joining Oracle.  I had aunique opportunity to take a detailed look at the features missingfrom Linux, and felt that Btrfs was the best way to solve them.
Linux is a very important platform for Oracle.  We use it heavily forour internal operations, and it has a broad customer base for us.  Wewant to keep Linux strong as a data center operating system, andinnovating in storage is a natural way for Oracle to contribute.
In other words, Oracle likes having Linux as a platform, and iswilling to invest development effort in it even if it's not directlyrelated to Oracle database performance.  Look at it this way: how manyoperating systems are written and funded in large part by yourcompetitors?  While it is tempting to have an operating systementirely under your control - like Solaris - it also means that youhave to pay for most of the development on that platform.  In the end,Oracle believes it is in its own interest to use its in-houseexpertise to help keep Linux strong.
After a few months of hacking and design discussions with Zach Brownand many others,Chris posted btrfs forreview.  From there on out, you can trace the history of btrfslike any other open source project through the mailing lists andsource code history.  Btrfs is now in the mainline kernel anddevelopers from Red Hat, SUSE, Intel, IBM, HP, Fujitsu, etc. are allworking on it.  Btrfs is a true open source project - not just in thelicense, but also in the community.
btrfs: A brief comparison with ZFSPeople often ask about the relationship between btrfs andZFS.  From one point of view, the two file systems are very similar: they are copy-on-writechecksummed file systems with multi-device support and writablesnapshots.  From other points of view, they are wildly different: filesystem architecture, development model, maturity, license, and hostoperating system, among other things.  Rather than answer individualquestions, I'll give a short history of ZFS development and compareand contrast btrfs and ZFS on a few key items.
When ZFS first got started, the outlook for file systems in Solariswas rather dim as well.  Logging UFS was already nearing the end ofits rope in terms of file system size and performance.  UFS was so farbehind that many Solaris customers paid substantial sums of money toVeritas to run VxFS instead.  Solaris needed a new file system, and itneeded it soon.
Jeff Bonwick decided to solve the problem and started the ZFS projectinside Sun.  His organizing metaphor was that of the virtual memorysubsystem - why can't disk be as easy to administer and use as memory?The central on-disk data structure was the slab - a chunk of diskdivided up into the same size blocks, like that inthe SLAB kernelmemory allocator, which he also created.  Instead of extents, ZFS would useone block pointer per block, but each object would use a differentblock size - e.g., 512 bytes, or 128KB - depending on the size of theobject.  Block addresses would be translated through avirtual-memory-like mechanism, so that blocks could be relocatedwithout the knowledge of upper layers.  All file system data andmetadata would be kept in objects.  And all changes to the file systemwould be described in terms of changes to objects, which would bewritten in a copy-on-write fashion.
In summary, btrfs organizes everything on disk into a btree of extentscontaining items and data.  ZFS organizes everything on disk into atree of block pointers, with different block sizes depending on theobject size.  btrfs checksums and reference-counts extents, ZFSchecksums and reference-counts variable-sized blocks.  Both filesystems write out changes to disk using copy-on-write - extents orblocks in use are never overwritten in place, they are always copiedsomewhere else first.
So, while the feature list of the two file systems looks quitesimilar, the implementations are completely different.  It's a bitlike convergentevolution between marsupials and placental mammals - a marsupial mouse and aplacental mouse look nearly identical on the outside, but theirinternal implementations are quite a bit different!
In my opinion, the basic architecture of btrfs is more suitable tostorage than that of ZFS.  One of the major problems with the ZFSapproach - "slabs" of blocks of a particular size - is fragmentation.Each object can contain blocks of only one size, and each slab canonly contain blocks of one size.  You can easily end up with, for example, a file of64K blocks that needs to grow one more block, but no 64K blocks areavailable, even if the file system is full off nearly empty slabs of512 byte blocks, 4K blocks, 128K blocks, etc.  To solve this problem,we (the ZFS developers) invented ways to create big blocks out of little blocks ("gangblocks") and other unpleasant workarounds.  In our defense, at thetime btrees and extents seemed fundamentally incompatible withcopy-on-write, and the virtual memory metaphor served us well in manyother respects.
In contrast, the items-in-a-btree approach is extremely spaceefficient and flexible.  Defragmentation is an ongoing process -repacking the items efficiently is part of the normal code pathpreparing extents to be written to disk.  Doing checksums, referencecounting, and other assorted metadata busy-work on a per-extent basisreduces overhead and makes new features (such as fast reverse mappingfrom an extent to everything that references it) possible.
Now for some personal predictions (based purely on publicinformation - I don't have any insider knowledge).  Btrfs will be thedefault file system on Linux within two years.  Btrfs as a projectwon't (and can't, at this point) be canceled by Oracle.  If all theintellectual property issues are worked out (a big if), ZFS will beported to Linux, but it will have less than a few percent of theinstalled base of btrfs.  Check back in two years and see if I got anyof these predictions right!
Btrfs: What's next?Btrfs is heading for 1.0, a little more than 2 years since the firstannouncement.  This is much faster than many file systems veterans -including myself - expected, especially given that during most of thattime, btrfs had only one full-time developer.  Btrfs is not ready forproduction use - that is, storing and serving data you would be upsetabout losing - but it is ready for widespread testing - e.g., on yourbacked-up-nightly laptop, or your experimental netbook that youreinstall every few weeks anyway.
Be aware that there was a recent flag day in the btrfs on-disk format:A commit shortly after the 2.6.30 release changed the on disk formatin a way that isn't compatible with older kernels.  If you create yourbtrfs file system using the old, 2.6.30 or earlier kernel and tools,and boot into a newer kernel with the new format, you won't be able touse your file system with a 2.6.30 or older kernel any longer.  LinusTorvalds [url=http://www.mail-archive.com/linux-[email protected]/msg02500.html]foundthis out the hard way.[/url]  But if this does happen to you, don'tpanic - you canfind rescueimages and other helpful information on the the btrfs wiki.

http://lwn.net/Articles/342892/