Robert love
时间:2007-05-22 来源:norn_larry
本文出自:http://kerneltrap.org/node/336
HackerName:Robert Love
KernelTrap's first interview was with Robert Love in October of 2001. Since that time, his kernel preemption patch has been merged into the 2.5 development kernel and he's continued to be active on the Linux kernel development scene. He recently agreed to speak with us again.
In this interview, Robert discusses the status of Linux kernel preemption, talks about his recent involvement with the O(1) scheduler and explains his recent VM overcommit work. He also reflects upon Linus' use of Bitkeeper, the future of Linux, and the recent Kernel Summit in Ottawa.
Presently in California working for MontaVista, Robert will return to college for his third year at the University of Florida in late August. Read on for the full interview.
Jeremy Andrews: Eight months have gone by since I last interviewed you, and in that time you've made some impressive contributions to the Linux kernel. Your preemptible kernel patch was merged into the 2.5 development tree several months ago, and you've also been quite involved with the O(1) scheduler, among other things. Outside of these efforts, how has life and school in Gainesville been treating you?
Robert Love: School and life are both good.
I finished up my second year at the University of Florida in April. I am enjoying most of my classes, especially my mathematics courses. I will start my third year in late August.
One of my hobbies is photography, and I bought a new camera -- a Canon Elan 7E. It is an SLR 35mm beast (sorry, not digital). I have been having a lot of fun taking lots and lots of pictures.
I started working at MontaVista, the embedded Linux company, back in January. I work in their real-time and performance group hacking on the kernel. MontaVista is very committed to opensource and consequently I am getting to work on a lot of projects in the community. They have always been interested in the preemptible kernel, which has always been their project, and now they are supporting me to maintain it both in 2.4 and 2.5.
JA: It sounds like you've managed to get paid for exactly what it is you like to do?
Robert Love: I am pretty lucky.
JA: You mention that MontaVista is an embedded Linux company. Beyond their contributions with the preemptible kernel I must confess I'm not very famliar with them. Can you tell us more about what they do?
Robert Love: MontaVista provides an open-source Linux-based embedded solution for developers. When design teams are working on an embedded product (from cell phones to PDAs to satellites) Linux is a popular choice. We offer a product and various services to meet those needs.
JA: How are you spending your summer?
Robert Love: During the school year I work part-time, but for this summer I am out in the Silicon Valley area working full-time for MontaVista. It has been a very rewarding experience.
I have a car out here, so I have been taking advantage of the
opportunity to travel. Northern California is really beautiful.
JA: Where are some places you've managed to drive and see since you've been there?
Robert Love: Well I live and work in the south bay, so I spend a lot of time around here. I go up to San Francisco a lot. I have been to Santa Cruz and hiking through the mountains around the valley.
I like the college towns - Palo Alto where Stanford is and Berkeley where (of all places) UC Berkeley is.
I visited Yosemite and loved it.
JA: Back to your kernel efforts, when did you learn that the preemptible kernel patch was going to be merged by Linus into the 2.5 tree?
Robert Love: When Linus said, "OK" ;-)
Gosh, I barely remember... we had talked about it before and obviously the thing was out in the community forever. I sent him a patch against 2.5.4-pre5. He raised a couple of issues which I addressed in a subsequent patch I sent him. And voila, 2.5.4-pre6 had a preemptible kernel!
JA: As a desktop Linux user I was very happy to hear it when your patch was merged. However, I remember earlier debates in which many did not think kernel preemption should be part of the mainline Linux kernel. What kind of reaction do you get from other kernel hackers?
Robert Love: It is mixed. Some appreciate it and are working on further enhancements, some dislike it, and many are neutral. The guy who matters most (Linus) likes it and I also have support from two of the (in my opinion) most talented kernel hackers, Andrew Morton and Ingo Molnar, which is appreciated.
JA: How much maintenance is now involved with the patch being merged into the mainline development tree?
Robert Love: Much less than when it was an external patch. You spend so much time just tracking the trees, keeping everything in sync, and releasing patches. Now I do not have to worry about any of that and in fact other patches have to worry about staying compatible with the preemptible kernel!
JA: What are some of the added complications introduced with a preemptible kernel?
Robert Love: There is the issue of per-CPU data. Now that the kernel is preemptible, we have a rule that kernel code cannot assume its current CPU will not change out from under it (like user-space) - this is a product of being preempted.
Thus, code that must operate on per-CPU data must make sure that preemption is disabled.
JA: What are some areas of the kernel that need to disable preemption for this reason?
Robert Love: An example, like a picture, is worth a thousand Jack Handy books:
int cpu = smp_processor_id();
extern stats[NR_CPUS];
do_stuff(stats[cpu]);
more_stuff(stats[cpu]);
In the above code, with kernel preemption, you can be preempted at any
point. When you are rescheduled you might be on a different CPU and
thus `cpu' no longer refers to your current CPU. Not good. The solution
is:
int cpu = get_cpu();
extern stats[NR_CPUS];
do_stuff(stats[cpu]);
more_stuff(stats[cpu]);
put_cpu();
The get_cpu() and put_cpu() methods return the current CPU, but also
disable preemption. Note you only need to do this when there are no
locks held and interrupts are enabled - not every usage is a problem.
We have already fixed what has been found.
Note the above can also be a problem on uniprocessor systems, since the
per-CPU data is "implicitly locked" - the author may not have protected
the data with a lock since it is impossible to reenter the code on the
same CPU. Preemption changes this and you may need to disable
preemption.
JA: Is the entire preemption patch merged into 2.5?
Robert Love: Yes, and then some. 2.5 has many changes that the 2.4 patch has not incorporated. 2.5 also has kernel preemption support for most architectures now - which shows the sort of developer support the patch has generated.
JA: Which architectures are now supported in 2.5?
Robert Love: ARM, i386, PPC, sparc64, and x86-64 are supported.
MontaVista has made available 2.4 patches for SH and MIPS and those can
easily be picked up and forward-ported, too.
JA: Aside from the extra architecture support, what is found in 2.5 kernel preemption that's not in the 2.4 patch?
Robert Love: We changed some of the model for how preemption works. The net result is the same, but the implementation is a bit improved.
Specifically, we moved some of the preempt_enable() macro into preempt_schedule() to move code out-of-line. Since preempt_enable() is inlined in every spinlock, this reduces code-size, especially on RISC architectures.
Next we inlined preempt_schedule() into the entry.S return-from-interrupt path. We can preemptively schedule two ways: off an interrupt (this is ideal since it would be the interrupt that woke up a task and set need_resched) or off an unlock. In the case of preempting off an interrupt, we have assembly in the return code in entry.S. This code does some checks like preempt_enable() and then calls preempt_schedule() if needed. preempt_schedule() then calls schedule(). We just stuck the logic for preempt_schedule() in the return path to bypass a level of indirection.
There are also some other little optimizations...
JA: What are some of your future plans for improving upon Linux kernel preemption?
Robert Love: There are a few things either myself or others are working on...
Ingo Molnar, Dave Miller, Linus, and myself have been tossing around some patches that remove entirely the concept of the local_bh_count and local_irq_stat (basically, a count of whether bottom halves or interrupts are in progress) and fold them into the preempt_count. This will allow us to use the preempt_count as a sort of "are we atomic" count and clean up a lot of code.
I also think there are some code paths that can better take advantage of having a preemptible kernel... I have to see where the new perspective would be useful.
Finally, the usual: reducing lock-held times to reduce latency.
JA: How much feedback have you gotten on your scheduler hints patch?
Robert Love: Not too much and the most useful feedback was unfortunately off-list.
I am not overly convinced scheduler hints are a good thing. They seem to work well for Solaris, but to me they are a big flag that the scheduler needs fixing. In other words, they are a bandaid.
The one hint I did that I suspect has some use is HINT_TIME, which is what Solaris is successful with. The hard part is implementing fairness and judging whether the added complexity is worth it. Probably not.
JA: Can you describe some of the other current work you're doing with the O(1) scheduler?
Robert Love:When Ingo first posted his scheduler, I was very impressed at how well designed it was. I have an immense amount of respect for Ingo as a programmer and architect, but I was still surprised.
I started contributing to the discussions, giving insight, and fixing bugs. I got the preemptible kernel working on the new scheduler.
More recently, I did the task CPU affinity system calls, real-time enhancements, configurable priority levels, and wrote some documentation. Now Ingo and I are co-maintaining the scheduler.
I also back-ported the updated bits to Alan's 2.4-ac tree and maintain the O(1) scheduler patch against the stock 2.4 tree.
JA: The O(1) backport to Alan's 2.4-ac tree is one of the main reasons I'm using it. Combined with your preempt patch and rmap, things run great. To be honest, it's been so long since I've run anything else, I forget what it was like! Is there anything missing from the O(1) scheduler in Alan's 2.4-ac tree versus what's in the main 2.5 tree?
Robert Love: There is a lag in bits entering 2.5, proving reasonably stable, being backported to 2.4-ac, and Alan merging them... but nothing overly special is pending. 2.4-ac tends to stay as in-sync with 2.5 as reasonable.
JA: Are you happy with how the 2.5 development kernel is evolving?
Robert Love: Yes, it seems to be going well. I have been having success in getting all the bits I send to Linus accepted, and that is always a good thing.
JA: What are some of the outstanding patches against 2.5 that you're still hoping Linus will merge?
Robert Love: Unfortunately not much is pending that he has not taken. What I am more interested in is the things we can hopefully get done before the Halloween freeze.
Hopefully some sanity can be injected into the SCSI layer. I am also still praying for some work to go into the tty layer. The rest of Patrick Mochel's excellent device model and driverfs code will be welcome.
Finally, I hope Andrew Morton breaks up Rik's rmap VM into logical chunks and Linus takes them. Then hopefully all the VM guys can work on improving the VM. I know I'd like to hack on it.
JA: Have you spent any time working with the VM system yourself?
Robert Love: A little. The VM is one of the things that interest me (the others being scheduling, SMP scalability, and locking) but I have not contributed much code.
That said, I have done a little work against Rik's rmap VM and I am currently working on strict VM overcommit for both Rik's VM and the stock 2.4 VM.
The strict VM overcommit work is a fun challenge. Most of the work was done by Alan Cox for his 2.4-ac tree. Right now, Linux gladly will overcommit memory - that is, it will succeed an allocation without there being memory to back it. This generally works since a system typically has plenty free memory. Further, many allocations are often not fully utilized or are quickly freed. Thus overcommit reduces VM pressure and consequently we swap less. Currently we succeed an allocation based on a simple heuristic which will stop wildly huge allocations from succeeding while stilling allowing enough overcommit to reduce swapping.
"Fixing" this implies two things. First, we need to do very strict accounting of the allocated memory. For example, a page shared copy-on-write among many processes is considered to use only a page of memory since that is all it physically consumes. But at any instant, a process can write to the page and it will no longer share it but have its own copy. Thus we need to account shared writable pages per-instance shared and not just once.
Next, now that we have proper accounting, we need to formulate a strict rule. For example, "committed memory will not exceed swap".
With these changes, it should be impossible to hit an out-of-memory condition. All points of failure should be pushed to the allocation routines (fork, mmap, malloc, etc.). Memory access should no longer result in an OOM kill -- if the memory has been allocated, it should exist.
JA: After reading your explanation, I'm left wondering why the existing VM systems don't already employ a strict VM overcommit. It seems to be an obviously good thing. Are there any disadvantages?
Robert Love: Sure, two:
(a) Extra accounting (probably not noticeable - in fact, my work
does the accounting whether you have strict overcommit on or
off).
(b) By not allowing overcommit, VM pressure goes up and consequently swap activity may increase.
JA: You mentioned an interest in SMP scalability. How does the future look for Linux SMP scalability?
Robert Love: Depends what you want to scale to :-)
2-4 CPUs are a sweet spot... we may cleanly be able to extend a bit past
that. Continuing to scale to infinity, however, will turn Linux into a
twisty mess.
JA: You also mentioned an interest in locking. Such as?
Robert Love: You know, I do not know why I find locking interesting. Most people probably do not think twice about how their locking is implemented. But the first thing I do when I read about an OS is understand their locking primitives. I guess I am odd...
Anyhow, I am interested in the various primitives we implement (spinlocks and semaphores) and how they are used. We have a really nice lightweight spinlock implementation. At the kernel summit, I discussed implementing a new lightweight mutex lock - basically a binary semaphore with none of the "special features" that our semaphores have and perhaps some spin-then-sleep behavior.
JA: How does 2.5 locking improve upon earlier kernels?
Robert Love: It is much finer-grained, especially in the VFS layer.
One of the first things I worked on in 2.5 was pushing the BKL out of the high-level llseek() method and into the individual llseek implementations. Now each method can decide which locks it requires. Consequently, the generic llseek used by usual disk operations now only acquires the i_node semaphore.
This "pushing locks down" work has continued in the VFS layer with many of the VFS methods.
Other work has gone on in other areas, notably with removing the BKL. When rmap is merged, I know Bill Irwin has some patches to reduce lock contention. Andrew Morton has also done some work to reduce high contention in the VM.
JA: How does Linux locking compare to other operating systems, such as FreeBSD or Solaris?
Robert Love: Solaris is much much finer grained than Linux. It continues to scale reasonably well past 64 CPUs. As a consequence, Solaris earned the reputation of compromising the low-end for the high-end.
FreeBSD has introduced some interesting locking constructs in their 5.0 tree but they are just beginning to finely grain their kernel. They seem to have their eye on the high-end of the scalability chart.
JA: Looking beyond 2.5, what work do you want to see in the next development kernel?
Robert Love: Rewrite of the tty layer (hah hah).
I cannot think that far ahead. I would love to just see things cleaned up. Old cruft removed and the such. A release where we just refine things.
JA: Do you have any concerns over the future direction of the Linux kernel?
Robert Love: Yes, at times I am scared of what the high end work will do to the common case. You either scale to 128 CPUs or you run great on one or two. You either optimize for a few running tasks or you have a system with 1000 ready database threads.
I fit the former so you can guess what I care about. So do 99% of the rest of us.
The situation is worse given the evolutionary design of the Linux kernel. I am afraid we will just keep hacking and rehacking "problem" areas - whatever shows up as the biggest problem.
JA: When do you predict we'll see a 2.6 release?
Robert Love: 18 months. Place your bets.
JA: A year and a half seems like a long time, especially with a code freeze happening in just a few months. Why do you think it will take that long before we see 2.6?
Robert Love: I have seen some estimates as soon as early 2003. That gives only a few months of stabilization. Historically, the stabilization period takes much longer than the "flood of features" period.
I put a realistic date at summer 2003. This is a year from now. Factoring in history, I'll bump that to 18 months. Maybe slightly less... but certainly about a year from freezing.
I am much more concerned with actually freezing on 31 October 2002. If we meet that - and we can - we can take as long as it needs to stabilize the kernel. Whether it be a month or a year.
JA: Is there any recent or near-future development outside of the kernel that you find interesting?
Robert Love: GNOME2 is amazing!
Metacity (a new GNOME2-compliant window manager) is very slick and quite quick. Havoc is a genius!
JA: What is your take on Linus' decision to use BitKeeper, and the lengthy debates that followed?
Robert Love: If it makes him more productive and happy, then I love it. Do I wish it were open source? Sure. But it works and is the best solution right now, so I do not whine about it.
JA: Are you using BitKeeper to manage your own kernel work?
Robert Love: Nope. I use diff, patch, grep, find, and a whole lot of directories.
JA: That's a humorous description. Why have you chosen this strategy over a more formal configuration management system?
Robert Love: I think a lot of the kernel developers use this approach. More likely their storage is in CVS, but a hand full of scripts is probably the most common source management tool.
I picked it because it works for me.
JA: You were recently in Ottawa at the kernel summit. Which presentations did you find the most interesting?
Robert Love: The Kernel Summit was very interesting. It was a two day invite-only conference with about 60 core kernel developers. Last time we talked, you asked me if I had met any kernel developers. While it was not at a petting zoo as I predicted, I can now say I have met just about all of them.
Some of the better talks were driverfs by Patrick Mochel and SCSI by James Bottomley. Patrick has plans for driverfs that include world domination and euphoria for all. SCSI needs a lot of work, and James is hopefully succeeding in leading an effort there.
I was hoping the VM debate would be contentious (and ultimately rewarding) but Linus made it pretty clear at the start he wants to at least give rmap a try in 2.5, which is fine by me.
The final formal talk was led by Ted Ts'o and focused on kernel release management. We decided on the now infamous 31 October 2002 freeze date. Dave Jones was also punished, err, given the wonderful opportunity, to help Linus with the feature freeze.
The most rewarding part of the summit, however, was the informal "hallway discussions" and chats in some of Ottawa's finer pubs. Doing the face-to-face thing is very rewarding and I think a lot of good has come from it.
JA: You also attended the Ottawa Linux Symposium which directly followed the kernel summit...
Robert Love: OLS was very interesting as well. It was a much larger group, supposedly around 500. The talks were still very kernel-centric which was great for me although I would like to of seen a couple more on different topics. Havoc Pennington did give a good talk on GCONF, though.
The "informal" part of OLS was rewarding as well. Since the Kernel Summit immediately preceded OLS, most of the kernel hackers were present so we were able to continue our discussions as well as meet a lot of new people.
JA: Is there anything else you'd like to add?
Robert Love: A laser is made of light, not sound. We need to get the word out on the streets.
JA: Thank you very much for once again taking the time to speak with me! Your contributions to the Linux kernel continue to be in areas that I find quite interesting and useful (and I'm far from alone in this). I will enjoy seeing where your kernel hacking interests lead you to in the future.
Robert Love: Thank you. It was again a pleasure. Take care.