When Linux Runs Out of Memory

时间：2009-03-23 来源：sjhf

Perhaps you rarely face it, but once you do, you surely know what's wrong: lack of free memory, or Out of Memory (OOM). The results are typical: you can no longer allocate more memory and the kernel kills a task (usually the current running one). Heavy swapping usually accompanies this situation, so both screen and disk activity reflect this. At the bottom of this problem lie other questions: how much memory do you want to allocate? How much does the operating system (OS) allocate for you? The basic reason of OOM is simple: you've asked for more than the available virtual memory space. I say "virtual" because RAM isn't the only place counted as free memory; any swap areas apply.

Exploring OOM

To begin exploring OOM, first type and run this code snippet that allocates huge blocks of memory:

#include <stdio.h>
#include <stdlib.h>

#define MEGABYTE 1024*1024

int main(int argc, char *argv[])
{
 void *myblock = NULL;
 int count = 0;

while (1)
 {
 myblock = (void *) malloc(MEGABYTE);
 if (!myblock) break;
 printf("Currently allocating %d MB\n", ++count);
 }

exit(0);
}

Compile the program, run it, and wait for a moment. Sooner or later it will go OOM. Now compile the next program, which allocates huge blocks and fills them with 1:

#include <stdio.h>
#include <stdlib.h>

#define MEGABYTE 1024*1024

int main(int argc, char *argv[])
{
 void *myblock = NULL;
 int count = 0;

while(1)
 {
 myblock = (void *) malloc(MEGABYTE);
 if (!myblock) break;
 memset(myblock,1, MEGABYTE);
 printf("Currently allocating %d MB\n",++count);
 }
 exit(0);

}

Notice the difference? Likely, program A allocates more memory blocks than program B does. It's also obvious that you will see the word "Killed" not too long after executing program B. Both programs end for the same reason: there is no more space available. More specifically, program A ends gracefully because of a failed malloc(). Program B ends because of the Linux kernel's so-called OOM killer. The first fact to observe is the amount of allocated blocks. Assume that you have 256MB of RAM and 888MB of swap (my current Linux settings). Program B ended at:

Currently allocating 1081 MB

On the other hand, program A ended at:

Currently allocating 3056 MB

Where did A get that extra 1975MB? Did I cheat? Of course not! If you look closer on both listings, you will find out that program B fills the allocated memory space with 1s, while A merely simply allocates without doing anything. This happens because Linux employs deferred page allocation. In other words, allocation doesn't actually happen until the last moment you really use it; for example, by writing data to the block. So, unless you touch the block, you can keep asking for more. The technical term for this is optimistic memory allocation. Checking /proc/<pid>/status on both programs will reveal the facts. Here's program A:

$ cat /proc/<pid of program A>/status
VmPeak: 3141876 kB
VmSize: 3141876 kB
VmLck: 0 kB
VmHWM: 12556 kB
VmRSS: 12556 kB
VmData: 3140564 kB
VmStk: 88 kB
VmExe: 4 kB
VmLib: 1204 kB
VmPTE: 3072 kB

Here's program B, shortly before the OOM killer struck:

$ cat /proc/<pid of program B>/status 
VmPeak: 1072512 kB
VmSize: 1072512 kB
VmLck: 0 kB
VmHWM: 234636 kB
VmRSS: 204692 kB
VmData: 1071200 kB
VmStk: 88 kB
VmExe: 4 kB
VmLib: 1204 kB
VmPTE: 1064 kB

VmRSS deserves further explanation. RSS stands for "Resident Set Size." It explains how many of the allocated blocks owned by the task currently reside in RAM. Also note that before B reaches OOM, swap usage is almost 100 percent (most of the 888MB), while A uses no swap at all. It's clear that malloc() itself did nothing more than just preserve a memory area, nothing else. Another question also arises. "Even without touching the pages, why is the allocation limit 3056MB?" This exposes an unseen limit. For every application in a 32-bit system, there is 4GB of address space available for usage. The Linux kernel usually splits the linear address to provide 0 to 3GB for user space and 3GB to 4GB for kernel space. User space is a room where a task can do anything it wants, while kernel space is solely for the kernel. If you try to cross this 3GB border, you will get a segmentation fault.

(Side note: There is a kernel patch that gives the whole 4GB to userspace, at the cost of some context-switching.) The conclusion is that OOM happens for two technical reasons:

No more pages are available in the VM.
No more user address space is available.
Both #1 and #2.

Thus the strategies to prevent those circumstances are:

Know how large the user address space is.
Know how many pages are available.

When you ask for a memory block, usually by using malloc(), you're asking the runtime C library whether a preallocated block is available. This block's size must at least equal the user request. If there is already a memory block available, malloc() will assign this block to the user and mark it as "used." Otherwise, malloc() must allocate more memory by extending the heap. All requested blocks go in an area called the heap. Do not confuse it with the stack, because the stack stores local variable and function return addresses. These two sections have different jobs. Where is the heap located in the address space? The process address map can tell you exactly where:

$ cat /proc/self/maps
0039d000-003b2000 r-xp 00000000 16:41 1080084 /lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 16:41 1080084 /lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 16:41 1080084 /lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
08048000-0804c000 r-xp 00000000 16:41 130592 /bin/cat
0804c000-0804d000 rwxp 00003000 16:41 130592 /bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0 [heap]
b7d95000-b7f95000 r-xp 00000000 16:41 2239455 /usr/lib/locale/locale-archive
b7f95000-b7f96000 rwxp b7f95000 00:00 0
b7fa9000-b7faa000 r-xp b7fa9000 00:00 0 [vdso]
bfe96000-bfeab000 rw-p bfe96000 00:00 0 [stack]

This is an actual address space layout shown for cat, but you may get different results. It is up to the Linux kernel and the runtime C library to arrange them. Notice that recent Linux kernel versions (2.6.x) kindly label the memory area, but don't completely rely on them. The heap is basically free space not already given for program mapping and stack; thus, it narrows down the available address space. It's not a full 3GB, but it's 3GB minus everything else that's mapped. The bigger your program's code segment is, the less space you have for heap. The more dynamic libraries you link into your program, the less space you get for the heap. This is important to remember. How does the map for program A look when it can't allocate more memory blocks? With a trivial change to pause the program (see