| Pp. 153-164 of the Proceedings |
![]() |
A Study in Malloc: A Case of Excessive Minor Faults
Phillip Ezolt
Compaq Computer
Corporation
Abstract
GNU libc's default setting for malloc can cause a significant
performance penalty for applications that use it extensively, such as Compaq's
high performance extended math library, CXML.
The default malloc tuning can cause a significant number of minor page
faults, and result in application performance of only half of the true
potential. This paper describes how to remove the performance penalty using
environmental variables and the method used to discover the cause of the malloc
performance penalty.
1. Why?
When a performance problem is discovered the first question asked
is usually "How can it be fixed?". Although the solution to the
performance problem is valuable, the method used to diagnose and fix the
problem is also valuable. An explanation can teach the inexperienced engineer
the thought process of the experience engineer, and give the inexperienced
engineer a method for finding and fixing future performance problems.
This paper describes how the performance problem of GNU libc's
malloc was diagnosed and how a solution was discovered. The performance hunt is
documented to demonstrate the methods used to find and fix a performance
problem.
2. What?
A customer running a chemistry benchmark on an Alpha system
reported a radically different application run time between Tru64 UNIX and
Linux/Alpha on the same hardware. Since the hardware of the two test systems
was identical, the runtime difference had to be caused by software.
Fortunately, the customer used the Compaq Fortran compiler, the Compaq Portable
Math Library (CPML), and the Compaq Extended Math Library (CXML) on Tru64 UNIX and Linux/Alpha. This
meant that the compiler subsystem was also the same. The main difference was
the operating system.
The program was run on both systems, and the "time"
command showed the runtime of both. The customer reported that the user time
was roughly the same on both operating systems, but system time on Linux/Alpha
was much greater than on Tru64 UNIX.
|
|
User |
System |
Elapsed |
CPU |
|
Linux |
256.284u |
209.641s |
7:46.35 |
99.9% |
|
Tru64 UNIX |
257.027u |
3.176s |
4:29.85 |
96.4% |
This difference pointed to a possible performance problem in the Linux
operating system. To determine where in Linux the time was spent, DCPI[1]
(an alpha profiling system) and was used to extract the following data[2]:
|
cycles[3] |
dtbmiss |
Image |
|
8116120 |
1062 |
System Total |
|
6979695 |
0 |
/vmlinux |
|
6044706 |
0 |
cpu_idle |
|
386675 |
0 |
do_anonymous_page |
|
87264 |
0 |
__free_page |
|
79269 |
0 |
__get_free_pages |
|
60127 |
0 |
__copy_user |
|
45412 |
0 |
EntMM |
|
36035 |
0 |
do_page_fault |
|
1110642 |
1052 |
Xvcc |
|
226983 |
183 |
Dgemm_nt |
|
213498 |
86 |
Dgemm_nn |
|
187057 |
47 |
Icopy_ |
|
92613 |
153 |
Dgemm_tt |
This profile showed that a large amount of the non idle kernel cycles was spent
in the 'do_anonymous_page' kernel function.
It also showed that a large number of dtbmisses occurred in xvcc, the
customer’s chemistry code.
The function of ‘do_anonymous_page’ was not immediately clear, but
further investigation revealed that it was part of the Linux kernel's memory
management routines (in /usr/src/linux/mm/memory.c), and that all calls to it
ultimately began with the page fault handler ‘handle_pte_fault’. Therefore, if
‘do_anonymous_page’ was called a large number of times, the page fault handler
was also being called a large number of times.
In addition to DCPI, the "time" command was also used to
measure where time was spent. As a side effect, it revealed that a large amount
of minor page faults occurred.
..
(168major+23099385minor)pagefaults
..
It was unclear at this point what a minor page fault was, and
whether a high number of them could cause a performance problem. However, if
Linux displayed a high number of minor faults, and Tru64 UNIX did not, it could
have been an indication of the problem.
It was known that a minor fault was a type of pagefault, and when
a pagefault occurred a dtbmiss[4]
or itbmiss must also have occurred. The
high number of page faults that the "time" command reported
corresponded nicely with DCPI's report of a high number of dtbmisses.
To determine if the number of minor faults was different on the two Alpha operating systems, the customer ran the following script on both Linux/Alpha and Tru64 UNIX:
(findfault.sh)
#!/bin/sh
COMMAND=$1
#Print command with
headers.
ps -a -o
vsize,rss,minflt,majflt,cmd | grep -e $COMMAND -e CMD | grep -v grep | grep -v
$0
while (true)
do
sleep 1
#Print command without headers.
ps
-a -o vsize, rss, minflt, \
majflt, cmd | grep -e $COMMAND |\
grep -v grep |grep -v $0
done
This script showed the size of the virtual and resident set as
well as the number of major and minor page faults for a specified process.
The customer reported a significant number of minor faults on
Linux/Alpha, but nearly none on Tru64 UNIX.
The result of the test is reported in graphical form below.


(Notice the difference in the scale of the minor faults)
Linux/Alpha had an ever-increasing number of minor faults, while
Tru64 UNIX's fault count stayed nearly constant. Linux/Alpha's virtual set size
fluctuated, while Tru64 UNIX's stayed nearly constant.
3. Faults are at
fault
The high minor fault count on Linux/Alpha pointed to a significant
difference between Tru64 UNIX and Linux/Alpha. This was the first piece of the
puzzle. However, to understand what a high minor fault count meant, it was
necessary to understand what a minor fault was.
A google[5]
search of "minor fault" and "linux", revealed the following
information about the different types of page faults.
In Linux and Unix, page faults are either minor or major. A major
fault requires an I/O operation to complete such as a page swap from disk.
Minor faults can be handled without an I/O such as a Copy on Write (COW)
request or a request for a zeroed page.
A linux kernel website[6]gave
the following definitions:
Major fault
A major page fault occurs when an attempt
to access a page not currently present in physical memory was made. The page
must be swapped in to physical memory by the fault fix-up code.
Minor fault
A minor page fault occurs when an attempt
to access a page present in physical memory, but without the correct
permissions. An example is the first write to a second reference to a shared
page, when the kernel must perform the copy-on-write and allow the task to
update the copied page.
On Compaq’s OpenVMS[7],
a high number of minor faults usually indicated that a process's working set
was larger than its allowed working set. Every attempt to use a new page would
result in an old one being kicked out of its working set, and the program would
spend a significant amount of time faulting in new pages.
It was assumed that this was what was happening on Linux.
The resident set of the customer's program hovered around 131
megabytes of memory, which seemed suspiciously close to a 128 megabytes limit.
The Linux kernel code was searched for such a hard coded limit, but
unfortunately, it was a dead end.
By running the following program, it was determined that a program
could allocate 256 megabytes of memory, and touch every page without taking a
minor fault:
#include
<unistd.h>
#include
<malloc.h>
#include
<stdio.h>
#include
<stdlib.h>
int main(int argc,
char *argv[])
{
int num_byte;
char *buffer, *p;
num_byte = atoi(argv[1])*1024*1024;
buffer = malloc(num_byte);
while (1){
for (p = buffer;
p < (buffer+num_byte);
p += getpagesize())
{*p= 0;}
}
}
This lack of faults did not match the behavior of the customer's
program.
The author of this paper would have been puzzled had he not
remembered that a member of the Compaq Math library team reported a similar
problem months ago. The problem of the Math Library team member and that of the
customer appeared to be very similar. A message to the Math library team member
revealed that he had found more information about the problem, but had not
found a solution.
His message stated that:
"The problem involving minor page
faults in DGEMM on Linux/Alpha is caused by the way Linux does heap management
(i.e., malloc and free). Allocation of large buffers is done via mmap, and when
they are freed, they are unmapped via munmap. The buffer allocated by DGEMM
falls into this category. Thus, for each call to DGEMM, address space for the
buffer is created, buffer pages are faulted into the resident set and then the
buffer, and the address space, is deleted. "
This changed the focus of the search, and also allowed for the creation of a
smaller test program which showed similar behavior to the original chemistry
code: an ever increasing number of minor page faults on Linux, and a small
number of page faults on Tru64 UNIX.
For those not fortunate enough to have a colleague who experienced
a similar problem, the kernel's minor page fault handler could have been
instrumented to print the address of instructions that cause more than 1000
minor page faults. Using this to find the guilty instruction, one could then
use 'nm' and 'gdb' to determine which function or line of code caused the minor
faults. Although this would not be a general-purpose solution, the availability
and modifiability of Linux kernel source makes this instrumentation possible.
Since it appeared that memory allocation was the cause of the
problem, it could be tested independently of the customer's chemistry program.
This was fortunate because the customer's chemistry program had many modules
and a long compile time.
The following simple program could reproduce the high number of
minor faults; it allocates a piece of memory, and then immediately frees it.
#include
<malloc.h>
#include
<stdio.h>
#include
<stdlib.h>
int main(int argc,
char *argv[])
{
int number_of_meg, num_byte;
char *buffer;
num_byte = atoi(argv[1])*1024*1024;
while (1){
buffer = malloc(num_byte);
free(buffer);
}
}
An strace[8]
of the test on Linux/Alpha confirms what the math library engineer had said.
‘mmap’ is used when mallocing large amounts of memory. Linux/Alpha had an
ever-increasing amount of minor faults.
strace ./malloc_test 1
....
mmap(0, 1056768,
PROT_NONE,
0 /* MAP_??? */, 0, 0) = 0x20000456000
munmap(0x20000456000,
1056768) = 0
mmap(0, 1056768,
PROT_NONE,
0 /* MAP_??? */, 0, 0) = 0x20000456000
munmap(0x20000456000,
1056768) = 0
mmap(0, 1056768,
PROT_NONE,
0 /* MAP_??? */, 0, 0) = 0x20000456000
munmap(0x20000456000,
1056768) = 0
mmap(0, 1056768,
PROT_NONE,
0 /* MAP_??? */, 0, 0) = 0x20000456000
munmap(0x20000456000,
1056768) = 0
.... 
Tracing the same program on Tru64 UNIX yielded two interesting
facts:
Few minor page
faults occurred on Tru64 UNIX, and
Tru64 UNIX used obreak() (a system call
which increases the processes heap size) instead of mmap() (a system call which
allocates system wide resources to a program) to malloc memory.
...
obreak (0x140108000) =
0
obreak (0x14020a000) =
0
obreak (0x140108000) =
0
obreak (0x14020a000) =
0
....

5.3 Intel/Linux
An Intel/Linux system behaved the same as
an Linux/Alpha system, with a fluctuating mmap() value and an increasing number
of faults.
strace
./malloc_test_i386 1
....
old_mmap(NULL,
1052672, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x4024f000
munmap(0x4024f000,
1052672) = 0
old_mmap(NULL,
1052672,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x4024f000
munmap(0x4024f000,
1052672) = 0
old_mmap(NULL,
1052672,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x4024f000
munmap(0x4024f000,
1052672) = 0
.....

To test this malloc issue on another operating system, FreeBSD was
installed on VMware[9], a very fast
i386 virtual machine.
FreeBSD did not display the increasing
number of page faults. FreeBSD used break() to set memory limits, much like
Tru64 UNIX.
...
2814 pagefault CALL
break(0x4000)
2814 pagefault RET
break 0
2814 pagefault CALL
break(0x104000)
2814 pagefault RET
break 0
2814 pagefault CALL
break(0x14000)
2814 pagefault RET
break 0
2814 pagefault CALL
break(0x114000)
2814 pagefault RET
break 0
2814 pagefault CALL
break(0x14000)
2814 pagefault RET
break 0
2814 pagefault CALL
break(0x114000)
2814 pagefault RET
break 0
2814 pagefault CALL
break(0x14000)
2814 pagefault RET
break 0
....

|
|
Linux |
Tru64 UNIX |
Linux |
FreeBSD |
|
Architecture |
Alpha |
Alpha |
Intel |
Intel |
|
Allocation |
mmap |
obreak |
mmap |
Break |
|
Changing Allocation Amount |
Yes |
Yes |
Yes |
Yes |
|
Large # of Minor Faults |
Yes |
No |
Yes |
No |
It appeared as if this problem was unique to Linux, and possibly
mmap.
To understand why different system calls were used (mmap() &
break()) when the same lib memory allocation routine (malloc()) was called, it
is necessary to understand how memory management in Linux/Unix works.
A Linux/Unix process can have three types of memory allocated on
its behalf: stack, heap and mmaped memory.
Stack memory is managed by the operating system, and is not
generally managed by individual processes. Stack memory (or "the
stack") usually contains local variables, and information saved during a
function call.
Stack memory is automatically allocated by the operating system,
when a process needs more. Stack memory is a temporary storage space, which is
not guaranteed to remain allocated for the life of a process.
Heap and mmaped memory are more permanent areas of memory and
remain allocated for the life of a process. Normally, heap and mmapped memory
are managed through malloc, but they can also be managed independently. (A
process can call the memory allocation system calls directly to bypass malloc.)
Heap memory (or "the heap") is managed by the brk()
system call. The brk() system call takes one argument which sets the "end
of heap" for a process. If brk() is passed a value greater than the
process's current brk() value, the size of a process's heap grows to the new
value, and the operating system reserves more memory for the process. If the
value passed to brk() is less than the current brk() value, the size of a
process's heap shrinks to the new value, and the operating system will free
memory from the process. (break(), brk() and obrk() are different names for the
same system call ‘brk()’)
Mmaped memory is managed by the mmap() and munmap() system calls.
When a piece of mmapped memory is to be allocated, mmap() is called the with
size of the requested memory. A pointer to the memory is returned, which is
used by the process. When the memory is to be deallocated, the pointer is
passed to the munmap() system call, and the operating system deallocates the
memory.
Use of mmap()/munmap() is more flexible than brk(), but it has
more size restrictions and a higher overhead per allocation. If a piece of
memory allocated with brk() is not at the end of the heap when it is freed, it
can not be released back to the system as free memory, because the brk()
interface only allows the end of heap memory to be specified. mmap() &
munmap do not suffer this problem.
Some mallocs, GNU libc's in particular, use both heap memory and
stack memory to fulfill allocation. Which type of memory is used depends on the
size of the allocation request.
It was reasoned that at some point below one megabyte allocations,
malloc would start to behave more like a traditional malloc(), using brk()
instead of mmap(). As a result, a test program was rewritten to allow kilobytes
to be specified as an allocation amount instead of megabytes.
#include
<stdlib.h>
#include
<stdio.h>
int main(int argc,
char *argv[]){
char *buffer;
int num_byte;
num_byte = atoi(argv[1])*1024;
while(1)
{buffer=malloc(num_byte);
free(buffer);}
}
After
further investigation under linux, it appeared that 128k was an important
malloc threshold. Three memory allocations close to 128k in size (126k, 127k
and 128k) yielded very different results.
7.1 126k allocation
When malloc
was called with an allocation request of 126k, brk() was used to allocate the
memory. free() did not release the memory; once the end of the heap was set to
"0x8069000", it did not change. Minor page faults did not occur.
strace ./pagefault 126
....
brk(0) = 0x804965c
brk(0x8068e74) =
0x8068e74
brk(0x8069000) =
0x8069000
(Nothing further)

7.2 127k
allocation
When malloc was called with an allocation
request of 127k, brk() was used to allocate the memory. free() released the
memory; the end of the heap fluctuated between 0x806a000 and 0x804a000. A large
number of minor page faults occurred.
strace ./pagefault 127
....
brk(0) = 0x804965c
brk(0x8069274) =
0x8069274
brk(0x806a000) =
0x806a000
brk(0x804a000) =
0x804a000
brk(0x806a000) =
0x806a000
brk(0x804a000) =
0x804a000
...

7.3 128k
allocation
When malloc was called with an allocation
request of 128k, mmap was used to allocate the memory. free() released the
memory, as the repeated calls to mmap showed. A large number of minor page
faults occurred.
strace ./pagefault 128
old_mmap(NULL, 135168,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x40116000
munmap(0x40116000,
135168) = 0
old_mmap(NULL, 135168,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x40116000
munmap(0x40116000,
135168) = 0
old_mmap(NULL, 135168,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x40116000
munmap(0x40116000,
135168) = 0
old_mmap(NULL, 135168,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x40116000
munmap(0x40116000, 135168) = 0

|
Allocation
Size |
126k |
127k |
128k |
|
Linux
Allocation Method |
brk |
brk |