18.11.2014 Views

One - The Linux Kernel Archives

One - The Linux Kernel Archives

One - The Linux Kernel Archives

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Proceedings of the<br />

<strong>Linux</strong> Symposium<br />

Volume <strong>One</strong><br />

July 21st–24th, 2004<br />

Ottawa, Ontario<br />

Canada


Contents<br />

TCP Connection Passing 9<br />

Werner Almesberger<br />

Cooperative <strong>Linux</strong> 23<br />

Dan Aloni<br />

Build your own Wireless Access Point 33<br />

Erik Andersen<br />

Run-time testing of LSB Applications 41<br />

Stuart Anderson<br />

<strong>Linux</strong> Block IO—present and future 51<br />

Jens Axboe<br />

<strong>Linux</strong> AIO Performance and Robustness for Enterprise Workloads 63<br />

Suparna Bhattacharya<br />

Methods to Improve Bootup Time in <strong>Linux</strong> 79<br />

Tim R. Bird<br />

<strong>Linux</strong> on NUMA Systems 89<br />

Martin J. Bligh<br />

Improving <strong>Kernel</strong> Performance by Unmapping the Page Cache 103<br />

James Bottomley<br />

<strong>Linux</strong> Virtualization on IBM Power5 Systems 113<br />

Dave Boutcher


<strong>The</strong> State of ACPI in the <strong>Linux</strong> <strong>Kernel</strong> 121<br />

Len Brown<br />

Scaling <strong>Linux</strong> to the Extreme 133<br />

Ray Bryant<br />

Get More Device Drivers out of the <strong>Kernel</strong>! 149<br />

Peter Chubb<br />

Big Servers—2.6 compared to 2.4 163<br />

Wim A. Coekaerts<br />

Multi-processor and Frequency Scaling 167<br />

Paul Devriendt<br />

Dynamic <strong>Kernel</strong> Module Support: From <strong>The</strong>ory to Practice 187<br />

Matt Domsch<br />

e100 weight reduction program 203<br />

Scott Feldman<br />

NFSv4 and rpcsec_gss for linux 207<br />

J. Bruce Fields<br />

Comparing and Evaluating epoll, select, and poll Event Mechanisms 215<br />

Louay Gammo<br />

<strong>The</strong> (Re)Architecture of the X Window System 227<br />

James Gettys<br />

IA64-<strong>Linux</strong> perf tools for IO dorks 239<br />

Grant Grundler


Carrier Grade Server Features in the <strong>Linux</strong> <strong>Kernel</strong> 255<br />

Ibrahim Haddad<br />

Demands, Solutions, and Improvements for <strong>Linux</strong> Filesystem Security 269<br />

Michael Austin Halcrow<br />

Hotplug Memory and the <strong>Linux</strong> VM 287<br />

Dave Hansen


Conference Organizers<br />

Andrew J. Hutton, Steamballoon, Inc.<br />

Stephanie Donovan, <strong>Linux</strong> Symposium<br />

C. Craig Ross, <strong>Linux</strong> Symposium<br />

Review Committee<br />

Jes Sorensen, Wild Open Source, Inc.<br />

Matt Domsch, Dell<br />

Gerrit Huizenga, IBM<br />

Matthew Wilcox, Hewlett-Packard<br />

Dirk Hohndel, Intel<br />

Val Henson, Sun Microsystems<br />

Jamal Hadi Salimi, Znyx<br />

Andrew Hutton, Steamballoon, Inc.<br />

Proceedings Formatting Team<br />

John W. Lockhart, Red Hat, Inc.<br />

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights<br />

to all as a condition of submission.


8 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


TCP Connection Passing<br />

Werner Almesberger<br />

werner@almesberger.net<br />

Abstract<br />

tcpcp is an experimental mechanism that allows<br />

cooperating applications to pass ownership<br />

of TCP connection endpoints from one<br />

<strong>Linux</strong> host to another one. tcpcp can be used<br />

between hosts using different architectures and<br />

does not need the other endpoint of the connection<br />

to cooperate (or even to know what’s<br />

going on).<br />

1 Introduction<br />

When designing systems for load-balancing,<br />

process migration, or fail-over, there is eventually<br />

the point where one would like to be<br />

able to “move” a socket from one machine to<br />

another one, without losing the connection on<br />

that socket, similar to file descriptor passing on<br />

a single host. Such a move operation usually<br />

involves at least three elements:<br />

1. Moving any application space state related<br />

to the connection to the new owner.<br />

E.g. in the case of a Web server serving<br />

large static files, the application state<br />

could simply be the file name and the current<br />

position in the file.<br />

2. Making sure that packets belonging to the<br />

connection are sent to the new owner of<br />

the socket. Normally this also means that<br />

the previous owner should no longer receive<br />

them.<br />

3. Last but not least, creating compatible<br />

network state in the kernel of the new connection<br />

owner, such that it can resume the<br />

communication where the previous owner<br />

left off.<br />

Origin (server)<br />

App<br />

Destination (server)<br />

<strong>Kernel</strong> state<br />

App<br />

Application state<br />

Packet routing<br />

User space<br />

<strong>Kernel</strong><br />

Peer<br />

(client)<br />

Figure 1: Passing one end of a TCP connection<br />

from one host to another.<br />

Figure 1 illustrates this for the case of a clientserver<br />

application, where one server passes<br />

ownership of a connection to another server.<br />

We shall call the host from which ownership of<br />

the connection endpoint is taken the origin, the<br />

host to which it is transferred the destination,<br />

and the host on the other end of the connection<br />

(which does not change) the peer.<br />

Details of moving the application state are beyond<br />

the scope of this paper, and we will only<br />

sketch relatively simple examples. Similarly,<br />

we will mention a few ways for how the redirection<br />

in the network can be accomplished,<br />

but without going into too much detail.


10 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

<strong>The</strong> complexity of the kernel state of a network<br />

connection, and the difficulty of moving this<br />

state from one host to another, varies greatly<br />

with the transport protocol being used. Among<br />

the two major transport protocols of the Internet,<br />

UDP [1] and TCP [2], the latter clearly<br />

presents more of a challenge in this regard.<br />

Nevertheless, some issues also apply to UDP.<br />

tcpcp (TCP Connection Passing) is a proof of<br />

concept implementation of a mechanism that<br />

allows applications to transport the kernel state<br />

of a TCP endpoint from one host to another,<br />

while the connection is established, and without<br />

requiring the peer to cooperate in any way.<br />

tcpcp is not a complete process migration or<br />

load-balancing solution, but rather a building<br />

block that can be integrated into such systems.<br />

tcpcp consists of a kernel patch (at the time<br />

of writing for version 2.6.4 of the <strong>Linux</strong> kernel)<br />

that implements the operations for dumping<br />

and restoring the TCP connection endpoint,<br />

a library with wrapper functions (see<br />

Section 3), and a few applications for debugging<br />

and demonstration.<br />

<strong>The</strong> project’s home page is at http://<br />

tcpcp.sourceforge.net/<br />

<strong>The</strong> remainder of this paper is organized as follows:<br />

this section continues with a description<br />

of the context in which connection passing exists.<br />

Section 2 explains the connection passing<br />

operation in detail. Section 3 introduces<br />

the APIs tcpcp provides. <strong>The</strong> information that<br />

defines a TCP connection and its state is described<br />

in Section 4. Sections 5 and 6 discuss<br />

congestion control and the limitations TCP imposes<br />

on checkpointing. Security implications<br />

of the availability and use of tcpcp are examined<br />

in Section 7. We conclude with an outlook<br />

on future direction the work on tcpcp will take<br />

in Section 8, and the conclusions in Section 9.<br />

<strong>The</strong> excellent “TCP/IP Illustrated” [3] is recommended<br />

for readers who wish to refresh<br />

their memory of TCP/IP concepts and terminology.<br />

1.1 <strong>The</strong>re is more than one way to do it<br />

tcpcp is only one of several possible methods<br />

for passing TCP connections among hosts.<br />

Here are some alternatives:<br />

In some cases, the solution is to avoid passing<br />

the “live” TCP connection, but to terminate<br />

the connection between the origin and the<br />

peer, and rely on higher protocol layers to reestablish<br />

a new connection between the destination<br />

and the peer. Drawbacks of this approach<br />

include that those higher layers need to<br />

know that they have to re-establish the connection,<br />

and that they need to do this within an<br />

acceptable amount of time. Furthermore, they<br />

may only be able to do this at a few specific<br />

points during a communication.<br />

<strong>The</strong> use of HTTP redirection [4] is a simple<br />

example of connection passing above the transport<br />

layer.<br />

Another approach is to introduce an intermediate<br />

layer between the application and the kernel,<br />

for the purpose of handling such redirection.<br />

This approach is fairly common in process<br />

migration solutions, such as Mosix [5],<br />

MIGSOCK [6], etc. It requires that the peer<br />

be equipped with the same intermediate layer.<br />

1.2 Transparency<br />

<strong>The</strong> key feature of tcpcp is that the peer can be<br />

left completely unaware that the connection is<br />

passed from one host to another. In detail, this<br />

means:<br />

• <strong>The</strong> peer’s networking stack can be used<br />

“as is,” without modification and without<br />

requiring non-standard functionality<br />

• <strong>The</strong> connection is not interrupted


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 11<br />

• <strong>The</strong> peer does not have to stop sending<br />

2 Passing the connection<br />

• No contradictory information is sent to the<br />

peer<br />

Figure 2 illustrates the connection passing procedure<br />

in detail.<br />

• <strong>The</strong>se properties apply to all protocol layers<br />

visible to the peer<br />

Furthermore, tcpcp allows the connection to be<br />

passed at any time, without needing to synchronize<br />

the data stream with the peer.<br />

<strong>The</strong> kernels of the hosts between which the<br />

connection is passed both need to support<br />

tcpcp, and the application(s) on these hosts will<br />

typically have to be modified to perform the<br />

connection passing.<br />

1.3 Various uses<br />

Application scenarios in which the functionality<br />

provided by tcpcp could be useful include<br />

load balancing, process migration, and failover.<br />

In the case of load balancing, an application<br />

can send connections (and whatever processing<br />

is associated with them) to another host if the<br />

local one gets overloaded. Or, one could have a<br />

host acting as a dispatcher that may perform an<br />

initial dialog and then assigns the connection<br />

to a machine in a farm.<br />

For process migration, tcpcp would be invoked<br />

when moving a file descriptor linked to a<br />

socket. If process migration is implemented in<br />

the kernel, an interface would have to be added<br />

to tcpcp to allow calling it in this way.<br />

Fail-over is tricker, because there is normally<br />

no prior indication when the origin will become<br />

unavailable. We discuss the issues arising<br />

from this in more detail in Section 6.<br />

1. <strong>The</strong> application at the origin initiates the<br />

procedure by requesting retrieval of what<br />

we call the Internal Connection Information<br />

(ICI) of a socket. <strong>The</strong> ICI contains<br />

all the information the kernel needs to recreate<br />

a TCP connection endpoint<br />

2. As a side-effect of retrieving the ICI,<br />

tcpcp isolates the connection: all incoming<br />

packets are silently discarded, and no<br />

packets are sent. This is accomplished<br />

by setting up a per-socket filter, and by<br />

changing the output function. Isolating<br />

the socket ensures that the state of the connection<br />

being passed remains stable at either<br />

end.<br />

3. <strong>The</strong> kernel copies all relevant variables,<br />

plus the contents of the out-of-order and<br />

send/retransmit buffers to the ICI. <strong>The</strong><br />

out-of-order buffer contains TCP segments<br />

that have not been acknowledged<br />

yet, because an earlier segment is still<br />

missing.<br />

4. After retrieving the ICI, the application<br />

empties the receive buffer. It can either<br />

process this data directly, or send it along<br />

with the other information, for the destination<br />

to process.<br />

5. <strong>The</strong> origin sends the ICI and any relevant<br />

application state to the destination. <strong>The</strong><br />

application at the origin keeps the socket<br />

open, to ensure that it stays isolated.<br />

6. <strong>The</strong> destination opens a new socket. It<br />

may then bind it to a new port (there are<br />

other choices, described below).


12 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Application<br />

Empty receive buffer (4)<br />

Copy kernel state to ICI (3)<br />

Isolate connection (2)<br />

Destination Origin<br />

Receive OutOfOrder<br />

TCP<br />

Send/Retransmit<br />

Get ICI (1)<br />

Switch network traffic (8)<br />

Vars Send/Retr OutOfOrder Internal Connection Information<br />

Bind port (6)<br />

Set ICI (7)<br />

Receive OutOfOrder<br />

TCP<br />

Send/Retransmit<br />

ACK<br />

Router, switch, ...<br />

Network path to peer<br />

Application<br />

(Re)transmit, or send ACK (10)<br />

Activate connection (9)<br />

Send application state and ICI to new host (5)<br />

Data flow in networking stack Data transfer Command<br />

Figure 2: Passing a TCP connection endpoint in ten easy steps.<br />

7. <strong>The</strong> application at the destination now sets<br />

the ICI on the socket. <strong>The</strong> kernel creates<br />

and populates the necessary data structures,<br />

but does not send any data yet. <strong>The</strong><br />

current implementation makes no use of<br />

the out-of-order data.<br />

8. Network traffic belonging to the connection<br />

is redirected from the origin to the<br />

destination host. Scenarios for this are described<br />

in more detail below. <strong>The</strong> application<br />

at the origin can now close the socket.<br />

9. <strong>The</strong> application at the destination makes a<br />

call to activate the connection.<br />

10. If there is data to transmit, the kernel<br />

will do so. If there is no data, an otherwise<br />

empty ACK segment (like a window<br />

probe) is sent to wake up the peer.<br />

Note that, at the end of this procedure, the<br />

socket at the destination is a perfectly normal<br />

TCP endpoint. In particular, this endpoint can<br />

be passed to another host (or back to the original<br />

one) with tcpcp.<br />

2.1 Local port selection<br />

<strong>The</strong> local port at the destination can be selected<br />

in three ways:<br />

• <strong>The</strong> destination can simply try to use the<br />

same port as the origin. This is necessary<br />

if no address translation is performed on<br />

the connection.<br />

• <strong>The</strong> application can bind the socket before<br />

setting the ICI. In this case, the port in the<br />

ICI is ignored.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 13<br />

• <strong>The</strong> application can also clear the port<br />

information in the ICI, which will cause<br />

the socket to be bound to any available<br />

port. Compared to binding the socket before<br />

setting the ICI, this approach has the<br />

advantage of using the local port number<br />

space much more efficiently.<br />

<strong>The</strong> choice of the port selection method depends<br />

on how the environment in which tcpcp<br />

operates is structured. Normally, either the first<br />

or the last method would be used.<br />

2.2 Switching network traffic<br />

<strong>The</strong>re are countless ways for redirecting IP<br />

packets from one host to another, without help<br />

from the transport layer protocol. <strong>The</strong>y include<br />

redirecting part of the link layer, ingenious<br />

modifications of how link and network<br />

layer interact [7], all kinds of tunnels, network<br />

address translation (NAT), etc.<br />

Since many of the techniques are similar to<br />

network-based load balancing, the <strong>Linux</strong> Virtual<br />

Server Project [8] is a good starting point<br />

for exploring these issues.<br />

While a comprehensive study of this topic if<br />

beyond the scope of this paper, we will briefly<br />

sketch an approach using a static route, because<br />

this is conceptually straightforward and<br />

relatively easy to implement.<br />

Server A<br />

Server B<br />

ipA, ipX<br />

ipB, ipX<br />

ipX gw ipA<br />

GW<br />

ipX gw ipB<br />

ipX<br />

Client<br />

Figure 3: Redirecting network traffic using a<br />

static route.<br />

<strong>The</strong> scenario shown in Figure 3 consists of two<br />

servers A and B, with interfaces with the IP addresses<br />

ipA and ipB, respectively. Each server<br />

also has a virtual interface with the address<br />

ipX. ipA, ipB, and ipX are on the same subnet,<br />

and also the gateway machine has an interface<br />

on this subnet.<br />

At the gateway, we create a static route as follows:<br />

route add ipX gw ipA<br />

When the client connects to the address ipX, it<br />

reaches host A. We can now pass the connection<br />

to host B, as outlined in Section 2. In Step<br />

8, we change the static route on the gateway as<br />

follows:<br />

route del ipX<br />

route add ipX gw ipB<br />

<strong>One</strong> major limitation of this approach is of<br />

course that this routing change affects all connections<br />

to ipX, which is usually undesirable.<br />

Nevertheless, this simple setup can be used to<br />

demonstrate the operation of tcpcp.<br />

3 APIs<br />

<strong>The</strong> API for tcpcp consists of a low-level part<br />

that is based on getting and setting socket options,<br />

and a high-level library that provides<br />

convenient wrappers for the low-level API.<br />

We mention only the most important aspects of<br />

both APIs here. <strong>The</strong>y are described in more detail<br />

in the documentation that is included with<br />

tcpcp.<br />

3.1 Low-level API<br />

<strong>The</strong> ICI is retrieved by getting the TCP_ICI<br />

socket option. As a side-effect, the connection<br />

is isolated, as described in Section 2. <strong>The</strong> application<br />

can determine the maximum ICI size


14 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

for the connection in question by getting the<br />

TCP_MAXICISIZE socket option.<br />

Example:<br />

void *buf;<br />

int ici_size;<br />

size_t size = sizeof(int);<br />

getsockopt(s,SOL_TCP,TCP_MAXICISIZE,<br />

&ici_size,&size);<br />

buf = malloc(ici_size);<br />

size = ici_size;<br />

getsockopt(s,SOL_TCP,TCP_ICI,<br />

buf,&size);<br />

<strong>The</strong> connection endpoint at the destination is<br />

created by setting the TCP_ICI socket option,<br />

and the connection is activated by “setting”<br />

the TCP_CP_FN socket option to the value<br />

TCPCP_ACTIVATE. 1<br />

Example:<br />

int sub_function = TCPCP_ACTIVATE;<br />

setsockopt(s,SOL_TCP,TCP_ICI,<br />

buf,size);<br />

/* ... */<br />

setsockopt(s,SOL_TCP,TCP_CP_FN,<br />

&sub_function,<br />

sizeof(sub_function));<br />

3.2 High-level API<br />

<strong>The</strong>se are the most important functions provided<br />

by the high-level API:<br />

void *tcpcp_get(int s);<br />

int tcpcp_size(const void *ici);<br />

int tcpcp_create(const void *ici);<br />

int tcpcp_activate(int s);<br />

1 <strong>The</strong> use of a multiplexed socket option is admittedly<br />

ugly, although convenient during development.<br />

tcpcp_get allocates a buffer for the ICI, and<br />

retrieves that ICI (isolating the connection as a<br />

side-effect). <strong>The</strong> amount of data in the ICI can<br />

be queried by calling tcpcp_size on it.<br />

tcpcp_create sets an ICI on a socket, and<br />

tcpcp_activate activates the connection.<br />

4 Describing a TCP endpoint<br />

In this section, we describe the parameters that<br />

define a TCP connection and its state. tcpcp<br />

collects all the information it needs to re-create<br />

a TCP connection endpoint in a data structure<br />

we call Internal Connection Information (ICI).<br />

<strong>The</strong> ICI is portable among systems supporting<br />

tcpcp, irrespective of their CPU architecture.<br />

Besides this data, the kernel maintains a large<br />

number of additional variables that can either<br />

be reset to default values at the destination<br />

(such as congestion control state), or that are<br />

only rarely used and not essential for correct<br />

operation of TCP (such as statistics).<br />

4.1 Connection identifier<br />

Each TCP connection in the global Internet or<br />

any private internet [9] is uniquely identified by<br />

the IP addresses of the source and destination<br />

host, and the port numbers used at both ends.<br />

tcpcp currently only supports IPv4, but can<br />

be extended to support IPv6, should the need<br />

arise.<br />

4.2 Fixed data<br />

A few parameters of a TCP connection are negotiated<br />

during the initial handshake, and remain<br />

unchanged during the life time of the<br />

connection. <strong>The</strong>se parameters include whether<br />

window scaling, timestamps, or selective acknowledgments<br />

are used, the number of bits by


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 15<br />

Connection identifier<br />

ip.v4.ip_src IPv4 address of the host on which the ICI was recorded (source)<br />

ip.v4.ip_dst IPv4 address of the peer (destination)<br />

tcp_sport Port at the source host<br />

tcp_dport Port at the destination host<br />

Fixed at connection setup<br />

tcp_flags TCP flags (window scale, SACK, ECN, etc.)<br />

snd_wscale Send window scale<br />

rcv_wscale Receive window scale<br />

snd_mss Maximum Segment Size at the source host<br />

rcv_mss MSS at the destination host<br />

Connection state<br />

state<br />

TCP connection state (e.g. ESTABLISHED)<br />

Sequence numbers<br />

snd_nxt Sequence number of next new byte to send<br />

rcv_nxt Sequence number of next new byte expected to receive<br />

Windows (flow-control)<br />

snd_wnd Window received from peer<br />

rcv_wnd Window advertised to peer<br />

Timestamps<br />

ts_gen Current value of the timestamp generator<br />

ts_recent Most recently received timestamp<br />

Table 1: TCP variables recorded in tcpcp’s Internal Connection Information (ICI) structure.<br />

which the window is shifted, and the maximum<br />

segment sizes (MSS).<br />

<strong>The</strong>se parameters are used mainly for sanity<br />

checks, and to determine whether the destination<br />

host is able to handle the connection. <strong>The</strong><br />

received MSS continues of course to limit the<br />

segment size.<br />

4.3 Sequence numbers<br />

<strong>The</strong> sequence numbers are used to synchronize<br />

all aspects of a TCP connection.<br />

Only the sequence numbers we expect to see<br />

in the network, in either direction, are needed<br />

when re-creating the endpoint. <strong>The</strong> kernel uses<br />

several variables that are derived from these sequence<br />

numbers. <strong>The</strong> values of these variables<br />

either coincide with snd_nxt and rcv_nxt<br />

in the state we set up, or they can be calculated<br />

by examining the send buffer.<br />

4.4 Windows (flow-control)<br />

<strong>The</strong> (flow-control) window determines how<br />

much more data can be sent or received without<br />

overrunning the receiver’s buffer.<br />

<strong>The</strong> window the origin received from the peer<br />

is also the window we can use after re-creating<br />

the endpoint.<br />

<strong>The</strong> window the origin advertised to the peer<br />

defines the minimum receive buffer size at the<br />

destination.


16 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

4.5 Timestamps<br />

TCP can use timestamps to detect old segments<br />

with wrapped sequence numbers [10]. This<br />

mechanism is called Protect Against Wrapped<br />

Sequence numbers (PAWS).<br />

<strong>Linux</strong> uses a global counter (tcp_time_<br />

stamp) to generate local timestamps. If a<br />

moved connection were to use the counter at<br />

the new host, local round-trip-time calculation<br />

may be confused when receiving timestamp<br />

replies from the previous connection, and the<br />

peer’s PAWS algorithm will discard segments<br />

if timestamps appear to have jumped back in<br />

time.<br />

Just turning off timestamps when moving the<br />

connection is not an acceptable solution, even<br />

though [10] seems to allow TCP to just stop<br />

sending timestamps, because doing so would<br />

bring back the problem PAWS tries to solve<br />

in the first place, and it would also reduce the<br />

accuracy of round-trip-time estimates, possibly<br />

degrading the throughput of the connection.<br />

A more satisfying solution is to synchronization<br />

the local timestamp generator. This is<br />

accomplished by introducing a per-connection<br />

timestamp offset that is added to the value<br />

of tcp_time_stamp. This calculation is<br />

hidden in the macro tp_time_stamp(tp),<br />

which just becomes tcp_time_stamp if the<br />

kernel is configured without tcpcp.<br />

<strong>The</strong> addition of the timestamp offset is the only<br />

major change tcpcp requires in the existing<br />

TCP/IP stack.<br />

4.6 Receive buffers<br />

<strong>The</strong>re are two buffers at the receiving side:<br />

the buffer containing segments received out-oforder<br />

(see Section 2), and the buffer with data<br />

that is ready for retrieval by the application.<br />

tcpcp currently ignores both buffers: the outof-order<br />

buffer is copied into the ICI, but not<br />

used when setting up the new socket. Any data<br />

in the receive buffer is left for the application<br />

to read and process.<br />

4.7 Send buffer<br />

<strong>The</strong> send and retransmit buffer contains data<br />

that is no longer accessible through the socket<br />

API, and that cannot be discarded. It is therefore<br />

placed in the ICI, and used to populate the<br />

send buffer at the destination.<br />

4.8 Selective acknowledgments<br />

In Section 5 of [11], the use of inbound SACK<br />

information is left optional. tcpcp takes advantage<br />

of this, and neither preserves SACK information<br />

collected from inbound segments, nor<br />

the history of SACK information sent to the<br />

peer.<br />

Outbound SACKs convey information about<br />

the receiver’s out-of-order queue. Fortunately,<br />

[11] declares this information as purely advisory.<br />

In particular, if reception of data has been<br />

acknowledged with a SACK, this does not imply<br />

that the receiver has to remember having<br />

done so. First, it can request retransmission of<br />

this data, and second, when constructing new<br />

SACKs, the receiver is encouraged to include<br />

information from previous SACKs, but is under<br />

no obligation to do so.<br />

<strong>The</strong>refore, while [11] discourages losing<br />

SACK information, doing so does not violate<br />

its requirements.<br />

Losing SACK information may temporarily<br />

degrade the throughput of the TCP connection.<br />

This is currently of little concern, because<br />

tcpcp forces the connection into slow<br />

start, which has even more drastic performance<br />

implications.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 17<br />

SACK recovery may need to be reconsidered<br />

once tcpcp implements more sophisticated<br />

congestion control.<br />

High−speed LAN<br />

4.9 Other data<br />

<strong>The</strong> TCP connection state is currently always<br />

ESTABLISHED. It may be useful to also allow<br />

passing connections in earlier states, e.g.<br />

SYN_RCVD. This is for further study.<br />

Congestion control data and statistics are currently<br />

omitted. <strong>The</strong> new connection starts with<br />

slow-start, to allow TCP to discover the characteristics<br />

of the new path to the peer.<br />

Origin<br />

Destination<br />

?<br />

Peer<br />

WAN<br />

Characteristics are identical<br />

Reuse congestion control state<br />

Characteristics may differ<br />

Go to slow−start<br />

5 Congestion control<br />

Most of the complexity of TCP is in its congestion<br />

control. tcpcp currently avoids touching<br />

congestion control almost entirely, by setting<br />

the destination to slow start.<br />

This is a highly conservative approach that is<br />

appropriate if knowing the characteristics of<br />

the path between the origin and the peer does<br />

not give us any information on the characteristics<br />

of the path between the destination and the<br />

peer, as shown in the lower part of Figure 4.<br />

However, if the characteristics of the two paths<br />

can be expected to be very similar, e.g. if the<br />

hosts passing the connection are on the same<br />

LAN, better performance could be achieved by<br />

allowing tcpcp to resume the connection at or<br />

nearly at full speed.<br />

Re-establishing congestion control state is for<br />

further study. To avoid abuse, such an operation<br />

can be made available only to sufficiently<br />

trusted applications.<br />

Figure 4: Depending on the structure of the<br />

network, the congestion control state of the<br />

original connection may or may not be reused.<br />

6 Checkpointing<br />

tcpcp is primarily designed for scenarios,<br />

where the old and the new connection owner<br />

are both functional during the process of connection<br />

passing.<br />

A similar usage scenario would if the node<br />

owning the connection occasionally retrieves<br />

(“checkpoints”) the momentary state of the<br />

connection, and after failure of the connection<br />

owner, another node would then use the checkpoint<br />

data to resurrect the connection.<br />

While apparently similar to connection passing,<br />

checkpointing presents several problems<br />

which we discuss in this section. Note that this<br />

is speculative and that the current implementation<br />

of tcpcp does not support any of the exten-


18 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

sions discussed here.<br />

We consider the send and receive flow of the<br />

TCP connection separately, and we assume that<br />

sequence numbers can be directly translated to<br />

application state (e.g. when transferring a file,<br />

application state consists only of the actual file<br />

position, which can be trivially mapped to and<br />

from TCP sequence numbers). Furthermore,<br />

we assume the connection to be in ESTAB-<br />

LISHED state at both ends.<br />

6.1 Outbound data<br />

<strong>One</strong> or more of the following events may occur<br />

between the last checkpoint and the moment<br />

the connection is resurrected:<br />

• the sender may have enqueued more data<br />

• the receiver may have acknowledged<br />

more data<br />

• the receiver may have retrieved more data,<br />

thereby growing its window<br />

Assuming that no additional data has been received<br />

from the peer, the new sender can simply<br />

re-transmit the last segment. (Alternatively,<br />

tcp_xmit_probe_skb might be useful for<br />

the same purpose.) In this case, the following<br />

protocol violations can occur:<br />

• <strong>The</strong> sequence number may have wrapped.<br />

This can be avoided by making sure<br />

that a checkpoint is never older than the<br />

Maximum Segment Lifetime (MSL) 2 , and<br />

that less than 2 31 bytes are sent between<br />

checkpoints.<br />

• If using PAWS, the timestamp may be below<br />

the last timestamp sent by the old<br />

sender. <strong>The</strong> best solution for avoiding this<br />

2 [2] specifies a MSL of two minutes.<br />

is probably to tightly synchronize clock<br />

on the old and the new connection owner,<br />

and to make a conservative estimate of the<br />

number of ticks of the local timestamp<br />

clock that have passed since taking the<br />

checkpoint. This assumes that the timestamp<br />

clock ticks roughly in real time.<br />

Since new data in the segment sent after resurrecting<br />

the connection cannot exceed the receiver’s<br />

window, the only possible outcomes<br />

are that the segment contains either new data,<br />

or only old data. In either case, the receiver<br />

will acknowledge the segment.<br />

Upon reception of an acknowledgment, either<br />

in response to the retransmitted segment, or<br />

from a packet in flight at the time when the connection<br />

was resurrected, the sender knows how<br />

far the connection state has advanced since the<br />

checkpoint was taken.<br />

If the sequence number from the acknowledgment<br />

is below snd_nxt, no special action<br />

is necessary. If the sequence number is<br />

above snd_nxt, the sender would exceptionally<br />

treat this as a valid acknowledgment. 3<br />

As a possible performance improvement, the<br />

sender may notify the application once a new<br />

sequence number has been received, and the<br />

application could then skip over unnecessary<br />

data.<br />

6.2 Inbound data<br />

<strong>The</strong> main problem with checkpointing of incoming<br />

data is that TCP will acknowledge data<br />

that has not yet been retrieved by the application.<br />

<strong>The</strong>refore, checkpointing would have to<br />

delay outbound acknowledgments until the application<br />

has actually retrieved them, and has<br />

3 Note that this exceptional condition does not necessarily<br />

have to occur with the first acknowledgment received.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 19<br />

checkpointed the resulting state change.<br />

To intercept all types of ACKs, tcp_<br />

transmit_skb would have to be changed<br />

to send tp->copied_seq instead of tp-><br />

rcv_nxt. Furthermore, a new API function<br />

would be needed to trigger an explicit acknowledgment<br />

after the data has been stored or processed.<br />

Putting acknowledges under application control<br />

would change their timing. This may upset<br />

the round-trip time estimation of the peer, and<br />

it may also cause it to falsely assume changes<br />

in the congestion level along the path.<br />

7 Security<br />

tcpcp bypasses various sets of access and consistency<br />

checks normally performed when setting<br />

up TCP connections. This section analyzes<br />

the overall security impact of tcpcp.<br />

7.1 Two lines of defense<br />

When setting TCP_ICI, the kernel has no<br />

means of verifying that the connection information<br />

actually originates from a compatible<br />

system. Users may therefore manipulate connection<br />

state, copy connection state from arbitrary<br />

other systems, or even synthesize connection<br />

state according to their wishes. tcpcp provides<br />

two mechanisms to protect against intentional<br />

or accidental mis-uses:<br />

1. tcpcp only takes as little information as<br />

possible from the user, and re-generates<br />

as much of the state related to the TCP<br />

connection (such as neighbour and destination<br />

data) as possible from local information.<br />

Furthermore, it performs a number<br />

of sanity checks on the ICI, to ensure<br />

its integrity, and compatibility with constraints<br />

of the local system (such as buffer<br />

size limits and kernel capabilities).<br />

2. Many manipulations possible through<br />

tcpcp can be shown to be available<br />

through other means if the application has<br />

the CAP_NET_RAW capability. <strong>The</strong>refore,<br />

establishing a new TCP connection<br />

with tcpcp also requires this capability.<br />

This can be relaxed on a host-wide basis.<br />

7.2 Retrieval of sensitive kernel data<br />

Getting TCP_ICI may retrieve information<br />

from the kernel that one would like to hide<br />

from unprivileged applications, e.g. details<br />

about the state of the TCP ISN generator. Since<br />

the equally unprivileged TCP_INFO already<br />

gives access to most TCP connection metadata,<br />

tcpcp does not create any new vulnerabilities.<br />

7.3 Local denial of service<br />

Setting TCP_ICI could be used to introduce<br />

inconsistent data in the TCP stack, or the kernel<br />

in general. Preventing this relies on the correctness<br />

and completeness of the sanity checks<br />

mentioned before.<br />

tcpcp can be used to accumulate stale data in<br />

the kernel. However, this is not very different<br />

from e.g. creating a large number of unused<br />

sockets, or letting buffers fill up in TCP connections,<br />

and therefore poses no new security<br />

threat.<br />

tcpcp can be used to shutdown connections belonging<br />

to third party applications, provided<br />

that the usual access restrictions grant access to<br />

copies of their socket descriptors. This is similar<br />

to executing shutdown on such sockets,<br />

and is therefore believed to pose no new threat.


20 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

7.4 Restricted state transitions<br />

tcpcp could be used to advance TCP connection<br />

state past boundaries imposed by internal<br />

or external control mechanisms. In particular,<br />

conspiring applications may create TCP connections<br />

without ever exchanging SYN packets,<br />

bypassing SYN-filtering firewalls. Since<br />

SYN-filtering firewalls can already be avoided<br />

by privileged applications, sites depending on<br />

SYN-filtering firewalls should therefore use<br />

the default setting of tcpcp, which makes its<br />

use also a privileged operation.<br />

7.5 Attacks on remote hosts<br />

<strong>The</strong> ability to set TCP_ICI makes it easy<br />

to commit all kinds of of protocol violations.<br />

While tcpcp may simplify implementing such<br />

attacks, this type of abuses has always been<br />

possible for privileged users, and therefore,<br />

tcpcp poses no new security threat to systems<br />

properly resistant against network attacks.<br />

However, if a site allows systems where only<br />

trusted users may be able to communicate with<br />

otherwise shielded systems with known remote<br />

TCP vulnerabilities, tcpcp could be used for attacks.<br />

Such sites should use the default setting,<br />

which makes setting TCP_ICI a privileged<br />

operation.<br />

7.6 Security summary<br />

To summarize, the author believes that the design<br />

of tcpcp does not open any new exploits if<br />

tcpcp is used in its default configuration.<br />

Obviously, some subtleties have probably been<br />

overlooked, and there may be bugs inadvertently<br />

leading to vulnerabilities. <strong>The</strong>refore,<br />

tcpcp should receive public scrutiny before being<br />

considered fit for regular use.<br />

8 Future work<br />

To allow faster connection passing among<br />

hosts that share the same, or a very similar path<br />

to the peer, tcpcp should try to avoid going to<br />

slow start. To do so, it will have to pass more<br />

congestion control information, and integrate it<br />

properly at the destination.<br />

Although not strictly part of tcpcp, the redirection<br />

apparatus for the network should be further<br />

extended, in particular to allow individual<br />

connections to be redirected at that point too,<br />

and to include some middleware that coordinates<br />

the redirecting with the changes at the<br />

hosts passing the connection.<br />

It would be very interesting if connection passing<br />

could also be used for checkpointing. <strong>The</strong><br />

analysis in Section 6 suggests that at least limited<br />

checkpointing capabilities should be feasible<br />

without interfering with regular TCP operation.<br />

<strong>The</strong> inner workings of TCP are complex and<br />

easily disturbed. It is therefore important to<br />

subject tcpcp to thorough testing, in particular<br />

in transient states, such as during recovery<br />

from lost segments. <strong>The</strong> umlsim simulator [12]<br />

allows to generate such conditions in a deterministic<br />

way, and will be used for these tests.<br />

9 Conclusion<br />

tcpcp is a proof of concept implementation that<br />

successfully demonstrates that an endpoint of<br />

a TCP connection can be passed from one host<br />

to another without involving the host at the opposite<br />

end of the TCP connection. tcpcp also<br />

shows that this can be accomplished with a relatively<br />

small amount of kernel changes.<br />

tcpcp in its present form is suitable for experimental<br />

use as a building block for load balancing<br />

and process migration solutions. Future


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 21<br />

work will focus on improving the performance<br />

of tcpcp, on validating its correctness, and on<br />

exploring checkpointing capabilities.<br />

References<br />

[1] RFC768; Postel, Jon. User Datagram<br />

Protocol, IETF, August 1980.<br />

[2] RFC793; Postel, Jon. Transmission<br />

Control Protocol, IETF, September 1981.<br />

[3] Stevens, W. Richard. TCP/IP Illustrated,<br />

Volume 1 – <strong>The</strong> Protocols,<br />

Addison-Wesley, 1994.<br />

[4] RFC2616; Fielding, Roy T.; Gettys,<br />

James; Mogul, Jeffrey C.; Frystyk<br />

Nielsen, Henrik; Masinter, Larry; Leach,<br />

Paul J.; Berners-Lee, Tim. Hypertext<br />

Transfer Protocol – HTTP/1.1, IETF,<br />

June 1999.<br />

[5] Bar, Moshe. OpenMosix, Proceedings of<br />

the 10th International <strong>Linux</strong> System<br />

Technology Conference<br />

(<strong>Linux</strong>-Kongress 2003), pp. 94–102,<br />

October 2003.<br />

[9] RFC1918; Rekhter, Yakov; Moskowitz,<br />

Robert G.; Karrenberg, Daniel; de Groot,<br />

Geert Jan; Lear, Eliot. Address<br />

Allocation for Private Internets, IETF,<br />

February 1996.<br />

[10] RFC1323; Jacobson, Van; Braden, Bob;<br />

Borman, Dave. TCP Extensions for High<br />

Performance, IETF, May 1992.<br />

[11] RFC2018; Mathis, Matt; Mahdavi,<br />

Jamshid; Floyd, Sally; Romanow, Allyn.<br />

TCP Selective Acknowledgement<br />

Options, IETF, October 1996.<br />

[12] Almesberger, Werner. UML Simulator,<br />

Proceedings of the Ottawa <strong>Linux</strong><br />

Symposium 2003, July 2003.<br />

http://archive.linuxsymposium.<br />

org/ols2003/Proceedings/<br />

All-Reprints/<br />

Reprint-Almesberger-OLS2003.<br />

pdf<br />

[6] Kuntz, Bryan; Rajan, Karthik.<br />

MIGSOCK – Migratable TCP Socket in<br />

<strong>Linux</strong>, CMU, M.Sc. <strong>The</strong>sis, February<br />

2002. http://www-2.cs.cmu.edu/<br />

~softagents/migsock/MIGSOCK.<br />

pdf<br />

[7] Leite, Fábio Olivé. Load-Balancing HA<br />

Clusters with No Single Point of Failure,<br />

Proceedings of the 9th International<br />

<strong>Linux</strong> System Technology Conference<br />

(<strong>Linux</strong>-Kongress 2002), pp. 122–131,<br />

September 2002. http://www.<br />

linux-kongress.org/2002/<br />

papers/lk2002-leite.html<br />

[8] <strong>Linux</strong> Virtual Server Project, http://<br />

www.linuxvirtualserver.org/


22 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Cooperative <strong>Linux</strong><br />

Dan Aloni<br />

da-x@colinux.org<br />

Abstract<br />

In this paper I’ll describe Cooperative <strong>Linux</strong>, a<br />

port of the <strong>Linux</strong> kernel that allows it to run as<br />

an unprivileged lightweight virtual machine in<br />

kernel mode, on top of another OS kernel. It allows<br />

<strong>Linux</strong> to run under any operating system<br />

that supports loading drivers, such as Windows<br />

or <strong>Linux</strong>, after minimal porting efforts. <strong>The</strong> paper<br />

includes the present and future implementation<br />

details, its applications, and its comparison<br />

with other <strong>Linux</strong> virtualization methods.<br />

Among the technical details I’ll present the<br />

CPU-complete context switch code, hardware<br />

interrupt forwarding, the interface between the<br />

host OS and <strong>Linux</strong>, and the management of the<br />

VM’s pseudo physical RAM.<br />

1 Introduction<br />

Cooperative <strong>Linux</strong> utilizes the rather underused<br />

concept of a Cooperative Virtual Machine<br />

(CVM), in contrast to traditional VMs that<br />

are unprivileged and being under the complete<br />

control of the host machine.<br />

<strong>The</strong> term Cooperative is used to describe two<br />

entities working in parallel, e.g. coroutines [2].<br />

In that sense the most plain description of Cooperative<br />

<strong>Linux</strong> is turning two operating system<br />

kernels into two big coroutines. In that<br />

mode, each kernel has its own complete CPU<br />

context and address space, and each kernel decides<br />

when to give control back to its partner.<br />

However, only one of the two kernels has control<br />

on the physical hardware, where the other<br />

is provided only with virtual hardware abstraction.<br />

From this point on in the paper I’ll refer<br />

to these two kernels as the host operating system,<br />

and the guest <strong>Linux</strong> VM respectively. <strong>The</strong><br />

host can be every OS kernel that exports basic<br />

primitives that provide the Cooperative <strong>Linux</strong><br />

portable driver to run in CPL0 mode (ring 0)<br />

and allocate memory.<br />

<strong>The</strong> special CPL0 approach in Cooperative<br />

<strong>Linux</strong> makes it significantly different than<br />

traditional virtualization solutions such as<br />

VMware, plex86, Virtual PC, and other methods<br />

such as Xen. All of these approaches work<br />

by running the guest OS in a less privileged<br />

mode than of the host kernel. This approach<br />

allowed for the extensive simplification of Cooperative<br />

<strong>Linux</strong>’s design and its short earlybeta<br />

development cycle which lasted only one<br />

month, starting from scratch by modifying the<br />

vanilla <strong>Linux</strong> 2.4.23-pre9 release until reaching<br />

to the point where KDE could run.<br />

<strong>The</strong> only downsides to the CPL0 approach is<br />

stability and security. If it’s unstable, it has the<br />

potential to crash the system. However, measures<br />

can be taken, such as cleanly shutting it<br />

down on the first internal Oops or panic. Another<br />

disadvantage is security. Acquiring root<br />

user access on a Cooperative <strong>Linux</strong> machine<br />

can potentially lead to root on the host machine<br />

if the attacker loads specially crafted kernel<br />

module or uses some very elaborated exploit<br />

in case which the Cooperative <strong>Linux</strong> kernel<br />

was compiled without module support.


24 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Most of the changes in the Cooperative <strong>Linux</strong><br />

patch are on the i386 tree—the only supported<br />

architecture for Cooperative at the time of this<br />

writing. <strong>The</strong> other changes are mostly additions<br />

of virtual drivers: cobd (block device),<br />

conet (network), and cocon (console). Most of<br />

the changes in the i386 tree involve the initialization<br />

and setup code. It is a goal of the Cooperative<br />

<strong>Linux</strong> kernel design to remain as close<br />

as possible to the standalone i386 kernel, so all<br />

changes are localized and minimized as much<br />

as possible.<br />

2 Uses<br />

Cooperative <strong>Linux</strong> in its current early state<br />

can already provide some of the uses that<br />

User Mode <strong>Linux</strong>[1] provides, such as virtual<br />

hosting, kernel development environment,<br />

research, and testing of new distributions or<br />

buggy software. It also enabled new uses:<br />

• Relatively effortless migration path<br />

from Windows. In the process of switching<br />

to another OS, there is the choice between<br />

installing another computer, dualbooting,<br />

or using a virtualization software.<br />

<strong>The</strong> first option costs money, the<br />

second is tiresome in terms of operation,<br />

but the third can be the most quick and<br />

easy method—especially if it’s free. This<br />

is where Cooperative <strong>Linux</strong> comes in. It<br />

is already used in workplaces to convert<br />

Windows users to <strong>Linux</strong>.<br />

• Adding Windows machines to <strong>Linux</strong><br />

clusters. <strong>The</strong> Cooperative <strong>Linux</strong> patch<br />

is minimal and can be easily combined<br />

with others such as the MOSIX or Open-<br />

MOSIX patches that add clustering capabilities<br />

to the kernel. This work in<br />

progress allows to add Windows machines<br />

to super-computer clusters, where one<br />

illustration could tell about a secretary<br />

workstation computer that runs Cooperative<br />

<strong>Linux</strong> as a screen saver—when the<br />

secretary goes home at the end of the day<br />

and leaves the computer unattended, the<br />

office’s cluster gets more CPU cycles for<br />

free.<br />

• Running an otherwise-dual-booted<br />

<strong>Linux</strong> system from the other OS. <strong>The</strong><br />

Windows port of Cooperative <strong>Linux</strong><br />

allows it to mount real disk partitions<br />

as block devices. Numerous people are<br />

using this in order to access, rescue, or<br />

just run their <strong>Linux</strong> system from their<br />

ext3 or reiserfs file systems.<br />

• Using <strong>Linux</strong> as a Windows firewall on<br />

the same machine. As a likely competitor<br />

to other out-of-the-box Windows firewalls,<br />

iptables along with a stripped-down<br />

Cooperative <strong>Linux</strong> system can potentially<br />

serve as a network firewall.<br />

• <strong>Linux</strong> kernel development / debugging<br />

/ research and study on another operating<br />

systems.<br />

Digging inside a running Cooperative<br />

<strong>Linux</strong> kernel, you can hardly tell the<br />

difference between it and a standalone<br />

<strong>Linux</strong>. All virtual addresses are the<br />

same—Oops reports look familiar and the<br />

architecture dependent code works in the<br />

same manner, excepts some transparent<br />

conversions, which are described in the<br />

next section in this paper.<br />

• Development environment for porting<br />

to and from <strong>Linux</strong>.<br />

3 Design Overview<br />

In this section I’ll describe the basic methods<br />

behind Cooperative <strong>Linux</strong>, which include


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 25<br />

complete context switches, handling of hardware<br />

interrupts by forwarding, physical address<br />

translation and the pseudo physical memory<br />

RAM.<br />

3.1 Minimum Changes<br />

To illustrate the minimal effect of the Cooperative<br />

<strong>Linux</strong> patch on the source tree, here is a<br />

diffstat listing of the patch on <strong>Linux</strong> 2.4.26 as<br />

of May 10, 2004:<br />

CREDITS | 6<br />

Documentation/devices.txt | 7<br />

Makefile | 8<br />

arch/i386/config.in | 30<br />

arch/i386/kernel/Makefile | 2<br />

arch/i386/kernel/cooperative.c | 181 +++++<br />

arch/i386/kernel/head.S | 4<br />

arch/i386/kernel/i387.c | 8<br />

arch/i386/kernel/i8259.c | 153 ++++<br />

arch/i386/kernel/ioport.c | 10<br />

arch/i386/kernel/process.c | 28<br />

arch/i386/kernel/setup.c | 61 +<br />

arch/i386/kernel/time.c | 104 +++<br />

arch/i386/kernel/traps.c | 9<br />

arch/i386/mm/fault.c | 4<br />

arch/i386/mm/init.c | 37 +<br />

arch/i386/vmlinux.lds | 82 +-<br />

drivers/block/Config.in | 4<br />

drivers/block/Makefile | 1<br />

drivers/block/cobd.c | 334 ++++++++++<br />

drivers/block/ll_rw_blk.c | 2<br />

drivers/char/Makefile | 4<br />

drivers/char/colx_keyb.c | 1221 +++++++++++++*<br />

drivers/char/mem.c | 8<br />

drivers/char/vt.c | 8<br />

drivers/net/Config.in | 4<br />

drivers/net/Makefile | 1<br />

drivers/net/conet.c | 205 ++++++<br />

drivers/video/Makefile | 4<br />

drivers/video/cocon.c | 484 +++++++++++++++<br />

include/asm-i386/cooperative.h | 175 +++++<br />

include/asm-i386/dma.h | 4<br />

include/asm-i386/io.h | 27<br />

include/asm-i386/irq.h | 6<br />

include/asm-i386/mc146818rtc.h | 7<br />

include/asm-i386/page.h | 30<br />

include/asm-i386/pgalloc.h | 7<br />

include/asm-i386/pgtable-2level.h | 8<br />

include/asm-i386/pgtable.h | 7<br />

include/asm-i386/processor.h | 12<br />

include/asm-i386/system.h | 8<br />

include/linux/console.h | 1<br />

include/linux/cooperative.h | 317 +++++++++<br />

include/linux/major.h | 1<br />

init/do_mounts.c | 3<br />

init/main.c | 9<br />

kernel/Makefile | 2<br />

kernel/cooperative.c | 254 +++++++<br />

kernel/panic.c | 4<br />

kernel/printk.c | 6<br />

50 files changed, 3828 insertions(+), 74 deletions(-)<br />

3.2 Device Driver<br />

<strong>The</strong> device driver port of Cooperative <strong>Linux</strong><br />

is used for accessing kernel mode and using<br />

the kernel primitives that are exported by the<br />

host OS kernel. Most of the driver is OSindependent<br />

code that interfaces with the OS<br />

dependent primitives that include page allocations,<br />

debug printing, and interfacing with user<br />

space.<br />

When a Cooperative <strong>Linux</strong> VM is created, the<br />

driver loads a kernel image from a vmlinux<br />

file that was compiled from the patched kernel<br />

with CONFIG_COOPERATIVE. <strong>The</strong> vmlinux<br />

file doesn’t need any cross platform tools in order<br />

to be generated, and the same vmlinux file<br />

can be used to run a Cooperative <strong>Linux</strong> VM on<br />

several OSes of the same architecture.<br />

<strong>The</strong> VM is associated with a per-process<br />

resource—a file descriptor in <strong>Linux</strong>, or a device<br />

handle in Windows. <strong>The</strong> purpose of this<br />

association makes sense: if the process running<br />

the VM ends abnormally in any way, all<br />

resources are cleaned up automatically from a<br />

callback when the system frees the per-process<br />

resource.<br />

3.3 Pseudo Physical RAM<br />

In Cooperative <strong>Linux</strong>, we had to work around<br />

the <strong>Linux</strong> MM design assumption that the entire<br />

physical RAM is bestowed upon the kernel<br />

on startup, and instead, only give Cooperative<br />

<strong>Linux</strong> a fixed set of physical pages, and<br />

then only do the translations needed for it to<br />

work transparently in that set. All the memory<br />

which Cooperative <strong>Linux</strong> considers as physical<br />

is in that allocated set, which we call the<br />

Pseudo Physical RAM.<br />

<strong>The</strong> memory is allocated in the host OS<br />

using the appropriate kernel function—<br />

alloc_pages() in <strong>Linux</strong> and<br />

MmAllocatePagesForMdl() in<br />

Windows—so it is not mapped in any address<br />

space on the host for not wasting PTEs.<br />

<strong>The</strong> allocated pages are always resident and<br />

not freed until the VM is downed. Page tables


26 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

--- linux/include/asm-i386/pgtable-2level.h 2004-04-20 08:04:01.000000000 +0300<br />

+++ linux/include/asm-i386/pgtable-2level.h 2004-05-09 16:54:09.000000000 +0300<br />

@@ -58,8 +58,14 @@<br />

}<br />

#define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0))<br />

#define pte_same(a, b)<br />

((a).pte_low == (b).pte_low)<br />

-#define pte_page(x)<br />

(mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))<br />

#define pte_none(x)<br />

(!(x).pte_low)<br />

+<br />

+#ifndef CONFIG_COOPERATIVE<br />

+#define pte_page(x)<br />

(mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))<br />

#define __mk_pte(page_nr,pgprot) __pte(((page_nr)


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 27<br />

dress space differently, such that one virtual address<br />

can contain a kernel mapped page in one<br />

OS and a user mapped page in another.<br />

Guest <strong>Linux</strong><br />

Intermediate<br />

0xFFFFFFFF<br />

Host OS<br />

In Cooperative <strong>Linux</strong> the problem was solved<br />

by using an intermediate address space during<br />

the switch (referred to as the ‘passage page,’<br />

see Figure 1). <strong>The</strong> intermediate address space<br />

is defined by a specially created page tables in<br />

both the guest and host contexts and maps the<br />

same code that is used for the switch (passage<br />

code) at both of the virtual addresses that are<br />

involved. When a switch occurs, first CR3 is<br />

changed to point to the intermediate address<br />

space. <strong>The</strong>n, EIP is relocated to the other mapping<br />

of the passage code using a jump. Finally,<br />

CR3 is changed to point to the top page table<br />

directory of the other OS.<br />

<strong>The</strong> single MMU page that contains the passage<br />

page code, also contains the saved state of<br />

one OS while the other is executing. Upon the<br />

beginning of a switch, interrupts are turned off,<br />

and a current state is saved to the passage page<br />

by the passage page code. <strong>The</strong> state includes<br />

all the general purpose registers, the segment<br />

registers, the interrupt descriptor table register<br />

(IDTR), the global descriptor table (GDTR),<br />

the local descriptor register (LTR), the task register<br />

(TR), and the state of the FPU / MMX<br />

/ SSE registers. In the middle of the passage<br />

page code, it restores the state of the other OS<br />

and interrupts are turned back on. This process<br />

is akin to a “normal” process to process context<br />

switch.<br />

Since control is returned to the host OS on every<br />

hardware interrupt (described in the following<br />

section), it is the responsibility of the host<br />

OS scheduler to give time slices to the Cooperative<br />

<strong>Linux</strong> VM just as if it was a regular process.<br />

0x80000000<br />

Figure 1: Address space transition during an<br />

OS cooperative kernel switch, using an intermapped<br />

page<br />

3.5 Interrupt Handling and Forwarding<br />

Since a complete MMU context switch also involves<br />

the IDTR, Cooperative <strong>Linux</strong> must set<br />

an interrupt vector table in order to handle the<br />

hardware interrupts that occur in the system<br />

during its running state. However, Cooperative<br />

<strong>Linux</strong> only forwards the invocations of interrupts<br />

to the host OS, because the latter needs<br />

to know about these interrupts in order to keep<br />

functioning and support the colinux-daemon<br />

process itself, regardless to the fact that external<br />

hardware interrupts are meaningless to the<br />

Cooperative <strong>Linux</strong> virtual machine.<br />

<strong>The</strong> interrupt vectors for the internal processor<br />

exceptions (0x0–0x1f) and the system call vector<br />

(0x80) are kept like they are so that Cooperative<br />

<strong>Linux</strong> handles its own page faults and<br />

other exceptions, but the other interrupt vectors<br />

point to special proxy ISRs (interrupt service<br />

routines). When such an ISR is invoked during<br />

the Cooperative <strong>Linux</strong> context by an external<br />

hardware interrupt, a context switch is made to<br />

the host OS using the passage code. On the


28 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

other side, the address of the relevant ISR of<br />

the host OS is determined by looking at its IDT.<br />

An interrupt call stack is forged and a jump occurs<br />

to that address. Between the invocation of<br />

the ISR in the <strong>Linux</strong> side and the handling of<br />

the interrupt in the host side, the interrupt flag<br />

is disabled.<br />

<strong>The</strong> operation adds a tiny latency to interrupt<br />

handling in the host OS, but it is quite neglectable.<br />

Considering that this interrupt forwarding<br />

technique also involves the hardware<br />

timer interrupt, the host OS cannot detect that<br />

its CR3 was hijacked for a moment and therefore<br />

no exceptions in the host side would occur<br />

as a result of the context switch.<br />

To provide interrupts for the virtual device<br />

drivers of the guest <strong>Linux</strong>, the changes in the<br />

arch code include a virtual interrupt controller<br />

which receives messages from the host OS<br />

on the occasion of a switch and invokes do_<br />

IRQ() with a forged struct pt_args.<br />

<strong>The</strong> interrupt numbers are virtual and allocated<br />

on a per-device basis.<br />

4 Benchmarks And Performance<br />

4.1 Dbench results<br />

This section shows a comparison between User<br />

Mode <strong>Linux</strong> and Cooperative <strong>Linux</strong>. <strong>The</strong> machine<br />

which the following results were generated<br />

on is a 2.8GHz Pentium 4 with HT enabled,<br />

512GB RAM, and a 120GB SATA Maxtor<br />

hard-drive that hosts ext3 partitions. <strong>The</strong><br />

comparison was performed using the dbench<br />

1.3-2 package of Debian on all setups.<br />

<strong>The</strong> host machine runs the <strong>Linux</strong> 2.6.6 kernel<br />

patched with SKAS support. <strong>The</strong> UML kernel<br />

is <strong>Linux</strong> 2.6.4 that runs with 32MB of RAM,<br />

and is configured to use SKAS mode. <strong>The</strong> Cooperative<br />

<strong>Linux</strong> kernel is a <strong>Linux</strong> 2.4.26 kernel<br />

and it is configured to run with 32MB of RAM,<br />

same as the UML system. <strong>The</strong> root file-system<br />

of both UML and Cooperative <strong>Linux</strong> machines<br />

is the same host <strong>Linux</strong> file that contains an ext3<br />

image of a 0.5GB minimized Debian system.<br />

<strong>The</strong> commands ‘dbench 1’, ‘dbench 3’, and<br />

‘dbench 10’ were run in 3 consecutive runs for<br />

each command, on the host <strong>Linux</strong>, on UML,<br />

and on Cooperative <strong>Linux</strong> setups. <strong>The</strong> results<br />

are shown in Table 2, Table 3, and Table 4.<br />

Table 2:<br />

MB/sec)<br />

Table 3:<br />

MB/sec)<br />

System Throughput Netbench<br />

43.813 54.766<br />

Host 50.117 62.647<br />

44.128 55.160<br />

10.418 13.022<br />

UML 9.408 11.760<br />

9.309 11.636<br />

10.418 13.023<br />

co<strong>Linux</strong> 12.574 15.718<br />

12.075 15.094<br />

output of dbench 10 (units are in<br />

System Throughput Netbench<br />

43.287 54.109<br />

Host 41.383 51.729<br />

59.965 74.956<br />

11.857 14.821<br />

UML 15.143 18.929<br />

14.602 18.252<br />

24.095 30.119<br />

co<strong>Linux</strong> 32.527 40.659<br />

36.423 45.528<br />

output of dbench 3 (units are in<br />

4.2 Understanding the results<br />

From the results in these runs, ‘dbench 10’,<br />

‘dbench 3’, and ‘dbench 1’ show 20%, 123%,<br />

and 303% increase respectively, compared to<br />

UML. <strong>The</strong>se numbers relate to the number


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 29<br />

Table 4:<br />

MB/sec)<br />

System Throughput Netbench<br />

158.205 197.756<br />

Host 182.191 227.739<br />

179.047 223.809<br />

15.351 19.189<br />

UML 16.691 20.864<br />

16.180 20.226<br />

45.592 56.990<br />

co<strong>Linux</strong> 72.452 90.565<br />

106.952 133.691<br />

output of dbench 1 (units are in<br />

of dbench threads, which is a result of the<br />

synchronous implementation of cobd 1 . Yet,<br />

neglecting the versions of the kernels compared,<br />

Cooperative <strong>Linux</strong> achieves much better<br />

probably because of low overhead with regard<br />

to context switching and page faulting in the<br />

guest <strong>Linux</strong> VM.<br />

<strong>The</strong> current implementation of the cobd driver<br />

is synchronous file reading and writing directly<br />

from the kernel of the host <strong>Linux</strong>—No user<br />

space of the host <strong>Linux</strong> is involved, therefore<br />

less context switching and copying. About<br />

copying, the specific implementation of cobd<br />

in the host <strong>Linux</strong> side benefits from the fact<br />

that filp->f_op->read() is called directly<br />

on the cobd driver’s request buffer after<br />

mapping it using kmap(). Reimplementing<br />

this driver as asynchronous on both the host<br />

and guest—can improve performance.<br />

Unlike UML, Cooperative <strong>Linux</strong> can benefit<br />

in the terms of performance from the implementation<br />

of kernel-to-kernel driver bridges<br />

such as cobd. For example, currently virtual<br />

Ethernet in Cooperative <strong>Linux</strong> is done similar<br />

to UML—i.e., using user space daemons<br />

with tuntap on the host. If instead we create<br />

a kernel-to-kernel implementation with no<br />

user space daemons in between, Cooperative<br />

1 ubd UML equivalent<br />

<strong>Linux</strong> has the potential to achieve much better<br />

in benchmarking.<br />

5 Planned Features<br />

Since Cooperative <strong>Linux</strong> is a new project<br />

(2004–), most of its features are still waiting<br />

to be implemented.<br />

5.1 Suspension<br />

Software-suspending <strong>Linux</strong> is a challenge on<br />

standalone <strong>Linux</strong> systems, considering the entire<br />

state of the hardware needs to be saved and<br />

restored, along with the space that needs to be<br />

found for storing the suspended image. On<br />

User Mode <strong>Linux</strong> suspending [3] is easier—<br />

only the state of a few processes needs saving,<br />

and no hardware is involved.<br />

However, in Cooperative <strong>Linux</strong>, it will be even<br />

easier to implement suspension, because it will<br />

involve its internal state almost entirely. <strong>The</strong><br />

procedure will involve serializing the pseudo<br />

physical RAM by enumerating all the page table<br />

entries that are used in Cooperative <strong>Linux</strong>,<br />

either by itself (for user space and vmalloc<br />

page tables) or for itself (the page tables of<br />

the pseudo physical RAM), and change them<br />

to contain the pseudo value instead of the real<br />

value.<br />

<strong>The</strong> purpose of this suspension procedure is to<br />

allow no notion of the real physical memory<br />

to be contained in any of the pages allocated<br />

for the Cooperative <strong>Linux</strong> VM, since Cooperative<br />

<strong>Linux</strong> will be given a different set of pages<br />

when it will resume at a later time. At the suspended<br />

state, the pages can be saved to a file<br />

and the VM could be resumed later. Resuming<br />

will involve loading that file, allocating the<br />

memory, and fix-enumerate all the page tables<br />

again so that the values in the page table entries<br />

point to the newly allocated memory.


30 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Another implementation strategy will be to just<br />

dump everything on suspension as it is, but<br />

on resume—enumerate all the page table entries<br />

and adjust between the values of the old<br />

RPPFNs 2 and new RPPFNs.<br />

Note that a suspended image could be created<br />

under one host OS and be resumed in another<br />

host OS of the same architecture. <strong>One</strong> could<br />

carry a suspended <strong>Linux</strong> on a USB memory device<br />

and resume/suspend it on almost any computer.<br />

5.2 User Mode <strong>Linux</strong>[1] inside Cooperative<br />

<strong>Linux</strong><br />

<strong>The</strong> possibility of running UML inside Cooperative<br />

<strong>Linux</strong> is not far from being immediately<br />

possible. It will allow to bring UML with all its<br />

glory to operating systems that cannot support<br />

it otherwise because of their user space APIs.<br />

Combining UML and Cooperative <strong>Linux</strong> cancels<br />

the security downside that running Cooperative<br />

<strong>Linux</strong> could incur.<br />

5.3 Live Cooperative Distributions<br />

Live-CD distributions like KNOPPIX can be<br />

used to boot on top of another operating system<br />

and not only as standalone, reaching a larger<br />

sector of computer users considering the host<br />

operating system to be Windows NT/2000/XP.<br />

5.4 Integration with ReactOS<br />

ReactOS, the free Windows NT clone, will be<br />

incorporating Cooperative <strong>Linux</strong> as a POSIX<br />

subsystem.<br />

5.5 Miscellaneous<br />

• Virtual frame buffer support.<br />

2 real physical page frame numbers<br />

• Incorporating features from User Mode<br />

<strong>Linux</strong>, e.g. humfs 3 .<br />

• Support for more host operating systems<br />

such as FreeBSD.<br />

6 Conclusions<br />

We have discussed how Cooperative <strong>Linux</strong><br />

works and its benefits—apart from being a<br />

BSKH 4 , Cooperative <strong>Linux</strong> has the potential<br />

to become an alternative to User Mode <strong>Linux</strong><br />

that enhances on portability and performance,<br />

rather than on security.<br />

Moreover, the implications that Cooperative<br />

<strong>Linux</strong> has on what is the media defines as<br />

‘<strong>Linux</strong> on the Desktop’—are massive, as the<br />

world’s most dominant albeit proprietary desktop<br />

OS supports running <strong>Linux</strong> distributions<br />

for free, as another software, with the aimedfor<br />

possibility that the <strong>Linux</strong> newbie would<br />

switch to the standalone <strong>Linux</strong>. As userfriendliness<br />

of the Windows port will improve,<br />

the exposure that <strong>Linux</strong> gets by the average<br />

computer user can increase tremendously.<br />

7 Thanks<br />

Muli Ben Yehuda, IBM<br />

Jun Okajima, Digital Infra<br />

Kuniyasu Suzaki, AIST<br />

References<br />

[1] Jeff Dike. User Mode <strong>Linux</strong>. http:<br />

//user-mode-linux.sf.net.<br />

3 A recent addition to UML that provides an host FS<br />

implementation that uses files in order to store its VFS<br />

metadata<br />

4 Big Scary <strong>Kernel</strong> Hack


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 31<br />

[2] Donald E. Knuth. <strong>The</strong> Art of Computer<br />

Programming, volume 1.<br />

Addison-Wesley, Reading, Massachusetts,<br />

1997. Describes coroutines in their pure<br />

sense.<br />

[3] Richard Potter. Scrapbook for User Mode<br />

<strong>Linux</strong>. http:<br />

//sbuml.sourceforge.net/.


32 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Build your own Wireless Access Point<br />

Erik Andersen<br />

Codepoet Consulting<br />

andersen@codepoet.org<br />

Abstract<br />

This presentation will cover the software, tools,<br />

libraries, and configuration files needed to<br />

construct an embedded <strong>Linux</strong> wireless access<br />

point. Some of the software available for constructing<br />

embedded <strong>Linux</strong> systems will be discussed,<br />

and selection criteria for which tools to<br />

use for differing embedded applications will be<br />

presented. During the presentation, an embedded<br />

<strong>Linux</strong> wireless access point will be constructed<br />

using the <strong>Linux</strong> kernel, the uClibc C<br />

library, BusyBox, the syslinux bootloader, iptables,<br />

etc. Emphasis will be placed on the<br />

more generic aspects of building an embedded<br />

<strong>Linux</strong> system using BusyBox and uClibc.<br />

At the conclusion of the presentation, the presenter<br />

will (with luck) boot up the newly constructed<br />

wireless access point and demonstrate<br />

that it is working perfectly. Source code, build<br />

system, cross compilers, and detailed instructions<br />

will be made available.<br />

1 Introduction<br />

When I began working on embedded <strong>Linux</strong>,<br />

the question of whether or not <strong>Linux</strong> was small<br />

enough to fit inside a particular device was a<br />

difficult problem. <strong>Linux</strong> distributions 1 have<br />

1 <strong>The</strong> term “distribution” is used by the <strong>Linux</strong> community<br />

to refer to a collection of software, including<br />

the <strong>Linux</strong> kernel, application programs, and needed library<br />

code, which makes up a complete running system.<br />

Sometimes, the term “<strong>Linux</strong>” or “GNU/<strong>Linux</strong>” is also<br />

used to refer to this collection of software.<br />

historically been designed for server and desktop<br />

systems. As such, they deliver a fullfeatured,<br />

comprehensive set of tools for just<br />

about every purpose imaginable. Most <strong>Linux</strong><br />

distributions, such as Red Hat, Debian, or<br />

SuSE, provide hundreds of separate software<br />

packages adding up to several gigabytes of<br />

software. <strong>The</strong> goal of server or desktop <strong>Linux</strong><br />

distributions has been to provide as much value<br />

as possible to the user; therefore, the large<br />

size is quite understandable. However, this<br />

has caused the <strong>Linux</strong> operating system to be<br />

much larger then is desirable for building an<br />

embedded <strong>Linux</strong> system such as a wireless access<br />

point. Since embedded devices represent<br />

a fundamentally different target for <strong>Linux</strong>,<br />

it became apparent to me that embedded devices<br />

would need different software than what<br />

is commonly used on desktop systems. I knew<br />

that <strong>Linux</strong> has a number of strengths which<br />

make it extremely attractive for the next generation<br />

of embedded devices, yet I could see<br />

that developers would need new tools to take<br />

advantage of <strong>Linux</strong> within small, embedded<br />

spaces.<br />

I began working on embedded <strong>Linux</strong> in the<br />

middle of 1999. At the time, building an ‘embedded<br />

<strong>Linux</strong>’ system basically involved copying<br />

binaries from an existing <strong>Linux</strong> distribution<br />

to a target device. If the needed software did<br />

not fit into the required amount of flash memory,<br />

there was really nothing to be done about<br />

it except to add more flash or give up on the<br />

project. Very little effort had been made to<br />

develop smaller application programs and li-


34 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

braries designed for use in embedded <strong>Linux</strong>.<br />

As I began to analyze how I could save space,<br />

I decided that there were three main areas that<br />

could be attacked to shrink the footprint of an<br />

embedded <strong>Linux</strong> system: the kernel, the set of<br />

common application programs included in the<br />

system, and the shared libraries. Many people<br />

doing <strong>Linux</strong> kernel development were at least<br />

talking about shrinking the footprint of the kernel.<br />

For the past five years, I have focused on<br />

the latter two areas: shrinking the footprint of<br />

the application programs and libraries required<br />

to produce a working embedded <strong>Linux</strong> system.<br />

This paper will describe some of the software<br />

tools I’ve worked on and maintained, which are<br />

now available for building very small embedded<br />

<strong>Linux</strong> systems.<br />

2 <strong>The</strong> C Library<br />

Let’s take a look at an embedded <strong>Linux</strong> system,<br />

the <strong>Linux</strong> Router Project, which was available<br />

in 1999. http://www.linuxrouter.org/<br />

<strong>The</strong> <strong>Linux</strong> Router Project, begun by Dave<br />

Cinege, was and continues to be a very commonly<br />

used embedded <strong>Linux</strong> system. Its selfdescribed<br />

tagline reads “A networking-centric<br />

micro-distribution of <strong>Linux</strong>” which is “small<br />

enough to fit on a single 1.44MB floppy disk,<br />

and makes building and maintaining routers,<br />

access servers, thin servers, thin clients,<br />

network appliances, and typically embedded<br />

systems next to trivial.” First, let’s download<br />

a copy of one of the <strong>Linux</strong> Router Project’s<br />

“idiot images.” I grabbed my copy from<br />

the mirror site at ftp://sunsite.unc.edu/<br />

pub/<strong>Linux</strong>/distributions/linux-router/<br />

dists/current/idiot-image_1440KB_FAT_<br />

2.9.8_<strong>Linux</strong>_2.2.gz.<br />

Opening up the idiot-image there are several<br />

very interesting things to be seen.<br />

# gunzip \<br />

idiot-image_1440KB_FAT_2.9.8_<strong>Linux</strong>_2.2.gz<br />

# mount \<br />

idiot-image_1440KB_FAT_2.9.8_<strong>Linux</strong>_2.2 \<br />

/mnt -o loop<br />

# du -ch /mnt/*<br />

34K /mnt/etc.lrp<br />

6.0K /mnt/ldlinux.sys<br />

512K /mnt/linux<br />

512 /mnt/local.lrp<br />

1.0K /mnt/log.lrp<br />

17K /mnt/modules.lrp<br />

809K /mnt/root.lrp<br />

512 /mnt/syslinux.cfg<br />

1.0K /mnt/syslinux.dpy<br />

1.4M total<br />

# mkdir test<br />

# cd test<br />

# tar -xzf /mnt/root.lrp<br />

# du -hs<br />

2.2M .<br />

2.2M total<br />

# du -ch bin root sbin usr var<br />

460K bin<br />

8.0K root<br />

264K sbin<br />

12K usr/bin<br />

304K usr/sbin<br />

36K usr/lib/ipmasqadm<br />

40K usr/lib<br />

360K usr<br />

56K var/lib/lrpkg<br />

60K var/lib<br />

4.0K var/spool/cron/crontabs<br />

8.0K var/spool/cron<br />

12K var/spool<br />

76K var<br />

1.2M total<br />

# du -ch lib<br />

24K lib/POSIXness<br />

1.1M lib<br />

1.1M total<br />

# du -h lib/libc-2.0.7.so<br />

644K lib/libc-2.0.7.so<br />

Taking a look at the software contained in<br />

this embedded <strong>Linux</strong> system, we quickly notice<br />

that in a software image totaling 2.2<br />

Megabytes, the libraries take up over half the<br />

space. If we look even closer at the set of<br />

libraries, we quickly find that the largest single<br />

component in the entire system is the GNU<br />

C library, in this case occupying nearly 650k.<br />

What is more, this is a very old version of<br />

the C library; newer versions of GNU glibc,


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 35<br />

such as version 2.3.2, are over 1.2 Megabytes<br />

all by themselves! <strong>The</strong>re are tools available<br />

from <strong>Linux</strong> vendors and in the Open Source<br />

community which can reduce the footprint of<br />

the GNU C library considerably by stripping<br />

unwanted symbols; however, using such tools<br />

precludes adding additional software at a later<br />

date. Even when these tools are appropriate,<br />

there are limits to the amount of size which can<br />

be reclaimed from the GNU C library in this<br />

way.<br />

<strong>The</strong> prospect of shrinking a single library that<br />

takes up so much space certainly looked like<br />

low hanging fruit. In practice, however, replacing<br />

the GNU C library for embedded <strong>Linux</strong><br />

systems was not easy task.<br />

3 <strong>The</strong> origins of uClibc<br />

As I despaired over the large size of the GNU<br />

C library, I decided that the best thing to do<br />

would be to find another C library for <strong>Linux</strong><br />

that would be better suited for embedded systems.<br />

I spent quite a bit of time looking around,<br />

and after carefully evaluating the various Open<br />

Source C libraries that I knew of 2 , I sadly<br />

found that none of them were suitable replacements<br />

for glibc. Of all the Open Source C libraries,<br />

the library closest to what I imagined<br />

an embedded C library should be was called<br />

uC-libc and was being used for uClinux systems.<br />

However, it also had many problems at<br />

the time—not the least of which was that uClibc<br />

had no central maintainer. <strong>The</strong> only mechanism<br />

being used to support multiple architec-<br />

2 <strong>The</strong> Open Source C libraries I evaluated at<br />

the time included Al’s Free C RunTime library<br />

(no longer on the Internet); dietlibc available from<br />

http://www.fefe.de/dietlibc/; the minix C<br />

library available from http://www.cs.vu.nl/<br />

cgi-bin/raw/pub/minix/; the newlib library<br />

available from http://sources.redhat.com/<br />

newlib/; and the eCos C library available from ftp:<br />

//ecos.sourceware.org/pub/ecos/.<br />

tures was a complete source tree fork, and there<br />

had already been a few such forks with plenty<br />

of divergant code. In short, uC-libc was a mess<br />

of twisty versions, all different. After spending<br />

some time with the code, I decided to fix it, and<br />

in the process changed the name to uClibc<br />

(no hyphen).<br />

With the help of D. Jeff Dionne, one of the creators<br />

of uClinux 3 , I ported uClibc to run on<br />

Intel compatible x86 CPUs. I then grafted in<br />

the header files from glibc 2.1.3 to simplify<br />

software ports, and I cleaned up the resulting<br />

breakage. <strong>The</strong> header files were later updated<br />

again to generally match glibc 2.3.2. This effort<br />

has made porting software from glibc to<br />

uClibc extremely easy. <strong>The</strong>re were, however,<br />

many functions in uClibc that were either broken<br />

or missing and which had to be re-written<br />

or created from scratch. When appropriate, I<br />

sometimes grafted in bits of code from the current<br />

GNU C library and libc5. Once the core<br />

of the library was reasonably solid, I began<br />

adding a platform abstraction layer to allow<br />

uClibc to compile and run on different types of<br />

CPUs. Once I had both the ARM and x86 platforms<br />

basically running, I made a few small<br />

announcements to the <strong>Linux</strong> community. At<br />

that point, several people began to make regular<br />

contributions. Most notably was Manuel<br />

Novoa III, who began contributing at that time.<br />

He has continued working on uClibc and is<br />

responsible for significant portions of uClibc<br />

such as the stdio and internationalization code.<br />

After a great deal of effort, we were able to<br />

build the first shared library version of uClibc<br />

in January 2001. And earlier this year we were<br />

able to compile a Debian Woody system using<br />

uClibc 4 , demonstrating the library is now able<br />

3 uClinux is a port of <strong>Linux</strong> designed to run on microcontrollers<br />

which lack Memory Management Units<br />

(MMUs) such as the Motorolla DragonBall or the<br />

ARM7TDMI. <strong>The</strong> uClinux web site is found at http:<br />

//www.uclinux.org/.<br />

4 http://www.uclibc.org/dists/


36 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

to support a complete <strong>Linux</strong> distribution. People<br />

now use uClibc to build versions of Gentoo,<br />

Slackware, <strong>Linux</strong> from Scratch, rescue disks,<br />

and even live <strong>Linux</strong> CDs 5 . A number of commercial<br />

products have also been released using<br />

uClibc, such as wireless routers, network attached<br />

storage devices, DVD players, etc.<br />

4 Compiling uClibc<br />

Before we can compile uClibc, we must first<br />

grab a copy of the source code and unpack it<br />

so it is ready to use. For this paper, we will just<br />

grab a copy of the daily uClibc snapshot.<br />

# SITE=http://www.uclibc.org/downloads<br />

# wget -q $SITE/uClibc-snapshot.tar.bz2<br />

# tar -xjf uClibc-snapshot.tar.bz2<br />

# cd uClibc<br />

uClibc requires a configuration file, .config,<br />

that can be edited to change the way the library<br />

is compiled, such as to enable or disable<br />

features (i.e. whether debugging support<br />

is enabled or not), to select a cross-compiler,<br />

etc. <strong>The</strong> preferred method when starting from<br />

scratch is to run make defconfig followed<br />

by make menuconfig. Since we are going<br />

to be targeting a standard Intel compatible x86<br />

system, no changes to the default configuration<br />

file are necessary.<br />

5 <strong>The</strong> Origins of BusyBox<br />

As I mentioned earlier, the two components<br />

of an embedded <strong>Linux</strong> that I chose to work<br />

towards reducing in size were the shared libraries<br />

and the set common application programs.<br />

A typical <strong>Linux</strong> system contains a variety<br />

of command-line utilities from numerous<br />

5 Puppy <strong>Linux</strong> available from http://www.<br />

goosee.com/puppy/ is a live linux CD system built<br />

with uClibc that includes such favorites as XFree86 and<br />

Mozilla.<br />

different organizations and independent programmers.<br />

Among the most prominent of these<br />

utilities were GNU shellutils, fileutils, textutils<br />

(now combined to form GNU coreutils), and<br />

similar programs that can be run within a shell<br />

(commands such as sed, grep, ls, etc.).<br />

<strong>The</strong> GNU utilities are generally very highquality<br />

programs, and are almost without exception<br />

very, very feature-rich. <strong>The</strong> large feature<br />

set comes at the cost of being quite large—<br />

prohibitively large for an embedded <strong>Linux</strong> system.<br />

After some investigation, I determined<br />

that it would be more efficient to replace them<br />

rather than try to strip them down, so I began<br />

looking at alternatives.<br />

Just as with alternative C libraries, there were<br />

several choices for small shell utilities: BSD<br />

has a number of utilities which could be used.<br />

<strong>The</strong> Minix operating system, which had recently<br />

released under a free software license,<br />

also had many useful utilities. Sash, the stand<br />

alone shell, was also a possibility. After quite<br />

a lot of research, the one that seemed to be<br />

the best fit was BusyBox. It also appealed to<br />

me because I was already familiar with Busy-<br />

Box from its use on the Debian boot floppies,<br />

and because I was acquainted with Bruce<br />

Perens, who was the maintainer. Starting approximately<br />

in October 1999, I began enhancing<br />

BusyBox and fixing the most obvious problems.<br />

Since Bruce was otherwise occupied and<br />

was no longer actively maintaining BusyBox,<br />

Bruce eventually consented to let me take over<br />

maintainership.<br />

Since that time, BusyBox has gained a large<br />

following and attracted development talent<br />

from literally the whole world. It has been<br />

used in commercial products such as the IBM<br />

<strong>Linux</strong> wristwatch, the Sharp Zaurus PDA, and<br />

Linksys wireless routers such as the WRT54G,<br />

with many more products being released all the<br />

time. So many new features and applets have<br />

been added to BusyBox, that the biggest chal-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 37<br />

lenge I now face is simply keeping up with all<br />

of the patches that get submitted!<br />

6 So, How Does It Work?<br />

BusyBox is a multi-call binary that combines<br />

many common Unix utilities into a single executable.<br />

When it is run, BusyBox checks if it<br />

was invoked via a symbolic link (a symlink),<br />

and if the name of the symlink matches the<br />

name of an applet that was compiled into Busy-<br />

Box, it runs that applet. If BusyBox is invoked<br />

as busybox, then it will read the command<br />

line and try to execute the applet name passed<br />

as the first argument. For example:<br />

# ./busybox date<br />

Wed Jun 2 15:01:03 MDT 2004<br />

# ./busybox echo "hello there"<br />

hello there<br />

# ln -s ./busybox uname<br />

# ./uname<br />

<strong>Linux</strong><br />

BusyBox is designed such that the developer<br />

compiling it for an embedded system can select<br />

exactly which applets to include in the final binary.<br />

Thus, it is possible to strip out support for<br />

unneeded and unwanted functionality, resulting<br />

in a smaller binary with a carefully selected<br />

set of commands. <strong>The</strong> customization granularity<br />

for BusyBox even goes one step further:<br />

each applet may contain multiple features that<br />

can be turned on or off. Thus, for example, if<br />

you do not wish to include large file support,<br />

or you do not need to mount NFS filesystems,<br />

you can simply turn these features off, further<br />

reducing the size of the final BusyBox binary.<br />

7 Compiling Busybox<br />

Let’s walk through a normal compile of Busy-<br />

Box. First, we must grab a copy of the Busy-<br />

Box source code and unpack it so it is ready to<br />

use. For this paper, we will just grab a copy of<br />

the daily BusyBox snapshot.<br />

# SITE=http://www.busybox.net/downloads<br />

# wget -q $SITE/busybox-snapshot.tar.bz2<br />

# tar -xjf busybox-snapshot.tar.bz2<br />

# cd busybox<br />

Now that we are in the BusyBox source directory<br />

we can configure BusyBox so that it<br />

meets the needs of our embedded <strong>Linux</strong> system.<br />

This is done by editing the file .config<br />

to change the set of applets that are compiled<br />

into BusyBox, to enable or disable features<br />

(i.e. whether debugging support is enabled or<br />

not), and to select a cross-compiler. <strong>The</strong> preferred<br />

method when starting from scratch is<br />

to run make defconfig followed by make<br />

menuconfig. Once BusyBox has been configured<br />

to taste, you just need to run make to<br />

compile it.<br />

8 Installing Busybox to a Target<br />

If you then want to install BusyBox onto a<br />

target device, this is most easily done by typing:<br />

make install. <strong>The</strong> installation script<br />

automatically creates all the required directories<br />

(such as /bin, /sbin, and the like) and<br />

creates appropriate symlinks in those directories<br />

for each applet that was compiled into the<br />

BusyBox binary.<br />

If we wanted to install BusyBox to the directory<br />

/mnt, we would simply run:<br />

# make PREFIX=/mnt install<br />

[--installation text omitted--]


38 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

9 Let’s build something that<br />

works!<br />

Now that I have certainly bored you to death,<br />

we finally get to the fun part, building our own<br />

embedded <strong>Linux</strong> system. For hardware, I will<br />

be using a Soekris 4521 system 6 with an 133<br />

Mhz AMD Elan CPU, 64 MB main memory,<br />

and a generic Intersil Prism based 802.11b card<br />

that can be driven using the hostap 7 driver.<br />

<strong>The</strong> root filesystem will be installed on a compact<br />

flash card.<br />

To begin with, we need to create toolchain with<br />

which to compile the software for our wireless<br />

access point. This requires we first compile<br />

GNU binutils 8 , then compile the GNU<br />

compiler collection—gcc 9 , and then compile<br />

uClibc using the newly created gcc compiler.<br />

With all those steps completed, we must finally<br />

recompile gcc using using the newly<br />

built uClibc library so that libgcc_s and<br />

libstdc++ can be linked with uClibc.<br />

Fortunately, the process of creating a uClibc<br />

toolchain can be automated. First we will go<br />

to the uClibc website and obtain a copy of the<br />

uClibc buildroot by going here:<br />

http://www.uclibc.org/cgi-bin/<br />

cvsweb/buildroot/<br />

and clicking on the “Download tarball” link 10 .<br />

This is a simple GNU make based build system<br />

which first builds a uClibc toolchain, and then<br />

builds a root filesystem using the newly built<br />

uClibc toolchain.<br />

For the root filesystem of our wireless access<br />

6 http://www.soekris.com/net4521.htm<br />

7 http://hostap.epitest.fi/<br />

8 http://sources.redhat.com/<br />

binutils/<br />

9 http://gcc.gnu.org/<br />

10 http://www.uclibc.org/cgi-bin/<br />

cvsweb/buildroot.tar.gz?view=tar<br />

point, we will need a <strong>Linux</strong> kernel, uClibc,<br />

BusyBox, pcmcia-cs, iptables, hostap, wtools,<br />

bridgeutils, and the dropbear ssh server. To<br />

compile these programs, we will first edit the<br />

buildroot Makefile to enable each of these<br />

items. Figure 1 shows the changes I made to<br />

the buildroot Makefile:<br />

Running make at this point will download the<br />

needed software packages, build a toolchain,<br />

and create a minimal root filesystem with the<br />

specified software installed.<br />

On my system, with all the software packages<br />

previously downloaded and cached locally, a<br />

complete build took 17 minutes, 19 seconds.<br />

Depending on the speed of your network connection<br />

and the speed of your build system,<br />

now might be an excellent time to take a lunch<br />

break, take a walk, or watch a movie.<br />

10 Checking out the new Root<br />

Filesystem<br />

We now have our root filesystem finished and<br />

ready to go. But we still need to do a little<br />

more work before we can boot up our newly<br />

built embedded <strong>Linux</strong> system. First, we need<br />

to compress our root filesystem so it can be<br />

loaded as an initrd.<br />

# gzip -9 root_fs_i386<br />

# ls -sh root_fs_i386.gz<br />

1.1M root_fs_i386.gz<br />

Now that our root filesystem has been compressed,<br />

it is ready to install on the boot media.<br />

To make things simple, I will install the Compact<br />

Flash boot media into a USB card reader<br />

device, and copy files using the card reader.<br />

# ms-sys -s /dev/sda<br />

Public domain master boot record<br />

successfully written to /dev/sda


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 39<br />

--- Makefile<br />

+++ Makefile<br />

@@ -140,6 +140,6 @@<br />

# Unless you want to build a kernel, I recommend just using<br />

# that...<br />

-TARGETS+=kernel-headers<br />

-#TARGETS+=linux<br />

+#TARGETS+=kernel-headers<br />

+TARGETS+=linux<br />

#TARGETS+=system-linux<br />

@@ -150,5 +150,5 @@<br />

#TARGETS+=zlib openssl openssh<br />

# Dropbear sshd is much smaller than openssl + openssh<br />

-#TARGETS+=dropbear_sshd<br />

+TARGETS+=dropbear_sshd<br />

# Everything needed to build a full uClibc development system!<br />

@@ -175,5 +175,5 @@<br />

# Some stuff for access points and firewalls<br />

-#TARGETS+=iptables hostap wtools dhcp_relay bridge<br />

+TARGETS+=iptables hostap wtools dhcp_relay bridge<br />

#TARGETS+=iproute2 netsnmp<br />

Figure 1: Changes to the buildroot Makefile<br />

# mkdosfs /dev/sda1<br />

mkdosfs 2.10 (22 Sep 2003)<br />

# syslinux /dev/sda1<br />

APPEND initrd=root_fs.gz \<br />

console=ttyS0,57600 \<br />

root=/dev/ram0 boot=/dev/hda1,msdos rw<br />

# cp syslinux.cfg /mnt<br />

# cp root_fs_i386.gz /mnt/root_fs.gz<br />

# cp build_i386/buildroot-kernel /mnt/linux<br />

So we now have a copy of our root filesystem<br />

and <strong>Linux</strong> kernel on the compact flash disk. Finally,<br />

we need to configure the bootloader. In<br />

case you missed it a few steps ago, we are using<br />

the syslinux bootloader for this example.<br />

I happen to have a ready to use syslinux configuration<br />

file, so I will now install that to the<br />

compact flash disk as well:<br />

# cat syslinux.cfg<br />

TIMEOUT 0<br />

PROMPT 0<br />

DEFAULT linux<br />

LABEL linux<br />

KERNEL linux<br />

And now, finally, we are done. Our embedded<br />

<strong>Linux</strong> system is complete and ready to boot.<br />

And you know what? It is very, very small.<br />

Take a look at Table 1.<br />

With a carefully optimized <strong>Linux</strong> kernel<br />

(which this kernel unfortunately isn’t) we<br />

could expect to have even more free space.<br />

And remember, every bit of space we save is<br />

money that embedded <strong>Linux</strong> developers don’t<br />

have to spend on expensive flash memory. So<br />

now comes the final test; it is now time to boot<br />

from our compact flash disk. Here is what you<br />

should see.<br />

[----kernel boot messages snipped--]


40 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

# ll /mnt<br />

total 1.9M<br />

drwxr-r- 2 root root 16K Jun 2 16:39 ./<br />

drwxr-xr-x 22 root root 4.0K Feb 6 07:40 ../<br />

-r-xr-r- 1 root root 7.7K Jun 2 16:36 ldlinux.sys*<br />

-rwxr-r- 1 root root 795K Jun 2 16:36 linux*<br />

-rwxr-r- 1 root root 1.1M Jun 2 16:36 root_fs.gz*<br />

-rwxr-r- 1 root root 170 Jun 2 16:39 syslinux.cfg*<br />

Table 1: Output of ls -lh /mnt.<br />

Freeing unused kernel memory: 64k freed<br />

Welcome to the Erik’s wireless access point.<br />

useful example. <strong>The</strong>re are thousands of other<br />

potential applications that are only waiting for<br />

you to create them.<br />

uclibc login: root<br />

BusyBox v1.00-pre10 (2004.06.02-21:54+0000)<br />

Built-in shell (ash)<br />

Enter ’help’ for a list of built-in commands.<br />

# du -h / | tail -n 1<br />

2.6M<br />

#<br />

And there you have it—your very own wireless<br />

access point. Some additional configuration<br />

will be necessary to start up the wireless<br />

interface, which will be demonstrated during<br />

my presentation.<br />

11 Conclusion<br />

<strong>The</strong> two largest components of a standard<br />

<strong>Linux</strong> system are the utilities and the libraries.<br />

By replacing these with smaller equivalents a<br />

much more compact system can be built. Using<br />

BusyBox and uClibc allows you to customize<br />

your embedded distribution by stripping<br />

out unneeded applets and features, thus<br />

further reducing the final image size. This<br />

space savings translates directly into decreased<br />

cost per unit as less flash memory will be required.<br />

Combine this with the cost savings of<br />

using <strong>Linux</strong>, rather than a more expensive proprietary<br />

OS, and the reasons for using <strong>Linux</strong><br />

become very compelling. <strong>The</strong> example Wireless<br />

Access point we created is a simple but


Run-time testing of LSB Applications<br />

Abstract<br />

Stuart Anderson<br />

Free Standards Group<br />

anderson@freestandards.org<br />

<strong>The</strong> dynamic application test tool is capable<br />

of checking API usage at run-time. <strong>The</strong> LSB<br />

defines only a subset of all possible parameter<br />

values to be valid. This tool is capable of<br />

checking these value while the application is<br />

running.<br />

This paper will explain how this tool works,<br />

and highlight some of the more interesting implementation<br />

details such as how we managed<br />

to generate most of the code automatically,<br />

based on the interface descriptions contained<br />

in the LSB database.<br />

Results to date will be presented, along with<br />

future plans and possible uses for this tool.<br />

1 Introduction<br />

<strong>The</strong> <strong>Linux</strong> Standard Base (LSB) Project began<br />

in 1998, when the <strong>Linux</strong> community came<br />

together and decided to take action to prevent<br />

GNU/<strong>Linux</strong> based operating systems from<br />

fragmenting in the same way UNIX operating<br />

systems did in the 1980s and 1990s. <strong>The</strong> LSB<br />

defines the Application Binary Interface (ABI)<br />

for the core part of a GNU/<strong>Linux</strong> system. As<br />

an ABI, the LSB defines the interface between<br />

the operating system and the applications. A<br />

complete set of tests for an ABI must be capable<br />

of measuring the interface from both sides.<br />

Almost from the beginning, testing has been<br />

Matt Elder<br />

University of South Caroilina<br />

happymutant@sc.rr.com<br />

a cornerstone of the project. <strong>The</strong> LSB was<br />

originally organized around 3 components: the<br />

written specification, a sample implementation,<br />

and the test suites. <strong>The</strong> written specification<br />

is the ultimate definition of the LSB. Both<br />

the sample implementation, and the test suites<br />

yield to the authority of the written specification.<br />

<strong>The</strong> sample implementation (SI) is a minimal<br />

subset of a GNU/<strong>Linux</strong> system that provides a<br />

runtime that implements the LSB, and as little<br />

else as possible. <strong>The</strong> SI is neither intended to<br />

be a minimal distribution, nor the basis for a<br />

distribution. Instead, it is used as both a proof<br />

of concept and a testing tool. Applications<br />

which are seeking certification are required to<br />

prove they execute correctly using the SI and<br />

two other distributions. <strong>The</strong> SI is also used to<br />

validate the runtime test suites.<br />

<strong>The</strong> third component is testing. <strong>One</strong> of the<br />

things that strengthens the LSB is its ability to<br />

measure, and thus prove, conformance to the<br />

standard. Testing is achieved with an array of<br />

different test suites, each of which measures a<br />

different aspect of the specification.<br />

LSB Runtime<br />

• cmdchk<br />

This test suite is a simple existence test<br />

that ensures the required LSB commands<br />

and utilities are found on an LSB conforming<br />

system.


42 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

• libchk<br />

This test suite checks the libraries required<br />

by the LSB to ensure they contain<br />

the interfaces and symbol versions as<br />

specified by the LSB.<br />

• runtimetests<br />

This test suite measures the behavior of<br />

the interfaces provided by the GNU/<strong>Linux</strong><br />

system. This is the largest of the test<br />

suites, and is actually broken down into<br />

several components, which are referred to<br />

collectively as the runtime tests. <strong>The</strong>se<br />

tests are derived from the test suites used<br />

by the Open Group for UNIX branding.<br />

LSB Packaging<br />

• pkgchk<br />

This test examines an RPM format package<br />

to ensure it conforms to the LSB.<br />

• pkginstchk<br />

This test suite is used to ensure that the<br />

package management tool provided by a<br />

GNU/<strong>Linux</strong> system will correctly install<br />

LSB conforming packages. This suite is<br />

still in early stages of development.<br />

LSB Application<br />

• appchk<br />

This test performs a static analysis of an<br />

application to ensure that it only uses<br />

libraries and interfaces specified by the<br />

LSB.<br />

• dynchk<br />

This test is used to measure an applications<br />

use of the LSB interfaces during its<br />

execution, and is the subject of this paper.<br />

2 <strong>The</strong> database<br />

<strong>The</strong> LSB Specification contains over 6600 interfaces,<br />

each of which is associated with a library<br />

and a header file, and may have parameters.<br />

Because of the size and complexity of the<br />

data describing these interfaces, a database is<br />

used to maintain this information.<br />

It is impractical to try and keep the specification,<br />

test suites and development libraries and<br />

headers synchronized for this much data. Instead,<br />

portions of the specification and tests,<br />

and all of the development headers and libraries<br />

are generated from the database. This<br />

ensures that as changes are made to the<br />

database, the changes are propagated to the<br />

other parts of the project as well.<br />

Some of the relevant data components in this<br />

DB are Libraries, Headers, Interfaces, and<br />

Types. <strong>The</strong>re are also secondary components<br />

and relations between all of the components. A<br />

short description of some of these is needed before<br />

moving on to how the dynchk test is constructed.<br />

2.1 Library<br />

<strong>The</strong> LSB specifies 17 shared libraries, which<br />

contains the 6600 interfaces. <strong>The</strong> interfaces<br />

in each library are grouped into logical units<br />

called a LibGroup. <strong>The</strong> LibGroups help to organize<br />

the interfaces, which is very useful in<br />

the written specification, but isn’t used much<br />

elsewhere.<br />

2.2 Interface<br />

An Interface represents a globally visible symbol,<br />

such as a function, or piece of data. Interfaces<br />

have a Type, which is either the type of<br />

the global data or the return type of the function.<br />

If the Interface is a function, then it will<br />

have zero or more Parameters, which form a


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 43<br />

Library<br />

Tid Ttype Tname Tbasetype<br />

1 Intrinsic int 0<br />

2 Pointer int * 1<br />

LibGroup<br />

Interface<br />

LibGroup<br />

Interface<br />

LibGroup<br />

Interface<br />

Table 1: Example of recursion in Type table for<br />

int *<br />

struct foo {<br />

int a;<br />

int *b;<br />

}<br />

Figure 1: Relationship between Library, Lib-<br />

Group and Interface<br />

Interface<br />

Type Parameter Parameter Parameter<br />

Type Type Type<br />

Figure 2: Relationship between Interface, Type<br />

and Parameter<br />

set of Types ordered by their position in the parameter<br />

list.<br />

2.3 Type<br />

As mentioned above, the database contains<br />

enough information to be able to generate<br />

header files which are a part of the LSB development<br />

tools. This means that the database<br />

must be able to represent Clanguage types. <strong>The</strong><br />

Type and TypeMember tables provide these.<br />

<strong>The</strong>se tables are used recursively. If a Type is<br />

defined in terms of another type, then it will<br />

have a base type that points to that other type.<br />

For structs and unions, the TypeMemeber table<br />

Figure 3: Sample struct<br />

is used to hold the ordered list of members. Entries<br />

in the TypeMember table point back to the<br />

Type table to describe the type of each member.<br />

For enums, the TypeMember table is also used<br />

to hold the ordered list of values.<br />

Tid Ttype Tname Tbasetype<br />

1 Intrinsic int 0<br />

2 Pointer int * 1<br />

3 Struct foo 0<br />

Table 2: Contents of Type table<br />

<strong>The</strong> structure shown in Figure 3 is represented<br />

by the entries in the Type table in Table 2 and<br />

the TypeMember table in Table 3.<br />

2.4 Header<br />

Headers, like Libraries, have their contents arranged<br />

into logical groupings known a Header-<br />

Groups. Unlike Libraries, these HeaderGroups<br />

are ordered so that the proper sequence of<br />

definitions within a header file can be maintained.<br />

HeaderGroups contain Constant definitions<br />

(i.e. #define statements) and Type definitions.<br />

If you examine a few well designed<br />

header files, you will notice a pattern of a comment<br />

followed by related constant definitions<br />

and type definitions. <strong>The</strong> entire header file can<br />

be viewed as a repeating sequence of this pat-


44 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Tmid TMname TMtypeid TMposition TMmemberof<br />

10 a 1 0 3<br />

11 b 2 1 3<br />

Table 3: Contents of TypeMember<br />

HeaderGroup 1<br />

Constants<br />

Types<br />

HeaderGroup 2<br />

Constants<br />

Types<br />

HeaderGroup 3<br />

Constants<br />

Types<br />

Function<br />

Declarations<br />

Figure 4: Organization of Headers<br />

tern. This pattern is the basis for the Header-<br />

Group concept.<br />

2.5 TypeType<br />

<strong>One</strong> last construct in our database should be<br />

mentioned. While we are able to represent<br />

a syntactic description of interfaces and<br />

types in the database, this is not enough to<br />

automatically generate meaningful test cases.<br />

We need to add some semantic information<br />

that better describes how the types in structures<br />

and parameters are used. As an example,<br />

struct sockaddr contains a member,<br />

sa_family, of type unsigned short. <strong>The</strong><br />

compiler will of course ensure that only values<br />

between 0 and 2 16 − 1 will be used, but<br />

only a few of those values have any meaning<br />

in this context. By adding the semantic information<br />

that this member holds a socket family<br />

value, the test generator can cause the value<br />

found in sa_family to be tested against the<br />

legal socket families values (AF_INET, AF_<br />

INET6, etc), instead of just ensuring the value<br />

falls between 0 and 2 16 − 1, which is really just<br />

a noop test.<br />

Example TypeType entries<br />

• RWaddress<br />

An address from the process space that<br />

must be both readable and writable.<br />

• Rdaddress<br />

An address from the process space that<br />

must be at least readable.<br />

• filedescriptor<br />

A small integer value greater than or equal<br />

to 0, and less than the maximum file descriptor<br />

for the process.<br />

• pathname<br />

<strong>The</strong> name of a file or directory that should<br />

be compared against the Filesystem Hierarchy<br />

Standard.<br />

2.6 Using this data<br />

As mentioned above, the data in the database is<br />

used to generate different portions of the LSB<br />

project. This strategy was adopted to ensure


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 45<br />

these different parts would always be in sync,<br />

without having to depend on human intervention.<br />

<strong>The</strong> written specification contains tables of interfaces,<br />

and data definitions (constants and<br />

types). <strong>The</strong>se are all generated from the<br />

database.<br />

<strong>The</strong> LSB development environment 1 consists<br />

of stub libraries and header files that contain<br />

only the interfaces defined by the LSB. This<br />

development environment helps catch the use<br />

of non-LSB interfaces during the development<br />

or porting of an application instead of being<br />

surprised by later test results. Both the stub<br />

libraries and headers are produced by scripts<br />

pulling data from the database.<br />

Some of the test suites described previously<br />

have components which are generated from the<br />

database. Cmdchk and libchk have lists of<br />

commands and interfaces respectively which<br />

are extracted from the database. <strong>The</strong> static application<br />

test tool, appchk, also has a list of<br />

interfaces that comes from the database. <strong>The</strong><br />

dynamic application test tool, dynchk, has the<br />

majority of its code generated from information<br />

in the database.<br />

3 <strong>The</strong> Dynamic Checker<br />

<strong>The</strong> static application checker simply examines<br />

an executable file to determine if it is using<br />

interfaces beyond those allowed by the LSB.<br />

This is very useful to determine if an application<br />

has been built correctly. However, is<br />

unable to determine if the interfaces are used<br />

correctly when the application is executed. A<br />

different kind of test is required to be able to<br />

perform this level of checking. This new test<br />

must interact with the application while it is<br />

1 See the May Issue of <strong>Linux</strong> Journal for more information<br />

on the LSB Development Environment.<br />

running, without interfering with the execution<br />

of the application.<br />

This new test has two major components: a<br />

mechanism for hooking itself into an application,<br />

and a collection of functions to perform<br />

the tests for all of the interfaces. <strong>The</strong>se components<br />

can mostly be developed independently<br />

of each other.<br />

3.1 <strong>The</strong> Mechanism<br />

<strong>The</strong> mechanism for interacting with the application<br />

must be transparent and noninterfering<br />

to the application. We considered the approach<br />

used by 3 different tools: abc, ltrace, and fakeroot.<br />

• abc—This tool was the inspiration for<br />

our new dynamic checker. abc was developed<br />

as part of the SVR4 ABI test<br />

tools. abc works by modifying the target<br />

application. <strong>The</strong> application’s executable<br />

is modified to load a different version<br />

of the shared libraries and to call a<br />

different version of each interface. This<br />

is accomplished by changing the strings<br />

in the symbol table and DT_NEEDED<br />

records. For example, libc.so.1 is<br />

changed to LiBc.So.1, and fread()<br />

is changed to FrEaD(). <strong>The</strong> test set<br />

is then located in /usr/lib/LiBc.<br />

So.1, which in turns loads the original<br />

/usr/lib/libc.so.1. This mechanism<br />

works, but the requirement to modify<br />

the executable file is undesirable.<br />

• ltrace—This tool is similar to<br />

strace, except that it traces calls<br />

into shared libraries instead of calls into<br />

the kernel. ltrace uses the ptrace<br />

interface to control the application’s<br />

process. With this approach, the test sets<br />

are located in a separate program and are<br />

invoked by stopping the application upon


46 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

entry to the interface being tested. This<br />

approach has two drawbacks: first, the<br />

code required to decode the process stack<br />

and extract the parameters is unique to<br />

each architecture, and second, the tests<br />

themselves are more complicated to write<br />

since the parameters have to be fetched<br />

from the application’s process.<br />

• fakeroot—This tool is used to create<br />

an environment where an unprivileged<br />

process appears to have root privileges.<br />

fakeroot uses LD_PRELOAD to load<br />

an additional shared library before any of<br />

the shared libraries specified by the DT_<br />

NEEDED records in the executable. This<br />

extra library contains a replacement function<br />

for each file manipulation function.<br />

<strong>The</strong> functions in this library will be selected<br />

by the dynamic linker instead of the<br />

normal functions found in the regular libraries.<br />

<strong>The</strong> test sets themselves will perform<br />

tests of the parameters, and then call<br />

the original version of the functions.<br />

We chose to use the LD_PRELOAD mechanism<br />

because we felt it was the simplest to use.<br />

Based on this mechanism, a sample test case<br />

looks like Figure 5.<br />

<strong>One</strong> problem that must be avoided when using<br />

this mechanism is recursion. If the above<br />

function just called read() at the end, it<br />

would end up calling itself again. Instead, the<br />

RTLD_NEXT flag passed to dlsym() tells the<br />

dynamic linker to look up the symbol on one<br />

of the libraries loaded after the current library.<br />

This will get the original version of the function.<br />

3.2 Test set organization<br />

<strong>The</strong> test set functions are organized into 3 layers.<br />

<strong>The</strong> top layer contains the functions that<br />

are test stubs for the LSB interfaces. <strong>The</strong>se<br />

functions are implemented by calling the functions<br />

in layers 2 and 3. An example of a function<br />

in the first layer was given in Figure 5.<br />

<strong>The</strong> second layer contains the functions that<br />

test data structures and types which are passed<br />

in as parameters. <strong>The</strong>se functions are also implemented<br />

by calling the functions in layer 3<br />

and other functions in layer 2. A function in<br />

the second layer looks like Figure 6.<br />

<strong>The</strong> third layer contains functions that test the<br />

types which have been annotated with additional<br />

semantic information. <strong>The</strong>se functions<br />

often have to perform nontrivial operations to<br />

test the assertion required for these supplemental<br />

types. Figure 7 is an example of a layer 3<br />

function.<br />

Presently, there are 3056 functions in layer 1<br />

(tests for libstdc++ are not yet being generated),<br />

106 functions in layer 2, and just a few<br />

in layer 3. We estimate that the total number of<br />

functions in layer 3 upon completion of the test<br />

tool will be on the order of several dozen. <strong>The</strong><br />

functions in the first two layers are automatically<br />

generated based on the information in the<br />

database. Functions in layer 3 are hand coded.<br />

3.3 Automatic generation of the tests<br />

In Table 4, is a summary of the size of the test<br />

tool so far. As work progresses, these numbers<br />

will only get larger. Most of the code in<br />

the test is very repetitive, and prone to errors<br />

when edited manually. <strong>The</strong> ability to automate<br />

the process of creating this code is highly desirable.<br />

Let’s take another look at the sample function<br />

from layer 1. This time, however, lets replace<br />

some of the code with a description of the information<br />

it represents. See Figure 8 for this<br />

parameterized version.<br />

All of the occurrences of the string read are


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 47<br />

ssize_t read (int arg0, void *arg1, size_t arg2) {<br />

if (!funcptr)<br />

funcptr = dlsym(RTLD_NEXT, "read");<br />

validate_filedescriptor(arg0, "read");<br />

validate_RWaddress(arg1, "read");<br />

validate_size_t(arg2, "read");<br />

return funcptr(arg0, arg1, arg2);<br />

}<br />

Figure 5: Test case for read() function<br />

void validate_struct_sockaddr_in(struct sockaddr_in *input,<br />

char *name) {<br />

validate_socketfamily(input->sin_family,name);<br />

validate_socketport(input->sin_port,name);<br />

validate_IPv4Address((input->sin_addr), name);<br />

}<br />

Figure 6: Test case for validating struct sockaddr_in<br />

Module Files Lines of Code<br />

libc 752 19305<br />

libdl 5 125<br />

libgcc_s 13 262<br />

libGL 450 11046<br />

libICE 49 1135<br />

libm 281 6568<br />

libncurses 266 6609<br />

libpam 13 335<br />

libpthread 82 2060<br />

libSM 37 865<br />

libX11 668 16112<br />

libXext 113 2673<br />

libXt 288 7213<br />

libz 39 973<br />

structs 106 1581<br />

Table 4: Summary of generated code<br />

<strong>The</strong>se two examples, now represent templates<br />

that can be used to create the functions for layers<br />

1 and 2. From the previous description of<br />

the database, you can see that there is enough<br />

information available to be able to instantiate<br />

these templates for each interfaces, and structure<br />

used by the LSB.<br />

<strong>The</strong> automation is implemented by 2 perl<br />

scripts: gen_lib.pl and gen_tests.pl.<br />

<strong>The</strong>se scripts generate the code for layers 1 and<br />

2 respectively.<br />

Overall, these scripts work well, but we have<br />

run into a few interesting situations along the<br />

way.<br />

3.4 Handling the exceptions<br />

actually just the function name, and could have<br />

been replaced also.<br />

<strong>The</strong> same thing can be done for the sample<br />

function from layer 2 as is seen in Figure 9.<br />

So far, we have come up with an overall architecture<br />

for the test tool, selected a mechanism<br />

that allows us to hook the tests into the running<br />

application, discovered the pattern in the test<br />

functions so that we could create a template for


48 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

void validate_filedescriptor(const int fd, const char *name) {<br />

if (fd >= lsb_sysconf(_SC_OPEN_MAX))<br />

ERROR("fd too big");<br />

else if (fd < 0)<br />

ERROR("fd negative");<br />

}<br />

Figure 7: Test case for validating a filedescriptor<br />

return-type read (list of parameters) {<br />

if (!funcptr)<br />

funcptr = dlsym(RTLD_NEXT, "read");<br />

validate_parameter1 type(arg0, "read");<br />

validate_parameter2 type(arg1, "read");<br />

validate_parameter3 type(arg2, "read");<br />

return funcptr(arg0, arg1, arg2);<br />

}<br />

Figure 8: Parameterized test case for a function<br />

automatically generating the code, and implemented<br />

the scripts to generate all of the tests<br />

cases. <strong>The</strong> only problem is that now we run<br />

into the real world, where things don’t always<br />

follow the rules.<br />

Here are a few of the interesting situations we<br />

have encountered<br />

• Variadic Functions<br />

Of the 725 functions in libc, 25 of them<br />

take a variable number of parameters.<br />

This causes problems in the generation of<br />

the code for the test case, but most importantly<br />

it affects our ability to know how<br />

to process the arguments. <strong>The</strong>se function<br />

have to be written by hand to handle<br />

the special needs of these functions.<br />

For the functions in the exec, printf<br />

and scanf families, the test cases can be<br />

implemented by calling the varargs form<br />

of the function (execl() can be implemented<br />

using execv()).<br />

• open()<br />

In addition to the problems of being a<br />

variadic function, the third parameter to<br />

open() and open64() is only valid<br />

if the O_CREAT flag is set in the second<br />

parameter to these functions. This<br />

simple exception requires a small amount<br />

of manual intervention, so these function<br />

have to be maintained by hand.<br />

• memory allocation<br />

<strong>One</strong> of the recursion problems we ran into<br />

is that memory will be allocated within<br />

the dlsym() function call, so the implementation<br />

of one test case ends up invoking<br />

the test case for one of the memory<br />

allocation routines, which by default<br />

would call dlsym(), creating the recursion.<br />

This cycle had to be broken by having<br />

the test cases for these routines call<br />

libc private interfaces to memory allocation.<br />

• changing memory map


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 49<br />

void validate_struct_structure name(struct structure name<br />

*input, char *name) {<br />

validate_type of member 1(input->name of member 1, name);<br />

validate_type of member 2(input->name of member 2, name);<br />

validate_type of member 3((input->name of member 3), name);<br />

}<br />

Figure 9: Parameterized test case for a struct<br />

Pointers are validated by making sure they<br />

contain an address that is valid for the process.<br />

/proc/self/maps is read to obtain<br />

the memory map of the current process.<br />

<strong>The</strong>se results are cached, for performance<br />

reasons, but usually, the memory<br />

map of the process will change over time.<br />

Both the stack and the heap will grow,<br />

resulting in valid pointers being checked<br />

against a cached copy of the memory map.<br />

In the event a pointer is found to be invalid,<br />

the memory map is re-read, and the<br />

pointer checked again. <strong>The</strong> mmap() and<br />

munmap() test cases are also maintained<br />

by hand so that they can also cause the<br />

memory map to be re-read.<br />

• hidden ioctl()s<br />

By design, the LSB specifies interfaces<br />

at the highest possible level. <strong>One</strong> example<br />

of this, is the use of the termio functions,<br />

instead of specifying the underlying<br />

ioctl() interface. It turns out that<br />

this tool catches the underlying ioctl()<br />

calls anyway, and flags it as an error. <strong>The</strong><br />

solution is for the termio functions the set<br />

a flag indicating that the ioctl() test<br />

case should skip its tests.<br />

• Optionally NULL parameters<br />

Many interfaces have parameters which<br />

may be NULL. This triggerred lots of<br />

warnings for many programs. <strong>The</strong> solution<br />

was to add a flag that indicated that<br />

the Parameter may be NULL, and to not<br />

try to validate the pointer, or the data being<br />

pointed to.<br />

No doubt, there will be more interesting situations<br />

to have to deal with before this tool is<br />

completed.<br />

4 Results<br />

As of the deadline for this paper, results are<br />

preliminary, but encouraging. <strong>The</strong> tool is initially<br />

being tested against simple commands<br />

such as ls and vi, and some X Windows clients<br />

such as xclock and xterm. <strong>The</strong> tool is correctly<br />

inserting itself into the application under test,<br />

and we are getting some interesting results that<br />

will be examined more closely.<br />

<strong>One</strong> example is vi passes a NULL to<br />

__strtol_internal several times during<br />

startup.<br />

<strong>The</strong> tool was designed to work across all architectures.<br />

At present, it has been built and tested<br />

on only the IA32 and IA64 architectures. No<br />

significant problems are anticipate on other architectures.<br />

Additional results and experience will be presented<br />

at the conference.


50 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

5 Future Work<br />

<strong>The</strong>re is still much work to be done. Some of<br />

the outstanding tasks are highlighted here.<br />

• Additional TypeTypes<br />

Semantic information needs to be added<br />

for additional parameters and structures.<br />

<strong>The</strong> additional layer 3 tests that correspond<br />

to this information must also be implemented.<br />

• Architecture-specific interfaces<br />

As we found in the LSB, there are some<br />

interfaces, and types that are unique to one<br />

or more architectures. <strong>The</strong>se need to be<br />

handled properly so they are not part of<br />

the tests when built on an architecture for<br />

which they don’t apply.<br />

• Unions<br />

Although Unions are represented in the<br />

database in the same way as structures,<br />

the database does not contain enough information<br />

to describe how to interpret or<br />

test the contents of a union. Test cases that<br />

involve unions may have to be written by<br />

hand.<br />

• Additional libraries<br />

<strong>The</strong> information in the database for the<br />

graphics libraries and for libstdc++ is<br />

incomplete, therefore, it is not possible to<br />

generate all of the test cases for those libraries.<br />

Once the data is complete, the test<br />

cases will also be complete.


<strong>Linux</strong> Block IO—present and future<br />

Jens Axboe<br />

SuSE<br />

axboe@suse.de<br />

Abstract<br />

<strong>One</strong> of the primary focus points of 2.5 was fixing<br />

up the bit rotting block layer, and as a result<br />

2.6 now sports a brand new implementation of<br />

basically anything that has to do with passing<br />

IO around in the kernel, from producer to disk<br />

driver. <strong>The</strong> talk will feature an in-depth look<br />

at the IO core system in 2.6 comparing to 2.4,<br />

looking at performance, flexibility, and added<br />

functionality. <strong>The</strong> rewrite of the IO scheduler<br />

API and the new IO schedulers will get a fair<br />

treatment as well.<br />

No 2.6 talk would be complete without 2.7<br />

speculations, so I shall try to predict what<br />

changes the future holds for the world of <strong>Linux</strong><br />

block I/O.<br />

1 2.4 Problems<br />

<strong>One</strong> of the most widely criticized pieces of<br />

code in the 2.4 kernels is, without a doubt, the<br />

block layer. It’s bit rotted heavily and lacks<br />

various features or facilities that modern hardware<br />

craves. This has led to many evils, ranging<br />

from code duplication in drivers to massive<br />

patching of block layer internals in vendor<br />

kernels. As a result, vendor trees can easily<br />

be considered forks of the 2.4 kernel with<br />

respect to the block layer code, with all of<br />

the problems that this fact brings with it: 2.4<br />

block layer code base may as well be considered<br />

dead, no one develops against it. Hardware<br />

vendor drivers include many nasty hacks<br />

and #ifdef’s to work in all of the various<br />

2.4 kernels that are out there, which doesn’t exactly<br />

enhance code coverage or peer review.<br />

<strong>The</strong> block layer fork didn’t just happen for the<br />

fun of it of course, it was a direct result of<br />

the various problem observed. Some of these<br />

are added features, others are deeper rewrites<br />

attempting to solve scalability problems with<br />

the block layer core or IO scheduler. In the<br />

next sections I will attempt to highlight specific<br />

problems in these areas.<br />

1.1 IO Scheduler<br />

<strong>The</strong> main 2.4 IO scheduler is called<br />

elevator_linus, named after the benevolent<br />

kernel dictator to credit him for some<br />

of the ideas used. elevator_linus is a<br />

one-way scan elevator that always scans in<br />

the direction of increasing LBA. It manages<br />

latency problems by assigning sequence<br />

numbers to new requests, denoting how many<br />

new requests (either merges or inserts) may<br />

pass this one. <strong>The</strong> latency value is dependent<br />

on data direction, smaller for reads than for<br />

writes. Internally, elevator_linus uses<br />

a double linked list structure (the kernels<br />

struct list_head) to manage the request<br />

structures. When queuing a new IO unit with<br />

the IO scheduler, the list is walked to find a<br />

suitable insertion (or merge) point yielding an<br />

O(N) runtime. That in itself is suboptimal in<br />

presence of large amounts of IO and to make<br />

matters even worse, we repeat this scan if the<br />

request free list was empty when we entered


52 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

the IO scheduler. <strong>The</strong> latter is not an error<br />

condition, it will happen all the time for even<br />

moderate amounts of write back against a<br />

queue.<br />

1.2 struct buffer_head<br />

<strong>The</strong> main IO unit in the 2.4 kernel is the<br />

struct buffer_head. It’s a fairly unwieldy<br />

structure, used at various kernel layers for different<br />

things: caching entity, file system block,<br />

and IO unit. As a result, it’s suboptimal for either<br />

of them.<br />

From the block layer point of view, the two<br />

biggest problems is the size of the structure<br />

and the limitation in how big a data region it<br />

can describe. Being limited by the file system<br />

one block semantics, it can at most describe a<br />

PAGE_CACHE_SIZE amount of data. In <strong>Linux</strong><br />

on x86 hardware that means 4KiB of data. Often<br />

it can be even worse: raw io typically uses<br />

the soft sector size of a queue (default 1KiB)<br />

for submitting io, which means that queuing<br />

eg 32KiB of IO will enter the io scheduler 32<br />

times. To work around this limitation and get<br />

at least to a page at the time, a 2.4 hack was<br />

introduced. This is called vary_io. A driver<br />

advertising this capability acknowledges that it<br />

can manage buffer_head’s of varying sizes<br />

at the same time. File system read-ahead, another<br />

frequent user of submitting larger sized<br />

io, has no option but to submit the read-ahead<br />

window in units of the page size.<br />

1.3 Scalability<br />

With the limit on buffer_head IO size and<br />

elevator_linus runtime, it doesn’t take a<br />

lot of thinking to discover obvious scalability<br />

problems in the <strong>Linux</strong> 2.4 IO path. To add insult<br />

to injury, the entire IO path is guarded by a<br />

single, global lock: io_request_lock. This<br />

lock is held during the entire IO queuing operation,<br />

and typically also from the other end<br />

when a driver subtracts requests for IO submission.<br />

A single global lock is a big enough<br />

problem on its own (bigger SMP systems will<br />

suffer immensely because of cache line bouncing),<br />

but add to that long runtimes and you have<br />

a really huge IO scalability problem.<br />

<strong>Linux</strong> vendors have long shipped lock scalability<br />

patches for quite some time to get around<br />

this problem. <strong>The</strong> adopted solution is typically<br />

to make the queue lock a pointer to a driver local<br />

lock, so the driver has full control of the<br />

granularity and scope of the lock. This solution<br />

was adopted from the 2.5 kernel, as we’ll<br />

see later. But this is another case where driver<br />

writers often need to differentiate between vendor<br />

and vanilla kernels.<br />

1.4 API problems<br />

Looking at the block layer as a whole (including<br />

both ends of the spectrum, the producers<br />

and consumers of the IO units going through<br />

the block layer), it is a typical example of code<br />

that has been hacked into existence without<br />

much thought to design. When things broke<br />

or new features were needed, they had been<br />

grafted into the existing mess. No well defined<br />

interface exists between file system and<br />

block layer, except a few scattered functions.<br />

Controlling IO unit flow from IO scheduler<br />

to driver was impossible: 2.4 exposes the IO<br />

scheduler data structures (the ->queue_head<br />

linked list used for queuing) directly to the<br />

driver. This fact alone makes it virtually impossible<br />

to implement more clever IO scheduling<br />

in 2.4. Even the recently (in the 2.4.20’s)<br />

added lower latency work was horrible to work<br />

with because of this lack of boundaries. Verifying<br />

correctness of the code is extremely difficult;<br />

peer review of the code likewise, since a<br />

reviewer must be intimate with the block layer<br />

structures to follow the code.<br />

Another example on lack of clear direction is


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 53<br />

the partition remapping. In 2.4, it’s the driver’s<br />

responsibility to resolve partition mappings.<br />

A given request contains a device and sector<br />

offset (i.e. /dev/hda4, sector 128) and the<br />

driver must map this to an absolute device offset<br />

before sending it to the hardware. Not only<br />

does this cause duplicate code in the drivers,<br />

it also means the IO scheduler has no knowledge<br />

of the real device mapping of a particular<br />

request. This adversely impacts IO scheduling<br />

whenever partitions aren’t laid out in strict ascending<br />

disk order, since it causes the io scheduler<br />

to make the wrong decisions when ordering<br />

io.<br />

2 2.6 Block layer<br />

<strong>The</strong> above observations were the initial kick off<br />

for the 2.5 block layer patches. To solve some<br />

of these issues the block layer needed to be<br />

turned inside out, breaking basically anythingio<br />

along the way.<br />

2.1 bio<br />

Given that struct buffer_head was one<br />

of the problems, it made sense to start from<br />

scratch with an IO unit that would be agreeable<br />

to the upper layers as well as the drivers.<br />

<strong>The</strong> main criteria for such an IO unit would be<br />

something along the lines of:<br />

1. Must be able to contain an arbitrary<br />

amount of data, as much as the hardware<br />

allows. Or as much that makes sense at<br />

least, with the option of easily pushing<br />

this boundary later.<br />

2. Must work equally well for pages that<br />

have a virtual mapping as well as ones that<br />

do not.<br />

3. When entering the IO scheduler and<br />

driver, IO unit must point to an absolute<br />

location on disk.<br />

4. Must be able to stack easily for IO stacks<br />

such as raid and device mappers. This includes<br />

full redirect stacking like in 2.4, as<br />

well as partial redirections.<br />

Once the primary goals for the IO structure<br />

were laid out, the struct bio was<br />

born. It was decided to base the layout<br />

on a scatter-gather type setup, with the bio<br />

containing a map of pages. If the map<br />

count was made flexible, items 1 and 2 on<br />

the above list were already solved. <strong>The</strong><br />

actual implementation involved splitting the<br />

data container from the bio itself into a<br />

struct bio_vec structure. This was mainly<br />

done to ease allocation of the structures so<br />

that sizeof(struct bio) was always constant.<br />

<strong>The</strong> bio_vec structure is simply a tuple<br />

of {page, length, offset}, and the<br />

bio can be allocated with room for anything<br />

from 1 to BIO_MAX_PAGES. Currently <strong>Linux</strong><br />

defines that as 256 pages, meaning we can support<br />

up to 1MiB of data in a single bio for<br />

a system with 4KiB page size. At the time<br />

of implementation, 1MiB was a good deal beyond<br />

the point where increasing the IO size further<br />

didn’t yield better performance or lower<br />

CPU usage. It also has the added bonus of<br />

making the bio_vec fit inside a single page,<br />

so we avoid higher order memory allocations<br />

(sizeof(struct bio_vec) == 12 on 32-<br />

bit, 16 on 64-bit) in the IO path. This is an<br />

important point, as it eases the pressure on the<br />

memory allocator. For swapping or other low<br />

memory situations, we ideally want to stress<br />

the allocator as little as possible.<br />

Different hardware can support different sizes<br />

of io. Traditional parallel ATA can do a maximum<br />

of 128KiB per request, qlogicfc SCSI<br />

doesn’t like more than 32KiB, and lots of high<br />

end controllers don’t impose a significant limit<br />

on max IO size but may restrict the maximum<br />

number of segments that one IO may be composed<br />

of. Additionally, software raid or de-


54 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

vice mapper stacks may like special alignment<br />

of IO or the guarantee that IO won’t cross<br />

stripe boundaries. All of this knowledge is either<br />

impractical or impossible to statically advertise<br />

to submitters of io, so an easy interface<br />

for populating a bio with pages was essential<br />

if supporting large IO was to become<br />

practical. <strong>The</strong> current solution is int bio_<br />

add_page() which attempts to add a single<br />

page (full or partial) to a bio. It returns the<br />

amount of bytes successfully added. Typical<br />

users of this function continue adding pages<br />

to a bio until it fails—then it is submitted for<br />

IO through submit_bio(), a new bio is allocated<br />

and populated until all data has gone<br />

out. int bio_add_page() uses statically<br />

defined parameters inside the request queue to<br />

determine how many pages can be added, and<br />

attempts to query a registered merge_bvec_<br />

fn for dynamic limits that the block layer cannot<br />

know about.<br />

Drivers hooking into the block layer before the<br />

IO scheduler 1 deal with struct bio directly,<br />

as opposed to the struct request that are<br />

output after the IO scheduler. Even though the<br />

page addition API guarantees that they never<br />

need to be able to deal with a bio that is too<br />

big, they still have to manage local splits at<br />

sub-page granularity. <strong>The</strong> API was defined that<br />

way to make it easier for IO submitters to manage,<br />

so they don’t have to deal with sub-page<br />

splits. 2.6 block layer defines two ways to<br />

deal with this situation—the first is the general<br />

clone interface. bio_clone() returns a clone<br />

of a bio. A clone is defined as a private copy of<br />

the bio itself, but with a shared bio_vec page<br />

map list. Drivers can modify the cloned bio<br />

and submit it to a different device without duplicating<br />

the data. <strong>The</strong> second interface is tailored<br />

specifically to single page splits and was<br />

written by kernel raid maintainer Neil Brown.<br />

<strong>The</strong> main function is bio_split() which re-<br />

1 Also known as at make_request time.<br />

turns a struct bio_pair describing the two<br />

parts of the original bio. <strong>The</strong> two bio’s can<br />

then be submitted separately by the driver.<br />

2.2 Partition remapping<br />

Partition remapping is handled inside the IO<br />

stack before going to the driver, so that both<br />

drivers and IO schedulers have immediate full<br />

knowledge of precisely where data should end<br />

up. <strong>The</strong> device unfolding is done automatically<br />

by the same piece of code that resolves<br />

full bio redirects. <strong>The</strong> worker function is<br />

blk_partition_remap().<br />

2.3 Barriers<br />

Another feature that found its way to some vendor<br />

kernels is IO barriers. A barrier is defined<br />

as a piece of IO that is guaranteed to:<br />

• Be on platter (or safe storage at least)<br />

when completion is signaled.<br />

• Not proceed any previously submitted io.<br />

• Not be proceeded by later submitted io.<br />

<strong>The</strong> feature is handy for journalled file systems,<br />

fsync, and any sort of cache bypassing<br />

IO 2 where you want to provide guarantees on<br />

data order and correctness. <strong>The</strong> 2.6 code isn’t<br />

even complete yet or in the Linus kernels, but it<br />

has made its way to Andrew Morton’s -mm tree<br />

which is generally considered a staging area for<br />

features. This section describes the code so far.<br />

<strong>The</strong> first type of barrier supported is a soft<br />

barrier. It isn’t of much use for data integrity<br />

applications, since it merely implies<br />

ordering inside the IO scheduler. It is signaled<br />

with the REQ_SOFTBARRIER flag inside<br />

struct request. A stronger barrier is the<br />

2 Such types of IO include O_DIRECT or raw.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 55<br />

hard barrier. From the block layer and IO<br />

scheduler point of view, it is identical to the<br />

soft variant. Drivers need to know about it<br />

though, so they can take appropriate measures<br />

to correctly honor the barrier. So far the ide<br />

driver is the only one supporting a full, hard<br />

barrier. <strong>The</strong> issue was deemed most important<br />

for journalled desktop systems, where the<br />

lack of barriers and risk of crashes / power loss<br />

coupled with ide drives generally always defaulting<br />

to write back caching caused significant<br />

problems. Since the ATA command set<br />

isn’t very intelligent in this regard, the ide solution<br />

adopted was to issue pre- and post flushes<br />

when encountering a barrier.<br />

<strong>The</strong> hard and soft barrier share the feature that<br />

they are both tied to a piece of data (a bio,<br />

really) and cannot exist outside of data context.<br />

Certain applications of barriers would really<br />

like to issue a disk flush, where finding out<br />

which piece of data to attach it to is hard or<br />

impossible. To solve this problem, the 2.6 barrier<br />

code added the blkdev_issue_flush()<br />

function. <strong>The</strong> block layer part of the code is basically<br />

tied to a queue hook, so the driver issues<br />

the flush on its own. A helper function is provided<br />

for SCSI type devices, using the generic<br />

SCSI command transport that the block layer<br />

provides in 2.6 (more on this later). Unlike<br />

the queued data barriers, a barrier issued with<br />

blkdev_issue_flush() works on all interesting<br />

drivers in 2.6 (IDE, SCSI, SATA). <strong>The</strong><br />

only missing bits are drivers that don’t belong<br />

to one of these classes—things like CISS and<br />

DAC960.<br />

2.4 IO Schedulers<br />

As mentioned in section 1.1, there are a number<br />

of known problems with the default 2.4 IO<br />

scheduler and IO scheduler interface (or lack<br />

thereof). <strong>The</strong> idea to base latency on a unit of<br />

data (sectors) rather than a time based unit is<br />

hard to tune, or requires auto-tuning at runtime<br />

and this never really worked out. Fixing the<br />

runtime problems with elevator_linus is<br />

next to impossible due to the data structure exposing<br />

problem. So before being able to tackle<br />

any problems in that area, a neat API to the IO<br />

scheduler had to be defined.<br />

2.4.1 Defined API<br />

In the spirit of avoiding over-design 3 , the API<br />

was based on initial adaption of elevator_<br />

linus, but has since grown quite a bit as newer<br />

IO schedulers required more entry points to exploit<br />

their features.<br />

<strong>The</strong> core function of an IO scheduler is, naturally,<br />

insertion of new io units and extraction of<br />

ditto from drivers. So the first 2 API functions<br />

are defined, next_req_fn and add_req_fn.<br />

If you recall from section 1.1, a new IO<br />

unit is first attempted merged into an existing<br />

request in the IO scheduler queue. And<br />

if this fails and the newly allocated request<br />

has raced with someone else adding an adjacent<br />

IO unit to the queue in the mean time,<br />

we also attempt to merge struct requests.<br />

So 2 more functions were added to cater to<br />

these needs, merge_fn and merge_req_fn.<br />

Cleaning up after a successful merge is done<br />

through merge_cleanup_fn. Finally, a defined<br />

IO scheduler can provide init and exit<br />

functions, should it need to perform any duties<br />

during queue init or shutdown.<br />

<strong>The</strong> above described the IO scheduler API<br />

as of 2.5.1, later on more functions were<br />

added to further abstract the IO scheduler<br />

away from the block layer core. More details<br />

may be found in the struct elevator_s in<br />

kernel include file.<br />

3 Some might, rightfully, claim that this is worse than<br />

no design


56 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

2.4.2 deadline<br />

specific hard target that must be met.<br />

In kernel 2.5.39, elevator_linus was finally<br />

replaced by something more appropriate,<br />

the deadline IO scheduler. <strong>The</strong> principles behind<br />

it are pretty straight forward — new requests<br />

are assigned an expiry time in milliseconds,<br />

based on data direction. Internally, requests<br />

are managed on two different data structures.<br />

<strong>The</strong> sort list, used for inserts and front<br />

merge lookups, is based on a red-black tree.<br />

This provides O(log n) runtime for both insertion<br />

and lookups, clearly superior to the doubly<br />

linked list. Two FIFO lists exist for tracking<br />

request expiry times, using a double linked<br />

list. Since strict FIFO behavior is maintained<br />

on these two lists, they run in O(1) time. For<br />

back merges it is important to maintain good<br />

performance as well, as they dominate the total<br />

merge count due to the layout of files on<br />

disk. So deadline added a merge hash for<br />

back merges, ideally providing O(1) runtime<br />

for merges. Additionally, deadline adds a onehit<br />

merge cache that is checked even before going<br />

to the hash. This gets surprisingly good hit<br />

rates, serving as much as 90% of the merges<br />

even for heavily threaded io.<br />

Implementation details aside, deadline continues<br />

to build on the fact that the fastest way to<br />

access a single drive, is by scanning in the direction<br />

of ascending sector. With its superior<br />

runtime performance, deadline is able to support<br />

very large queue depths without suffering<br />

a performance loss or spending large amounts<br />

of time in the kernel. It also doesn’t suffer from<br />

latency problems due to increased queue sizes.<br />

When a request expires in the FIFO, deadline<br />

jumps to that disk location and starts serving<br />

IO from there. To prevent accidental seek<br />

storms (which would further cause us to miss<br />

deadlines), deadline attempts to serve a number<br />

of requests from that location before jumping<br />

to the next expired request. This means that<br />

the assigned request deadlines are soft, not a<br />

2.4.3 Anticipatory IO scheduler<br />

While deadline works very well for most<br />

workloads, it fails to observe the natural dependencies<br />

that often exist between synchronous<br />

reads. Say you want to list the contents of<br />

a directory—that operation isn’t merely a single<br />

sync read, it consists of a number of reads<br />

where only the completion of the final request<br />

will give you the directory listing. With deadline,<br />

you could get decent performance from<br />

such a workload in presence of other IO activities<br />

by assigning very tight read deadlines. But<br />

that isn’t very optimal, since the disk will be<br />

serving other requests in between the dependent<br />

reads causing a potentially disk wide seek<br />

every time. On top of that, the tight deadlines<br />

will decrease performance on other io streams<br />

in the system.<br />

Nick Piggin implemented an anticipatory IO<br />

scheduler [Iyer] during 2.5 to explore some interesting<br />

research in this area. <strong>The</strong> main idea<br />

behind the anticipatory IO scheduler is a concept<br />

called deceptive idleness. When a process<br />

issues a request and it completes, it might be<br />

ready to issue a new request (possibly close<br />

by) immediately. Take the directory listing example<br />

from above—it might require 3–4 IO<br />

operations to complete. When each of them<br />

completes, the process 4 is ready to issue the<br />

next one almost instantly. But the traditional<br />

io scheduler doesn’t pay any attention to this<br />

fact, the new request must go through the IO<br />

scheduler and wait its turn. With deadline, you<br />

would have to typically wait 500 milliseconds<br />

for each read, if the queue is held busy by other<br />

processes. <strong>The</strong> result is poor interactive performance<br />

for each process, even though overall<br />

throughput might be acceptable or even good.<br />

4 Or the kernel, on behalf of the process.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 57<br />

Instead of moving on to the next request from<br />

an unrelated process immediately, the anticipatory<br />

IO scheduler (hence forth known as AS)<br />

opens a small window of opportunity for that<br />

process to submit a new IO request. If that happens,<br />

AS gives it a new chance and so on. Internally<br />

it keeps a decaying histogram of IO think<br />

times to help the anticipation be as accurate as<br />

possible.<br />

Internally, AS is quite like deadline. It uses the<br />

same data structures and algorithms for sorting,<br />

lookups, and FIFO. If the think time is set<br />

to 0, it is very close to deadline in behavior.<br />

<strong>The</strong> only differences are various optimizations<br />

that have been applied to either scheduler allowing<br />

them to diverge a little. If AS is able to<br />

reliably predict when waiting for a new request<br />

is worthwhile, it gets phenomenal performance<br />

with excellent interactiveness. Often the system<br />

throughput is sacrificed a little bit, so depending<br />

on the workload AS might not be the<br />

best choice always. <strong>The</strong> IO storage hardware<br />

used, also plays a role in this—a non-queuing<br />

ATA hard drive is a much better fit than a SCSI<br />

drive with a large queuing depth. <strong>The</strong> SCSI<br />

firmware reorders requests internally, thus often<br />

destroying any accounting that AS is trying<br />

to do.<br />

processes can be placed in. And using regular<br />

hashing technique to find the appropriate<br />

bucket in case of collisions, fatal collisions are<br />

avoided.<br />

CFQ deviates radically from the concepts that<br />

deadline and AS is based on. It doesn’t assign<br />

deadlines to incoming requests to maintain<br />

fairness, instead it attempts to divide<br />

bandwidth equally among classes of processes<br />

based on some correlation between them. <strong>The</strong><br />

default is to hash on thread group id, tgid.<br />

This means that bandwidth is attempted distributed<br />

equally among the processes in the<br />

system. Each class has its own request sort<br />

and hash list, using red-black trees again for<br />

sorting and regular hashing for back merges.<br />

When dealing with writes, there is a little catch.<br />

A process will almost never be performing its<br />

own writes—data is marked dirty in context of<br />

the process, but write back usually takes place<br />

from the pdflush kernel threads. So CFQ is<br />

actually dividing read bandwidth among processes,<br />

while treating each pdflush thread as a<br />

separate process. Usually this has very minor<br />

impact on write back performance. Latency is<br />

much less of an issue with writes, and good<br />

throughput is very easy to achieve due to their<br />

inherent asynchronous nature.<br />

2.4.4 CFQ<br />

<strong>The</strong> third new IO scheduler in 2.6 is called<br />

CFQ. It’s loosely based on the ideas on<br />

stochastic fair queuing (SFQ [McKenney]).<br />

SFQ is fair as long as its hashing doesn’t collide,<br />

and to avoid that, it uses a continually<br />

changing hashing function. Collisions can’t be<br />

completely avoided though, frequency will depend<br />

entirely on workload and timing. CFQ<br />

is an acronym for completely fair queuing, attempting<br />

to get around the collision problem<br />

that SFQ suffers from. To do so, CFQ does<br />

away with the fixed number of buckets that<br />

2.5 Request allocation<br />

Each block driver in the system has at least<br />

one request_queue_t request queue structure<br />

associated with it. <strong>The</strong> recommended<br />

setup is to assign a queue to each logical<br />

spindle. In turn, each request queue has<br />

a struct request_list embedded which<br />

holds free struct request structures used<br />

for queuing io. 2.4 improved on this situation<br />

from 2.2, where a single global free list was<br />

available to add one per queue instead. This<br />

free list was split into two sections of equal<br />

size, for reads and writes, to prevent either


58 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

direction from starving the other 5 . 2.4 statically<br />

allocated a big chunk of requests for each<br />

queue, all residing in the precious low memory<br />

of a machine. <strong>The</strong> combination of O(N) runtime<br />

and statically allocated request structures<br />

firmly prevented any real world experimentation<br />

with large queue depths on 2.4 kernels.<br />

2.6 improves on this situation by dynamically<br />

allocating request structures on the fly instead.<br />

Each queue still maintains its request free list<br />

like in 2.4. However it’s also backed by a memory<br />

pool 6 to provide deadlock free allocations<br />

even during swapping. <strong>The</strong> more advanced<br />

io schedulers in 2.6 usually back each request<br />

by its own private request structure, further<br />

increasing the memory pressure of each request.<br />

Dynamic request allocation lifts some of<br />

this pressure as well by pushing that allocation<br />

inside two hooks in the IO scheduler API—<br />

set_req_fn and put_req_fn. <strong>The</strong> latter<br />

handles the later freeing of that data structure.<br />

2.6 Plugging<br />

For the longest time, the <strong>Linux</strong> block layer has<br />

used a technique dubbed plugging to increase<br />

IO throughput. In its simplicity, plugging<br />

works sort of like the plug in your tub drain—<br />

when IO is queued on an initially empty queue,<br />

the queue is plugged. Only when someone asks<br />

for the completion of some of the queued IO is<br />

the plug yanked out, and io is allowed to drain<br />

from the queue. So instead of submitting the<br />

first immediately to the driver, the block layer<br />

allows a small buildup of requests. <strong>The</strong>re’s<br />

nothing wrong with the principle of plugging,<br />

and it has been shown to work well for a number<br />

of workloads. However, the block layer<br />

maintains a global list of plugged queues inside<br />

the tq_disk task queue. <strong>The</strong>re are three<br />

main problems with this approach:<br />

5 In reality, to prevent writes for consuming all requests.<br />

6<br />

mempool_t interface from Ingo Molnar.<br />

1. It’s impossible to go backwards from the<br />

file system and find the specific queue to<br />

unplug.<br />

2. Unplugging one queue through tq_disk<br />

unplugs all plugged queues.<br />

3. <strong>The</strong> act of plugging and unplugging<br />

touches a global lock.<br />

All of these adversely impact performance.<br />

<strong>The</strong>se problems weren’t really solved until late<br />

in 2.6, when Intel reported a huge scalability<br />

problem related to unplugging [Chen] on a 32<br />

processor system. 93% of system time was<br />

spent due to contention on blk_plug_lock,<br />

which is the 2.6 direct equivalent of the 2.4<br />

tq_disk embedded lock. <strong>The</strong> proposed solution<br />

was to move the plug lists to a per-<br />

CMU structure. While this would solve the<br />

contention problems, it still leaves the other 2<br />

items on the above list unsolved.<br />

So work was started to find a solution that<br />

would fix all problems at once, and just generally<br />

Feel Right. 2.6 contains a link between<br />

the block layer and write out paths<br />

which is embedded inside the queue, a<br />

struct backing_dev_info. This structure<br />

holds information on read-ahead and queue<br />

congestion state. It’s also possible to go from<br />

a struct page to the backing device, which<br />

may or may not be a block device. So it<br />

would seem an obvious idea to move to a backing<br />

device unplugging scheme instead, getting<br />

rid of the global blk_run_queues() unplugging.<br />

That solution would fix all three issues at<br />

once—there would be no global way to unplug<br />

all devices, only target specific unplugs, and<br />

the backing device gives us a mapping from<br />

page to queue. <strong>The</strong> code was rewritten to do<br />

just that, and provide unplug functionality going<br />

from a specific struct block_device,<br />

page, or backing device. Code and interface<br />

was much superior to the existing code base,


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 59<br />

and results were truly amazing. Jeremy Higdon<br />

tested on an 8-way IA64 box [Higdon] and<br />

got 75–80 thousand IOPS on the stock kernel<br />

at 100% CPU utilization, 110 thousand IOPS<br />

with the per-CPU Intel patch also at full CPU<br />

utilization, and finally 200 thousand IOPS at<br />

merely 65% CPU utilization with the backing<br />

device unplugging. So not only did the new<br />

code provide a huge speed increase on this<br />

machine, it also went from being CPU to IO<br />

bound.<br />

2.6 also contains some additional logic to<br />

unplug a given queue once it reaches the<br />

point where waiting longer doesn’t make much<br />

sense. So where 2.4 will always wait for an explicit<br />

unplug, 2.6 can trigger an unplug when<br />

one of two conditions are met:<br />

1. <strong>The</strong> number of queued requests reach a<br />

certain limit, q->unplug_thresh. This<br />

is device tweak able and defaults to 4.<br />

2. When the queue has been idle for q-><br />

unplug_delay. Also device tweak able,<br />

and defaults to 3 milliseconds.<br />

<strong>The</strong> idea is that once a certain number of<br />

requests have accumulated in the queue, it<br />

doesn’t make much sense to continue waiting<br />

for more—there is already an adequate number<br />

available to keep the disk happy. <strong>The</strong> time limit<br />

is really a last resort, and should rarely trigger<br />

in real life. Observations on various work<br />

loads have verified this. More than a handful or<br />

two timer unplugs per minute usually indicates<br />

a kernel bug.<br />

2.7 SCSI command transport<br />

An annoying aspect of CD writing applications<br />

in 2.4 has been the need to use ide-scsi, necessitating<br />

the inclusion of the entire SCSI stack<br />

for only that application. With the clear majority<br />

of the market being ATAPI hardware, this<br />

becomes even more silly. ide-scsi isn’t without<br />

its own class of problems either—it lacks the<br />

ability to use DMA on certain writing types.<br />

CDDA audio ripping is another application that<br />

thrives with ide-scsi, since the native uniform<br />

cdrom layer interface is less than optimal (put<br />

mildly). It doesn’t have DMA capabilities at<br />

all.<br />

2.7.1 Enhancing struct request<br />

<strong>The</strong> problem with 2.4 was the lack of ability<br />

to generically send SCSI “like” commands<br />

to devices that understand them. Historically,<br />

only file system read/write requests could be<br />

submitted to a driver. Some drivers made up<br />

faked requests for other purposes themselves<br />

and put then on the queue for their own consumption,<br />

but no defined way of doing this existed.<br />

2.6 adds a new request type, marked by<br />

the REQ_BLOCK_PC bit. Such a request can be<br />

either backed by a bio like a file system request,<br />

or simply has a data and length field set.<br />

For both types, a SCSI command data block is<br />

filled inside the request. With this infrastructure<br />

in place and appropriate update to drivers<br />

to understand these requests, it’s a cinch to support<br />

a much better direct-to-device interface for<br />

burning.<br />

Most applications use the SCSI sg API for talking<br />

to devices. Some of them talk directly to<br />

the /dev/sg* special files, while (most) others<br />

use the SG_IO ioctl interface. <strong>The</strong> former<br />

requires a yet unfinished driver to transform<br />

them into block layer requests, but the latter<br />

can be readily intercepted in the kernel and<br />

routed directly to the device instead of through<br />

the SCSI layer. Helper functions were added<br />

to make burning and ripping even faster, providing<br />

DMA for all applications and without<br />

copying data between kernel and user space at<br />

all. So the zero-copy DMA burning was possible,<br />

and this even without changing most ap-


60 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

plications.<br />

3 <strong>Linux</strong>-2.7<br />

<strong>The</strong> 2.5 development cycle saw the most massively<br />

changed block layer in the history of<br />

<strong>Linux</strong>. Before 2.5 was opened, Linus had<br />

clearly expressed that one of the most important<br />

things that needed doing, was the block<br />

layer update. And indeed, the very first thing<br />

merged was the complete bio patch into 2.5.1-<br />

pre2. At that time, no more than a handful<br />

drivers compiled (let alone worked). <strong>The</strong> 2.7<br />

changes will be nowhere as severe or drastic.<br />

A few of the possible directions will follow in<br />

the next few sections.<br />

3.1 IO Priorities<br />

Prioritized IO is a very interesting area that<br />

is sure to generate lots of discussion and development.<br />

It’s one of the missing pieces of<br />

the complete resource management puzzle that<br />

several groups of people would very much like<br />

to solve. People running systems with many<br />

users, or machines hosting virtual hosts (or<br />

completed virtualized environments) are dying<br />

to be able to provide some QOS guarantees.<br />

Some work was already done in this<br />

area, so far nothing complete has materialized.<br />

<strong>The</strong> CKRM [CKRM] project spear headed by<br />

IBM is an attempt to define global resource<br />

management, including io. <strong>The</strong>y applied a little<br />

work to the CFQ IO scheduler to provide<br />

equal bandwidth between resource management<br />

classes, but at no specific priorities. Currently<br />

I have a CFQ patch that is 99% complete<br />

that provides full priority support, using the IO<br />

contexts introduced by AS to manage fair sharing<br />

over the full time span that a process exists<br />

7 . This works well enough, but only works<br />

7 CFQ currently tears down class structures as soon<br />

as it is empty, it doesn’t persist over process life time.<br />

for that specific IO scheduler. A nicer solution<br />

would be to create a scheme that works independently<br />

of the io scheduler used. That would<br />

require a rethinking of the IO scheduler API.<br />

3.2 IO Scheduler switching<br />

Currently <strong>Linux</strong> provides no less than 4 IO<br />

schedulers—the 3 mentioned, plus a forth<br />

dubbed noop. <strong>The</strong> latter is a simple IO scheduler<br />

that does no request reordering, no latency<br />

management, and always merges whenever it<br />

can. Its area of application is mainly highly<br />

intelligent hardware with huge queue depths,<br />

where regular request reordering doesn’t make<br />

sense. Selecting a specific IO scheduler can<br />

either be done by modifying the source of a<br />

driver and putting the appropriate calls in there<br />

at queue init time, or globally for any queue by<br />

passing the elevator=xxx boot parameter.<br />

This makes it impossible, or at least very impractical,<br />

to benchmark different IO schedulers<br />

without many reboots or recompiles. Some<br />

way to switch IO schedulers per queue and on<br />

the fly is desperately needed. Freezing a queue<br />

and letting IO drain from it until it’s empty<br />

(pinning new IO along the way), and then shutting<br />

down the old io scheduler and moving to<br />

the new scheduler would not be so hard to do.<br />

<strong>The</strong> queues expose various sysfs variables already,<br />

so the logical approach would simply be<br />

to:<br />

# echo deadline > \<br />

/sys/block/hda/queue/io_scheduler<br />

A simple but effective interface. At least two<br />

patches doing something like this were already<br />

proposed, but nothing was merged at that time.<br />

4 Final comments<br />

<strong>The</strong> block layer code in 2.6 has come a long<br />

way from the rotted 2.4 code. New features


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 61<br />

bring it more up-to-date with modern hardware,<br />

and completely rewritten from scratch<br />

core provides much better scalability, performance,<br />

and memory usage benefiting any machine<br />

from small to really huge. Going back<br />

a few years, I heard constant complaints about<br />

the block layer and how much it sucked and<br />

how outdated it was. <strong>The</strong>se days I rarely<br />

hear anything about the current state of affairs,<br />

which usually means that it’s doing pretty well<br />

indeed. 2.7 work will mainly focus on feature<br />

additions and driver layer abstractions (our<br />

concept of IDE layer, SCSI layer etc will be<br />

severely shook up). Nothing that will wreak<br />

havoc and turn everything inside out like 2.5<br />

did. Most of the 2.7 work mentioned above<br />

is pretty light, and could easily be back ported<br />

to 2.6 once it has been completed and tested.<br />

Which is also a good sign that nothing really<br />

radical or risky is missing. So things are settling<br />

down, a sign of stability.<br />

[Higdon] Jeremy Higdon, Re: [PATCH]<br />

per-backing dev unplugging #2, <strong>Linux</strong><br />

kernel mailing list<br />

http://marc.theaimsgroup.<br />

com/?l=linux-kernel&m=<br />

107941470424309&w=2, 2004<br />

[CKRM] IBM, Class-based <strong>Kernel</strong> Resource<br />

Management (CKRM),<br />

http://ckrm.sf.net, 2004<br />

[Bhattacharya] Suparna Bhattacharya, Notes<br />

on the Generic Block Layer Rewrite in<br />

<strong>Linux</strong> 2.5, General discussion,<br />

Documentation/block/biodoc.<br />

txt, 2002<br />

References<br />

[Iyer] Sitaram Iyer and Peter Druschel,<br />

Anticipatory scheduling: A disk<br />

scheduling framework to overcome<br />

deceptive idleness in synchronous I/O,<br />

18th ACM Symposium on Operating<br />

Systems Principles, http:<br />

//www.cs.rice.edu/~ssiyer/<br />

r/antsched/antsched.ps.gz,<br />

2001<br />

[McKenney] Paul E. McKenney, Stochastic<br />

Fairness Queuing, INFOCOM http:<br />

//rdrop.com/users/paulmck/<br />

paper/sfq.2002.06.04.pdf,<br />

1990<br />

[Chen] Kenneth W. Chen, per-cpu<br />

blk_plug_list, <strong>Linux</strong> kernel mailing list<br />

http://www.ussg.iu.edu/<br />

hypermail/linux/kernel/<br />

0403.0/0179.html, 2004


62 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


<strong>Linux</strong> AIO Performance and Robustness for<br />

Enterprise Workloads<br />

Suparna Bhattacharya, IBM (suparna@in.ibm.com)<br />

John Tran, IBM (jbtran@ca.ibm.com)<br />

Mike Sullivan, IBM (mksully@us.ibm.com)<br />

Chris Mason, SUSE (mason@suse.com)<br />

1 Abstract<br />

In this paper we address some of the issues<br />

identified during the development and stabilization<br />

of Asynchronous I/O (AIO) on <strong>Linux</strong><br />

2.6.<br />

We start by describing improvements made to<br />

optimize the throughput of streaming buffered<br />

filesystem AIO for microbenchmark runs.<br />

Next, we discuss certain tricky issues in ensuring<br />

data integrity between AIO Direct I/O<br />

(DIO) and buffered I/O, and take a deeper look<br />

at synchronized I/O guarantees, concurrent<br />

I/O, write-ordering issues and the improvements<br />

resulting from radix-tree based writeback<br />

changes in the <strong>Linux</strong> VFS.<br />

We then investigate the results of using <strong>Linux</strong><br />

2.6 filesystem AIO on the performance metrics<br />

for certain enterprise database workloads<br />

which are expected to benefit from AIO, and<br />

mention a few tips on optimizing AIO for such<br />

workloads. Finally, we briefly discuss the issues<br />

around workloads that need to combine<br />

asynchronous disk I/O and network I/O.<br />

2 Introduction<br />

AIO enables a single application thread to<br />

overlap processing with I/O operations for better<br />

utilization of CPU and devices. AIO can<br />

improve the performance of certain kinds of<br />

I/O intensive applications like databases, webservers<br />

and streaming-content servers. <strong>The</strong><br />

use of AIO also tends to help such applications<br />

adapt and scale more smoothly to varying<br />

loads.<br />

2.1 Overview of kernel AIO in <strong>Linux</strong> 2.6<br />

<strong>The</strong> <strong>Linux</strong> 2.6 kernel implements in-kernel<br />

support for AIO. A low-level native AIO system<br />

call interface is provided that can be invoked<br />

directly by applications or used by library<br />

implementations to build POSIX/SUS<br />

semantics. All discussion hereafter in this paper<br />

pertains to the native kernel AIO interfaces.<br />

Applications can submit one or more<br />

I/O requests asynchronously using the<br />

io_submit() system call, and obtain<br />

completion notification using the<br />

io_getevents() system call. Each<br />

I/O request specifies the operation (typically<br />

read/write), the file descriptor and the parameters<br />

for the operation (e.g., file offset,<br />

buffer). I/O requests are associated with the<br />

completion queue (ioctx) they were submitted<br />

against. <strong>The</strong> results of I/O are reported as<br />

completion events on this queue, and reaped<br />

using io_getevents().<br />

<strong>The</strong> design of AIO for the <strong>Linux</strong> 2.6 kernel has<br />

been discussed in [1], including the motivation


64 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

behind certain architectural choices, for example:<br />

• Sharing a common code path for AIO and<br />

regular I/O<br />

• A retry-based model for AIO continuations<br />

across blocking points in the case of<br />

buffered filesystem AIO (currently implemented<br />

as a set of patches to the <strong>Linux</strong> 2.6<br />

kernel) where worker threads take on the<br />

caller’s address space for executing retries<br />

involving access to user-space buffers.<br />

2.2 Background on retry-based AIO<br />

<strong>The</strong> retry-based model allows an AIO request<br />

to be executed as a series of non-blocking iterations.<br />

Each iteration retries the remaining<br />

part of the request from where the last iteration<br />

left off, re-issuing the corresponding<br />

AIO filesystem operation with modified arguments<br />

representing the remaining I/O. <strong>The</strong> retries<br />

are “kicked” via a special AIO waitqueue<br />

callback routine, aio_wake_function(),<br />

which replaces the default waitqueue entry<br />

used for blocking waits.<br />

<strong>The</strong> high-level retry infrastructure is responsible<br />

for running the iterations in the address<br />

space context of the caller, and ensures that<br />

only one retry instance is active at a given time.<br />

This relieves the fops themselves from having<br />

to deal with potential races of that sort.<br />

2.3 Overview of the rest of the paper<br />

In subsequent sections of this paper, we describe<br />

our experiences in addressing several issues<br />

identified during the optimization and stabilization<br />

efforts related to the kernel AIO implementation<br />

for <strong>Linux</strong> 2.6, mainly in the area<br />

of disk- or filesystem-based AIO.<br />

We observe, for example, how I/O patterns<br />

generated by the common VFS code paths<br />

used by regular and retry-based AIO could<br />

be non-optimal for streaming AIO requests,<br />

and we describe the modifications that address<br />

this finding. A different set of problems<br />

that has seen some development activity<br />

are the races, exposures and potential<br />

data-integrity concerns between direct and<br />

buffered I/O, which become especially tricky<br />

in the presence of AIO. Some of these issues<br />

motivated Andrew Morton’s modified pagewriteback<br />

design for the VFS using tagged<br />

radix-tree lookups, and we discuss the implications<br />

for the AIO O_SYNC write implementation.<br />

In general, disk-based filesystem AIO requirements<br />

for database workloads have been a<br />

guiding consideration in resolving some of the<br />

trade-offs encountered, and we present some<br />

initial performance results for such workloads.<br />

Lastly, we touch upon potential approaches to<br />

allow processing of disk-based AIO and communications<br />

I/O within a single event loop.<br />

3 Streaming AIO reads<br />

3.1 Basic retry pattern for single AIO read<br />

<strong>The</strong> retry-based design for buffered filesystem<br />

AIO read works by converting each blocking<br />

wait for read completion on a page into a retry<br />

exit. <strong>The</strong> design queues an asynchronous notification<br />

callback and returns the number of<br />

bytes for which the read has completed so far<br />

without blocking. <strong>The</strong>n, when the page becomes<br />

up-to-date, the callback kicks off a retry<br />

continuation in task context. This retry continuation<br />

invokes the same filesystem read operation<br />

again using the caller’s address space, but<br />

this time with arguments modified to reflect the<br />

remaining part of the read request.<br />

For example, given a 16KB read request starting<br />

at offset 0, where the first 4KB is already<br />

in cache, one might see the following sequence<br />

of retries (in the absence of readahead):


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 65<br />

first time:<br />

fop->aio_read(fd, 0, 16384) = 4096<br />

and when read completes for the second page:<br />

fop->aio_read(fd, 4096, 12288) = 4096<br />

and when read completes for the third page:<br />

fop->aio_read(fd, 8192, 8192) = 4096<br />

and when read completes for the fourth page:<br />

fop->aio_read(fd, 12288, 4096) = 4096<br />

3.2 Impact of readahead on single AIO read<br />

Usually, however, the readahead logic attempts<br />

to batch read requests in advance. Hence, more<br />

I/O would be seen to have completed at each<br />

retry. <strong>The</strong> logic attempts to predict the optimal<br />

readahead window based on state it maintains<br />

about the sequentiality of past read requests on<br />

the same file descriptor. Thus, given a maximum<br />

readahead window size of 128KB, the sequence<br />

of retries would appear to be more like<br />

the following example, which results in significantly<br />

improved throughput:<br />

first time:<br />

fop->aio_read(fd, 0, 16384) = 4096,<br />

after issuing readahead<br />

for 128KB/2 = 64KB<br />

and when read completes for the above I/O:<br />

fop->aio_read(fd, 4096, 12288) = 12288<br />

Notice that care is taken to ensure that readaheads<br />

are not repeated during retries.<br />

3.3 Impact of readahead on streaming AIO<br />

reads<br />

In the case of streaming AIO reads, a sequence<br />

of AIO read requests is issued on the same<br />

file descriptor, where subsequent reads are submitted<br />

without waiting for previous requests to<br />

complete (contrast this with a sequence of synchronous<br />

reads).<br />

Interestingly, we encountered a significant<br />

throughput degradation as a result of the interplay<br />

of readahead and streaming AIO reads.<br />

To see why, consider the retry sequence for<br />

streaming random AIO read requests of 16KB,<br />

where o1, o2, o3, ... refer to the random<br />

offsets where these reads are issued:<br />

first time:<br />

fop->aio_read(fd, o1, 16384) = -EIOCBRETRY,<br />

after issuing readahead for 64KB<br />

as the readahead logic sees the first page<br />

of the read<br />

fop->aio_read(fd, o2, 16384) = -EIOCBRETRY,<br />

after issuing readahead for 8KB (notice<br />

the shrinkage of the readahead window<br />

because of non-sequentiality seen by the<br />

readahead logic)<br />

fop->aio_read(fd, o3, 16384) = -EIOCBRETRY,<br />

after maximally shrinking the readahead<br />

window, turning off readahead and issuing<br />

4KB read in the slow path<br />

fop->aio_read(fd, o4, 16384) = -EIOCBRETRY,<br />

after issuing 4KB read in the slow path<br />

.<br />

.<br />

and when read completes for o1<br />

fop->aio_read(fd, o1, 16384) = 16384<br />

and when read completes for o2<br />

fop->aio_read(fd, o2, 16384) = 8192<br />

and when read completes for o3<br />

fop->aio_read(fd, o3, 16384) = 4096<br />

and when read completes for o4<br />

fop->aio_read(fd, o3, 16384) = 4096<br />

.<br />

.<br />

In steady state, this amounts to a maximallyshrunk<br />

readahead window with 4KB reads at<br />

random offsets being issued serially one at a<br />

time on a slow path, causing seek storms and<br />

driving throughputs down severely.<br />

3.4 Upfront readahead for improved streaming<br />

AIO read throughputs<br />

To address this issue, we made the readahead<br />

logic aware of the sequentiality of all pages in a<br />

single read request upfront—before submitting<br />

the next read request. This resulted in a more<br />

desirable outcome as follows:<br />

fop->aio_read(fd, o1, 16384) = -EIOCBRETRY,<br />

after issuing readahead for 64KB<br />

as the readahead logic sees all the 4<br />

pages for the read<br />

fop->aio_read(fd, o2, 16384) = -EIOCBRETRY,<br />

after issuing readahead for 20KB, as the<br />

readahead logic sees all 4 pages of the<br />

read (the readahead window shrinks to<br />

4+1=5 pages)


66 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

fop->aio_read(fd, o3, 16384) = -EIOCBRETRY,<br />

after issuing readahead for 20KB, as the<br />

readahead logic sees all 4 pages of the<br />

read (the readahead window is maintained<br />

at 4+1=5 pages)<br />

.<br />

.<br />

and when read completes for o1<br />

fop->aio_read(fd, o1, 16384) = 16384<br />

and when read completes for o2<br />

fop->aio_read(fd, o2, 16384) = 16384<br />

and when read completes for o3<br />

fop->aio_read(fd, o3, 16384) = 16384<br />

.<br />

.<br />

3.5 Upfront readahead and sendfile regressions<br />

At first sight it appears that upfront readahead<br />

is a reasonable change for all situations, since<br />

it immediately passes to the readahead logic<br />

the entire size of the request. However, it has<br />

the unintended, potential side-effect of losing<br />

pipelining benefits for really large reads, or operations<br />

like sendfile which involve post processing<br />

I/O on the contents just read. <strong>One</strong> way<br />

to address this is to clip the maximum size<br />

of upfront readahead to the maximum readahead<br />

setting for the device. To see why even<br />

that may not suffice for certain situations, let<br />

us take a look at the following sequence for<br />

a webserver that uses non-blocking sendfile to<br />

serve a large (2GB) file.<br />

sendfile(fd, 0, 2GB, fd2) = 8192,<br />

tells readahead about up to 128KB<br />

of the read<br />

sendfile(fd, 8192, 2GB - 8192, fd2) = 8192,<br />

tells readahead about 8KB - 132KB<br />

of the read<br />

sendfile(fd, 16384, 2GB - 16384, fd2) = 8192,<br />

tells readahead about 16KB-140KB<br />

of the read<br />

...<br />

This confuses the readahead logic about the<br />

I/O pattern which appears to be 0–128K, 8K–<br />

132K, 16K–140K instead of clear sequentiality<br />

from 0–2GB that is really appropriate.<br />

To avoid such unanticipated issues, upfront<br />

readahead required a special case for AIO<br />

alone, limited to the maximum readahead setting<br />

for the device.<br />

3.6 Streaming AIO read microbenchmark<br />

comparisons<br />

We explored streaming AIO throughput improvements<br />

with the retry-based AIO implementation<br />

and optimizations discussed above,<br />

using a custom microbenchmark called aiostress<br />

[2]. aio-stress issues a stream of AIO<br />

requests to one or more files, where one can<br />

vary several parameters including I/O unit size,<br />

total I/O size, depth of iocbs submitted at a<br />

time, number of concurrent threads, and type<br />

and pattern of I/O operations, and reports the<br />

overall throughput attained.<br />

<strong>The</strong> hardware included a 4-way 700MHz<br />

Pentium ® III machine with 512MB of RAM<br />

and a 1MB L2 cache. <strong>The</strong> disk subsystem<br />

used for the I/O tests consisted of an Adaptec<br />

AIC7896/97 Ultra2 SCSI controller connected<br />

to a disk enclosure with six 9GB disks, one<br />

of which was configured as an ext3 filesystem<br />

with a block size of 4KB for testing.<br />

<strong>The</strong> runs compared aio-stress throughputs for<br />

streaming random buffered I/O reads (i.e.,<br />

without O_DIRECT), with and without the<br />

previously described changes. All the runs<br />

were for the case where the file was not already<br />

cached in memory. <strong>The</strong> above graph<br />

summarizes how the results varied across individual<br />

request sizes of 4KB to 64KB, where<br />

I/O was targeted to a single file of size 1GB,<br />

the depth of iocbs outstanding at a time being<br />

64KB. A third run was performed to find out<br />

how the results compared with equivalent runs<br />

using AIO-DIO.<br />

With the changes applied, the results showed<br />

an approximate 2x improvement across all<br />

block sizes, bringing throughputs to levels that<br />

match the corresponding results using AIO-<br />

DIO.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 67<br />

aio-stress throughput (MB/s)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Streaming AIO read results with aio-stress<br />

FSAIO (non-cached) 2.6.2 Vanilla<br />

FSAIO (non-cached) 2.6.2 Patched<br />

AIO-DIO 2.6.2 Vanilla<br />

0<br />

0 10 20 30 40 50 60 70<br />

request size (KB)<br />

Figure 1: Comparisons of streaming random<br />

AIO read throughputs<br />

of blocks by a truncate while DIO was in<br />

progress. <strong>The</strong> semaphore was implemented<br />

held in shared mode by DIO and in exclusive<br />

mode by truncate.<br />

Note that handling the new locking rules (i.e.,<br />

lock ordering of i_sem first and then i_<br />

alloc_sem) while allowing for filesystemspecific<br />

implementations of the DIO and filewrite<br />

interfaces had to be handled with some<br />

care.<br />

4.2 AIO-DIO specific races<br />

4 AIO DIO vs cached I/O integrity<br />

issues<br />

4.1 DIO vs buffered races<br />

Stephen Tweedie discovered several races between<br />

DIO and buffered I/O to the same file<br />

[3]. <strong>The</strong>se races could lead to potential staledata<br />

exposures and even data-integrity issues.<br />

Most instances were related to situations when<br />

in-core meta-data updates were visible before<br />

actual instantiation or resetting of corresponding<br />

data blocks on disk. Problems could also<br />

arise when meta-data updates were not visible<br />

to other code paths that could simultaneously<br />

update meta-data as well. <strong>The</strong> races mainly affected<br />

sparse files due to the lack of atomicity<br />

between the file flush in the DIO paths and actual<br />

data block accesses.<br />

<strong>The</strong> solution that Stephen Tweedie came<br />

up with, and which Badari Pulavarty reported<br />

to <strong>Linux</strong> 2.6, involved protecting block<br />

lookups and meta-data updates with the inode<br />

semaphore (i_sem) in DIO paths for both read<br />

and write, atomically with the file flush. Overwriting<br />

of sparse blocks in the DIO write path<br />

was modified to fall back to buffered writes.<br />

Finally, an additional semaphore (i_alloc_<br />

sem) was introduced to lock out deallocation<br />

<strong>The</strong> inclusion of AIO in <strong>Linux</strong> 2.6 added some<br />

tricky scenarios to the above-described problems<br />

because of the potential races inherent in<br />

returning without waiting for I/O completion.<br />

<strong>The</strong> interplay of AIO-DIO writes and truncate<br />

was a particular worry as it could lead to corruption<br />

of file data; for example, blocks could<br />

get deallocated and reallocated to a new file<br />

while an AIO-DIO write to the file was still in<br />

progress. To avoid this, AIO-DIO had to return<br />

with i_alloc_sem held, and only release it<br />

as part of I/O completion post-processing. Notice<br />

that this also had implications for AIO cancellation.<br />

File size updates for AIO-DIO file extends<br />

could expose unwritten blocks if they happened<br />

before I/O completed asynchronously.<br />

<strong>The</strong> case involving fallback to buffered I/O<br />

was particularly non-trivial if a single request<br />

spanned allocated and sparse regions of a<br />

file. Specifically, part of the I/O could have<br />

been initiated via DIO then continued asynchronously,<br />

while the fallback to buffered I/O<br />

occurred and signaled I/O completion to the<br />

application. <strong>The</strong> application may thus have<br />

reused its I/O buffer, overwriting it with other<br />

data and potentially causing file data corruption<br />

if writeout to disk had still been pending.<br />

It might appear that some of these problems


68 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

could be avoided if I/O schedulers guaranteed<br />

the ordering of I/O requests issued to the same<br />

disk block. However, this isn’t a simple proposition<br />

in the current architecture, especially in<br />

generalizing the design to all possible cases,<br />

including network block devices. <strong>The</strong> use of<br />

I/O barriers would be necessary and the costs<br />

may not be justified for these special-case situations.<br />

Instead, a pragmatic approach was taken in order<br />

to address this based on the assumptions<br />

that true asynchronous behaviour was really<br />

meaningful in practice, mainly when performing<br />

I/O to already-allocated file blocks. For<br />

example, databases typically preallocate files<br />

at the time of creation, so that AIO writes<br />

during normal operation and in performancecritical<br />

paths do not extend the file or encounter<br />

sparse regions. Thus, for the sake of correctness,<br />

synchronous behaviour may be tolerable<br />

for AIO writes involving sparse regions or file<br />

extends. This compromise simplified the handling<br />

of the scenarios described earlier. AIO-<br />

DIO file extends now wait for I/O to complete<br />

and update the file size. AIO-DIO writes spanning<br />

allocated and sparse regions now wait for<br />

previously- issued DIO for that request to complete<br />

before falling back to buffered I/O.<br />

5 Concurrent I/O with synchronized<br />

write guarantees<br />

An application opts for synchronized writes<br />

(by using the O_SYNC option on file open)<br />

when the I/O must be committed to disk before<br />

the write request completes. In the case<br />

of DIO, writes directly go to disk anyway. For<br />

buffered I/O, data is first copied into the page<br />

cache and later written out to disk; if synchronized<br />

I/O is specified then the request returns<br />

only after the writeout is complete.<br />

An application might also choose to synchronize<br />

previously-issued writes to disk by invoking<br />

fsync(), which writes back data from the<br />

page cache to disk and waits for writeout to<br />

complete before returning.<br />

5.1 Concurrent DIO writes<br />

DIO writes formerly held the inode semaphore<br />

in exclusive mode until write completion. This<br />

helped ensure atomicity of DIO writes and<br />

protected against potential file data corruption<br />

races with truncate. However, it also meant that<br />

multiple threads or processes submitting parallel<br />

DIOs to different parts of the same file<br />

effectively became serialized synchronously.<br />

If the same behaviour were extended to AIO<br />

(i.e., having the i_sem held through I/O completion<br />

for AIO-DIO writes), it would significantly<br />

degrade throughput of streaming AIO<br />

writes as subsequent write submissions would<br />

block until completion of the previous request.<br />

With the fixes described in the previous section,<br />

such synchronous serialization is avoidable<br />

without loss of correctness, as the inode<br />

semaphore needs to be held only when looking<br />

up the blocks to write, and not while actual I/O<br />

is in progress on the data blocks. This could allow<br />

concurrent DIO writes on different parts of<br />

a file to proceed simultaneously, and efficient<br />

throughputs for streaming AIO-DIO writes.<br />

5.2 Concurrent O_SYNC buffered writes<br />

In the original writeback design in the <strong>Linux</strong><br />

VFS, per-address space lists were maintained<br />

for dirty pages and pages under writeback for<br />

a given file. Synchronized write was implemented<br />

by traversing these lists to issue writeouts<br />

for the dirty pages and waiting for writeback<br />

to complete on the pages on the writeback<br />

list. <strong>The</strong> inode semaphore had to be held all<br />

through to avoid possibilities of livelocking on<br />

these lists as further writes streamed into the<br />

same file. While this helped maintain atomicity


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 69<br />

of writes, it meant that parallel O_SYNC writes<br />

to different parts of the file were effectively<br />

serialized synchronously. Further, dependence<br />

on i_sem-protected state in the address space<br />

lists across I/O waits made it difficult to retryenable<br />

this code path for AIO support.<br />

In order to allow concurrent O_SYNC writes to<br />

be active on a file, the range of pages to be<br />

written back and waited on could instead be<br />

obtained directly through a radix-tree lookup<br />

for the range of offsets in the file that was being<br />

written out by the request [4]. This would<br />

avoid traversal of the page lists and hence the<br />

need to hold i_sem across the I/O waits. Such<br />

an approach would also make it possible to<br />

complete O_SYNC writes as a sequence of nonblocking<br />

retry iterations across the range of<br />

bytes in a given request.<br />

5.3 Data-integrity guarantees<br />

shared mode for background writers. It involved<br />

navigating issues with busy-waits in<br />

background writers and the code was beginning<br />

to get complicated and potentially fragile.<br />

This was one of the problems that finally<br />

prompted Andrew Morton to change the entire<br />

VFS writeback code to use radix-tree walks instead<br />

of the per-address space pagelists. <strong>The</strong><br />

main advantage was that avoiding the need<br />

for movement across lists during state changes<br />

(e.g., when re-dirtying a page if its buffers were<br />

locked for I/O by another process) reduced the<br />

chances of pages getting missed from consideration<br />

without the added serialization of entire<br />

writebacks.<br />

6 Tagged radix-tree based writeback<br />

Background writeout threads cannot block on<br />

the inode semaphore like O_SYNC/fsync writers.<br />

Hence, with the per-address space lists<br />

writeback model, some juggling involving<br />

movement across multiple lists was required<br />

to avoid livelocks. <strong>The</strong> implementation had<br />

to make sure that pages which by chance got<br />

picked up for processing by background writeouts<br />

didn’t slip from consideration when waiting<br />

for writeback to complete for a synchronized<br />

write request. <strong>The</strong> latter would be particularly<br />

relevant for ensuring synchronized-write<br />

guarantees that impacted data integrity for applications.<br />

However, as Daniel McNeil’s analysis<br />

would indicate [5], getting this right required<br />

the writeback code to write and wait<br />

upon I/O and dirty pages which were initiated<br />

by other processes, and that turned out to be<br />

fairly tricky.<br />

<strong>One</strong> solution that was explored was peraddress<br />

space serialization of writeback to ensure<br />

exclusivity to synchronous writers and<br />

For the radix-tree walk writeback design to perform<br />

as well as the address space lists-based<br />

approach, an efficient way to get to the pages<br />

of interest in the radix trees is required. This<br />

is especially so when there are many pages in<br />

the pagecache but only a few are dirty or under<br />

writeback. Andrew Morton solved this problem<br />

by implementing tagged radix-tree lookup<br />

support to enable lookup of dirty or writeback<br />

pages in O(log64(n)) time [6].<br />

This was achieved by adding tag bits for each<br />

slot to each radix-tree node. If a node is<br />

tagged, then the corresponding slots on all the<br />

nodes above it in the tree are tagged. Thus,<br />

to search for a particular tag, one would keep<br />

going down sub-trees under slots which have<br />

the tag bit set until the tagged leaf nodes are<br />

accessed. A tagged gang lookup function is<br />

used for in-order searches for dirty or writeback<br />

pages within a specified range. <strong>The</strong>se<br />

lookups are used to replace the per-addressspace<br />

page lists altogether.


70 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

To synchronize writes to disk, a tagged radixtree<br />

gang lookup of dirty pages in the byterange<br />

corresponding to the write request is performed<br />

and the resulting pages are written out.<br />

Next, pages under writeback in the byte-range<br />

are obtained through a tagged radix-tree gang<br />

lookup of writeback pages, and we wait for<br />

writeback to complete on these pages (without<br />

having to hold the inode semaphore across the<br />

waits). Observe how this logic lends itself to be<br />

broken up into a series of non-blocking retry iterations<br />

proceeding in-order through the range.<br />

<strong>The</strong> same logic can also be used for a whole<br />

file sync, by specifying a byte-range that spans<br />

the entire file.<br />

Background writers also use tagged radix-tree<br />

gang lookups of dirty pages. Instead of always<br />

scanning a file from its first dirty page, the index<br />

where the last batch of writeout terminated<br />

is tracked so the next batch of writeouts can be<br />

started after that point.<br />

7 Streaming AIO writes<br />

<strong>The</strong> tagged radix-tree walk writeback approach<br />

greatly simplifies the design of AIO support for<br />

synchronized writes, as mentioned in the previous<br />

section,<br />

7.1 Basic retry pattern for synchronized AIO<br />

writes<br />

<strong>The</strong> retry-based design for buffered AIO O_<br />

SYNC writes works by converting each blocking<br />

wait for writeback completion of a page<br />

into a retry exit. <strong>The</strong> conversion point queues<br />

an asynchronous notification callback and returns<br />

to the caller of the filesystem’s AIO<br />

write operation the number of bytes for which<br />

writeback has completed so far without blocking.<br />

<strong>The</strong>n, when writeback completes for that<br />

page, the callback kicks off a retry continuation<br />

in task context which invokes the same AIO<br />

write operation again using the caller’s address<br />

space, but this time with arguments modified to<br />

reflect the remaining part of the write request.<br />

As writeouts for the range would have already<br />

been issued the first time before the loop to<br />

wait for writeback completion, the implementation<br />

takes care not to re-dirty pages or reissue<br />

writeouts during subsequent retries of<br />

AIO write. Instead, when the code detects that<br />

it is being called in a retry context, it simply<br />

falls through directly to the step involving waiton-writeback<br />

for the remaining range as specified<br />

by the modified arguments.<br />

7.2 Filtered waitqueues to avoid retry storms<br />

with hashed wait queues<br />

Code that is in a retry-exit path (i.e., the return<br />

path following a blocking point where a retry is<br />

queued) should in general take care not to call<br />

routines that could wakeup the newly-queued<br />

retry.<br />

<strong>One</strong> thing that we had to watch for was calls<br />

to unlock_page() in the retry-exit path.<br />

This could cause a redundant wakeup if an<br />

async wait-on-page writeback was just queued<br />

for that page. <strong>The</strong> redundant wakeup would<br />

arise if the kernel used the same waitqueue<br />

on unlock as well as writeback completion for<br />

a page, with the expectation that the waiter<br />

would check for the condition it was waiting<br />

for and go back to sleep if it hadn’t occurred. In<br />

the AIO case, however, a wakeup of the newlyqueued<br />

callback in the same code path could<br />

potentially trigger a retry storm, as retries kept<br />

triggering themselves over and over again for<br />

the wrong condition.<br />

<strong>The</strong> interplay of unlock_page() and<br />

wait_on_page_writeback() with<br />

hashed waitqueues can get quite tricky for<br />

retries. For example, consider what happens<br />

when the following sequence in retryable code<br />

is executed at the same time for 2 pages, px


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 71<br />

and py, which happen to hash to the same<br />

waitqueue (Table 1).<br />

lock_page(p)<br />

check condition and process<br />

unlock_page(p)<br />

if (wait_on_page_writeback_wq(p)<br />

== -EIOCBQUEUED)<br />

return bytes_done<br />

<strong>The</strong> above code could keep cycling between<br />

spurious retries on px and py until I/O is done,<br />

wasting precious CPU time!<br />

If we can ensure specificity of the wakeup with<br />

hashed waitqueues then this problem can be<br />

avoided. William Lee Irwin’s implementation<br />

of filtered wakeup support in the recent <strong>Linux</strong><br />

2.6 kernels [7] achieves just that. <strong>The</strong> wakeup<br />

routine specifies a key to match before invoking<br />

the wakeup function for an entry in the<br />

waitqueue, thereby limiting wakeups to those<br />

entries which have a matching key. For page<br />

waitqueues, the key is computed as a function<br />

of the page and the condition (unlock or writeback<br />

completion) for the wakeup.<br />

7.3 Streaming AIO write microbenchmark<br />

comparisons<br />

<strong>The</strong> following graph compares aio-stress<br />

throughputs for streaming random buffered<br />

I/O O_SYNC writes, with and without the<br />

previously-described changes. <strong>The</strong> comparison<br />

was performed on the same setup used for<br />

the streaming AIO read results discussed earlier.<br />

<strong>The</strong> graph summarizes how the results varied<br />

across individual request sizes of 4KB to<br />

64KB, where I/O was targeted to a single file<br />

of size 1GB and the depth of iocbs outstanding<br />

at a time was 64KB. A third run was performed<br />

to determine how the results compared<br />

with equivalent runs using AIO-DIO.<br />

With the changes applied, the results showed<br />

an approximate 2x improvement across all<br />

aio-stress throughput (MB/s)<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

Streaming AIO O_SYNC write results with aio-stress<br />

FSAIO 2.6.2 Vanilla<br />

FSAIO 2.6.2 Patched<br />

AIO-DIO 2.6.2 Vanilla<br />

0<br />

0 10 20 30 40 50 60 70<br />

request size (KB)<br />

Figure 2: Comparisons of streaming random<br />

AIO write throughputs.<br />

block sizes, bringing throughputs to levels that<br />

match the corresponding results using AIO-<br />

DIO.<br />

8 AIO performance analysis for<br />

database workloads<br />

Large database systems leveraging AIO can<br />

show marked performance improvements compared<br />

to those systems that use synchronous<br />

I/O alone. We use IBM ® DB2 ® Universal<br />

Database V8 running an online transaction<br />

processing (OLTP) workload to illustrate the<br />

performance improvement of AIO on raw devices<br />

and on filesystems.<br />

8.1 DB2 page cleaners<br />

A DB2 page cleaner is a process responsible<br />

for flushing dirty buffer pool pages to disk.<br />

It simulates AIO by executing asynchronously<br />

with respect to the agent processes. <strong>The</strong> number<br />

of page cleaners and their behavior can be<br />

tuned according to the demands of the system.<br />

<strong>The</strong> agents, freed from cleaning pages themselves,<br />

can dedicate their resources (e.g., processor<br />

cycles) towards processing transactions,<br />

thereby improving throughput.


72 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

CPU1<br />

CPU2<br />

lock_page(px)<br />

...<br />

unlock_page(px)<br />

lock_page(py)<br />

wait_on_page_writeback_wq(px) ...<br />

unlock_page(py) -> wakes up p1<br />

triggering<br />

<br />

Table 1: Retry storm livelock with redundant wakeups on hashed wait queues<br />

8.2 AIO performance analysis for raw devices<br />

Two experiments were conducted to measure<br />

the performance benefits of AIO on raw devices<br />

for an update-intensive OLTP database<br />

workload. <strong>The</strong> workload used was derived<br />

from a TPC[8] benchmark, but is in no way<br />

comparable to any TPC results. For the first experiment,<br />

the database was configured with one<br />

page cleaner using the native <strong>Linux</strong> AIO interface.<br />

For the second experiment, the database<br />

was configured with 55 page cleaners all using<br />

the synchronous I/O interface. <strong>The</strong>se experiments<br />

showed that a database, properly configured<br />

in terms of the number of page cleaners<br />

with AIO, can out-perform a properly configured<br />

database using synchronous I/O page<br />

cleaning.<br />

For both experiments, the system configuration<br />

consisted of DB2 V8 running on a 2-way AMD<br />

Opteron system with <strong>Linux</strong> 2.6.1 installed. <strong>The</strong><br />

disk subsystem consisted of two FAStT 700<br />

storage servers, each with eight disk enclosures.<br />

<strong>The</strong> disks were configured as RAID-0<br />

arrays with a stripe size of 256KB.<br />

Table 2 shows the relative database performance<br />

with and without AIO. Higher numbers<br />

are better. <strong>The</strong> results show that the database<br />

performed 9% better when configured with one<br />

page cleaner using AIO, than when it was<br />

configured with 55 page cleaners using synchronous<br />

I/O.<br />

Configuration<br />

Relative<br />

Throughput<br />

1 page cleaner with AIO 133<br />

55 page cleaners without AIO 122<br />

Table 2: Database performance with and without<br />

AIO.<br />

Analyzing the I/O write patterns (see Table 3),<br />

we see that one page cleaner using AIO was<br />

sufficient to keep the buffer pools clean under<br />

a very heavy load, but that 55 page cleaners<br />

using synchronous I/O were not, as indicated<br />

by the 30% agent writes. This data<br />

suggests that more page cleaners should have<br />

been configured to improve the performance of<br />

the case with synchronous I/O. However, additional<br />

page cleaners consumed more memory,<br />

requiring a reduction in bufferpool size<br />

and thereby decreasing throughput. For the<br />

test configuration, 55 cleaners was the optimal<br />

number before memory constraints arose.<br />

8.3 AIO performance analysis for filesystems<br />

This section examines the performance improvements<br />

of AIO when used in conjunction<br />

with filesystems. This experiment was per-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 73<br />

Configuration Page cleaner Agent<br />

writes (%) writes (%)<br />

1 page cleaner with 100 0<br />

AIO<br />

55 page cleaners without<br />

AIO<br />

70 30<br />

Table 3: DB2 write patterns for raw device<br />

configurations.<br />

formed using the same OLTP benchmark as in<br />

the previous section.<br />

<strong>The</strong> test system consisted of two 1GHz AMD<br />

Opteron processors, 4GB of RAM and two<br />

QLogic 2310 FC controllers. Attached to the<br />

server was a single FAStT900 storage server<br />

and two disk enclosures with a total of 28 15K<br />

RPM 18GB drives. <strong>The</strong> <strong>Linux</strong> kernel used<br />

for the examination was 2.6.0+mm1, which includes<br />

the AIO filesystem support patches [9]<br />

discussed in this paper.<br />

<strong>The</strong> database tables were spread across multiple<br />

ext2 filesystem partitions. Database logs<br />

were stored on a single raw partition.<br />

Three separate tests were performed, utilizing<br />

different I/O methods for the database page<br />

cleaners.<br />

Test 1. Synchronous (Buffered) I/O.<br />

Test 2. Asynchronous (Buffered) I/O.<br />

Test 3. Direct I/O.<br />

<strong>The</strong> results are shown in Table 4 as relative<br />

commercial processing scores using synchronous<br />

I/O as the baseline (i.e., higher is better).<br />

Looking at the efficiency of the page cleaners<br />

(see Table 5), we see that the use of AIO<br />

is more successful in keeping the buffer pools<br />

clean. In the synchronous I/O and DIO cases,<br />

the agents needed to spend more time cleaning<br />

Configuration Commercial Processing<br />

Scores<br />

Synchronous I/O 100<br />

AIO (Buffered) 113.7<br />

DIO 111.9<br />

Table 4: Database performance on filesystems<br />

with and without AIO.<br />

buffer pool pages, resulting in less time processing<br />

transactions.<br />

Configuration Page cleaner Agent<br />

writes (%) writes (%)<br />

Synchronous I/O 37 63<br />

AIO (buffered) 100 0<br />

DIO 49 51<br />

Table 5: DB2 write patterns for filesystem configurations.<br />

8.4 Optimizing AIO for database workloads<br />

Databases typically use AIO for streaming<br />

batches of random, synchronized write requests<br />

to disk (where the writes are directed<br />

to preallocated disk blocks). This has been<br />

found to improve the performance of OLTP<br />

workloads, as it helps bring down the number<br />

of dedicated threads or processes needed<br />

for flushing updated pages, and results in reduced<br />

memory footprint and better CPU utilization<br />

and scaling.<br />

<strong>The</strong> size of individual write requests is determined<br />

by the page size used by the database.<br />

For example, a DB2 UDB installation might<br />

use a database page size of 8KB.<br />

As observed in previous sections, the use of<br />

AIO helps reduce the number of database page<br />

cleaner processes required to keep the bufferpool<br />

clean. To keep the disk queues maximally<br />

utilized and limit contention, it may be preferable<br />

to have requests to a given disk streamed<br />

out from a single page cleaner. Typically a<br />

set of of disks could be serviced by each page


74 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

cleaner if and when multiple page cleaners<br />

need to be used.<br />

Databases might also use AIO for reads, for example,<br />

for prefetching data to service queries.<br />

This usually helps improve the performance of<br />

decision support workloads. <strong>The</strong> I/O pattern<br />

generated in these cases is that of streaming<br />

batches of large AIO reads, with sizes typically<br />

determined by the file allocation extent size<br />

used by the database (e.g., a DB2 installation<br />

might use a database extent size of 256KB).<br />

For installations using buffered AIO reads, tuning<br />

the readahead setting for the corresponding<br />

devices to be more than the extent size would<br />

help improve performance of streaming AIO<br />

reads (recall the discussion in Section 3.5).<br />

9 Addressing AIO workloads involving<br />

both disk and communications<br />

I/O<br />

Certain applications need to handle both diskbased<br />

AIO and communications I/O. For communications<br />

I/O, the epoll interface—which<br />

provides support for efficient scalable event<br />

polling in <strong>Linux</strong> 2.6—could be used as appropriate,<br />

possibly in conjunction with O_<br />

NONBLOCK socket I/O. Disk-based AIO on<br />

the other hand, uses the native AIO API io_<br />

getevents for completion notification. This<br />

makes it difficult to combine both types of I/O<br />

processing within a single event loop, even<br />

when such a model is a natural way to program<br />

the application, as in implementations of the<br />

application on other operating systems.<br />

How do we address this issue? <strong>One</strong> option is to<br />

extend epoll to enable it to poll for notification<br />

of AIO completion events, so that AIO completion<br />

status can then be reaped in a non-blocking<br />

manner. This involves mixing both epoll and<br />

AIO API programming models, which is not<br />

ideal.<br />

9.1 AIO poll interface<br />

Another alternative is to add support for<br />

polling an event on a given file descriptor<br />

through the AIO interfaces. This function, referred<br />

to as AIO poll, can be issued through<br />

io_submit() just like other AIO operations,<br />

and specifies the file descriptor and<br />

the eventset to wait for. When the event<br />

occurs, notification is reported through io_<br />

getevents().<br />

<strong>The</strong> retry-based design of AIO poll works by<br />

converting the blocking wait for the event into<br />

a retry exit.<br />

<strong>The</strong> generic synchronous polling code fits<br />

nicely into the AIO retry design, so most of the<br />

original polling code can be used unchanged.<br />

<strong>The</strong> private data area of the iocb can be used<br />

to hold polling-specific data structures, and a<br />

few special cases can be added to the generic<br />

polling entry points. This allows the AIO poll<br />

case to proceed without additional memory allocations.<br />

9.2 AIO operations for communications I/O<br />

A third option is to add support for AIO operations<br />

for communications I/O. For example,<br />

AIO support for pipes has been implemented<br />

by converting the blocking wait for<br />

I/O on pipes to a retry exit. <strong>The</strong> generic pipe<br />

code was also structured such that conversion<br />

to AIO retries was quite simple, the only significant<br />

change was using the current io_wait<br />

context instead of a locally defined waitqueue,<br />

and returning early if no data was available.<br />

However, AIO pipe testing did show significantly<br />

more context switches then the 2.4 AIO<br />

pipe implementation, and this was coupled<br />

with much lower performance. <strong>The</strong> AIO core<br />

functions were relying on workqueues to do<br />

most of the retries, and this resulted in constant


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 75<br />

switching between the workqueue threads and<br />

user processes.<br />

<strong>The</strong> solution was to change the AIO core<br />

to do retries in io_submit() and in io_<br />

getevents(). This allowed the process to<br />

do some of its own work while it is scheduled<br />

in. Also, retries were switched to a delayed<br />

workqueue, so that bursts of retries would trigger<br />

fewer context switches.<br />

While delayed wakeups helped with pipe<br />

workloads, it also caused I/O stalls in filesystem<br />

AIO workloads. This was because a delayed<br />

wakeup was being used even when a user<br />

process was waiting in io_getevents().<br />

When user processes are actively waiting for<br />

events, it proved best to trigger the worker<br />

thread immediately.<br />

General AIO support for network operations<br />

has been considered but not implemented so far<br />

because of lack of supporting study that predicts<br />

a significant benefit over what epoll and<br />

non-blocking I/O can provide, except for the<br />

scope for enabling potential zero-copy implementations.<br />

This is a potential area for future<br />

research.<br />

10 Conclusions<br />

Our experience over the last year with AIO development,<br />

stabilization and performance improvements<br />

brought us to design and implementation<br />

issues that went far beyond the initial<br />

concern of converting key I/O blocking<br />

points to be asynchronous.<br />

AIO uncovered scenarios and I/O patterns that<br />

were unlikely or less significant with synchronous<br />

I/O alone. For example, the issues we<br />

discussed around streaming AIO performance<br />

with readahead and concurrent synchronized<br />

writes, as well as DIO vs buffered I/O complexities<br />

in the presence of AIO. In retrospect,<br />

this was the hardest part of supporting AIO—<br />

modifiying code that was originally designed<br />

only for synchronous I/O.<br />

Interestingly, this also meant that AIO appeared<br />

to magnify some problems early. For<br />

example, issues with hashed waitqueues that<br />

led to the filtered wakeup patches, and readahead<br />

window collapses with large random<br />

reads which precipitated improvements to the<br />

readahead code from Ramachandra Pai. Ultimately,<br />

many of the core improvements that<br />

helped AIO have had positive benefits in allowing<br />

improved concurrency for some of the<br />

synchronous I/O paths.<br />

In terms of benchmarking and optimizing<br />

<strong>Linux</strong> AIO performance, there is room for<br />

more exhaustive work. Requirements for AIO<br />

fsync support are currently under consideration.<br />

<strong>The</strong>re is also a need for more widely used<br />

AIO applications, especially those that take advantaged<br />

of AIO support for buffered I/O or<br />

bring out additional requirements like network<br />

I/O beyond epoll or AIO poll. Finally, investigations<br />

into API changes to help enable more<br />

efficient POSIX AIO implementations based<br />

on kernel AIO support may be a worthwhile<br />

endeavor.<br />

11 Acknowledgements<br />

We would like to thank the many people<br />

on the linux-aio@kvack.org and<br />

linux-kernel@vger.kernel.org<br />

mailing lists who provided us with valuable<br />

comments and suggestions during our<br />

development efforts.<br />

We would especially like to acknowledge the<br />

important contributions of Andrew Morton,<br />

Daniel McNeil, Badari Pulavarty, Stephen<br />

Tweedie, and William Lee Irwin towards several<br />

pieces of work discussed in this paper.


76 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

This paper and the work it describes wouldn’t<br />

have been possible without the efforts of Janet<br />

Morgan in many different ways, starting from<br />

review, test and debugging feedback to joining<br />

the midnight oil camp to help with modifications<br />

and improvements to the text during the<br />

final stages of the paper.<br />

We also thank Brian Twitchell, Steve Pratt,<br />

Gerrit Huizenga, Wayne Young, and John<br />

Lumby from IBM for their help and discussions<br />

along the way.<br />

This work was a part of the <strong>Linux</strong> Scalability<br />

Effort (LSE) on SourceForge, and further<br />

information about <strong>Linux</strong> 2.6 AIO is available<br />

at the LSE AIO web page [10]. All the external<br />

AIO patches including AIO support for<br />

buffered filesystem I/O, AIO poll and AIO support<br />

for pipes are available at [9].<br />

12 Legal Statement<br />

This work represents the view of the authors and<br />

does not necessarily represent the view of IBM.<br />

IBM, DB2 and DB2 Universal Database are registered<br />

trademarks of International Business Machines<br />

Corporation in the United States and/or other<br />

countries.<br />

<strong>Linux</strong> is a registered trademark of Linus Torvalds.<br />

Pentium is a trademark of Intel Corporation in the<br />

United States, other countries, or both.<br />

Other company, product, and service names may be<br />

trademarks or service marks of others.<br />

13 Disclaimer<br />

<strong>The</strong> benchmarks discussed in this paper were conducted<br />

for research purposes only, under laboratory<br />

conditions. Results will not be realized in all computing<br />

environments.<br />

References<br />

[1] Suparna Bhattacharya, Badari<br />

Pulavarthy, Steven Pratt, and Janet<br />

Morgan. Asynchronous i/o support for<br />

linux 2.5. In Proceedings of the <strong>Linux</strong><br />

Symposium. <strong>Linux</strong> Symposium, Ottawa,<br />

July 2003. http://archive.<br />

linuxsymposium.org/ols2003/<br />

Proceedings/All-Reprints/<br />

Reprint-Pulavarty-OLS2003.pdf.<br />

[2] Chris Mason. aio-stress<br />

microbenchmark.<br />

ftp://ftp.suse.com/pub/people/<br />

mason/utils/aio-stress.c.<br />

[3] Stephen C. Tweedie. Posting on dio races<br />

in 2.4. http://marc.theaimsgroup.<br />

com/?l=linux-fsdevel&m=<br />

105597840711609&w=2.<br />

[4] Andrew Morton. O_sync speedup patch.<br />

http:<br />

//www.kernel.org/pub/linux/<br />

kernel/people/akpm/patches/2.<br />

6/2.6.0/2.6.0-mm1/broken-out/<br />

O_SYNC-speedup-2.patch.<br />

[5] Daniel McNeil. Posting on synchronized<br />

writeback races.<br />

http://marc.theaimsgroup.com/<br />

?l=linux-aio&m=<br />

107671729611002&w=2.<br />

[6] Andrew Morton. Posting on in-order<br />

tagged radix tree walk based vfs<br />

writeback.<br />

http://marc.theaimsgroup.com/<br />

?l=bk-commits-head&m=<br />

108184544016117&w=2.<br />

[7] William Lee Irwin. Filtered wakeup<br />

patch. http://marc.theaimsgroup.<br />

com/?l=bk-commits-head&m=<br />

108459430513660&w=2.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 77<br />

[8] Transaction processing performance<br />

council. http://www.tpc.org.<br />

[9] Suparna Bhattacharya (with<br />

contributions from Andrew Morton &<br />

Chris Mason). Additional 2.6 <strong>Linux</strong><br />

<strong>Kernel</strong> Asynchronous I/O patches.<br />

http:<br />

//www.kernel.org/pub/linux/<br />

kernel/people/suparna/aio.<br />

[10] LSE team. <strong>Kernel</strong> Asynchronous I/O<br />

(AIO) Support for <strong>Linux</strong>. http:<br />

//lse.sf.net/io/aio.html.


78 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Methods to Improve Bootup Time in <strong>Linux</strong><br />

Tim R. Bird<br />

Sony Electronics<br />

tim.bird@am.sony.com<br />

Abstract<br />

This paper presents several techniques for reducing<br />

the bootup time of the <strong>Linux</strong> kernel, including<br />

Execute-In-Place (XIP), avoidance of<br />

calibrate_delay(), and reduced probing<br />

by certain drivers and subsystems. Using<br />

a variety of techniques, the <strong>Linux</strong> kernel can<br />

be booted on embedded hardware in under 500<br />

milliseconds. Current efforts and future directions<br />

of work to improve bootup time are described.<br />

2 Overview of Boot Process<br />

<strong>The</strong> entire boot process of <strong>Linux</strong> can be<br />

roughly divided into 3 main areas: firmware,<br />

kernel, and user space. <strong>The</strong> following is a list<br />

of events during a typical boot sequence:<br />

1. power on<br />

2. firmware (bootloader) starts<br />

3. kernel decompression starts<br />

4. kernel start<br />

5. user space start<br />

1 Introduction<br />

6. RC script start<br />

7. application start<br />

Users of consumer electronics products expect<br />

their devices to be available for use very soon<br />

after being turned on. Configurations of <strong>Linux</strong><br />

for desktop and server markets exhibit boot<br />

times in the range of 20 seconds to a few minutes,<br />

which is unacceptable for many consumer<br />

products.<br />

No single item is responsible for overall poor<br />

boot time performance. <strong>The</strong>refore a number<br />

of techniques must be employed to reduce the<br />

boot up time of a <strong>Linux</strong> system. This paper<br />

presents several techniques which have been<br />

found to be useful for embedded configurations<br />

of <strong>Linux</strong>.<br />

8. first available use<br />

This paper focuses on techniques for reducing<br />

the bootup time up until the start of user space.<br />

That is, techniques are described which reduce<br />

the firmware time, and the kernel start time.<br />

This includes activities through the completion<br />

of event 4 in the list above.<br />

<strong>The</strong> actual kernel execution begins with<br />

the routine start_kernel(), in the file<br />

init/main.c.<br />

An overview of major steps in the initialization<br />

sequence of the kernel is as follows:


80 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

• start_kernel()<br />

– init architecture<br />

– init interrupts<br />

– init memory<br />

– start idle thread<br />

– call rest_init()<br />

*<br />

start ‘init’ kernel thread<br />

<strong>The</strong> init kernel thread performs a few<br />

other tasks, then calls do_basic_setup(),<br />

which calls do_initcalls(), to run<br />

through the array of initialization routines for<br />

drivers statically linked in the kernel. Finally,<br />

this thread switches to user space by execveing<br />

to the first user space program, usually<br />

/sbin/init.<br />

• init (kernel thread)<br />

– call do_basic_setup()<br />

*<br />

call do_initcalls()<br />

· init buses and drivers<br />

– prepare and mount root filesystem<br />

– call run_init_process()<br />

*<br />

call execve() to start user<br />

space process<br />

3 Typical Desktop Boot Time<br />

<strong>The</strong> boot times for a typical desktop system<br />

were measured and the results are presented<br />

below, to give an indication of the major areas<br />

in the kernel where time is spent. While the<br />

numbers in these tests differ somewhat from<br />

those for a typical embedded system, it is useful<br />

to see these to get an idea of where some of<br />

the trouble spots are for kernel booting.<br />

3.1 System<br />

An HP XW4100 <strong>Linux</strong> workstation system<br />

was used for these tests, with the following<br />

characteristics:<br />

• Pentium 4 HT processor, running at 3GHz<br />

• 512 MB RAM<br />

• Western Digital 40G hard drive on hda<br />

• Generic CDROM drive on hdc<br />

3.2 Measurement method<br />

<strong>The</strong> kernel used was 2.6.6, with the KFI patch<br />

applied. KFI stands for “<strong>Kernel</strong> Function Instrumentation”.<br />

This is an in-kernel system<br />

to measure the duration of each function executed<br />

during a particular profiling run. It<br />

uses the -finstrument-functions option<br />

of gcc to instrument kernel functions<br />

with callouts on each function entry and exit.<br />

This code was authored by developers at MontaVista<br />

Software, and a patch for 2.6.6 is available,<br />

although the code is not ready (as of the<br />

time of this writing) for general publication.<br />

Information about KFI and the patch are available<br />

at:<br />

http://tree.celinuxforum.org/pubwiki<br />

/moin.cgi<br />

/<strong>Kernel</strong>FunctionInstrumentation<br />

3.3 Key delays<br />

<strong>The</strong> average time for kernel startup of the test<br />

system was about 7 seconds. This was the<br />

amount of time for just the kernel and NOT the<br />

firmware or user space. It corresponds to the<br />

period of time between events 4 and 5 in the<br />

boot sequence listed in Section 2.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 81<br />

Some key delays were found in the kernel<br />

startup on the test system. Table 1 shows<br />

some of the key routines where time was spent<br />

during bootup. <strong>The</strong>se are the low-level routines<br />

where significant time was spent inside<br />

the functions themselves, rather than in subroutines<br />

called by the functions.<br />

<strong>Kernel</strong> Function No. of Avg. call Total<br />

calls time time<br />

delay_tsc 5153 1 5537<br />

default_idle 312 1 325<br />

get_cmos_time 1 500 500<br />

psmouse_sendbyte 44 2.4 109<br />

pci_bios_find_device 25 1.7 44<br />

atkbd_sendbyte 7 3.7 26<br />

calibrate_delay 1 24 24<br />

Note: Times are in milliseconds.<br />

Table 1: Functions consuming lots of time during<br />

a typical desktop <strong>Linux</strong> kernel startup.<br />

Note that over 80% of the total time of the<br />

bootup (almost 6 seconds out of 7) was spent<br />

busywaiting in delay_tsc() or spinning in<br />

the routine default_idle(). It appears<br />

that great reductions in total bootup time could<br />

be achieved if these delays could be reduced,<br />

or if it were possible to run some initialization<br />

tasks concurrently.<br />

Another interesting point is that the routine<br />

get_cmos_time() was extremely variable<br />

in the length of time it took. Measurements<br />

of its duration ranged from under 100 milliseconds<br />

to almost one second. This routine, and<br />

methods to avoid this delay and variability, are<br />

discussed in section 9.<br />

3.4 High-level delay areas<br />

Since delay_tsc() is used (via various<br />

delay mechanisms) for busywaiting by a<br />

number of different subsystems, it is helpful to<br />

identify the higher-level routines which end up<br />

invoking this function.<br />

Table 2 shows some high-level routines called<br />

during kernel initialization, and the amount of<br />

time they took to complete on the test machine.<br />

Duration times marked with a tilde denote<br />

functions which were highly variable in<br />

duration.<br />

Table 2:<br />

startup.<br />

<strong>Kernel</strong> Function Duration time<br />

ide_init 3327<br />

time_init ~500<br />

isapnp_init 383<br />

i8042_init 139<br />

prepare_namespace ~50<br />

calibrate_delay 24<br />

Note: Times are in milliseconds.<br />

High-level delays during a typical<br />

For a few of these, it is interesting to examine<br />

the call sequences underneath the high-level<br />

routines. This shows the connection between<br />

the high-level routines that are taking a long<br />

time to complete and the functions where the<br />

time is actually being spent.<br />

Figures 1 and 2 show some call sequences for<br />

high-level calls which take a long time to complete.<br />

In each call tree, the number in parentheses is<br />

the number of times that the routine was called<br />

by the parent in this chain. Indentation shows<br />

the call nesting level.<br />

For example, in Figure 1, do_probe() is<br />

called a total of 31 times by probe_hwif(),<br />

and it calls ide_delay_50ms() 78 times,<br />

and try_to_identify() 8 times.<br />

<strong>The</strong> timing data for the test system showed<br />

that IDE initialization was a significant contributor<br />

to overall bootup time. <strong>The</strong> call sequence<br />

underneath ide_init() shows that<br />

a large number of calls are made to the routine<br />

ide_delay_50ms(), which in turn calls


82 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

ide_init-><br />

probe_for_hwifs(1)-><br />

ide_scan_pcibus(1)-><br />

ide_scan_pci_dev(2)-><br />

piix_init_one(2)-><br />

init_setup_piix(2)-><br />

ide_setup_pci_device(2)-><br />

probe_hwif_init(2)-><br />

probe_hwif(4)-><br />

do_probe(31)-><br />

ide_delay_50ms(78)-><br />

__const_udelay(3900)-><br />

__delay(3900)-><br />

delay_tsc(3900)<br />

try_to_identify(8)-><br />

actual_try_to_identify(8)-><br />

ide_delay_50ms(24)-><br />

__const_udelay(1200)-><br />

__delay(1200)-><br />

delay_tsc(1200)<br />

Figure 1: IDE init call tree<br />

isapnp_init-><br />

isapnp_isolate(1)-><br />

isapnp_isolate_rdp_select(1)-><br />

__const_udelay(25)-><br />

__delay(25)-><br />

delay_tsc(25)<br />

isapnp_key(18)-><br />

__const_udelay(18)-><br />

__delay(18)-><br />

delay_tsc(18)<br />

Figure 2: ISAPnP init call tree<br />

__const_udelay() very many times. <strong>The</strong><br />

busywaits in ide_delay_50ms() alone accounted<br />

for over 5 seconds, or about 70% of<br />

the total boot up time.<br />

Another significant area of delay was the initialization<br />

of the ISAPnP system. This took<br />

about 380 milliseconds on the test machine.<br />

Both the mouse and the keyboard drivers used<br />

crude busywaits to wait for acknowledgements<br />

from their respective hardware.<br />

Finally, the routine calibrate_delay()<br />

took about 25 milliseconds to run, to compute<br />

the value of loops_per_jiffy and print<br />

(the related) BogoMips for the machine.<br />

<strong>The</strong> remaining sections of this paper discuss<br />

various specific methods for reducing bootup<br />

time for embedded and desktop systems. Some<br />

of these methods are directly related to some of<br />

the delay areas identified in this test configuration.<br />

4 <strong>Kernel</strong> Execute-In-Place<br />

A typical sequence of events during bootup is<br />

for the bootloader to load a compressed kernel<br />

image from either disk or Flash, placing it into<br />

RAM. <strong>The</strong> kernel is decompressed, either during<br />

or just after the copy operation. <strong>The</strong>n the<br />

kernel is executed by jumping to the function<br />

start_kernel().<br />

<strong>Kernel</strong> Execute-In-Place (XIP) is a mechanism<br />

where the kernel instructions are executed directly<br />

from ROM or Flash.<br />

In a kernel XIP configuration, the step of copying<br />

the kernel code segment into RAM is omitted,<br />

as well as any decompression step. Instead,<br />

the kernel image is stored uncompressed<br />

in ROM or Flash. <strong>The</strong> kernel data segments<br />

still need to be initialized in RAM, but by eliminating<br />

the text segment copy and decompression,<br />

the overall effect is a reduction in the time<br />

required for the firmware phase of the bootup.<br />

Table 3 shows the differences in time duration<br />

for various parts of the boot stage for a system<br />

booted with and without use of kernel XIP.<br />

<strong>The</strong> times in the table are shown in milliseconds.<br />

<strong>The</strong> table shows that using XIP in this<br />

configuration significantly reduced the time to<br />

copy the kernel to RAM (because only the data<br />

segments were copied), and completely eliminated<br />

the time to decompress the kernel (453<br />

milliseconds). However, the kernel initialization<br />

time increased slightly in the XIP configuration,<br />

for a net savings of 463 milliseconds.<br />

In order to support an Execute-In-Place con-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 83<br />

Boot Stage Non-XIP time XIP time<br />

Copy kernel to RAM 85 12<br />

Decompress kernel 453 0<br />

<strong>Kernel</strong> initialization 819 882<br />

Total kernel boot time 1357 894<br />

Note: Times are in milliseconds. Results are for<br />

PowerPC 405 LP at 266 MHz<br />

Table 3: Comparison of Non-XIP vs. XIP<br />

bootup times<br />

figuration, the kernel must be compiled and<br />

linked so that the code is ready to be executed<br />

from a fixed memory location. <strong>The</strong>re<br />

are examples of XIP configurations for ARM,<br />

MIPS and SH platforms in the CE<strong>Linux</strong><br />

source tree, available at: http://tree.<br />

celinuxforum.org/<br />

4.1 XIP Design Tradeoffs<br />

<strong>The</strong>re are tradeoffs involved in the use of XIP.<br />

First, it is common for access times to flash<br />

memory to be greater than access times to<br />

RAM. Thus, a kernel executing from Flash<br />

usually runs a bit slower than a kernel executing<br />

from RAM. Table 4 shows some of the results<br />

from running the lmbench benchmark<br />

on system, with the kernel executing in a standard<br />

non-XIP configuration versus an XIP configuration.<br />

Operation Non-XIP XIP<br />

stat() syscall 22.4 25.6<br />

fork a process 4718 7106<br />

context switching for 16 932 1109<br />

processes and 64k data size<br />

pipe communication 248 548<br />

Note: Times are in microseconds. Results are for<br />

lmbench benchmark run on OMAP 1510 (ARM9 at<br />

168 MHz) processor<br />

Table 4: Comparison of Non-XIP and XIP performance<br />

Some of the operations in the benchmark took<br />

significantly longer with the kernel run in the<br />

XIP configuration. Most individual operations<br />

took about 20% to 30% longer. This performance<br />

penalty is suffered permanently while<br />

the kernel is running, and thus is a serious<br />

drawback to the use of XIP for reducing bootup<br />

time.<br />

A second tradeoff with kernel XIP is between<br />

the sizes of various types of memory in the<br />

system. In the XIP configuration the kernel<br />

must be stored uncompressed, so the amount<br />

of Flash required for the kernel increases, and<br />

is usually about doubled, versus a compressed<br />

kernel image used with a non-XIP configuration.<br />

However, the amount of RAM required<br />

for the kernel is decreased, since the kernel<br />

code segment is never copied to RAM. <strong>The</strong>refore,<br />

kernel XIP is also of interest for reducing<br />

the runtime RAM footprint for <strong>Linux</strong> in embedded<br />

systems.<br />

<strong>The</strong>re is additional research under way to investigate<br />

ways of reducing the performance<br />

impact of using XIP. <strong>One</strong> promising technique<br />

appears to be the use of “partial-XIP,” where a<br />

highly active subset of the kernel is loaded into<br />

RAM, but the majority of the kernel is executed<br />

in place from Flash.<br />

5 Delay Calibration Avoidance<br />

<strong>One</strong> time-consuming operation inside the kernel<br />

is the process of calibrating the value used<br />

for delay loops. <strong>One</strong> of the first routines in<br />

the kernel, calibrate_delay(), executes<br />

a series of delays in order to determine the correct<br />

value for a variable called loops_per_<br />

jiffy, which is then subsequently used to execute<br />

short delays in the kernel.<br />

<strong>The</strong> cost of performing this calibration is, interestingly,<br />

independent of processor speed.<br />

Rather, it is dependent on the number of iter-


84 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

ations required to perform the calibration, and<br />

the length of each iteration. Each iteration requires<br />

1 jiffy, which is the length of time defined<br />

by the HZ variable.<br />

In 2.4 versions of the <strong>Linux</strong> kernel, most platforms<br />

defined HZ as 100, which makes the<br />

length of a jiffy 10 milliseconds. A typical<br />

number of iterations for the calibration operation<br />

is 20 to 25, making the total time required<br />

for this operation about 250 milliseconds.<br />

In 2.6 versions of the <strong>Linux</strong> kernel, a few platforms<br />

(notably i386) have changed HZ to 1000,<br />

making the length of a jiffy 1 millisecond. On<br />

those platforms, the typical cost of this calibration<br />

operation has decreased to about 25 milliseconds.<br />

Thus, the benefit of eliminating this<br />

operation on most standard desktop systems<br />

has been reduced. However, for many embedded<br />

systems, HZ is still defined as 100, which<br />

makes bypassing the calibration useful.<br />

It is easy to eliminate the calibration operation.<br />

You can directly edit the code in init/main.<br />

c:calibrate_delay() to hardcode a value<br />

for loops_per_jiffy, and avoid the calibration<br />

entirely. Alternatively, there is a patch<br />

available at http://tree.celinuxforum.<br />

org/pubwiki/moin.cgi/PresetLPJ<br />

This patch allows you to use a kernel configuration<br />

option to specify a value for loops_<br />

per_jiffy at kernel compile time. Alternatively,<br />

the patch also allows you to use a kernel<br />

command line argument to specify a preset<br />

value for loops_per_jiffy at kernel boot<br />

time.<br />

6 Avoiding Probing During Bootup<br />

Another technique for reducing bootup time is<br />

to avoid probing during bootup. As a general<br />

technique, this can consist of identifying hardware<br />

which is known not to be present on one’s<br />

machine, and making sure the kernel is compiled<br />

without the drivers for that hardware.<br />

In the specific case of IDE, the kernel supports<br />

options at the command line to allow the<br />

user to avoid performing probing for specific<br />

interfaces and devices. To do this, you can<br />

use the IDE and harddrive noprobe options<br />

at the kernel command line. Please see the<br />

file Documentation/ide.txt in the kernel<br />

source tree for details on the syntax of using<br />

these options.<br />

On the test machine, IDE noprobe options<br />

were used to reduce the amount of probing during<br />

startup. <strong>The</strong> test machine had only a hard<br />

drive on hda (ide0 interface, first device) and<br />

a CD-ROM drive on hdc (ide1 interface, first<br />

device).<br />

In one test, noprobe options were specified<br />

to suppress probing of non-used interfaces and<br />

devices. Specifically, the following arguments<br />

were added to the kernel command line:<br />

hdb=none hdd=none ide2=noprobe<br />

<strong>The</strong> kernel was booted and the result was<br />

that the function ide_delay_50ms() was<br />

called only 68 times, and delay_tsc() was<br />

called only 3453 times. During a regular<br />

kernel boot without these options specified,<br />

the function ide_delay_50ms() is called<br />

102 times, and delay_tsc() is called 5153<br />

times. Each call to delay_tsc() takes<br />

about 1 millisecond, so the total time savings<br />

from using these options was 1700 milliseconds.<br />

<strong>The</strong>se IDE noprobe options have been available<br />

at least since the 2.4 kernel series, and are<br />

an easy way to reduce bootup time, without<br />

even having to recompile the kernel.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 85<br />

7 Reducing Probing Delays<br />

As was noted on the test machine, IDE initialization<br />

takes a significant percentage of<br />

the total bootup time. Almost all of this<br />

time is spent busywaiting in the routine ide_<br />

delay_50ms().<br />

It is trivial to modify the value of the timeout<br />

used in this routine. As an experiment,<br />

this code (located in the file drivers/ide/<br />

ide.c) was modified to only delay 5 milliseconds<br />

instead of 50 milliseconds.<br />

<strong>The</strong> results of this change were interesting.<br />

When a kernel with this change was run on<br />

the test machine, the total time for the ide_<br />

init() routine dropped from 3327 milliseconds<br />

to 339 milliseconds. <strong>The</strong> total time spent<br />

in all invocations of ide_delay_50ms()<br />

was reduced from 5471 milliseconds to 552<br />

milliseconds. <strong>The</strong> overall bootup time was reduced<br />

accordingly, by about 5 seconds.<br />

<strong>The</strong> ide devices were successfully detected,<br />

and the devices operated without problem on<br />

the test machine. However, this configuration<br />

was not tested exhaustively.<br />

Reducing the duration of the delay in the ide_<br />

delay_50ms() routine provides a substantial<br />

reduction in the overall bootup time for the<br />

kernel on a typical desktop system. It also has<br />

potential use in embedded systems where PCIbased<br />

IDE drives are used.<br />

However, there are several issues with this<br />

modification that need to be resolved. This<br />

change may not support legacy hardware<br />

which requires long delays for proper probing<br />

and initializing. <strong>The</strong> kernel code needs to be<br />

analyzed to determine if any callers of this routine<br />

really need the 50 milliseconds of delay<br />

that they are requesting. Also, it should be determined<br />

whether this call is used only in initialization<br />

context or if it is used during regular<br />

runtime use of IDE devices also.<br />

Also, it may be that 5 milliseconds does not<br />

represent the lowest possible value for this delay.<br />

It is possible that this value will need to<br />

be tuned to match the hardware for a particular<br />

machine. This type of tuning may be acceptable<br />

in the embedded space, where the hardware<br />

configuration of a product may be fixed.<br />

But it may be too risky to use in desktop configurations<br />

of <strong>Linux</strong>, where the hardware is not<br />

known ahead of time.<br />

More experimentation, testing and validation<br />

are required before this technique should be<br />

used.<br />

IMPORTANT NOTE: You should probably not<br />

experiment with this modification on production<br />

hardware unless you have evaluated the<br />

risks.<br />

8 Using the “quiet” Option<br />

<strong>One</strong> non-obvious method to reduce overhead<br />

during booting is to use the quiet option on<br />

the kernel command line. This option changes<br />

the loglevel to 4, which suppresses the output<br />

of regular (non-emergency) printk messages.<br />

Even though the messages are not printed to<br />

the system console, they are still placed in the<br />

kernel printk buffer, and can be retrieved after<br />

bootup using the dmesg command.<br />

When embedded systems boot with a serial<br />

console, the speed of printing the characters<br />

to the console is constrained by the speed of<br />

the serial output. Also, depending on the<br />

driver, some VGA console operations (such as<br />

scrolling the screen) may be performed in software.<br />

For slow processors, this may take a significant<br />

amount of time. In either case, the cost<br />

of performing output of printk messages during<br />

bootup may be high. But it is easily eliminated<br />

using the quiet command line option.


86 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Table 5 shows the difference in bootup time of<br />

using the quiet option and not, for two different<br />

systems (one with a serial console and<br />

one with a VGA console).<br />

9 RTC Read Synchronization<br />

<strong>One</strong> routine that potentially takes a long time<br />

during kernel startup is get_cmos_time().<br />

This routine is used to read the value of the external<br />

real-time clock (RTC) when the kernel<br />

boots. Currently, this routine delays until the<br />

edge of the next second rollover, in order to ensure<br />

that the time value in the kernel is accurate<br />

with respect to the RTC.<br />

However, this operation can take up to one full<br />

second to complete, and thus introduces up to<br />

1 second of variability in the total bootup time.<br />

For systems where the target bootup time is under<br />

1 second, this variability is unacceptable.<br />

<strong>The</strong> synchronization in this routine is easy<br />

to remove. It can be eliminated by removing<br />

the first two loops in the function<br />

get_cmos_time(), which is located in<br />

include/asm-i386/mach-default/<br />

mach_time.h for the i386 architecture. Similar<br />

routines are present in the kernel source<br />

tree for other architectures.<br />

When the synchronization is removed, the routine<br />

completes very quickly.<br />

<strong>One</strong> tradeoff in making this modification is that<br />

the time stored by the <strong>Linux</strong> kernel is no longer<br />

completely synchronized (to the boundary of a<br />

second) with the time in the machine’s realtime<br />

clock hardware. Some systems save the system<br />

time back out to the hardware clock on system<br />

shutdown. After numerous bootups and shutdowns,<br />

this lack of synchronization will cause<br />

the realtime clock value to drift from the correct<br />

time value.<br />

Since the amount of un-synchronization is up<br />

to a second per boot cycle, this drift can be<br />

significant. However, for some embedded applications,<br />

this drift is unimportant. Also, in<br />

some situations the system time may be synchronized<br />

with an external source anyway, so<br />

the drift, if any, is corrected under normal circumstances<br />

soon after booting.<br />

10 User space Work<br />

<strong>The</strong>re are a number of techniques currently<br />

available or under development for user space<br />

bootup time reductions. <strong>The</strong>se techniques are<br />

(mostly) outside the scope of kernel development,<br />

but may provide additional benefits for<br />

reducing overall bootup time for <strong>Linux</strong> systems.<br />

Some of these techniques are mentioned briefly<br />

in this section.<br />

10.1 Application XIP<br />

<strong>One</strong> technique for improving application<br />

startup speed is application XIP, which is similar<br />

to the kernel XIP discussed in this paper.<br />

To support application XIP the kernel must be<br />

compiled with a file system where files can be<br />

stored linearly (where the blocks for a file are<br />

stored contiguously) and uncompressed. <strong>One</strong><br />

file system which supports this is CRAMFS,<br />

with the LINEAR option turned on. This is a<br />

read-only file system.<br />

With application XIP, when a program is executed,<br />

the kernel program loader maps the<br />

text segments for applications directly from the<br />

flash memory of the file system. This saves the<br />

time required to load these segments into system<br />

RAM.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 87<br />

Platform Speed console w/o quiet with quiet difference<br />

type option option<br />

SH-4 SH7751R 240 MHz VGA 637 461 176<br />

OMAP 1510 (ARM 9) 168 MHz serial 551 280 271<br />

Note: Times are in milliseconds<br />

Table 5: Bootup time with and without the quiet option<br />

10.2 RC Script improvements<br />

Also, there are a number of projects which<br />

strive to decrease total bootup time of a system<br />

by parallelizing the execution of the system<br />

run-command scripts (“RC scripts”). <strong>The</strong>re is<br />

a list of resources for some of these projects at<br />

the following web site:<br />

http://tree.celinuxforum.org/<br />

pubwiki/moin.cgi/<br />

BootupTimeWorkingGroup<br />

Also, there has been some research conducted<br />

in reducing the overhead of running RC scripts.<br />

This consists of modifying the multi-function<br />

program busybox to reduce the number and<br />

cost of forks during RC script processing, and<br />

to optimize the usage of functions builtin to the<br />

busybox program. Initial testing has shown a<br />

reduction from about 8 seconds to 5 seconds<br />

for a particular set of Debian RC scripts on an<br />

OMAP 1510 (ARM 9) processor, running at<br />

168 MHz.<br />

11 Results<br />

By use of the some of the techniques mentioned<br />

in this paper, as well as additional techniques,<br />

Sony was able to boot a 2.4.20-based<br />

<strong>Linux</strong> system, from power on to user space display<br />

of a greeting image and sound playback,<br />

in 1.2 seconds. <strong>The</strong> time from power on to the<br />

end of kernel initialization (first user space instruction)<br />

in this configuration was about 110<br />

milliseconds. <strong>The</strong> processor was a TI OMAP<br />

1510 processor, with an ARM9-based core,<br />

running at 168 MHz.<br />

Some of the techniques used for reducing the<br />

bootup time of embedded systems can also be<br />

used for desktop or server systems. Often, it<br />

is possible, with rather simple and small modifications,<br />

to decrease the bootup time of the<br />

<strong>Linux</strong> kernel to only a few seconds. In the<br />

desktop configuration of <strong>Linux</strong> presented here,<br />

techniques from this paper were used to reduced<br />

the total bootup time from around 7 seconds<br />

to around 1 second. This was with no<br />

loss of functionality that the author could detect<br />

(with limited testing).<br />

12 Further Research<br />

As stated in the beginning of the paper, numerous<br />

techniques can be employed to reduce the<br />

overall bootup time of <strong>Linux</strong> systems. Further<br />

work continues or is needed in a number of areas.<br />

12.1 Concurrent Driver Init<br />

<strong>One</strong> area of additional research that seems<br />

promising is to structure driver initializations<br />

in the kernel so that they can proceed in parallel.<br />

For some items, like IDE initialization,<br />

there are large delays as buses and devices are<br />

probed and initialized. <strong>The</strong> time spent in such<br />

busywaits could potentially be used to perform<br />

other startup tasks, concurrently with the ini-


88 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

tializations waiting for hardware events to occur<br />

or time out.<br />

<strong>The</strong> big problem to be addressed with concurrent<br />

initialization is to identify what kernel<br />

startup activities can be allowed to occur<br />

in parallel. <strong>The</strong> kernel init sequence is already<br />

a carefully ordered sequence of events to make<br />

sure that critical startup dependencies are observed.<br />

Any system of concurrent driver initialization<br />

will have to provide a mechanism<br />

to guarantee sequencing of initialization tasks<br />

which have order dependencies.<br />

12.2 Partial XIP<br />

Another possible area of further investigation,<br />

which has already been mentioned, is<br />

“partial XIP,” whereby the kernel is executed<br />

mostly in-place. Prototype code already exists<br />

which demonstrates the mechanisms necessary<br />

to move a subset of an XIP-configured kernel<br />

into RAM, for faster code execution. <strong>The</strong> key<br />

to making partial kernel XIP useful will be to<br />

ensure correct identification (either statically or<br />

dynamically) of the sections of kernel code that<br />

need to be moved to RAM. Also, experimentation<br />

and testing need to be performed to determine<br />

the appropriate tradeoff between the size<br />

of the RAM-based portion of the kernel, and<br />

the effect on bootup time and system runtime<br />

performance.<br />

of only performing fixups on demand as library<br />

routines are called by a running program.<br />

Additional research is needed with both of<br />

these techniques to determine if they can provide<br />

benefit for current <strong>Linux</strong> systems.<br />

13 Credits<br />

This paper is the result of work performed by<br />

the Bootup Time Working Group of the CE<br />

<strong>Linux</strong> forum (of which the author is Chair).<br />

I would like to thank developers at some of<br />

CELF’s member companies, including Hitachi,<br />

Intel, Mitsubishi, MontaVista, Panasonic, and<br />

Sony, who contributed information or code<br />

used in writing this paper.<br />

12.3 Pre-linking and Lazy Linking<br />

Finally, research is needed into reducing the<br />

time required to fixup links between programs<br />

and their shared libraries.<br />

Two systems that have been proposed and experimented<br />

with are pre-linking and lazy linking.<br />

Pre-linking involves fixing the location in<br />

virtual memory of the shared libraries for a system,<br />

and performing fixups on the programs of<br />

the system ahead of time. Lazy linking consists


<strong>Linux</strong> on NUMA Systems<br />

Martin J. Bligh<br />

mbligh@aracnet.com<br />

Matt Dobson<br />

colpatch@us.ibm.com<br />

Darren Hart<br />

dvhltc@us.ibm.com<br />

Gerrit Huizenga<br />

gh@us.ibm.com<br />

Abstract<br />

NUMA is becoming more widespread in the<br />

marketplace, used on many systems, small or<br />

large, particularly with the advent of AMD<br />

Opteron systems. This paper will cover a summary<br />

of the current state of NUMA, and future<br />

developments, encompassing the VM subsystem,<br />

scheduler, topology (CPU, memory, I/O<br />

layouts including complex non-uniform layouts),<br />

userspace interface APIs, and network<br />

and disk I/O locality. It will take a broad-based<br />

approach, focusing on the challenges of creating<br />

subsystems that work for all machines (including<br />

AMD64, PPC64, IA-32, IA-64, etc.),<br />

rather than just one architecture.<br />

1 What is a NUMA machine?<br />

NUMA stands for non-uniform memory architecture.<br />

Typically this means that not all memory<br />

is the same “distance” from each CPU in<br />

the system, but also applies to other features<br />

such as I/O buses. <strong>The</strong> word “distance” in this<br />

context is generally used to refer to both latency<br />

and bandwidth. Typically, NUMA machines<br />

can access any resource in the system,<br />

just at different speeds.<br />

NUMA systems are sometimes measured with<br />

a simple “NUMA factor” ratio of N:1—<br />

meaning that the latency for a cache miss memory<br />

read from remote memory is N times the latency<br />

for that from local memory (for NUMA<br />

machines, N > 1). Whilst such a simple descriptor<br />

is attractive, it can also be highly misleading,<br />

as it describes latency only, not bandwidth,<br />

on an uncontended bus (which is not<br />

particularly relevant or interesting), and takes<br />

no account of inter-node caches.<br />

<strong>The</strong> term node is normally used to describe a<br />

grouping of resources—e.g., CPUs, memory,<br />

and I/O. On some systems, a node may contain<br />

only some types of resources (e.g., only<br />

memory, or only CPUs, or only I/O); on others<br />

it may contain all of them. <strong>The</strong> interconnect<br />

between nodes may take many different<br />

forms, but can be expected to be higher latency<br />

than the connection within a node, and typically<br />

lower bandwidth.<br />

Programming for NUMA machines generally<br />

implies focusing on locality—the use of resources<br />

close to the device in question, and<br />

trying to reduce traffic between nodes; this<br />

type of programming generally results in better<br />

application throughput. On some machines<br />

with high-speed cross-node interconnects, bet-


90 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

ter performance may be derived under certain<br />

workloads by “striping” accesses across multiple<br />

nodes, rather than just using local resources,<br />

in order to increase bandwidth. Whilst<br />

it is easy to demonstrate a benchmark that<br />

shows improvement via this method, it is difficult<br />

to be sure that the concept is generally<br />

benefical (i.e., with the machine under full<br />

load).<br />

2 Why use a NUMA architecture to<br />

build a machine?<br />

<strong>The</strong> intuitive approach to building a large machine,<br />

with many processors and banks of<br />

memory, would be simply to scale up the typical<br />

2–4 processor machine with all resources<br />

attached to a shared system bus. However, restrictions<br />

of electronics and physics dictate that<br />

accesses slow as the length of the bus grows,<br />

and the bus is shared amongst more devices.<br />

Rather than accept this global slowdown for a<br />

larger machine, designers have chosen to instead<br />

give fast access to a limited set of local<br />

resources, and reserve the slower access times<br />

for remote resources.<br />

Historically, NUMA architectures have only<br />

been used for larger machines (more than 4<br />

CPUs), but the advantages of NUMA have<br />

been brought into the commodity marketplace<br />

with the advent of AMD’s x86-64, which has<br />

one CPU per node, and local memory for each<br />

processor. <strong>Linux</strong> supports NUMA machines<br />

of every size from 2 CPUs upwards (e.g., SGI<br />

have machines with 512 processors).<br />

It might help to envision the machine as a<br />

group of standard SMP machines, connected<br />

by a very fast interconnect somewhat like a network<br />

connection, except that the transfers over<br />

that bus are transparent to the operating system.<br />

Indeed, some earlier systems were built<br />

exactly like that; the older Sequent NUMA-<br />

Q hardware uses a standard 450NX 4 processor<br />

chipset, with an SCI interconnect plugged<br />

into the system bus of each node to unify them,<br />

and pass traffic between them. <strong>The</strong> complex<br />

part of the implementation is to ensure cachecoherency<br />

across the interconnect, and such<br />

machines are often referred to as CC-NUMA<br />

(cache coherent NUMA). As accesses over the<br />

interconnect are transparent, it is possible to<br />

program such machines as if they were standard<br />

SMP machines (though the performance<br />

will be poor). Indeed, this is exactly how the<br />

NUMA-Q machines were first bootstrapped.<br />

Often, we are asked why people do not use<br />

clusters of smaller machines, instead of a large<br />

NUMA machine, as clusters are cheaper, simpler,<br />

and have a better price:performance ratio.<br />

Unfortunately, it makes the programming<br />

of applications much harder; all of the intercommunication<br />

and load balancing now has to<br />

be more explicit. Some large applications (e.g.,<br />

database servers) do not split up across multiple<br />

cluster nodes easily—in those situations,<br />

people often use NUMA machines. In addition,<br />

the interconnect for NUMA boxes is normally<br />

very low latency, and very high bandwidth,<br />

yielding excellent performance. <strong>The</strong><br />

management of a single NUMA machine is<br />

also simpler than that of a whole cluster with<br />

multiple copies of the OS.<br />

We could either have the operating system<br />

make decisions about how to deal with the architecture<br />

of the machine on behalf of the user<br />

processes, or give the userspace application an<br />

API to specify how such decisions are to be<br />

made. It might seem, at first, that the userspace<br />

application is in a better position to make such<br />

decisions, but this has two major disadvantages:<br />

1. Every application must be changed to support<br />

NUMA machines, and may need to


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 91<br />

be revised when a new hardware platform<br />

is released.<br />

2. Applications are not in a good position<br />

to make global holistic decisions about<br />

machine resources, coordinate themselves<br />

with other applications, and balance decisions<br />

between them.<br />

Thus decisions on process, memory and I/O<br />

placement are normally best left to the operating<br />

system, perhaps with some hints from<br />

userspace about which applications group together,<br />

or will use particular resources heavily.<br />

Details of hardware layout are put in one place,<br />

in the operating system, and tuning and modification<br />

of the necessary algorithms are done<br />

once in that central location, instead of in every<br />

application. In some circumstances, the<br />

application or system administrator will want<br />

to override these decisions with explicit APIs,<br />

but this should be the exception, rather than the<br />

norm.<br />

3 <strong>Linux</strong> NUMA Memory Support<br />

In order to manage memory, <strong>Linux</strong> requires<br />

a page descriptor structure (struct page)<br />

for each physical page of memory present in<br />

the system. This consumes approximately 1%<br />

of the memory managed (assuming 4K page<br />

size), and the structures are grouped into an array<br />

called mem_map. For NUMA machines,<br />

there is a separate array for each node, called<br />

lmem_map. <strong>The</strong> mem_map and lmem_map<br />

arrays are simple contiguous data structures accessed<br />

in a linear fashion by their offset from<br />

the beginning of the node. This means that the<br />

memory controlled by them is assumed to be<br />

physically contiguous.<br />

NUMA memory support is enabled by<br />

CONFIG_DISCONTIGMEM and CONFIG_<br />

NUMA. A node descriptor called a struct<br />

pgdata_t is created for each node. Currently<br />

we do not support discontiguous memory<br />

within a node (though large gaps in the<br />

physical address space are acceptable between<br />

nodes). Thus we must still create page descriptor<br />

structures for “holes” in memory within a<br />

node (and then mark them invalid), which will<br />

waste memory (potentially a problem for large<br />

holes).<br />

Dave McCracken has picked up Daniel<br />

Phillips’ earlier work on a better data structure<br />

for holding the page descriptors, called<br />

CONFIG_NONLINEAR. This will allow the<br />

mapping of discontigous memory ranges inside<br />

each node, and greatly simplify the existing<br />

code for discontiguous memory on non-<br />

NUMA machines.<br />

CONFIG_NONLINEAR solves the problem by<br />

creating an artificial layer of linear addresses.<br />

It does this by dividing the physical address<br />

space into fixed size sections (akin to very<br />

large pages), then allocating an array to allow<br />

translations from linear physical address to true<br />

physical address. This added level of indirection<br />

allows memory with widely differing true<br />

physical addresses to appear adjacent to the<br />

page allocator and to be in the same zone, with<br />

a single struct page array to describe them. It<br />

also provides support for memory hotplug by<br />

allowing new physical memory to be added to<br />

an existing zone and struct page array.<br />

<strong>Linux</strong> normally allocates memory for a process<br />

on the local node, i.e., the node that the process<br />

is currently running on. alloc_pages<br />

will call alloc_pages_node for the current<br />

processor’s node, which will pass the relevant<br />

zonelist (pgdat->node_zonelists)<br />

to the core allocator (__alloc_pages). <strong>The</strong><br />

zonelists are built by build_zonelists,<br />

and are set up to allocate memory in a roundrobin<br />

fashion, starting from the local node (this<br />

creates a roughly even distribution of memory


92 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

pressure).<br />

In the interest of reducing cross-node traffic,<br />

and reducing memory access latency for frequently<br />

accessed data and text, it is desirable<br />

to replicate any such memory that is read-only<br />

to each node, and use the local copy on any accesses,<br />

rather than a remote copy. <strong>The</strong> obvious<br />

candidates for such replication are the kernel<br />

text itself, and the text of shared libraries such<br />

as libc. Of course, this faster access comes<br />

at the price of increased memory usage, but<br />

this is rarely a problem on large NUMA machines.<br />

Whilst it might be technically possible<br />

to replicate read/write mappings, this is complex,<br />

of dubious utility, and is unlikely to be<br />

implemented.<br />

<strong>Kernel</strong> text is assumed by the kernel itself to<br />

appear at a fixed virtual address, and to change<br />

this would be problematic. Hence the easiest<br />

way to replicate it is to change the virtual to<br />

physical mappings for each node to point at a<br />

different address. On IA-64, this is easy, since<br />

the CPU provides hardware assistance in the<br />

form of a pinned TLB entry.<br />

On other architectures this proves more difficult,<br />

and would depend on the structure of the<br />

pagetables. On IA-32 with PAE enabled, as<br />

long as the user-kernel split is aligned on a<br />

PMD boundary, we can have a separate kernel<br />

PMD for each node, and point the vmalloc<br />

area (which uses small page mappings) back to<br />

a globally shared set of PTE pages. <strong>The</strong> PMD<br />

entries for the ZONE_NORMAL areas normally<br />

never change, so this is not an issue, though<br />

there is an issue with ioremap_nocache<br />

that can change them (GART trips over this)<br />

and speculative execution means that we will<br />

have to deal with that (this can be a slow-path<br />

that updates all copies of the PMDs though).<br />

Dave Hansen has created a patch to replicate<br />

read only pagecache data, by adding a per-node<br />

data structure to each node of the pagecache<br />

radix tree. As soon as any mapping is opened<br />

for write, the replication is collapsed, making<br />

it safe. <strong>The</strong> patch gives a 5%–40% increase in<br />

performance, depending on the workload.<br />

In the 2.6 <strong>Linux</strong> kernel, we have a per-node<br />

LRU for page management and a per-node<br />

LRU lock, in place of the global structures<br />

and locks of 2.4. Not only does this reduce<br />

contention through finer grained locking, it<br />

also means we do not have to search other<br />

nodes’ page lists to free up pages on one node<br />

which is under memory pressure. Moreover,<br />

we get much better locality, as only the local<br />

kswapd process is accessing that node’s<br />

pages. Before splitting the LRU into per-node<br />

lists, we were spending 50% of the system time<br />

during a kernel compile just spinning waiting<br />

for pagemap_lru_lock (which was the<br />

biggest global VM lock at the time). Contention<br />

for the pagemap_lru_lock is now<br />

so small it is not measurable.<br />

4 Sched Domains—a Topologyaware<br />

Scheduler<br />

<strong>The</strong> previous <strong>Linux</strong> scheduler, the O(1) scheduler,<br />

provided some needed improvements to<br />

the 2.4 scheduler, but shows its age as more<br />

complex system topologies become more and<br />

more common. With technologies such as<br />

NUMA, Symmetric Multi-Threading (SMT),<br />

and variations and combinations of these, the<br />

need for a more flexible mechanism to model<br />

system topology is evident.<br />

4.1 Overview<br />

In answer to this concern, the mainline 2.6<br />

tree (linux-2.6.7-rc1 at the time of this writing)<br />

contains an updated scheduler with support for<br />

generic CPU topologies with a data structure,<br />

struct sched_domain, that models the<br />

architecture and defines scheduling policies.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 93<br />

Simply speaking, sched domains group CPUs<br />

together in a hierarchy that mimics that of the<br />

physical hardware. Since CPUs at the bottom<br />

of the hierarchy are most closely related<br />

(in terms of memory access), the new scheduler<br />

performs load balancing most often at the<br />

lower domains, with decreasing frequency at<br />

each higher level.<br />

Consider the case of a machine with two SMT<br />

CPUs. Each CPU contains a pair of virtual<br />

CPU siblings which share a cache and the core<br />

processor. <strong>The</strong> machine itself has two physical<br />

CPUs which share main memory. In such<br />

a situation, treating each of the four effective<br />

CPUs the same would not result in the best<br />

possible performance. With only two tasks,<br />

for example, the scheduler should place one<br />

on CPU0 and one on CPU2, and not on the<br />

two virtual CPUs of the same physical CPU.<br />

When running several tasks it seems natural to<br />

try to place newly ready tasks on the CPU they<br />

last ran on (hoping to take advantage of cache<br />

warmth). However, virtual CPU siblings share<br />

a cache; a task that was running on CPU0,<br />

then blocked, and became ready when CPU0<br />

was running another task and CPU1 was idle,<br />

would ideally be placed on CPU1. Sched domains<br />

provide the structures needed to realize<br />

these sorts of policies. With sched domains,<br />

each physical CPU represents a domain containing<br />

the pair of virtual siblings, each represented<br />

in a sched_group structure. <strong>The</strong>se<br />

two domains both point to a parent domain<br />

which contains all four effective processors in<br />

two sched_group structures, each containing<br />

a pair of virtual siblings. Figure 1 illustrates<br />

this hierarchy.<br />

Next consider a two-node NUMA machine<br />

with two processors per node. In this example<br />

there are no virtual sibling CPUs, and therefore<br />

no shared caches. When a task becomes<br />

ready and the processor it last ran on is busy,<br />

the scheduler needs to consider waiting un-<br />

Figure 1: SMT Domains<br />

til that CPU is available to take advantage of<br />

cache warmth. If the only available CPU is<br />

on another node, the scheduler must carefully<br />

weigh the costs of migrating that task to another<br />

node, where access to its memory will<br />

be slower. <strong>The</strong> lowest level sched domains in<br />

a machine like this will contain the two processors<br />

of each node. <strong>The</strong>se two CPU level<br />

domains each point to a parent domain which<br />

contains the two nodes. Figure 2 illustrates this<br />

hierarchy.<br />

Figure 2: NUMA Domains<br />

<strong>The</strong> next logical step is to consider an SMT<br />

NUMA machine. By combining the previous<br />

two examples, the resulting sched domain hierarchy<br />

has three levels, sibling domains, physical<br />

CPU domains, and the node domain. Figure<br />

3 illustrates this hierarchy.<br />

<strong>The</strong> unique AMD Opteron architecture warrants<br />

mentioning here as it creates a NUMA<br />

system on a single physical board. In this case,<br />

however, each NUMA node contains only one


94 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Figure 3: SMT NUMA Domains<br />

physical CPU. Without careful consideration<br />

of this property, a typical NUMA sched domains<br />

hierarchy would perform badly, trying<br />

to load balance single CPU nodes often (an obvious<br />

waste of cycles) and between node domains<br />

only rarely (also bad since these actually<br />

represent the physical CPUs).<br />

4.2 Sched Domains Implementation<br />

4.2.1 Structure<br />

<strong>The</strong> sched_domain structure stores policy<br />

parameters and flags and, along with<br />

the sched_group structure, is the primary<br />

building block in the domain hierarchy. Figure<br />

4 describes these structures. <strong>The</strong> sched_<br />

domain structure is constructed into an upwardly<br />

traversable tree via the parent pointer,<br />

the top level domain setting parent to NULL.<br />

<strong>The</strong> groups list is a circular list of of sched_<br />

group structures which essentially define the<br />

CPUs in each child domain and the relative<br />

power of that group of CPUs (two physical<br />

CPUs are more powerful than one SMT CPU).<br />

<strong>The</strong> span member is simply a bit vector with a<br />

1 for every CPU encompassed by that domain<br />

and is always the union of the bit vector stored<br />

in each element of the groups list. <strong>The</strong> remaining<br />

fields define the scheduling policy to be followed<br />

while dealing with that domain, see Section<br />

4.2.2.<br />

While the hierarchy may seem simple, the details<br />

of its construction and resulting tree structures<br />

are not. For performance reasons, the<br />

domain hierarchy is built on a per-CPU basis,<br />

meaning each CPU has a unique instance of<br />

each domain in the path from the base domain<br />

to the highest level domain. <strong>The</strong>se duplicate<br />

structures do share the sched_group structures<br />

however. <strong>The</strong> resulting tree is difficult to<br />

diagram, but resembles Figure 5 for the machine<br />

with two SMT CPUs discussed earlier.<br />

In accordance with common practice, each<br />

architecture may specify the construction of<br />

the sched domains hierarchy and the parameters<br />

and flags defining the various policies.<br />

At the time of this writing, only i386<br />

and ppc64 defined custom construction routines.<br />

Both architectures provide for SMT<br />

processors and NUMA configurations. Without<br />

an architecture-specific routine, the kernel<br />

uses the default implementations in sched.c,<br />

which do take NUMA into account.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 95<br />

struct sched_domain {<br />

/* <strong>The</strong>se fields must be setup */<br />

struct sched_domain *parent; /* top domain must be null terminated */<br />

struct sched_group *groups; /* the balancing groups of the domain */<br />

cpumask_t span; /* span of all CPUs in this domain */<br />

unsigned long min_interval; /* Minimum balance interval ms */<br />

unsigned long max_interval; /* Maximum balance interval ms */<br />

unsigned int busy_factor; /* less balancing by factor if busy */<br />

unsigned int imbalance_pct; /* No balance until over watermark */<br />

unsigned long long cache_hot_time; /* Task considered cache hot (ns) */<br />

unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */<br />

unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */<br />

int flags; /* See SD_* */<br />

};<br />

/* Runtime fields. */<br />

unsigned long last_balance; /* init to jiffies. units in jiffies */<br />

unsigned int balance_interval; /* initialise to 1. units in ms. */<br />

unsigned int nr_balance_failed; /* initialise to 0 */<br />

struct sched_group {<br />

struct sched_group *next; /* Must be a circular list */<br />

cpumask_t cpumask;<br />

unsigned long cpu_power;<br />

};<br />

Figure 4: Sched Domains Structures<br />

4.2.2 Policy<br />

<strong>The</strong> new scheduler attempts to keep the system<br />

load as balanced as possible by running rebalance<br />

code when tasks change state or make<br />

specific system calls, we will call this event<br />

balancing, and at specified intervals measured<br />

in jiffies, called active balancing. Tasks must<br />

do something for event balancing to take place,<br />

while active balancing occurs independent of<br />

any task.<br />

Event balance policy is defined in each<br />

sched_domain structure by setting a combination<br />

of the #defines of figure 6 in the flags<br />

member.<br />

To define the policy outlined for the dual SMT<br />

processor machine in Section 4.1, the lowest<br />

level domains would set SD_BALANCE_<br />

NEWIDLE and SD_WAKE_IDLE (as there is<br />

no cache penalty for running on a different<br />

sibling within the same physical CPU),<br />

SD_SHARE_CPUPOWER to indicate to the<br />

scheduler that this is an SMT processor (the<br />

scheduler will give full physical CPU access<br />

to a high priority task by idling the<br />

virtual sibling CPU), and a few common<br />

flags SD_BALANCE_EXEC, SD_BALANCE_<br />

CLONE, and SD_WAKE_AFFINE. <strong>The</strong> next<br />

level domain represents the physical CPUs<br />

and will not set SD_WAKE_IDLE since cache<br />

warmth is a concern when balancing across<br />

physical CPUs, nor SD_SHARE_CPUPOWER.<br />

This domain adds the SD_WAKE_BALANCE<br />

flag to compensate for the removal of SD_<br />

WAKE_IDLE. As discussed earlier, an SMT<br />

NUMA system will have these two domains<br />

and another node-level domain. This domain<br />

removes the SD_BALANCE_NEWIDLE<br />

and SD_WAKE_AFFINE flags, resulting in<br />

far fewer balancing across nodes than within<br />

nodes. When one of these events occurs, the<br />

scheduler search up the domain hierarchy and<br />

performs the load balancing at the highest level<br />

domain with the corresponding flag set.<br />

Active balancing is fairly straightforward and<br />

aids in preventing CPU-hungry tasks from hogging<br />

a processor, since these tasks may only


96 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

#define SD_BALANCE_NEWIDLE 1 /* Balance when about to become idle */<br />

#define SD_BALANCE_EXEC 2 /* Balance on exec */<br />

#define SD_BALANCE_CLONE 4 /* Balance on clone */<br />

#define SD_WAKE_IDLE 8 /* Wake to idle CPU on task wakeup */<br />

#define SD_WAKE_AFFINE 16 /* Wake task to waking CPU */<br />

#define SD_WAKE_BALANCE 32 /* Perform balancing at task wakeup */<br />

#define SD_SHARE_CPUPOWER 64 /* Domain members share cpu power */<br />

Figure 6: Sched Domains Policies<br />

4.3 Conclusions and Future Work<br />

Figure 7: Kernbench Results<br />

Figure 5: Per CPU Domains<br />

rarely trigger event balancing. At each rebalance<br />

tick, the scheduler starts at the lowest<br />

level domain and works its way up, checking<br />

the balance_interval and last_<br />

balance fields to determine if that domain<br />

should be balanced. If the domain is already<br />

busy, the balance_interval is adjusted<br />

using the busy_factor field. Other fields<br />

define how out of balance a node must be before<br />

rebalancing can occur, as well as some<br />

sane limits on cache hot time and min and max<br />

balancing intervals. As with the flags for event<br />

balancing, the active balancing parameters are<br />

defined to perform less balancing at higher domains<br />

in the hierarchy.<br />

To compare the O(1) scheduler of mainline<br />

with the sched domains implementation in the<br />

mm tree, we ran kernbench (with the -j option<br />

to make set to 8, 16, and 32) on a 16 CPU SMT<br />

machine (32 virtual CPUs) on linux-2.6.6 and<br />

linux-2.6.6-mm3 (the latest tree with sched domains<br />

at the time of the benchmark) with and<br />

without CONFIG_SCHED_SMT enabled. <strong>The</strong><br />

results are displayed in Figure 7. <strong>The</strong> O(1)<br />

scheduler evenly distributed compile tasks accross<br />

virtual CPUs, forcing tasks to share cache<br />

and computational units between virtual sibling<br />

CPUs. <strong>The</strong> sched domains implementation<br />

with CONFIG_SCHED_SMT enabled balanced<br />

the load accross physical CPUs, making<br />

far better use of CPU resources when running<br />

fewer tasks than CPUs (as in the j8 case) since<br />

each compile task would have exclusive access<br />

to the physical CPU. Surprisingly, sched domains<br />

(which would seem to have more overhead<br />

than the mainline scheduler) even showed<br />

improvement for the j32 case, where it doesn’t


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 97<br />

benefit from balancing across physical CPUs<br />

before virtual CPUs as there are more tasks<br />

than virtual CPUs. Considering the sched domains<br />

implementation has not been heavily<br />

tested or tweaked for performance, some fine<br />

tuning is sure to further improve performance.<br />

<strong>The</strong> sched domains structures replace the expanding<br />

set of #ifdefs of the O(1) scheduler,<br />

which should improve readability and<br />

maintainability. Unfortunately, the per CPU<br />

nature of the domain construction results in a<br />

non-intuitive structure that is difficult to work<br />

with. For example, it is natural to discuss the<br />

policy defined at “the” top level domain; unfortunately<br />

there are NR_CPUS top level domains<br />

and, since they are self-adjusting, each<br />

one could conceivably have a different set of<br />

flags and parameters. Depending on which<br />

CPU the scheduler was running on, it could behave<br />

radically differently. As an extension of<br />

this research, an effort to analyze the impact of<br />

a unified sched domains hierarchy is needed,<br />

one which only creates one instance of each<br />

domain.<br />

Sched domains provides a needed structural<br />

change to the way the <strong>Linux</strong> scheduler views<br />

modern architectures, and provides the parameters<br />

needed to create complex scheduling<br />

policies that cater to the strengths and weaknesses<br />

of these systems. Currently only i386<br />

and ppc64 machines benefit from arch specific<br />

construction routines; others must now step<br />

forward and fill in the construction and parameter<br />

setting routines for their architecture of<br />

choice. <strong>The</strong>re is still plenty of fine tuning and<br />

performance tweaking to be done.<br />

5 NUMA API<br />

5.1 Introduction<br />

<strong>One</strong> of the biggest impediments to the acceptance<br />

of a NUMA API for <strong>Linux</strong> was a<br />

lack of understanding of what its potential uses<br />

and users would be. <strong>The</strong>re are two schools<br />

of thought when it comes to writing NUMA<br />

code. <strong>One</strong> says that the OS should take care<br />

of all the NUMA details, hide the NUMAness<br />

of the underlying hardware in the kernel<br />

and allow userspace applications to pretend<br />

that it’s a regular SMP machine. <strong>Linux</strong><br />

does this by having a process scheduler and<br />

a VMM that make intelligent decisions based<br />

on the hardware topology presented by archspecific<br />

code. <strong>The</strong> other way to handle NUMA<br />

programming is to provide as much detail as<br />

possible about the system to userspace and<br />

allow applications to exploit the hardware to<br />

the fullest by giving scheduling hints, memory<br />

placement directives, etc., and the NUMA<br />

API for <strong>Linux</strong> handles this. Many applications,<br />

particularly larger applications with many concurrent<br />

threads of execution, cannot fully utilize<br />

a NUMA machine with the default scheduler<br />

and VM behavior. Take, for example, a<br />

database application that uses a large region of<br />

shared memory and many threads. This application<br />

may have a startup thread that initializes<br />

the environment, sets up the shared memory<br />

region, and forks off the worker threads. <strong>The</strong><br />

default behavior of <strong>Linux</strong>’s VM for NUMA is<br />

to bring pages into memory on the node that<br />

faulted them in. This behavior for our hypothetical<br />

app would mean that many pages<br />

would get faulted in by the startup thread on<br />

the node it is executing on, not necessarily on<br />

the node containing the processes that will actually<br />

use these pages. Also, the forked worker<br />

threads would get spread around by the scheduler<br />

to be balanced across all the nodes and<br />

their CPUs, but with no guarantees as to which


98 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

threads would be associated with which nodes.<br />

<strong>The</strong> NUMA API and scheduler affinity syscalls<br />

allow this application to specify that its threads<br />

be pinned to particular CPUs and that its memory<br />

be placed on particular nodes. <strong>The</strong> application<br />

knows which threads will be working<br />

with which regions of memory, and is better<br />

equipped than the kernel to make those decisions.<br />

<strong>The</strong> <strong>Linux</strong> NUMA API allows applications<br />

to give regions of their own virtual memory<br />

space specific allocation behaviors, called policies.<br />

Currently there are four supported policies:<br />

PREFERRED, BIND, INTERLEAVE,<br />

and DEFAULT. <strong>The</strong> DEFAULT policy is the<br />

simplest, and tells the VMM to do what it<br />

would normally do (ie: pre-NUMA API) for<br />

pages in the policied region, and fault them<br />

in from the local node. This policy applies<br />

to all regions, but is overridden if an application<br />

requests a different policy. <strong>The</strong> PRE-<br />

FERRED policy allows an application to specify<br />

one node that all pages in the policied region<br />

should come from. However, if the specified<br />

node has no available pages, the PRE-<br />

FERRED policy allows allocation to fall back<br />

to any other node in the system. <strong>The</strong> BIND<br />

policy allows applications to pass in a nodemask,<br />

a bitmap of nodes, that the VM is required<br />

to use when faulting in pages from a region.<br />

<strong>The</strong> fourth policy type, INTERLEAVE,<br />

again requires applications to pass in a nodemask,<br />

but with the INTERLEAVE policy, the<br />

nodemask is used to ensure pages are faulted<br />

in in a round-robin fashion from the nodes<br />

in the nodemask. As with the PREFERRED<br />

policy, the INTERLEAVE policy allows page<br />

allocation to fall back to other nodes if necessary.<br />

In addition to allowing a process to<br />

policy a specific region of its VM space, the<br />

NUMA API also allows a process to policy<br />

its entire VM space with a process-wide policy,<br />

which is set with a different syscall: set_<br />

mempolicy(). Note that process-wide policies<br />

are not persistent over swapping, however<br />

per-VMA policies are. Please also note that<br />

none of the policies will migrate existing (already<br />

allocated) pages to match the binding.<br />

<strong>The</strong> actual implementation of the in-kernel<br />

policies uses a struct mempolicy that is<br />

hung off the struct vm_area_struct.<br />

This choice involves some tradeoffs. <strong>The</strong> first<br />

is that, previous to the NUMA API, the per-<br />

VMA structure was exactly 32 bytes on 32-<br />

bit architectures, meaning that multiple vm_<br />

area_structs would fit conveniently in a<br />

single cacheline. <strong>The</strong> structure is now a little<br />

larger, but this allowed us to achieve a per-<br />

VMA granularity to policied regions. This is<br />

important in that it is flexible enough to bind<br />

a single page, a whole library, or a whole process’<br />

memory. This choice did lead to a second<br />

obstacle, however, which was for shared<br />

memory regions. For shared memory regions,<br />

we really want the policy to be shared amongst<br />

all processes sharing the memory, but VMAs<br />

are not shared across separate tasks. <strong>The</strong> solution<br />

that was implemented to work around this<br />

was to create a red-black tree of “shared policy<br />

nodes” for shared memory regions. Due<br />

to this, calls were added to the vm_ops structure<br />

which allow the kernel to check if a shared<br />

region has any policies and to easily retrieve<br />

these shared policies.<br />

5.2 Syscall Entry Points<br />

1. sys_mbind(unsigned long start, unsigned<br />

long len, unsigned long mode, unsigned<br />

long *nmask, unsigned long maxnode,<br />

unsigned flags);<br />

Bind the region of memory [start,<br />

start+len) according to mode and<br />

flags on the nodes enumerated in<br />

nmask and having a maximum possible<br />

node number of maxnode.<br />

2. sys_set_mempolicy(int mode, unsigned


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 99<br />

long *nmask, unsigned long maxnode);<br />

Bind the entire address space of the current<br />

process according to mode on the<br />

nodes enumerated in nmask and having<br />

a maximum possible node number of<br />

maxnode.<br />

3. sys_get_mempolicy(int *policy, unsigned<br />

long *nmask, unsigned long maxnode,<br />

unsigned long addr, unsigned long flags);<br />

Return the current binding’s mode in<br />

policy and node enumeration in<br />

nmask based on the maxnode, addr,<br />

and flags passed in.<br />

5.4 At Page Fault Time<br />

<strong>The</strong>re are now several new and different<br />

flavors of alloc_pages() style functions.<br />

Previous to the NUMA API, there<br />

existed alloc_page(), alloc_pages()<br />

and alloc_pages_node(). Without going<br />

into too much detail, alloc_page()<br />

and alloc_pages() both called alloc_<br />

pages_node() with the current node id as<br />

an argument. alloc_pages_node() allocated<br />

2 order pages from a specific node, and<br />

was the only caller to the real page allocator,<br />

__alloc_pages().<br />

In addition to the raw syscalls discussed above,<br />

there is a user-level library called “libnuma”<br />

that attempts to present a more cohesive interface<br />

to the NUMA API, topology, and scheduler<br />

affinity functionality. This, however, is<br />

documented elsewhere.<br />

alloc_page()<br />

alloc_pages()<br />

5.3 At mbind() Time<br />

After argument validation, the passed-in list of<br />

nodes is checked to make sure they are all online.<br />

If the node list is ok, a new memory policy<br />

structure is allocated and populated with the<br />

binding details. Next, the given address range<br />

is checked to make sure the vma’s for the region<br />

are present and correct. If the region is ok,<br />

we proceed to actually install the new policy<br />

into all the vma’s in that range. For most types<br />

of virtual memory regions, this involves simply<br />

pointing the vma->vm_policy to the newly<br />

allocated memory policy structure. For shared<br />

memory, hugetlbfs, and tmpfs, however, it’s<br />

not quite this simple. In the case of a memory<br />

policy for a shared segment, a red-black tree<br />

root node is created, if it doesn’t already exist,<br />

to represent the shared memory segment and<br />

is populated with “shared policy nodes.” This<br />

allows a user to bind a single shared memory<br />

segment with multiple different bindings.<br />

alloc_pages_node()<br />

__alloc_pages()<br />

Figure 8: old alloc_pages<br />

With the introduction of the NUMA API, non-<br />

NUMA kernels still retain the old alloc_<br />

page*() routines, but the NUMA allocators<br />

have changed. alloc_pages_node()<br />

and __alloc_pages(), the core routines<br />

remain untouched, but all calls to alloc_<br />

page()/alloc_pages() now end up calling<br />

alloc_pages_current(), a new<br />

function.


100 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

<strong>The</strong>re has also been the addition<br />

of two new page allocation functions:<br />

alloc_page_vma() and<br />

alloc_page_interleave().<br />

alloc_pages_current() checks that the<br />

system is not currently in_interrupt(),<br />

and if it isn’t, uses the current process’s<br />

process policy for allocation. If<br />

the system is currently in interrupt context,<br />

alloc_pages_current() falls<br />

back to the old default allocation scheme.<br />

alloc_page_interleave() allocates<br />

pages from regions that are bound with an<br />

interleave policy, and is broken out separately<br />

because there are some statistics kept for<br />

interleaved regions. alloc_page_vma()<br />

is a new allocator that allocates only single<br />

pages based on a per-vma policy. <strong>The</strong><br />

alloc_page_vma() function is the only<br />

one of the new allocator functions that must be<br />

called explicity, so you will notice that some<br />

calls to alloc_pages() have been replaced<br />

by calls to alloc_page_vma() throughout<br />

the kernel, as necessary.<br />

6 Legal statement<br />

This work represents the view of the authors, and<br />

does not necessarily represent the view of IBM.<br />

IBM, NUMA-Q and Sequent are registerd trademarks<br />

of International Business Machines Corporation<br />

in the United States, other contries, or both.<br />

Other company, product, or service names may be<br />

trademarks of service names of others.<br />

References<br />

[LWN] LWN Editor, “Scheduling Domains,”<br />

http://lwn.net/Articles/<br />

80911/<br />

[MM2] <strong>Linux</strong> 2.6.6-rc2/mm2 source, http:<br />

//www.kernel.org<br />

5.5 Problems/Future Work<br />

<strong>The</strong>re is no checking that the nodes requested<br />

are online at page fault time, so interactions<br />

with hotpluggable CPUs/memory<br />

will be tricky. <strong>The</strong>re is an asymmetry between<br />

how you bind a memory region and<br />

a whole process’s memory: <strong>One</strong> call takes<br />

a flags argument, and one doesn’t. Also<br />

the maxnode argument is a bit strange,<br />

the get/set_affinity calls take a number of<br />

bytes to be read/written instead of a maximum<br />

CPU number. <strong>The</strong> alloc_page_<br />

interleave() function could be dropped if<br />

we were willing to forgo the statistics that are<br />

kept for interleaved regions. Again, a lack of<br />

symmetry exists because other types of policies<br />

aren’t tracked in any way.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 101<br />

Both UP/SMP & NUMA<br />

UP/SMP only<br />

alloc_page_vma()<br />

alloc_page()<br />

NUMA only<br />

alloc_pages()<br />

alloc_pages_current()<br />

alloc_pages_node()<br />

alloc_page_interleave()<br />

__alloc_pages()<br />

Figure 9: new alloc_pages


102 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Improving <strong>Kernel</strong> Performance by Unmapping the<br />

Page Cache<br />

James Bottomley<br />

SteelEye Technology, Inc.<br />

James.Bottomley@SteelEye.com<br />

Abstract<br />

<strong>The</strong> current DMA API is written on the founding<br />

assumption that the coherency is being<br />

done between the device and kernel virtual addresses.<br />

We have a different API for coherency<br />

between the kernel and userspace. <strong>The</strong> upshot<br />

is that every Process I/O must be flushed twice:<br />

Once to make the user coherent with the kernel<br />

and once to make the kernel coherent with the<br />

device. Additionally, having to map all pages<br />

for I/O places considerable resource pressure<br />

on x86 (where any highmem page must be separately<br />

mapped).<br />

We present a different paradigm: Assume that<br />

by and large, read/write data is only required<br />

by a single entity (the major consumers of large<br />

multiply shared mappings are libraries, which<br />

are read only) and optimise the I/O path for this<br />

case. This means that any other shared consumers<br />

of the data (including the kernel) must<br />

separately map it themselves. <strong>The</strong> DMA API<br />

would be changed to perform coherence to the<br />

preferred address space (which could be the<br />

kernel). This is a slight paradigm shift, because<br />

now devices that need to peek at the data may<br />

have to map it first. Further, to free up more<br />

space for this mapping, we would break the assumption<br />

that any page in ZONE_NORMAL<br />

is automatically mapped into kernel space.<br />

<strong>The</strong> benefits are that I/O goes straight from<br />

the device into the user space (for processors<br />

that have virtually indexed caches) and the kernel<br />

has quite a large unmapped area for use in<br />

kmapping highmem pages (for x86).<br />

1 Introduction<br />

In the <strong>Linux</strong> kernel 1 there are two addressing<br />

spaces: memory physical which is the location<br />

in the actual memory subsystem and CPU virtual,<br />

which is an address the CPU’s Memory<br />

Management Unit (MMU) translates to a memory<br />

physical address internally. <strong>The</strong> <strong>Linux</strong> kernel<br />

operates completely in CPU virtual space,<br />

keeping separate virtual spaces for the kernel<br />

and each of the current user processes. However,<br />

the kernel also has to manage the mappings<br />

between physical and virtual spaces, and<br />

to do that it keeps track of where the physical<br />

pages of memory currently are.<br />

In the <strong>Linux</strong> kernel, memory is split into zones<br />

in memory physical space:<br />

• ZONE_DMA: A historical region where<br />

ISA DMAable memory is allocated from.<br />

On x86 this is all memory under 16MB.<br />

• ZONE_NORMAL: This is where normally<br />

allocated kernel memory goes. Where<br />

1 This is not quite true, there are kernels for processors<br />

without memory management units, but these are<br />

very specialised and won’t be considered further


104 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

this zone ends depends on the architecture.<br />

However, all memory in this zone<br />

is mapped in kernel space (visible to the<br />

kernel).<br />

• ZONE_HIGHMEM: This is where the rest<br />

of the memory goes. Its characteristic is<br />

that it is not mapped in kernel space (thus<br />

the kernel cannot access it without first<br />

mapping it).<br />

1.1 <strong>The</strong> x86 and Highmem<br />

<strong>The</strong> main reason for the existence of ZONE_<br />

HIGHMEM is a peculiar quirk on the x86 processor<br />

which makes it rather expensive to have<br />

different page table mappings between the kernel<br />

and user space. <strong>The</strong> root of the problem<br />

is that the x86 can only keep one set of physical<br />

to virtual mappings on-hand at once. Since<br />

the kernel and the processes occupy different<br />

virtual mappings, the TLB context would have<br />

to be switched not only when the processor<br />

changes current user tasks, but also when the<br />

current user task calls on the kernel to perform<br />

an operation on its behalf. <strong>The</strong> time taken<br />

to change mappings, called the TLB flushing<br />

penalty, contributes to a degradation in process<br />

performance and has been measured at around<br />

30%[1]. To avoid this penalty, the <strong>Kernel</strong> and<br />

user spaces share a partitioned virtual address<br />

space so that the kernel is actually mapped into<br />

user space (although protected from user access)<br />

and vice versa.<br />

<strong>The</strong> upshot of this is that the x86 userspace<br />

is divided 3GB/1GB with the virtual address<br />

range 0x00000000-0xbfffffff<br />

being available for the user process and<br />

0xc0000000-0xffffffff being reserved<br />

for the kernel.<br />

<strong>The</strong> problem, for the kernel, is that it now only<br />

has 1GB of virtual address to play with including<br />

all memory mapped I/O regions. <strong>The</strong> result<br />

being that ZONE_NORMAL actually ends<br />

at around 850kb on most x86 boxes. Since<br />

the kernel must also manage the mappings for<br />

every user process (and these mappings must<br />

be memory resident), the larger the physical<br />

memory of the kernel becomes, the less of<br />

ZONE_NORMAL becomes available to the kernel.<br />

On a 64GB x86 box, the usable memory<br />

becomes minuscule and has lead to the<br />

proposal[2] to use a 4G/4G split and just accept<br />

the TLB flushing penalty.<br />

1.2 Non-x86 and Virtual Indexing<br />

Most other architectures are rather better implemented<br />

and are able to cope easily with separate<br />

virtual spaces for the user and the kernel<br />

without imposing a performance penalty<br />

transitioning from one virtual address space to<br />

another. However, there are other problems<br />

the kernel’s penchant for keeping all memory<br />

mapped causes, notably with Virtual Indexing.<br />

Virtual Indexing[3] (VI) means that the CPU<br />

cache keeps its data indexed by virtual address<br />

(rather than by physical address like the x86<br />

does). <strong>The</strong> problem this causes is that if multiple<br />

virtual address spaces have the same physical<br />

address mapped, but at different virtual addresses<br />

then the cache may contain duplicate<br />

entries, called aliases. Managing these aliases<br />

becomes impossible if there are multiple ones<br />

that become dirty.<br />

Most VI architectures find a solution to the<br />

multiple cache line problem by having a “congruence<br />

modulus” meaning that if two virtual<br />

addresses are equal modulo this congruence<br />

(usually a value around 4MB) then the cache<br />

will detect the aliasing and keep only a single<br />

copy of the data that will be seen by all the virtual<br />

addresses.<br />

<strong>The</strong> problems arise because, although architectures<br />

go to great lengths to make sure all<br />

user mappings are congruent, because the ker-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 105<br />

nel memory is always mapped, it is highly unlikely<br />

that any given kernel page would be congruent<br />

to a user page.<br />

1.3 <strong>The</strong> solution: Unmapping ZONE_NORMAL<br />

It has already been pointed out[4] that x86<br />

could recover some of its precious ZONE_<br />

NORMAL space simply by moving page table<br />

entries into unmapped highmem space. However,<br />

the penalty of having to map and unmap<br />

the page table entries to modify them turned<br />

out to be unacceptable.<br />

<strong>The</strong> solution, though, remains valid. <strong>The</strong>re<br />

are many pages of data currently in ZONE_<br />

NORMAL that the kernel doesn’t ordinarily use.<br />

If these could be unmapped and their virtual<br />

address space given up then the x86 kernel<br />

wouldn’t be facing quite such a memory<br />

crunch.<br />

For VI architectures, the problems stem from<br />

having unallocated kernel memory already<br />

mapped. If we could keep the majority of kernel<br />

memory unmapped, and map it only when<br />

we really need to use it, then we would stand<br />

a very good chance of being able to map the<br />

memory congruently even in kernel space.<br />

<strong>The</strong> solution this paper will explore is that of<br />

keeping the majority of kernel memory unmapped,<br />

mapping it only when it is used.<br />

2 A closer look at Virtual Indexing<br />

As well as the aliasing problem, VI architectures<br />

also have issues with I/O coherency on<br />

DMA. <strong>The</strong> essence of the problem stems from<br />

the fact that in order to make a device access<br />

to physical memory coherent, any cache<br />

lines that the processor is holding need to be<br />

flushed/invalidates as part of the DMA transaction.<br />

In order to do DMA, a device simply<br />

presents a physical address to the system with<br />

a request to read or write. However, if the processor<br />

indexes the caches virtually, it will have<br />

no idea whether it is caching this physical address<br />

or not. <strong>The</strong>refore, in order to give the<br />

processor an idea of where in the cache the data<br />

might be, the DMA engines on VI architectures<br />

also present a virtual index (called the “coherence<br />

index”) along with the physical address.<br />

2.1 Coherence Indices and DMA<br />

<strong>The</strong> Coherence Index is computed by the processor<br />

on a per page basis, and is used to identify<br />

the line in the cache belonging to the physical<br />

address the DMA is using.<br />

<strong>One</strong> will notice that this means the coherence<br />

index must be computed on every DMA transaction<br />

for a particular address space (although,<br />

if all the addresses are congruent, one may simply<br />

pick any one). Since, at the time the dma<br />

mapping is done, the only virtual address the<br />

kernel knows about is the kernel virtual address,<br />

it means that DMA is always done coherently<br />

with the kernel.<br />

In turn, since the kernel address is pretty much<br />

not congruent with any user address, before the<br />

DMA is signalled as being completed to the<br />

user process, the kernel mapping and the user<br />

mappings must likewise be made coherent (using<br />

the flush_dcache_page() function).<br />

However, since the majority of DMA transactions<br />

occur on user data in which the kernel has<br />

no interest, the extra flush is simply an unnecessary<br />

performance penalty.<br />

This performance penalty would be eliminated<br />

if either we knew that the designated kernel address<br />

was congruent to all the user addresses<br />

or we didn’t bother to map the DMA region<br />

into kernel space and simply computed the coherence<br />

index from a given user process. <strong>The</strong><br />

latter would be preferable from a performance<br />

point of view since it eliminates an unneces-


106 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

sary map and unmap.<br />

2.2 Other Issues with Non-Congruence<br />

On the parisc architecture, there is an architectural<br />

requirement that we don’t simultaneously<br />

enable multiple read and write translations of<br />

a non-congruent address. We can either enable<br />

a single write translation or multiple read (but<br />

no write) translations. With the current manner<br />

of kernel operation, this is almost impossible<br />

to satisfy without going to enormous lengths in<br />

our page translation and fault routines to work<br />

around the issues.<br />

Previously, we were able to get away with<br />

ignoring this restriction because the machine<br />

would only detect it if we allowed multiple<br />

aliases to become dirty (something <strong>Linux</strong> never<br />

does). However, in the next generation systems,<br />

this condition will be detected when it<br />

occurs. Thus, addressing it has become critical<br />

to providing a bootable kernel on these new<br />

machines.<br />

Thus, as well as being a simple performance<br />

enhancement, removing non-congruence becomes<br />

vital to keeping the kernel booting on<br />

next generation machines.<br />

2.3 VIPT vs VIVT<br />

This topic is covered comprehensively in [3].<br />

However, there is a problem in VIPT caches,<br />

namely that if we are reusing the virtual address<br />

in kernel space, we must flush the processor’s<br />

cache for that page on this re-use otherwise<br />

it may fall victim to stale cache references<br />

that were left over from a prior use.<br />

Flushing a VIPT cache is easier said than done,<br />

since in order to flush, a valid translation must<br />

exist for the virtual address in order for the<br />

flush to be effective. This causes particular<br />

problems for pages that were mapped to a user<br />

space process, since the address translations<br />

are destroyed before the page is finally freed.<br />

3 <strong>Kernel</strong> Virtual Space<br />

Although the kernel is nominally mapped in<br />

the same way the user process is (and can theoretically<br />

be fragmented in physical space), in<br />

fact it is usually offset mapped. This means<br />

there is a simple mathematical relation between<br />

the physical and virtual addresses:<br />

virtual = physical + __PAGE_OFFSET<br />

where __PAGE_OFFSET is an architecture<br />

defined quantity. This type of mapping makes<br />

it very easy to calculate virtual addresses from<br />

physical ones and vice versa without having to<br />

go to all the bother (and CPU time) of having<br />

to look them up in the kernel page tables.<br />

3.1 Moving away from Offset Mapping<br />

<strong>The</strong>re’s another wrinkle on some architectures<br />

in that if an interruption occurs, the CPU<br />

turns off virtual addressing to begin processing<br />

it. This means that the kernel needs to<br />

save the various registers and turn virtual addressing<br />

back on, all in physical space. If<br />

it’s no longer a simple matter of subtracting<br />

__PAGE_OFFSET to get the kernel stack for<br />

the process, then extra time will be consumed<br />

in the critical path doing potentially cache cold<br />

page table lookups.<br />

3.2 Keeping track of Mapped pages<br />

In general, when mapping a page we will either<br />

require that it goes in the first available<br />

slot (for x86), or that it goes at the first available<br />

slot congruent with a given address (for VI<br />

architectures). All we really require is a simple<br />

mechanism for finding the first free page


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 107<br />

virtual address given some specific constraints.<br />

However, since the constraints are architecture<br />

specific, the specifics of this tracking are also<br />

implemented in architectures (see section 5.2<br />

for details on parisc).<br />

3.3 Determining Physical address from Virtual<br />

and Vice-Versa<br />

In the <strong>Linux</strong> kernel, the simple macros<br />

__pa() and __va() are used to do physical<br />

to virtual translation. Since we are now filling<br />

the mappings in randomly, this is no longer a<br />

simple offset calculation.<br />

<strong>The</strong> kernel does have help for finding the virtual<br />

address of a given page. <strong>The</strong>re is an<br />

optional virtual entry which is turned on<br />

and populated with the page’s current virtual<br />

address when the architecture defines WANT_<br />

PAGE_VIRTUAL. <strong>The</strong> __va() macro can be<br />

programmed simply to do this lookup.<br />

To find the physical address, the best method is<br />

probably to look the page up in the kernel page<br />

table mappings. This is obviously less efficient<br />

than a simple subtraction.<br />

4 Implementing the unmapping of<br />

ZONE_NORMAL<br />

It is not surprising, given that the entire kernel<br />

is designed to operate with ZONE_NORMAL<br />

mapped it is surprising that unmapping it turns<br />

out to be fairly easy. <strong>The</strong> primary reason for<br />

this is the existence of highmem. Since pages<br />

in ZONE_HIGHMEM are always unmapped and<br />

since they are usually assigned to user processes,<br />

the kernel must proceed on the assumption<br />

that it potentially has to map into its address<br />

space any page from a user process that<br />

it wishes to touch.<br />

4.1 Booting<br />

<strong>The</strong> kernel has an entire bootmem API whose<br />

sole job is to cope with memory allocations<br />

while the system is booting and before paging<br />

has been initialised to the point where normal<br />

memory allocations may proceed. On parisc,<br />

we simply get the available page ranges from<br />

the firmware, map them all and turn them over<br />

lock stock and barrel to bootmem.<br />

<strong>The</strong>n, when we’re ready to begin paging, we<br />

simply release all the unallocated bootmem<br />

pages for the kernel to use from its mem_map 2<br />

array of pages.<br />

We can implement the unmapping idea simply<br />

by covering all our page ranges with an offset<br />

map for bootmem, but then unmapping all the<br />

unreserved pages that bootmem releases to the<br />

mem_map array.<br />

This leaves us with the kernel text and data sections<br />

contiguously offset mapped, and all other<br />

boot time<br />

4.2 Pages Coming From User Space<br />

<strong>The</strong> standard mechanisms for mapping potential<br />

highmem pages from user space for the<br />

kernel to see are kmap, kunmap, kmap_<br />

atomic, and kmap_atomic_to_page.<br />

Simply hijacking them and divorcing their implementation<br />

from CONFIG_HIGHMEM is sufficient<br />

to solve all user to kernel problems<br />

that arise because of the unmapping of ZONE_<br />

NORMAL.<br />

4.3 In <strong>Kernel</strong> Problems: Memory Allocation<br />

Since now every free page in the system will<br />

be unmapped, they will have to be mapped<br />

2 This global array would be a set of per-zone arrays<br />

on NUMA


108 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

before the kernel can use them (pages allocated<br />

for use in user space have no need to<br />

be mapped additionally in kernel space at allocation<br />

time). <strong>The</strong> engine for doing this is a<br />

single point in __alloc_pages() which is<br />

the central routine for allocating every page in<br />

the system. In the single successful page return,<br />

the page is mapped for the kernel to use it<br />

if __GFP_HIGH is not set—this simple test is<br />

sufficient to ensure that kernel pages only are<br />

mapped here.<br />

<strong>The</strong> unmapping is done in two separate routines:<br />

__free_pages_ok() for freeing bulk<br />

pages (accumulations of contiguous pages) and<br />

free_hot_cold_page() for freeing single<br />

pages. Here, since we don’t know the gfp mask<br />

the page was allocated with, we simply check<br />

to see if the page is currently mapped, and unmap<br />

it if it is before freeing it. <strong>The</strong>re is another<br />

side benefit to this: the routine that transfers all<br />

the unreserved bootmem to the mem_map array<br />

does this via __free_pages(). Thus,<br />

we additionally achieve the unmapping of all<br />

the free pages in the system after booting with<br />

virtually no additional effort.<br />

4.4 Other Benefits: Variable size pages<br />

Although it wasn’t the design of this structure<br />

to provide variable size pages, one of the benefits<br />

of this approach is now that the pages that<br />

are mapped as they are allocated. Since pages<br />

in the kernel are allocated with a specified order<br />

(the power of two of the number of contiguous<br />

pages), it becomes possible to cover<br />

them with a TLB entry that is larger than the<br />

usual page size (as long as the architecture supports<br />

this). Thus, we can take the order argument<br />

to __alloc_pages() and work out<br />

the smallest number of TLB entries that we<br />

need to allocate to cover it.<br />

Implementation of variable size pages is actually<br />

transparent to the system; as far as <strong>Linux</strong><br />

is concerned, the page table entries it deal with<br />

describe 4k pages. However, we add additional<br />

flags to the pte to tell the software TLB routine<br />

that actually we’d like to use a larger size TLB<br />

to access this region.<br />

As a further optimisation, in the architecture<br />

specific routines that free the boot mem, we can<br />

remap the kernel text and data sections with the<br />

smallest number of TLB entries that will entirely<br />

cover each of them.<br />

5 Achieving <strong>The</strong> VI architecture<br />

Goal: Fully Congruent Aliasing<br />

<strong>The</strong> system possesses every attribute it now<br />

needs to implement this. We no-longer map<br />

any user pages into kernel space unless the kernel<br />

actually needs to touch them. Thus, the<br />

pages will have congruent user addresses allocated<br />

to them in user space before we try to<br />

map them in kernel space. Thus, all we have<br />

to do is track up the free address list in increments<br />

of the congruence modulus until we find<br />

an empty place to map the page congruently.<br />

5.1 Wrinkles in the I/O Subsystem<br />

<strong>The</strong> I/O subsystem is designed to operate without<br />

mapping pages into the kernel at all. This<br />

becomes problematic for VI architectures because<br />

we have to know the user virtual address<br />

to compute the coherence index for the I/O.<br />

If the page is unmapped in kernel space, we<br />

can no longer make it coherent with the kernel<br />

mapping and, unfortunately, the information in<br />

the BIO is insufficient to tell us the user virtual<br />

address.<br />

<strong>The</strong> proposal for solving this is to add an architecture<br />

defined set of elements to struct<br />

bio_vec and an architecture specific function<br />

for populating this (possibly empty) set of<br />

elements as the biovec is created. In parisc,


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 109<br />

we need to add an extra unsigned long for<br />

the coherence index, which we compute from<br />

a pointer to the mm and the user virtual address.<br />

<strong>The</strong> architecture defined components are<br />

pulled into struct scatterlist by yet<br />

another callout when the request is mapped for<br />

DMA.<br />

5.2 Tracking the Mappings in ZONE_DMA<br />

Since the tracking requirements vary depending<br />

on architectures: x86 will merely wish to<br />

find the first free pte to place a page into; however<br />

VI architectures will need to find the first<br />

free pte satisfying the congruence requirements<br />

(which vary by architecture), the actual mechanism<br />

for finding a free pte for the mapping<br />

needs to be architecture specific.<br />

On parisc, all of this can be done in kmap_<br />

kernel() which merely uses rmap[5] to determine<br />

if the page is mapped in user space<br />

and find the congruent address if it is. We<br />

use a simple hash table based bitmap with one<br />

bucket representing the set of available congruent<br />

pages. Thus, finding a page congruent to<br />

any given virtual address is the simple computation<br />

of finding the first set bit in the congruence<br />

bucket. To find an arbitrary page, we keep<br />

a global bucket counter, allocating a page from<br />

that bucket and then incrementing the counter 3 .<br />

6 Implementation Details on PA-<br />

RISC<br />

Since the whole thrust of this project was to improve<br />

the kernel on PA-RISC (and bring it back<br />

into architectural compliance), it is appropriate<br />

to investigate some of the other problems that<br />

turned up during the implementation.<br />

3 This can all be done locklessly with atomic increments,<br />

since it doesn’t really matter if we get two allocations<br />

from the same bucket because of race conditions<br />

6.1 Equivalent Mapping<br />

<strong>The</strong> PA architecture has a software TLB meaning<br />

that in Virtual mode, if the CPU accesses<br />

an address that isn’t in the CPU’s TLB cache,<br />

it will take a TLB fault so the software routine<br />

can locate the TLB entry (by walking the page<br />

tables) and insert it into the CPU’s TLB. Obviously,<br />

this type of interruption must be handled<br />

purely by referencing physical addresses.<br />

In fact, the PA CPU is designed to have fast and<br />

slow paths for faults and interruptions. <strong>The</strong> fast<br />

paths (since they cannot take another interruption,<br />

i.e. not a TLB miss fault) must all operate<br />

on physical addresses. To assist with this, the<br />

PA CPU even turns off virtual addressing when<br />

it takes an interruption.<br />

When the CPU turns off virtual address translation,<br />

it is said to be operating in absolute<br />

mode. All address accesses in this mode are<br />

physical. However, all accesses in this mode<br />

also go through the CPU cache (which means<br />

that for this particular mode the cache is actually<br />

Physically Indexed). Unfortunately, this<br />

can also set up unwanted aliasing between the<br />

physical address and its virtual translation. <strong>The</strong><br />

fix for this is to obey the architectural definition<br />

for “equivalent mapping.” Equivalent mapping<br />

is defined as virtual and physical addresses being<br />

equal; however, we benefit from the obvious<br />

loophole in that the physical and virtual addresses<br />

don’t have to be exactly equal, merely<br />

equal modulo the congruent modulus.<br />

All of this means that when a page is allocated<br />

for use by the kernel, we must determine if it<br />

will ever be used in absolute mode, and make it<br />

equivalently mapped if it will be. At the time of<br />

writing, this was simply implemented by making<br />

all kernel allocated pages equivalent. However,<br />

really all that needs to be equivalently<br />

mapped is<br />

1. the page tables (pgd, pmd and pte),


110 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

2. the task structure and<br />

3. the kernel stacks.<br />

6.2 Physical to Virtual address Translation<br />

In the interruption slow path, where we save<br />

all the registers and transition to virtual mode,<br />

there is a point where execution must be<br />

switched (and hence pointers moved from<br />

physical to virtual). Currently, with offset<br />

mapping, this is simply done by and addition<br />

of __PAGE_OFFSET. However, in the new<br />

scheme we cannot do this, nor can we call<br />

the address translation functions when in absolute<br />

mode. <strong>The</strong>refore, we had to reorganise<br />

the interruption paths in the PA code so<br />

that both the physical and virtual address was<br />

available. Currently parisc uses a control register<br />

(%cr30) to store the virtual address of<br />

the struct thread_info. We altered all<br />

paths to change %cr30 to contain the physical<br />

address of struct thread_info and<br />

also added a physical address pointer to the<br />

struct task_struct to the thread info.<br />

This is sufficient to perform all the necessary<br />

register saves in absolute addressing mode.<br />

6.3 Flushing on Page Freeing<br />

as was documented in section 2.3, we need to<br />

find a way of flushing a user virtual address after<br />

its translation is gone. Actually, this turns<br />

out to be quite easy on PARISC. We already<br />

have an area of memory (called the tmpalias<br />

space) that we use to copy to priming the user<br />

cache (it is simply a 4MB memory area we dynamically<br />

program to map to the page). <strong>The</strong>refore,<br />

as long as we know the user virtual address,<br />

we can simply flush the page through<br />

the tmpalias space. In order to confound any<br />

attempted kernel use of this page, we reserve<br />

a separate 4MB virtual area that produces a<br />

page fault if referenced, and point the page’s<br />

virtual address into this when it is removed<br />

from process mappings (so that any kernel attempt<br />

to use the page produces an immediate<br />

fault). <strong>The</strong>n, when the page is freed, if its<br />

virtual pointer is within this range, we convert<br />

it to a tmpalias address and flush it using<br />

the tmpalias mechanism.<br />

7 Results and Conclusion<br />

<strong>The</strong> best result is that on a parisc machine, the<br />

total amount of memory the operational kernel<br />

keeps mapped is around 10MB (although this<br />

alters depending on conditions).<br />

<strong>The</strong> current implementation makes all pages<br />

congruent or equivalent, but the allocation routine<br />

contains BUG_ON() asserts to detect if we<br />

run out of equivalent addresses. So far, under<br />

fairly heavy stress, none of these has tripped.<br />

Although the primary reason for the unmapping<br />

was to move parisc back within its architectural<br />

requirements, it also produces a knock<br />

on effect of speeding up I/O by eliminating the<br />

cache flushing from kernel to user space. At<br />

the time of writing, the effects of this were still<br />

unmeasured, but expected to be around 6% or<br />

so.<br />

As a final side effect, the flush on free necessity<br />

releases the parisc from a very stringent “flush<br />

the entire cache on process death or exec” requirement<br />

that was producing horrible latencies<br />

in the parisc fork/exec. With this code in<br />

place, we see a vast (50%) improvement in the<br />

fork/exec figures.<br />

References<br />

[1] Andrea Arcangeli 3:1 4:4 100HZ<br />

1000HZ comparison with the HINT<br />

benchmark 7 April 2004<br />

http://www.kernel.org/pub/


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 111<br />

linux/kernel/people/andrea/<br />

misc/31-44-100-1000/<br />

31-44-100-1000.html<br />

[2] Ingo Molnar [announce, patch] 4G/4G<br />

split on x86, 64 GB RAM (and more)<br />

support 8 July 2003<br />

http://marc.theaimsgroup.<br />

com/?t=105770467300001<br />

[3] James E.J. Bottomley Understanding<br />

Caching <strong>Linux</strong> Journal January 2004,<br />

Issue 117 p58<br />

[4] Ingo Molnar [patch] simpler ‘highpte’<br />

design 18 February 2002<br />

http://marc.theaimsgroup.<br />

com/?l=linux-kernel&m=<br />

101406121032371<br />

[5] Rik van Riel Re: Rmap code? 22 August<br />

2001 http:<br />

//marc.theaimsgroup.com/?l=<br />

linux-mm&m=99849912207578


112 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


<strong>Linux</strong> Virtualization on IBM POWER5 Systems<br />

Abstract<br />

Dave Boutcher<br />

IBM<br />

boutcher@us.ibm.com<br />

In 2004 IBM® is releasing new systems based<br />

on the POWER5 processor. <strong>The</strong>re is new<br />

support in both the hardware and firmware for<br />

virtualization of multiple operating systems on<br />

a single platform. This includes the ability to<br />

have multiple operating systems share a processor.<br />

Additionally, a hypervisor firmware<br />

layer supports virtualization of I/O devices<br />

such as SCSI, LAN, and console, allowing<br />

limited physical resources in a system to be<br />

shared.<br />

At its extreme, these new systems allow 10<br />

<strong>Linux</strong> images per physical processor to run<br />

concurrently, contending for and sharing the<br />

system’s physical resources. All changes to<br />

support these new functions are in the 2.4 and<br />

2.6 <strong>Linux</strong> kernels.<br />

This paper discusses the virtualization capabilities<br />

of the processor and firmware, as well as<br />

the changes made to the PPC64 kernel to take<br />

advantage of them.<br />

1 Introduction<br />

IBM’s new POWER5 ∗∗ processor is being used<br />

in both IBM iSeries® and pSeries® systems<br />

capable of running any combination of <strong>Linux</strong>,<br />

AIX®, and OS/400® in logical partitions. <strong>The</strong><br />

hardware and firmware, including a hypervisor<br />

[AAN00], in these systems provide the ability<br />

to create “virtual” system images with virtual<br />

Dave Engebretsen<br />

IBM<br />

engebret@us.ibm.com<br />

hardware. <strong>The</strong> virtualization technique used on<br />

POWER hardware is known as paravirtualization,<br />

where the operating system is modified<br />

in select areas to make calls into the hypervisor.<br />

PPC64 <strong>Linux</strong> has been enhanced to make<br />

use of these virtualization interfaces. Note that<br />

the same PPC64 <strong>Linux</strong> kernel binary works<br />

on both virtualized systems and previous “bare<br />

metal” pSeries systems that did not offer a hypervisor.<br />

All changes related to virtualization have been<br />

made in the kernel, and almost exclusively in<br />

the PPC64 portion of the code. <strong>One</strong> challenge<br />

has been keeping as much code common<br />

as possible between POWER5 portions of the<br />

code and other portions, such as those supporting<br />

the Apple G5.<br />

Like previous generations of POWER processors<br />

such as the RS64 and POWER4 families,<br />

POWER5 includes hardware enablement<br />

for logical partitioning. This includes features<br />

such as a hypervisor state which is more privileged<br />

than supervisor state. This higher privilege<br />

state is used to restrict access to system<br />

resources, such as the hardware page table, to<br />

hypervisor only access. All current systems<br />

based on POWER5 run in a hypervised environment,<br />

even if only one partition is active on<br />

the system.


114 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

<strong>Linux</strong> OS/400 <strong>Linux</strong> AIX<br />

1.50<br />

CPU<br />

CPU<br />

0<br />

1.0<br />

CPU<br />

Hypervisor<br />

CPU<br />

1<br />

0.50<br />

CPU<br />

CPU<br />

2<br />

1.0<br />

CPU<br />

CPU<br />

3<br />

Figure 1: POWER5 Partitioned System<br />

2 Processor Virtualization<br />

2.1 Virtual Processors<br />

When running in a partition, the operating<br />

system is allocated virtual processors (VP’s),<br />

where each VP can be configured in either<br />

shared or dedicated mode of operation. In<br />

shared mode, as little as 10%, or 10 processing<br />

units, of a physical processor can be allocated<br />

to a partition and the hypervisor layer<br />

timeslices between the partitions. In dedicated<br />

mode, 100% of the processor is given to the<br />

partition such that its capacity is never multiplexed<br />

with another partition.<br />

It is possible to create more virtual processors<br />

in the partition than there are physical processors<br />

on the system. For example, a partition allocated<br />

100 processing units (the equivalent of<br />

1 processor) of capacity could be configured to<br />

have 10 virtual processors, where each VP has<br />

10% of a physical processor’s time. While not<br />

generally valuable, this extreme configuration<br />

can be used to help test SMP configurations on<br />

small systems.<br />

On POWER5 systems with multiple logical<br />

partitions, an important requirement is to be<br />

able to move processors (either shared or dedicated)<br />

from one logical partition to another.<br />

In the case of dedicated processors, this truly<br />

means moving a CPU from one logical partition<br />

to another. In the case of shared processors,<br />

it means adjusting the number of processors<br />

used by <strong>Linux</strong> on the fly.<br />

This “hotplug CPU” capability is far more interesting<br />

in this environment than in the case<br />

that the covers are going to be removed from a<br />

real system and a CPU physically added. <strong>The</strong><br />

goal of virtualization on these systems is to dynamically<br />

create and adjust operating system<br />

images as required. Much work has been done,<br />

particularly by Rusty Russell, to get the architecture<br />

independent changes into the mainline<br />

kernel to support hotplug CPU.<br />

Hypervisor interfaces exist that help the operating<br />

system optimize its use of the physical processor<br />

resources. <strong>The</strong> following sections describe<br />

some of these mechanisms.<br />

2.2 Virtual Processor Area<br />

Each virtual processor in the partition can create<br />

a virtual processor area (VPA), which is a<br />

small (one page) data structure shared between<br />

the hypervisor and the operating system. Its<br />

primary use is to communicate information between<br />

the two software layers. Examples of<br />

the information that can be communicated in<br />

the VPA include whether the OS is in the idle<br />

loop, if floating point and performance counter<br />

register state must be saved by the hypervisor<br />

between operating system dispatches, and<br />

whether the VP is running in the partition’s operating<br />

system.<br />

2.3 Spinlocks<br />

<strong>The</strong> hypervisor provides an interface that helps<br />

minimize wasted cycles in the operating system<br />

when a lock is held. Rather than simply<br />

spin on the held lock in the OS, a new hypervi-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 115<br />

sor call, h_confer, has been provided. This<br />

interface is used to confer any remaining virtual<br />

processor cycles from the lock requester<br />

to the lock holder.<br />

<strong>The</strong> PPC64 spinlocks were changed to identify<br />

the logical processor number of the lock<br />

holder, examine that processor’s VPA yield<br />

count field to determine if it is not running in<br />

the OS (even values indicate the VP is running<br />

in the OS), and to make the h_confer call<br />

to the hypervisor to give any cycles remaining<br />

in the virtual processor’s timeslice to the lock<br />

holder. Obviously, this more expensive leg of<br />

spinlock processing is only taken if the spinlock<br />

cannot be immediately acquired. In cases<br />

where the lock is available, no additional pathlength<br />

is incurred.<br />

2.4 Idle<br />

When the operating system no longer has active<br />

tasks to run and enters its idle loop, the<br />

h_cede interface is used to indicate to the hypervisor<br />

that the processor is available for other<br />

work. <strong>The</strong> operating system simply sets the<br />

VPA idle bit and calls h_cede. Under this<br />

call, the hypervisor is free to allocate the processor<br />

resources to another partition, or even to<br />

another virtual processor within the same partition.<br />

<strong>The</strong> processor is returned to the operating<br />

system if an external, decrementer (timer),<br />

or interprocessor interrupt occurs. As an alternative<br />

to sending an IPI, the ceded processor<br />

can be awoken by another processor calling the<br />

h_prod interface, which has slightly less overhead<br />

in this environment.<br />

Making use of the cede interface is especially<br />

important on systems where partitions configured<br />

to run uncapped exist. In uncapped mode,<br />

any physical processor cycles not used by other<br />

partitions can be allocated by the hypervisor to<br />

a non-idle partition, even if that partition has<br />

already consumed its defined quantity of processor<br />

units. For example, a partition that is<br />

defined as uncapped, 2 virtual processors, and<br />

20 processing units could consume 2 full processors<br />

(200 processing units), if all other partitions<br />

are idle.<br />

2.5 SMT<br />

<strong>The</strong> POWER5 processor provides symmetric<br />

multithreading (SMT) capabilities that allow<br />

two threads of execution to simultaneously execute<br />

on one physical processor. This results<br />

in twice as many processor contexts being<br />

presented to the operating system as there<br />

are physical processors. Like other processor<br />

threading mechanisms found in POWER RS64<br />

and Intel® processors, the goal of SMT is to<br />

enable higher processor utilization.<br />

At <strong>Linux</strong> boot, each processor thread is discovered<br />

in the open firmware device tree<br />

and a logical processor is created for <strong>Linux</strong>.<br />

A command line option, smt-enabled =<br />

[on, off, dynamic], has been added to allow<br />

the <strong>Linux</strong> partition to config SMT in one<br />

of three states. <strong>The</strong> on and off modes indicate<br />

that the processor always runs with SMT<br />

either on or off. <strong>The</strong> dynamic mode allows<br />

the operating system and firmware to dynamically<br />

configure the processor to switch between<br />

threaded (SMT) and a single threaded<br />

(ST) mode where one of the processor threads<br />

is dormant. <strong>The</strong> hardware implementation is<br />

such that running in ST mode can provide additional<br />

performance when only a single task is<br />

executing.<br />

<strong>Linux</strong> can cause the processor to switch between<br />

SMT and ST modes via the h_cede hypervisor<br />

call interface. When entering its idle<br />

loop, <strong>Linux</strong> sets the VPA idle state bit, and after<br />

a selectable delay, calls h_cede. Under<br />

this interface, the hypervisor layer determines<br />

if only one thread is idle, and if so, switches<br />

the processor into ST mode. If both threads are


116 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

idle (as indicated by the VPA idle bit), then the<br />

hypervisor keeps the processor in SMT mode<br />

and returns to the operating system.<br />

<strong>The</strong> processor switches back to SMT mode<br />

if an external or decrementer interrupt is presented,<br />

or if another processor calls the h_<br />

prod interface against the dormant thread.<br />

3 Memory Virtualization<br />

Memory is virtualized only to the extent that all<br />

partitions on the system are presented a contiguous<br />

range of logical addresses that start<br />

at zero. <strong>Linux</strong> sees these logical addresses<br />

as its real storage. <strong>The</strong> actual real memory<br />

is allocated by the hypervisor from any available<br />

space throughout the system, managing<br />

the storage in logical memory blocks (LMB’s).<br />

Each LMB is presented to the partition via<br />

a memory node in the open firmware device<br />

tree. When <strong>Linux</strong> creates a mapping in the<br />

hardware page table for effective addresses, it<br />

makes a call to the hypervisor (h_enter) indicating<br />

the effective and partition logical address.<br />

<strong>The</strong> hypervisor translates the logical address<br />

to the corresponding real address and inserts<br />

the mapping into the hardware page table.<br />

<strong>One</strong> additional layer of memory virtualization<br />

managed by the hypervisor is a real mode offset<br />

(RMO) region. This is a 128 or 256 MB region<br />

of memory covering the first portion of the<br />

logical address space within a partition. It can<br />

be accessed by <strong>Linux</strong> when address relocation<br />

is off, for example after an exception occurs.<br />

When a partition is running relocation off and<br />

accesses addresses within the RMO region, a<br />

simple offset is added by the hardware to generate<br />

the actual storage access. In this manner,<br />

each partition has what it considers logical address<br />

zero.<br />

4 I/O Virtualization<br />

Once CPU and memory have been virtualized,<br />

a key requirement is to provide virtualized I/O.<br />

<strong>The</strong> goal of the POWER5 systems is to have,<br />

for example, 10 <strong>Linux</strong> images running on a<br />

small system with a single CPU, 1GB of memory,<br />

and a single SCSI adapter and Ethernet<br />

adapter.<br />

<strong>The</strong> approach taken to virtualize I/O is a cooperative<br />

implementation between the hypervisor<br />

and the operating system images. <strong>One</strong> operating<br />

system image always “owns” physical<br />

adapters and manages all I/O to those adapters<br />

(DMA, interrupts, etc.)<br />

<strong>The</strong> hypervisor and Open Firmware then provide<br />

“virtual” adapters to any operating systems<br />

that require them. Creation of virtual<br />

adapters is done by the system administrator<br />

as part of logically partitioning the system. A<br />

key concept is that these virtual adapters do not<br />

interact in any way with the physical adapters.<br />

<strong>The</strong> virtual adapters interact with other operating<br />

systems in other logical partitions, which<br />

may choose to make use of physical adapters.<br />

Virtual adapters are presented to the operating<br />

system in the Open Firmware device tree just<br />

as physical adapters are. <strong>The</strong>y have very similar<br />

attributes as physical adapters, including<br />

DMA windows and interrupts.<br />

<strong>The</strong> adapters currently supported by the hypervisor<br />

are virtual SCSI adapters, virtual Ethernet<br />

adapters, and virtual TTY adapters.<br />

4.1 Virtual Bus<br />

Virtual adapters, of course, exist on a virtual<br />

bus. <strong>The</strong> bus has slots into which virtual<br />

adapters are configured. <strong>The</strong> number of slots<br />

available on the virtual bus is configured by<br />

the system administrator. <strong>The</strong> goal is to make


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 117<br />

the behavior of virtual adapters consistent with<br />

physical adapters. <strong>The</strong> virtual bus is not presented<br />

as a PCI bus, but rather as its own bus<br />

type.<br />

4.2 Virtual LAN<br />

Virtual LAN adapters are conceptually the simplest<br />

kind of virtual adapter. <strong>The</strong> hypervisor<br />

implements a switch, which supports 802.1Q<br />

semantics for having multiple VLANs share<br />

a physical switch. Adapters can be marked<br />

as 802.1Q aware, in which case the hypervisor<br />

expects the operating system to handle the<br />

802.1Q VLAN headers, or 802.1Q unaware, in<br />

which case the hypervisor connects the adapter<br />

to a single VLAN. Multiple virtual Ethernet<br />

adapters can be created for a given partition.<br />

Virtual Ethernet adapters have an additional attribute<br />

called “Trunk Adapter.” An adapter<br />

marked as a Trunk Adapter will be delivered<br />

all frames that don’t match any MAC address<br />

on the virtual Ethernet. This is similar, but<br />

not identical, to promiscuous mode on a real<br />

adapter.<br />

For a logical partition to have network connectivity<br />

to the outside world, the partition owning<br />

a “real” network adapter generally has both<br />

the real Ethernet adapter and a virtual Ethernet<br />

adapter marked as a Trunk adapter. That<br />

partition then performs either routing or bridging<br />

between the real adapter and the virtual<br />

adapter. <strong>The</strong> <strong>Linux</strong> bridge-utils package works<br />

well to bridge the two kinds of networks.<br />

Note that there is no architected link between<br />

the real and virtual adapters, it is the responsibility<br />

of some operating system to route traffic<br />

between them.<br />

<strong>The</strong> implementation of the virtual Ethernet<br />

adapters involves a number of hypervisor interfaces.<br />

Some of the more significant interfaces<br />

are h_register_logical_lan to establish<br />

the initial link between a device driver and<br />

a virtual Ethernet device, h_send_logical_<br />

lan to send a frame, and h_add_logical_<br />

lan_buffer to tell the hypervisor about a<br />

data buffer into which a received frame is to be<br />

placed. <strong>The</strong> hypervisor interfaces then support<br />

either polled or interrupt driven notification of<br />

new frames arriving.<br />

For additional information on the virtual Ethernet<br />

implementation, the code is the documentation<br />

(drivers/net/ibmveth.c).<br />

4.3 Virtual SCSI<br />

Unlike virtual Ethernet adapters, virtual SCSI<br />

adapters come in two flavors. A “client” virtual<br />

SCSI adapter behaves just as a regular<br />

SCSI host bus adapter and is implemented<br />

within the SCSI framework of the <strong>Linux</strong> kernel.<br />

<strong>The</strong> SCSI mid-layer issues standard SCSI<br />

commands such as Inquiry to determine devices<br />

connected to the adapter, and issues regular<br />

SCSI operations to those devices.<br />

A “server” virtual SCSI adapter, generally in a<br />

different partition than the client, receives all<br />

the SCSI commands from the client and is responsible<br />

for handling them. <strong>The</strong> hypervisor<br />

is not involved in what the server does with<br />

the commands. <strong>The</strong>re is no requirement for<br />

the server to link a virtual SCSI adapter to any<br />

kind of real adapter. <strong>The</strong> server can process<br />

and return SCSI responses in any fashion it<br />

likes. If it happens to issue I/O operations to a<br />

real adapter as part of satisfying those requests,<br />

that is an implementation detail of the operating<br />

system containing the server adapter.<br />

<strong>The</strong> hypervisor provides two very primitive<br />

interpartition communication mechanisms on<br />

which the virtual SCSI implementation is built.<br />

<strong>The</strong>re is a queue of 16 byte messages referred<br />

to as a “Command/Response Queue” (CRQ).<br />

Each partition provides the hypervisor with a


118 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

page of memory where its receive queue resides,<br />

and a partition wishing to send a message<br />

to its partner’s queue issues an h_send_crq<br />

hypervisor call. When a message is received<br />

on the queue, an interrupt is (optionally) generated<br />

in the receiving partition.<br />

<strong>The</strong> second hypervisor mechanism is a facility<br />

for issuing DMA operations between partitions.<br />

<strong>The</strong> h_copy_rdma call is used to<br />

DMA a block of memory from the memory<br />

space of one logical partition to the memory<br />

space of another.<br />

<strong>The</strong> virtual SCSI interpartition protocol is<br />

implemented using the ANSI “SCSI RDMA<br />

Protocol” (SRP) (available at http://www.<br />

t10.org). When the client wishes to issue a<br />

SCSI operation, it builds an SRP frame, and<br />

sends the address of the frame in a 16 byte<br />

CRQ message. <strong>The</strong> server DMA’s the SRP<br />

frame from the client, and processes it. <strong>The</strong><br />

SRP frame may itself contain DMA addresses<br />

required for data transfer (read or write buffers,<br />

for example) which may require additional interpartition<br />

DMA operations. When the operation<br />

is complete, the server DMA’s the SRP<br />

response back to the same location as the SRP<br />

command came from and sends a 16 byte CRQ<br />

message back indicating that the SCSI command<br />

has completed.<br />

<strong>The</strong> current <strong>Linux</strong> virtual SCSI server decodes<br />

incoming SCSI commands and issues<br />

block layer commands (generic_make_<br />

request). This allows the SCSI server to<br />

share any block device (e.g., /dev/sdb6 or<br />

/dev/loop0) with client partitions as a virtual<br />

SCSI device.<br />

Note that consideration was given to using protocols<br />

such as iSCSI for device sharing between<br />

partitions. <strong>The</strong> virtual SCSI SRP design<br />

above, however, is a much simpler design<br />

that does not rely on riding above an existing<br />

IP stack. Additionally, the ability to use DMA<br />

operations between partitions fits much better<br />

into the SRP model than an iSCSI model.<br />

<strong>The</strong> <strong>Linux</strong> virtual SCSI client (drivers/<br />

scsi/ibmvscsi/ibmvscsi.c) is close, at<br />

the time of writing, to being accepted into the<br />

<strong>Linux</strong> mainline. <strong>The</strong> <strong>Linux</strong> virtual SCSI server<br />

is sufficiently unlike existing SCSI drivers that<br />

it will require much more mailing list “discussion.”<br />

4.4 Virtual TTY<br />

In addition to virtual Ethernet and SCSI<br />

adapters, the hypervisor supports virtual serial<br />

(TTY) adapters. As with SCSI adapter, these<br />

can be configured as “client” adapters, and<br />

“server” adapters and connected between partitions.<br />

<strong>The</strong> first virtual TTY adapter is used as<br />

the system console, and is treated specially by<br />

the hypervisor. It is automatically connected to<br />

the partition console on the Hardware Management<br />

Console.<br />

To date, multiple concurrent “consoles” have<br />

not been implemented, but they could be. Similarly,<br />

this interface could be used for kernel<br />

debugging as with any serial port, but such an<br />

implementation has not been done.<br />

5 Dynamic Resource Movement<br />

As mentioned for processors, the logical partition<br />

environment lends itself to moving resources<br />

(processors, memory, I/O) between<br />

partitions. In a perfect world, such movement<br />

should be done dynamically while the operating<br />

system is running. Dynamic movement of<br />

processors is currently being implemented, and<br />

dynamic movement of I/O devices (including<br />

dynamically adding and removing virtual I/O<br />

devices) is included in the kernel mainline.<br />

<strong>The</strong> one area for future work in <strong>Linux</strong> is the dynamic<br />

movement of memory into and out of an


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 119<br />

active partition. This function is already supported<br />

on other POWER5 operating systems,<br />

so there is an opportunity for <strong>Linux</strong> to catch<br />

up.<br />

6 Multiple Operating Systems<br />

A key feature of the POWER5 systems is the<br />

ability to run different operating systems in<br />

different logical partitions on the same physical<br />

system. <strong>The</strong> operating systems currently<br />

supported on the POWER5 hardware are AIX,<br />

OS/400, and <strong>Linux</strong>.<br />

While running multiple operating systems, all<br />

of the functions for interpartion interaction described<br />

above must work between operating<br />

systems. For example, idle cycles from an AIX<br />

partition can be given to <strong>Linux</strong>. A processor<br />

can be moved from OS/400 to <strong>Linux</strong> while<br />

both operating systems are active.<br />

For I/O, multiple operating systems must be<br />

able to communicate over the virtual Ethernet,<br />

and SCSI devices must be sharable from (say)<br />

an AIX virtual SCSI server to a <strong>Linux</strong> virtual<br />

SCSI client.<br />

<strong>The</strong>se requirements, along with the architected<br />

hypervisor interfaces, limit the ability to<br />

change implementations just to fit a <strong>Linux</strong> kernel<br />

internal behavior.<br />

7 Conclusions<br />

While many of the basic virtualization technologies<br />

described in this paper existed in the<br />

<strong>Linux</strong> implementation provided on POWER<br />

RS64 and POWER4 iSeries systems [Bou01],<br />

they have been significantly enhanced for<br />

POWER5 to better use the firmware provided<br />

interfaces.<br />

<strong>The</strong> introduction of POWER5-based systems<br />

converged all of the virtualization interfaces<br />

provided by firmware on legacy iSeries and<br />

pSeries systems to a model in line with the<br />

legacy pSeries partitioned system architecture.<br />

As a result much of the PPC64 <strong>Linux</strong> virtualization<br />

code was updated to use these new virtualization<br />

interface definitions.<br />

8 Acknowledgments<br />

<strong>The</strong> authors would like to thank the entire<br />

<strong>Linux</strong>/PPC64 team for the work that went into<br />

the POWER5 virtualization effort. In particular<br />

Anton Blanchard, Paul Mackerras, Rusty<br />

Rusell, Hollis Blanchard, Santiago Leon, Ryan<br />

Arnold, Will Schmidt, Colin Devilbiss, Kyle<br />

Lucke, Mike Corrigan, Jeff Scheel, and David<br />

Larson.<br />

9 Legal Statement<br />

This paper represents the view of the authors, and<br />

does not necessarily represent the view of IBM.<br />

IBM, AIX, iSeries, OS/400, POWER, POWER4,<br />

POWER5, and pSeries are trademarks or registered<br />

trademarks of International Business Machines<br />

Corporation in the United States, other countries,<br />

or both.<br />

Other company, product or service names may be<br />

trademerks or service marks of others.<br />

References<br />

[AAN00] Bill Armstrong, Troy Armstrong,<br />

Naresh Nayar, Ron Peterson, Tom Sand,<br />

and Jeff Scheel. Logical Partitioning,<br />

http://www-1.ibm.com/servers/<br />

eserver/iseries/beyondtech/<br />

lpar.htm.


120 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

[Bou01] David Boutcher, <strong>The</strong> iSeries <strong>Linux</strong><br />

<strong>Kernel</strong> 2001 <strong>Linux</strong> Symposium, (July<br />

2001).


<strong>The</strong> State of ACPI in the <strong>Linux</strong> <strong>Kernel</strong><br />

A. Leonard Brown<br />

Intel<br />

len.brown@intel.com<br />

Abstract<br />

ACPI puts <strong>Linux</strong> in control of configuration<br />

and power management. It abstracts the platform<br />

BIOS and hardware so <strong>Linux</strong> and the<br />

platform can interoperate while evolving independently.<br />

This paper starts with some background on the<br />

ACPI specification, followed by the state of<br />

ACPI deployment on <strong>Linux</strong>.<br />

It describes the implementation architecture of<br />

ACPI on <strong>Linux</strong>, followed by details on the configuration<br />

and power management features.<br />

It closes with a summary of ACPI bugzilla activity,<br />

and a list of what is next for ACPI in<br />

<strong>Linux</strong>.<br />

1 ACPI Specification Background<br />

“ACPI (Advanced Configuration and<br />

Power Interface) is an open industry<br />

specification co-developed by<br />

Hewlett-Packard, Intel, Microsoft,<br />

Phoenix, and Toshiba.<br />

ACPI establishes industry-standard<br />

interfaces for OS-directed configuration<br />

and power management on laptops,<br />

desktops, and servers.<br />

ACPI evolves the existing collection<br />

of power management BIOS<br />

code, Advanced Power Management<br />

(APM) application programming<br />

interfaces (APIs, PNPBIOS<br />

APIs, Multiprocessor Specification<br />

(MPS) tables and so on into a welldefined<br />

power management and configuration<br />

interface specification.” 1<br />

ACPI 1.0 was published in 1996. 2.0 added<br />

64-bit support in 2000. ACPI 3.0 is expected<br />

in summer 2004.<br />

2 <strong>Linux</strong> ACPI Deployment<br />

<strong>Linux</strong> supports ACPI on three architectures:<br />

ia64, i386, and x86_64.<br />

2.1 ia64 <strong>Linux</strong>/ACPI support<br />

Most ia64 platforms require ACPI support,<br />

as they do not have the legacy configuration<br />

methods seen on i386. All the <strong>Linux</strong> distributions<br />

that support ia64 include ACPI support,<br />

whether they’re based on <strong>Linux</strong>-2.4 or <strong>Linux</strong>-<br />

2.6.<br />

2.2 i386 <strong>Linux</strong>/ACPI support<br />

Not all <strong>Linux</strong>-2.4 distributions enabled ACPI<br />

by default on i386. Often they used<br />

just enough table parsing to enable Hyper-<br />

Threading (HT), ala acpi=ht below, and relied<br />

on MPS and PIRQ routers to configure the<br />

1 http://www.acpi.info


122 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

setup_arch()<br />

dmi_scan_machine()<br />

Scan DMI blacklist<br />

BIOS Date vs Jan 1, 2001<br />

acpi_boot_init()<br />

acpi_table_init()<br />

locate and checksum all ACPI tables<br />

print table headers to console<br />

acpi_blacklisted()<br />

ACPI table headers vs. blacklist<br />

parse(BOOT) /* Simple Boot Flags */<br />

parse(FADT) /* PM timer address */<br />

parse(MADT) /* LAPIC, IOAPIC */<br />

parse(HPET) /* HiPrecision Timer */<br />

parse(MCFG) /* PCI Express base */<br />

Figure 1: Early ACPI init on i386<br />

machine. Some included ACPI support by default,<br />

but required the user to add acpi=on to<br />

the cmdline to enable it.<br />

So far, the major <strong>Linux</strong> 2.6 distributions all<br />

support ACPI enabled by default on i386.<br />

Several methods are used to make it more practical<br />

to deploy ACPI onto i386 installed base.<br />

Figure 1 shows the early ACPI startup on the<br />

i386 and where these methods hook in.<br />

2. DMI also exports the hardware manufacturer,<br />

baseboard name, BIOS version,<br />

etc. that you can observe with<br />

dmidecode. 2 dmi_scan.c has a general<br />

purpose blacklist that keys off this information,<br />

and invokes various platformspecific<br />

workarounds. acpi=off is the<br />

most severe—disabling all ACPI support,<br />

even the simple table parsing needed to<br />

enable Hyper-Threading (HT). acpi=ht<br />

does the same, excepts parses enough tables<br />

to enable HT. pci=noacpi disables<br />

ACPI for PCI enumeration and interrupt<br />

configuration. And acpi=noirq disables<br />

ACPI just for interrupt configuration.<br />

3. <strong>The</strong> ACPI tables also contain header information,<br />

which you see near the top<br />

of the kernel messages. ACPI maintains<br />

a blacklist based on the table headers.<br />

But this blacklist is somewhat primitive.<br />

When an entry matches the system, it either<br />

prints warnings or invokes acpi=<br />

off.<br />

1. Most modern system BIOS support DMI,<br />

which exports the date of the BIOS. <strong>Linux</strong><br />

DMI scan in i386 disables ACPI on platforms<br />

with a BIOS older than January 1,<br />

2001. <strong>The</strong>re is nothing magic about this<br />

date, except it allowed developers to focus<br />

on recent platforms without getting distracted<br />

debugging issues on very old platforms<br />

that:<br />

(a) had been running <strong>Linux</strong> w/o ACPI<br />

support for years.<br />

(b) had virtually no chance of a BIOS<br />

update from the OEM.<br />

Boot parameter acpi=force is available<br />

to enable ACPI on platforms older<br />

than the cutoff date.<br />

All three of these methods share the problem<br />

that if they are successful, they tend to hide<br />

root-cause issues in <strong>Linux</strong> that should be fixed.<br />

For this reason, adding to the blacklists is discouraged<br />

in the upstream kernel. <strong>The</strong>ir main<br />

value is to allow <strong>Linux</strong> distributors to quickly<br />

react to deployment issues when they need to<br />

support deviant platforms.<br />

2.3 x86_64 <strong>Linux</strong>/ACPI support<br />

All x86_64 platforms I’ve seen include ACPI<br />

support. <strong>The</strong> major x86_64 <strong>Linux</strong> distributions,<br />

whether <strong>Linux</strong>-2.4 or <strong>Linux</strong>-2.6 based,<br />

all support ACPI.<br />

2 http://www.nongnu.org/dmidecode


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 123<br />

3 Implementation Overview<br />

<strong>The</strong> ACPI specification describes platform registers,<br />

ACPI tables, and operation of the ACPI<br />

BIOS. Figure 2 shows these ACPI components<br />

logically as a layer above the platform specific<br />

hardware and firmware.<br />

<strong>The</strong> ACPI kernel support centers around the<br />

ACPICA (ACPI Component Architecture 3 )<br />

core. ACPICA includes the AML 4 interpreter<br />

that implements ACPI’s hardware abstraction.<br />

ACPICA also implements other OS-agnostic<br />

parts of the ACPI specification. <strong>The</strong> ACPICA<br />

code does not implement any policy, that is the<br />

realm of the <strong>Linux</strong>-specific code. A single file,<br />

osl.c, glues ACPICA to the <strong>Linux</strong>-specific<br />

functions it requires.<br />

<strong>The</strong> box in Figure 2 labeled “<strong>Linux</strong>/ACPI” represents<br />

the <strong>Linux</strong>-specific ACPI code, including<br />

boot-time configuration.<br />

Optional “ACPI drivers,” such as Button, Battery,<br />

Processor, etc. are (optionally loadable)<br />

modules that implement policy related to those<br />

specific features and devices.<br />

3.1 Events<br />

ACPI registers for a “System Control Interrupt”<br />

(SCI) and all ACPI events come through<br />

that interrupt.<br />

<strong>The</strong> kernel interrupt handler de-multiplexes the<br />

possible events using ACPI constructs. In<br />

some cases, it then delivers events to a userspace<br />

application such as acpid via /proc/<br />

acpi/events.<br />

3 http://www.intel.com/technology/<br />

iapc/acpi<br />

4 AML, ACPI Machine Language.<br />

!"<br />

"! !"<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

#<br />

$<br />

!<br />

&<br />

"!!<br />

'<br />

%<br />

<br />

<br />

<br />

<br />

<br />

<br />

Figure 2: Implementation Architecture<br />

4 ACPI Configuration<br />

Interrupt configuration on i386 dominated the<br />

ACPI bug fixing activity over the last year.<br />

<strong>The</strong> algorithm to configure interrupts on an<br />

i386 system with an IOAPIC is shown in Figure<br />

3. ACPI mandates that all PIC mode IRQs<br />

be identity mapped to IOAPIC pins. Exceptions<br />

are specified in MADT 5 interrupt source<br />

override entries.<br />

Over-rides are often used, for example, to specify<br />

that the 8254 timer on IRQ0 in PIC mode<br />

does not use pin0 on the IOAPIC, but uses<br />

pin2. Over-rides also often move the ACPI SCI<br />

to a different pin in IOAPIC mode than it had<br />

in PIC mode, or change its polarity or trigger<br />

from the default.<br />

5 MADT, Multiple APIC Description Table.


124 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

setup_arch()<br />

acpi_boot_init()<br />

parse(MADT);<br />

parse(LAPIC); /* processors */<br />

parse(IOAPIC)<br />

parse(INT_SRC_OVERRIDE);<br />

add_identity_legacy_mappings();<br />

/* mp_irqs[] initialized */<br />

init()<br />

smp_boot_cpus()<br />

setup_IO_APIC()<br />

enable_IO_APIC();<br />

setup_IO_APIC_irqs(); /* mp_irqs[] */<br />

do_initcalls()<br />

acpi_init()<br />

"ACPI: Subsystem revision 20040326"<br />

acpi_initialize_subsystem();<br />

/* AML interpreter */<br />

acpi_load_tables(); /* DSDT */<br />

acpi_enable_subsystem();<br />

/* HW into ACPI mode */<br />

"ACPI: Interpreter enabled"<br />

acpi_bus_init_irq();<br />

AML(_PIC, PIC | IOAPIC | IOSAPIC);<br />

acpi_pci_link_init()<br />

for(every PCI Link in DSDT)<br />

acpi_pci_link_add(Link)<br />

AML(_PRS, Link);<br />

AML(_CRS, Link);<br />

"... Link [LNKA] (IRQs 9 10 *11)"<br />

pci_acpi_init()<br />

"PCI: Using ACPI for IRQ routing"<br />

acpi_irq_penalty_init();<br />

for (PCI devices)<br />

acpi_pci_irq_enable(device)<br />

acpi_pci_irq_lookup()<br />

find _PRT entry<br />

if (Link) {<br />

acpi_pci_link_get_irq()<br />

acpi_pci_link_allocate()<br />

examine possible & current IRQs<br />

AML(_SRS, Link)<br />

} else {<br />

use hard-coded IRQ in _PRT entry<br />

}<br />

acpi_register_gsi()<br />

mp_register_gsi()<br />

io_apic_set_pci_routing()<br />

"PCI: PCI interrupt 00:06.0[A] -><br />

GSI 26 (level, low) -> IRQ 26"<br />

Figure 3: Interrupt Initialization<br />

So after identifying that the system will be in<br />

IOAPIC mode, the 1st step is to record all the<br />

Interrupt Source Overrides in mp_irqs[].<br />

<strong>The</strong> second step is to add the legacy identity<br />

mappings where pins and IRQs have not been<br />

consumed by the over-rides.<br />

Step three is to digest mp_irqs[] in<br />

setup_IO_APIC_irqs(), just like it<br />

would be if the system were running in legacy<br />

MPS mode.<br />

But that is just the start of interrupt configuration<br />

in ACPI mode. <strong>The</strong> system still needs<br />

to enable the mappings for PCI devices, which<br />

are stored in the DSDT 6 _PRT 7 entries. Further,<br />

the _PRT can contain both static entries,<br />

analogous to MPS table entries, or it can contain<br />

dynamic _PRT entries that use PCI Interrupt<br />

Link Devices.<br />

So <strong>Linux</strong> enables the AML interpreter and informs<br />

the ACPI BIOS that it plans to run the<br />

system in IOAPIC mode.<br />

Next the PCI Interrupt Link Devices are<br />

parsed. <strong>The</strong>se “links” are abstract versions of<br />

what used to be called PIRQ-routers, though<br />

they are more general. acpi_pci_link_<br />

init() searches the DSDT for Link Devices<br />

and queries each about the IRQs it can be set<br />

to (_PRS) 8 and the IRQ that it is already set to<br />

(_CRS) 9<br />

A penalty table is used to help decide how<br />

to program the PCI Interrupt Link Devices.<br />

Weights are statically compiled into the table<br />

to avoid programming the links to well<br />

known legacy IRQs. acpi_irq_penalty_<br />

init() updates the table to add penalties to<br />

the IRQs where the Links have possible set-<br />

6 DSDT, Differentiated Services Description Table,<br />

written in AML<br />

7 _PRT, PCI Routing Table<br />

8 PRS, Possible Resource Settings.<br />

9 CRS, Current Resource Settings.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 125<br />

tings. <strong>The</strong> idea is to minimize IRQ sharing,<br />

while not conflicting with legacy IRQ use.<br />

While it works reasonably well in practice, this<br />

heuristic is inherently flawed because it assumes<br />

the legacy IRQs rather than asking the<br />

DSDT what legacy IRQs are actually in use. 10<br />

<strong>The</strong> PCI sub-system calls acpi_pci_irq_<br />

enable() for every device. ACPI looks up<br />

the device in the _PRT by device-id and if it<br />

a simple static entry, programs the IOAPIC.<br />

If it is a dynamic entry, acpi_pci_link_<br />

allocate() chooses an IRQ for the link and<br />

programs the link via AML (_SRS). 11 <strong>The</strong>n the<br />

associated IOAPIC entry is programmed.<br />

Later, the drivers initialize and call request_<br />

irq(IRQ) with the IRQ the PCI sub-system<br />

told it to request.<br />

<strong>One</strong> issue we have with this scheme is that it<br />

can’t automatically recover when the heuristic<br />

balancing act fails. For example when the<br />

parallel port grabs IRQ7 and a PCI Interrupt<br />

Links gets programmed to the same IRQ, then<br />

request_irq(IRQ) correctly fails to put<br />

ISA and PCI interrupts on the same pin. But<br />

the system doesn’t realize that one of the contenders<br />

could actually be re-programmed to a<br />

different IRQ.<br />

<strong>The</strong> fix for this issue will be to delete the<br />

heuristic weights from the IRQ penalty table.<br />

Instead the kernel should scan the DSDT to<br />

enumerate exactly what legacy devices reserve<br />

exactly what IRQs. 12<br />

10 In PIC mode, the default is to keep the BIOS provided<br />

current IRQ setting, unless cmdline acpi_irq_<br />

balance is used. Balancing is always enabled in<br />

IOAPIC mode.<br />

11 SRS, Set Resource Setting<br />

12 bugzilla 2733<br />

4.1 Issues With PCI Interrupt Link Devices<br />

Most of the issues have been with PCI Interrupt<br />

Link Devices, an ACPI mechanism primarily<br />

used to replace the chip-set-specific Legacy<br />

PIRQ code.<br />

• <strong>The</strong> status (_STA) returned by a PCI Interrupt<br />

Link Device does not matter. Some<br />

systems mark the ones we should use as<br />

enabled, some do not.<br />

• <strong>The</strong> status set by <strong>Linux</strong> on a link is important<br />

on some chip sets. If we do<br />

not explicitly disable some unused links,<br />

they result in tying together IRQs and can<br />

cause spurious interrupts.<br />

• <strong>The</strong> current setting returned by a link<br />

(_CRS) can not always be trusted. Some<br />

systems return invalid settings always.<br />

<strong>Linux</strong> must assume that when it sets a<br />

link, the setting was successful.<br />

• Some systems return a current setting that<br />

is outside the list of possible settings. Per<br />

above, this must be ignored and a new setting<br />

selected from the possible-list.<br />

4.2 Issues With ACPI SCI Configuration<br />

Another area that was ironed out this year<br />

was the ACPI SCI (System Control Interrupt).<br />

Originally, the SCI was always configured as<br />

level/low, but SCI failures didn’t stop until<br />

we implemented the algorithm in Figure 4.<br />

During debugging, the kernel gained the cmdline<br />

option that applies to either PIC or IOAPIC<br />

mode: acpi_sci={level,edge,high,<br />

low} but production systems seem to be working<br />

properly and this has seen use recently only<br />

to work around prototype BIOS bugs.


126 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

if (PIC mode) {<br />

set ELCR to level trigger();<br />

} else { /* IOAPIC mode */<br />

if (Interrupt Source Override) {<br />

Use IRQ specified in override<br />

if(trigger edge or level)<br />

use edge or level<br />

else (compatible trigger)<br />

use level<br />

}<br />

if (polarity high or low)<br />

use high or low<br />

else<br />

use low<br />

} else { /* no Override */<br />

use level-trigger<br />

use low-polarity<br />

}<br />

Figure 4: SCI configuration algorithm<br />

4.3 Unresolved: Local APIC Timer Issue<br />

<strong>The</strong> most troublesome configuration issue today<br />

is that many systems with no IO-APIC will<br />

hang during boot unless their LOCAL-APIC<br />

has been disabled, eg. by booting nolapic.<br />

While this issue has gone away on several systems<br />

with BIOS upgrades, entire product lines<br />

from high-volume OEMS appear to be subject<br />

to this failure. <strong>The</strong> current workaround to disable<br />

the LAPIC timer for the duration of the<br />

SMI-CMD update that enables ACPI mode. 13<br />

4.4 Wanted: Generic <strong>Linux</strong> Driver Manager<br />

<strong>The</strong> ACPI DSDT enumerates motherboard devices<br />

via PNP identifiers. This method is used<br />

to load the ACPI specific devices today, eg.<br />

battery, button, fan, thermal etc. as well as<br />

8550_acpi. PCI devices are enumerated via<br />

PCI-ids from PCI config space. Legacy devices<br />

probe out using hard-coded address values.<br />

But a device driver should not have to know or<br />

13 http://bugzilla.kernel.org 1269<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

!"<br />

<br />

#<br />

$! <br />

%&<br />

'''<br />

) "<br />

&(''<br />

Figure 5: ACPI Global, CPU, and Sleep states.<br />

care how it is enumerated by its parent bus. An<br />

8250 driver should worry about the 8250 and<br />

not if it is being discovered by legacy means,<br />

ACPI enumeration, or PCI.<br />

<strong>One</strong> fix would be to be to abstract the PCI-ids,<br />

PNP-ids, and perhaps even some hard-coded<br />

values into a generic device manager directory<br />

that maps them to device drivers.<br />

This would simply add a veneer to the PCI<br />

device configuration, simplifying a very small<br />

number of drivers that can be configured by<br />

PCI or ACPI. However, it would also fix the<br />

real issue that the configuration information in<br />

the ACPI DSDT for most motherboard devices<br />

is currently not parsed and not communicated<br />

to any <strong>Linux</strong> drivers.<br />

<strong>The</strong> Device driver manager would also be<br />

able to tell the power management sub-system<br />

which methods are used to power-manage the<br />

device. Eg. PCI or ACPI.<br />

5 ACPI Power Management<br />

<strong>The</strong> Global System States defined by ACPI are<br />

illustrated in Figure 5. G0 is the working state,<br />

G1 is sleeping, G2 is soft-off and G3 is mechanical<br />

off. <strong>The</strong> “Legacy” state illustrates<br />

where the system is not in ACPI mode.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 127<br />

5.1 P-states<br />

In the context of G0 – Global Working State,<br />

and C0 – CPU Executing State, P-states (Performance<br />

states) are available to reduce power<br />

of the running processor. P-states simultaneously<br />

modulate both the MHz and the voltage.<br />

As power varies by voltage squared, P-states<br />

are extremely effective at saving power.<br />

While P-states are extremely important, the<br />

cpufreq sub-system handles P-states on a<br />

number of different platforms, and the topic is<br />

best addressed in that larger context.<br />

5.2 Throttling<br />

In the context of the G0-Working, C0-<br />

Executing state, Throttling states are defined to<br />

modulate the frequency of the running processor.<br />

Power varies (almost) directly with MHz, so<br />

when the MHz is cut if half, so is the power.<br />

Unfortunately, so is the performance.<br />

<strong>Linux</strong> currently uses Throttling only in response<br />

to thermal events where the processor<br />

is too hot. However, in the future, <strong>Linux</strong> could<br />

add throttling when the processor is already in<br />

the lowest P-state to save additional power.<br />

Note that most processors also include a<br />

backup <strong>The</strong>rmal Monitor throttling mechanism<br />

in hardware, set with higher temperature<br />

thresholds than ACPI throttling. Most processors<br />

also have in hardware an thermal emergency<br />

shutdown mechanism.<br />

5.3 C-states<br />

In the context of G0 Working system state, C-<br />

state (CPU-state) C0 is used to refer to the executing<br />

state. Higher number C-states are entered<br />

to save successively more power when<br />

the processor is idle. No instructions are executed<br />

when in C1, C2, or C3.<br />

ACPI replaces the default idle loop so it can<br />

enter C1, C2 or C3. <strong>The</strong> deeper the C-state,<br />

the more power savings, but the higher the latency<br />

to enter/exit the C-state. You can observe<br />

the C-states supported by the system and<br />

the success at using them in /proc/acpi/<br />

processor/CPU0/power<br />

C1 is included in every processor and has<br />

negligible latency. C1 is implemented with<br />

the HALT or MONITOR/MWAIT instructions.<br />

Any interrupt will automatically wake the processor<br />

from C1.<br />

C2 has higher latency (though always under<br />

100 usec) and higher power savings than C1.<br />

It is entered through writes to ACPI registers<br />

and exits automatically with any interrupt.<br />

C3 has higher latency (though always under<br />

1000 usec) and higher power savings than C2.<br />

It is entered through writes to ACPI registers<br />

and exits automatically with any interrupt or<br />

bus master activity. <strong>The</strong> processor does not<br />

snoop its cache when in C3, which is why busmaster<br />

(DMA) activity will wake it up. <strong>Linux</strong><br />

sees several implementation issues with C3 today:<br />

1. C3 is enabled even if the latency is up to<br />

1000 usec. This compares with the <strong>Linux</strong><br />

2.6 clock tick rate of 1000Hz = 1ms =<br />

1000usec. So when a clock tick causes<br />

C3 to exit, it may take all the way to the<br />

next clock tick to execute the next kernel<br />

instruction. So the benefit of C3 is lost<br />

because the system effectively pays C3 latency<br />

and gets negligible C3 residency to<br />

save power.<br />

2. Some devices do not tolerate the DMA<br />

latency introduced by C3. <strong>The</strong>ir device<br />

buffers underrun or overflow. This is cur-


128 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

rently an issue with the ipw2100 WLAN<br />

NIC.<br />

3. Some platforms can lie about C3 latency<br />

and transparently put the system into a<br />

higher latency C4 when we ask for C3—<br />

particularly when running on batteries.<br />

4. Many processors halt their local APIC<br />

timer (a.k.a. TSC – Timer Stamp Counter)<br />

when in C3. You can observe this<br />

by watching LOC fall behind IRQ0 in<br />

/proc/interrupts.<br />

5. USB makes it virtually impossible to enter<br />

C3 because of constant bus master activity.<br />

<strong>The</strong> workaround at the moment is<br />

to unplug your USB devices when idle.<br />

Longer term, it will take enhancements<br />

to the USB sub-system to address this issue.<br />

Ie. USB software needs to recognize<br />

when devices are present but idle, and reduce<br />

the frequency of bus master activity.<br />

<strong>Linux</strong> decides which C-state to enter on idle<br />

based on a promotion/demotion algorithm.<br />

<strong>The</strong> current algorithm measures the residency<br />

in the current C-state. If it meets a threshold<br />

the processor is promoted to the deeper C-state<br />

on re-entrance into idle. If it was too short, then<br />

the processor is demoted to a lower-numbered<br />

C-state.<br />

Unfortunately, the demotion rules are overly<br />

simplistic, as <strong>Linux</strong> tracks only its previous<br />

success at being idle, and doesn’t yet account<br />

for the load on the system.<br />

Support for deeper C-states via the _CST<br />

method is currently in prototype. Hopefully<br />

this method will also give the OS more accurate<br />

data than the FADT about the latency associated<br />

with C3. If it does not, then we may<br />

need to consider discarding the table-provided<br />

latencies and measuring the actual latency at<br />

boot time.<br />

5.4 Sleep States<br />

ACPI names sleeps states S0 – S5. S0 is the<br />

non-sleep state, synonymous with G0. S1 is<br />

standby, it halts the processor and turns off the<br />

display. Of course turning off the display on an<br />

idle system saves the same amount of power<br />

without taking the system off line, so S1 isn’t<br />

worth much. S2 is deprecated. S3 is suspend to<br />

RAM. S4 is hibernate to disk. S5 is soft-power<br />

off, AKA G2.<br />

Sleep states are unreliable enough on <strong>Linux</strong> today<br />

that they’re best considered “experimental.”<br />

Suspend/Resume suffers from (at least)<br />

two systematic problems:<br />

• _init() and _initdata() on items<br />

that may be referenced after boot, say,<br />

during resume, is a bad idea.<br />

• PCI configuration space is not uniformly<br />

saved and restored either for devices or<br />

for PCI bridges. This can be observed<br />

by using lspci before and after a suspend/resume<br />

cycle. Sometimes setpci<br />

can be used to repair this damage from<br />

user-space.<br />

5.5 Device States<br />

Not shown on the diagram, ACPI defines<br />

power saving states for devices: D0 – D3. D0<br />

is on, D3 is off, D1 and D2 are intermediate.<br />

Higher device states have<br />

1. more power savings,<br />

2. less device context saved by hardware,<br />

3. more device driver state restoring,<br />

4. higher restore latency.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 129<br />

ACPI defines semantics for each device state in<br />

each device class. In practice, D1 and D2 are<br />

often optional - as many devices support only<br />

on and off either because they are low-latency,<br />

or because they are simple.<br />

<strong>Linux</strong>-2.6 includes an updated device driver<br />

model to accommodate power management. 14<br />

This model is highly compatible with PCI and<br />

ACPI. However, this vision is not yet fully realized.<br />

To do so, <strong>Linux</strong> needs a global power<br />

policy manager.<br />

5.6 Wanted: Generic <strong>Linux</strong> Run-time Power<br />

Policy Manager<br />

PCI device drivers today call pci_set_<br />

power_state() to enter D-states. This uses<br />

the power management capabilities in the PCI<br />

power management specification.<br />

<strong>The</strong> ACPI DSDT supplies methods for ACPI<br />

enumerated devices to access ACPI D-states.<br />

However, no driver calls into ACPI to enter D-<br />

states today. 15<br />

Drivers shouldn’t have to care if they are power<br />

managed by PCI or by ACPI. Drivers should be<br />

able to up-call to a generic run-time power policy<br />

manager. That manager should know about<br />

calling the PCI layer or the ACPI layer as appropriate.<br />

<strong>The</strong> power manager should also put those requests<br />

in the context of user-specified power<br />

policy. Eg. Does the user want maximum performance,<br />

or maximum battery life? Currently<br />

there is no method to specify the detailed policy,<br />

and the kernel wouldn’t know how to handle<br />

it anyway.<br />

rently only suspend upon system suspend. This<br />

is probably not the path to industry leading battery<br />

life.<br />

Device drivers should recognize when their device<br />

has gone idle. <strong>The</strong>y should invoke a suspend<br />

up-call to a power manager layer which<br />

will decide if it really is a good idea to grant<br />

that request now, and if so, how. In this case by<br />

calling the PCI or ACPI layer as appropriate.<br />

6 ACPI as seen by bugzilla<br />

Over the last year the ACPI developers have<br />

made heavy use of bugzilla 16 to help prioritize<br />

and track 460 bugs. 300 bugs are closed or resolved,<br />

160 are open. 17<br />

We cc: acpi-bugzilla@lists.<br />

sourceforge.net on these bugs, and<br />

we encourage the community to add that alias<br />

to ACPI-specific bugs in other bugzillas so that<br />

the team can help out wherever the problems<br />

are found.<br />

We haven’t really used the bugzilla priority<br />

field. Instead we’ve split the bugs into categories<br />

and have addressed the configuration issues<br />

first. This explains why most of the interrupt<br />

bugs are resolved, and most of the suspend/resume<br />

bugs are unresolved.<br />

We’ve seen an incoming bug rate of 10-<br />

bugs/week for many months, but the new reports<br />

favor the power management features<br />

over configuration, so we’re hopeful that the<br />

torrent of configuration issues is behind us.<br />

In a related point, it appears that devices cur-<br />

14 Patrick Mochel, <strong>Linux</strong> <strong>Kernel</strong> Power Management,<br />

OLS 2003.<br />

15 Actually, the ACPI hot-plug driver invokes D-states,<br />

but that is the only exception.<br />

16 http://bugzilla.kernel.org/<br />

17 <strong>The</strong> resolved state indicates that a patch is available<br />

for testing, but that it is not yet checked into the kernel.org<br />

kernel.


130 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

$<br />

" #<br />

!<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

*"<br />

+<br />

<br />

% &' '% (' )%%<br />

Figure 6: ACPI bug profile<br />

7 Future Work<br />

7.1 <strong>Linux</strong> 2.4<br />

Going forward, I expect to back-port only critical<br />

configuration related fixes to <strong>Linux</strong>-2.4.<br />

For the latest power management code, users<br />

need to migrate to <strong>Linux</strong>-2.6.<br />

7.2 <strong>Linux</strong> 2.6<br />

<strong>Linux</strong>-2.6 is a “stable” release, so it is not<br />

appropriate to integrate significant new features.<br />

However, the power management side<br />

of ACPI is widely used in 2.6 and there will be<br />

plenty of bug-fixes necessary. <strong>The</strong> most visible<br />

will probably be anything that makes Suspend/Resume<br />

work on more platforms.<br />

7.3 <strong>Linux</strong> 2.7<br />

<strong>The</strong>se feature gaps will not be addressed in<br />

<strong>Linux</strong> 2.6, and so are candidates for <strong>Linux</strong> 2.7:<br />

• Device enumeration is not abstracted in<br />

a generic device driver manager that can<br />

shield drivers from knowing if they’re<br />

enumerated by ACPI, PCI, or other.<br />

• Motherboard devices enumerated by<br />

ACPI in the DSDT are ignored, and<br />

probed instead via legacy methods. This<br />

can lead to resource conflicts.<br />

• Device power states are not abstracted in<br />

a generic device power manager that can<br />

shield drivers from knowing whether to<br />

call ACPI or PCI to handle D-states.<br />

• <strong>The</strong>re is no power policy manager to<br />

translate the user-requested power policy<br />

into kernel policy.<br />

• No devices invoke ACPI methods to enter<br />

D-states.<br />

• Devices do not detect that they are idle<br />

and request of a power manager whether<br />

they should enter power saving device<br />

states.<br />

• <strong>The</strong>re is no MP/SMT coordination of P-<br />

states. Today, P-states are disabled on<br />

SMP systems. Coordination needs to account<br />

for multiple threads and multiple<br />

cores per package.<br />

• Coordinate P-states and T-states. Throttling<br />

should be used only after the system<br />

is put in the lowest P-state.<br />

• Idle states above C1 are disabled on SMP.<br />

• Enable Suspend in PAE mode. 18<br />

18 PAE, Physical Address Extended—MMU mode to<br />

handle > 4GB RAM—optional on i386, always used<br />

on x86_64.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 131<br />

• Enable Suspend on SMP.<br />

• Tick timer modulation for idle power savings.<br />

• Video control extensions. Video is a large<br />

power consumer. <strong>The</strong> ACPI spec Video<br />

extensions are currently in prototype.<br />

• Docking Station support is completely absent<br />

from <strong>Linux</strong>.<br />

• ACPI 3.0 features. TBD after the specification<br />

is published.<br />

7.4 ACPI 3.0<br />

Although ACPI 3.0 has not yet been published,<br />

two ACPI 3.0 tidbits are already in <strong>Linux</strong>.<br />

• PCI Express table scanning. This is the<br />

basic PCI Express support, there will be<br />

more coming. Those in the PCI SIG<br />

can read all about it in the PCI Express<br />

Firmware Specification.<br />

• Several clarifications to the ACPI 2.0b<br />

spec resulted directly from open source<br />

development, 19 and the text of ACPI 3.0<br />

has been updated accordingly. For example,<br />

some subtleties of SCI interrupt configuration<br />

and device enumeration.<br />

When the ACPI 3.0 specification is published<br />

there will instantly be a multiple additions to<br />

the ACPI/<strong>Linux</strong> feature to-do list.<br />

7.5 Tougher Issues<br />

• Battery Life on <strong>Linux</strong> is not yet competitive.<br />

This single metric is the sum of all<br />

the power savings features in the platform,<br />

and if any of them are not working properly,<br />

it comes out on this bottom line.<br />

19 FreeBSD deserves kudos in addition to <strong>Linux</strong><br />

• Laptop Hot Keys are used to control<br />

things such as video brightness, etc. ACPI<br />

does not specify Hot Keys. But when they<br />

work in APM mode and don’t work in<br />

ACPI mode, ACPI gets blamed. <strong>The</strong>re are<br />

4 ways to implement hot keys:<br />

1. SMI 20 handler, the BIOS handles<br />

interrupts from the keys, and controls<br />

the device directly. This acts<br />

like “hardware” control as the OS<br />

doesn’t know it is happening. But<br />

on many systems this SMI method is<br />

disabled as soon as the system transitions<br />

into ACPI mode. Thus the<br />

complaint “the button works in APM<br />

mode, but doesn’t work in ACPI<br />

mode.”<br />

But ACPI doesn’t specify how hot<br />

keys work, so in ACPI mode one of<br />

the other methods listed here needs<br />

to handle the keys.<br />

2. Keyboard Extension driver, such as<br />

i8k. Here the keys return scan<br />

codes like any other keys on the keyboard,<br />

and the keyboard driver needs<br />

to understand those scan code. This<br />

is independent of ACPI, and generally<br />

OEM specific.<br />

3. OEM-specific ACPI hot key driver.<br />

Some OEMs enumerate the hot<br />

keys as OEM-specific devices in the<br />

ACPI tables. While the device is<br />

described in AML, such devices are<br />

not described in the ACPI spec so<br />

we can’t build generic ACPI support<br />

for them. <strong>The</strong> OEM must supply<br />

the appropriate hot-key driver since<br />

only they know how it is supposed<br />

to work.<br />

4. Platform-specific “ACPI” driver. Today<br />

<strong>Linux</strong> includes Toshiba and<br />

20 SMI, System Management Interrupt; invisible to the<br />

OS, handled by the BIOS, generally considered evil.


132 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Asus platform specific extension<br />

drivers to ACPI. <strong>The</strong>y do not use<br />

portable ACPI compliant methods to<br />

recognize and talk to the hot keys,<br />

but generally use the methods above.<br />

<strong>The</strong> correct solution to the the Hot Key issue<br />

on <strong>Linux</strong> will require direct support<br />

from the OEMs, either by supplying documentation,<br />

or code to the community.<br />

8 Summary<br />

This past year has seen great strides in the configuration<br />

aspects of ACPI. Multiple <strong>Linux</strong> distributors<br />

now enable ACPI on multiple architectures.<br />

This sets the foundation for the next era of<br />

ACPI on <strong>Linux</strong> where we can evolve the more<br />

advanced ACPI features to meet the expectations<br />

of the community.<br />

9 Resources<br />

patch, so you can test the latest ACPI code<br />

combined with other recent updates there. 22<br />

10 Acknowledgments<br />

Many thanks to the following people whose direct<br />

contributions have significantly improved<br />

the quality of the ACPI code in the last<br />

year: Jesse Barnes, John Belmonte, Dominik<br />

Brodowski, Bruno Ducrot, Bjorn Helgaas,<br />

Nitin, Kamble, Andi Kleen, Karol Kozimor,<br />

Pavel Machek, Andrew Morton, Jun Nakajima,<br />

Venkatesh Pallipadi, Nate Lawson, David<br />

Shaohua Li, Suresh Siddha, Jes Sorensen, Andrew<br />

de Quincey, Arjan van de Ven, Matt<br />

Wilcox, and Luming Yu. Thanks also to all<br />

the bug submitters, and the enthusiasts on<br />

acpi-devel.<br />

Special thanks to Intel’s Mobile Platforms<br />

Group, which created ACPICA, particularly<br />

Bob Moore and Andy Grover.<br />

<strong>Linux</strong> is a trademark of Linus Torvalds. Bit-<br />

Keeper is a trademark of BitMover, Inc.<br />

<strong>The</strong> ACPI specification is published at http:<br />

//www.acpi.info.<br />

<strong>The</strong> home page for the <strong>Linux</strong> ACPI development<br />

community is here: http://<br />

acpi.sourceforge.net/ It contains numerous<br />

useful pointers, including one to the<br />

acpi-devel mailing list.<br />

<strong>The</strong> latest ACPI code can be found against various<br />

recent releases in the BitKeeper repositories:<br />

http://linux-acpi.bkbits.<br />

net/<br />

Plain patches are available on kernel.<br />

org. 21 Note that Andrew Morton currently<br />

includes the latest ACPI test tree in the -mm<br />

21 http://ftp.kernel.org/pub/linux/<br />

kernel/people/lenb/acpi/patches/<br />

22 http://ftp.kernel.org/pub/linux/<br />

kernel/people/akpm/patches/


Scaling <strong>Linux</strong>® to the Extreme<br />

From 64 to 512 Processors<br />

Ray Bryant<br />

raybry@sgi.com<br />

Jesse Barnes<br />

jbarnes@sgi.com<br />

John Hawkes<br />

hawkes@sgi.com<br />

Jeremy Higdon<br />

jeremy@sgi.com<br />

Silicon Graphics, Inc.<br />

Jack Steiner<br />

steiner@sgi.com<br />

Abstract<br />

In January 2003, SGI announced the SGI® Altix®<br />

3000 family of servers. As announced,<br />

the SGI Altix 3000 system supported up to<br />

64 Intel® Itanium® 2 processors and 512 GB<br />

of main memory in a single <strong>Linux</strong>® image.<br />

Altix now supports up to 256 processors in<br />

a single <strong>Linux</strong> system, and we have a few<br />

early-adopter customers who are running 512<br />

processors in a single <strong>Linux</strong> system; others<br />

are running with as much as 4 terabytes of<br />

memory. This paper continues the work reported<br />

on in our 2003 OLS paper by describing<br />

the changes necessary to get <strong>Linux</strong> to efficiently<br />

run high-performance computing workloads<br />

on such large systems.<br />

Introduction<br />

At OLS 2003 [1], we discussed changes to<br />

<strong>Linux</strong> that allowed us to make <strong>Linux</strong> scale to<br />

64 processors for our high-performance computing<br />

(HPC) workloads. Since then, we have<br />

continued our scalability work, and we now<br />

support up to 256 processors in a single <strong>Linux</strong><br />

image, and we have a few early-adopter customers<br />

who are running 512 processors in a<br />

single-system image; other customers are running<br />

with as much as 4 terabytes of memory.<br />

As can be imagined, the type of changes necessary<br />

to get a single <strong>Linux</strong> system to scale on a<br />

512 processor system or to support 4 terabytes<br />

of memory are of a different nature than those<br />

necessary to get <strong>Linux</strong> to scale up to a 64 processor<br />

system, and the majority of this paper<br />

will describe such changes.<br />

While much of this work has been done in<br />

the context of a <strong>Linux</strong> 2.4 kernel, Altix is<br />

now a supported platform in the <strong>Linux</strong> 2.6 series<br />

(www.kernel.org versions of <strong>Linux</strong><br />

2.6 boot and run well on many small to moderate<br />

sized Altix systems), and our plan is to<br />

port many of these changes to <strong>Linux</strong> 2.6 and<br />

propose them as enhancements to the community<br />

kernel. While some of these changes will<br />

be unique to the <strong>Linux</strong> kernel for Altix, many<br />

of the changes we propose will also improve<br />

performance on smaller SMP and NUMA systems,<br />

so should be of general interest to the<br />

<strong>Linux</strong> scalability community.<br />

In the rest of this paper, we will first provide<br />

a brief review of the SGI Altix 3000 hardware.<br />

Next we will describe why we believe<br />

that very large single-system image, sharedmemory<br />

machine can be more effective tools<br />

for HPC than similar sized non-shared memory<br />

clusters. We will then discuss changes that<br />

we made to <strong>Linux</strong> for Altix in order to make


134 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

that system a more effective system for HPC<br />

on systems with as many as 512 processors.<br />

A second large topic of discussion will be the<br />

changes to support high-performance I/O on<br />

Altix and some of the hardware underpinnings<br />

for that support. We believe that the latter set<br />

of problems are general in the sense that they<br />

apply to any large scale NUMA system and the<br />

solutions we have adopted should be of general<br />

interest for this reason.<br />

Even though this paper is focused on the<br />

changes that we have made to <strong>Linux</strong> to effectively<br />

support very large Altix platforms, it<br />

should be remembered that the total number of<br />

such changes is small in relation to the overall<br />

size of the <strong>Linux</strong> kernel and its supporting<br />

software. SGI is committed to supporting<br />

the <strong>Linux</strong> community and continues to support<br />

<strong>Linux</strong> for Altix as a member of the <strong>Linux</strong><br />

family of kernels, and in general to support binary<br />

compatibility between <strong>Linux</strong> for Altix and<br />

<strong>Linux</strong> on other Itanium Processor Family platforms.<br />

In many cases, the scaling changes described in<br />

this paper have already been submitted to the<br />

community for consideration for inclusion in<br />

<strong>Linux</strong> 2.6. In other cases, the changes are under<br />

evaluation to determine if they need to be<br />

added to <strong>Linux</strong> 2.6, or whether they are fixes<br />

for problems in <strong>Linux</strong> 2.4.21 (the current product<br />

base for <strong>Linux</strong> for Altix) that are no longer<br />

present in <strong>Linux</strong> 2.6.<br />

Finally, this paper contains forward-looking<br />

statements regarding SGI® technologies and<br />

third-party technologies that are subject to<br />

risks and uncertainties. <strong>The</strong> reader is cautioned<br />

not to rely unduly on these forward-looking<br />

statements, which are not a guarantee of future<br />

or current performance, nor are they a guarantee<br />

that features described herein will or will<br />

not be available in future SGI products.<br />

<strong>The</strong> SGI Altix Hardware<br />

This section is condensed from [1]; the reader<br />

should refer to that paper for additional details.<br />

An Altix system consists of a configurable<br />

number of rack-mounted units, each of which<br />

SGI refers to as a brick. <strong>The</strong> most common<br />

type of brick is the C-brick (or compute brick).<br />

A fully configured C-brick consists of two separate<br />

dual-processor Intel Itanium 2 systems,<br />

each of which is a bus-connected multiprocessor<br />

or node.<br />

In addition to the two processors on the bus,<br />

there is also a SHUB chip on each bus. <strong>The</strong><br />

SHUB is a proprietary ASIC that (1) acts as<br />

a memory controller for the local memory,<br />

(2) provides the interface to the interconnection<br />

network, (3) manages the global cache coherency<br />

protocol, and (4) some other functions<br />

as discussed in [1].<br />

Memory accesses in an Altix system are either<br />

local (i.e., the reference is to memory in the<br />

same node as the processor) or remote. <strong>The</strong><br />

SHUB detects whether a reference is local, in<br />

which case it directs the request to the memory<br />

on the node, or remote, in which case it<br />

forwards the request across the interconnection<br />

network to the SHUB chip where the memory<br />

reference will be serviced.<br />

Local memory references have lower latency;<br />

the Altix system is thus a NUMA (non-uniform<br />

memory access) system. <strong>The</strong> ratio of remote to<br />

local memory access times on an Altix system<br />

varies from 1.9 to 3.5, depending on the size<br />

of the system and the relative locations of the<br />

processor and memory module involved in the<br />

transfer.<br />

<strong>The</strong> cache-coherency policy in the Altix system<br />

can be divided into two levels: local<br />

and global. <strong>The</strong> local cache-coherency protocol<br />

is defined by the processors on the local


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 135<br />

bus and is used to maintain cache-coherency<br />

between the Itanium processors on the bus.<br />

<strong>The</strong> global cache-coherency protocol is implemented<br />

by the SHUB chip. <strong>The</strong> global protocol<br />

is directory-based and is a refinement of the<br />

protocol originally developed for DASH [2].<br />

<strong>The</strong> Altix system interconnection network uses<br />

routing bricks to provide connectivity in system<br />

sizes larger than 16 processors. In systems<br />

with 128 or more processors a second layer<br />

of routing bricks is used to forward requests<br />

among subgroups of 32 processors each. <strong>The</strong><br />

routing topology is a fat-tree topology with additional<br />

“express” links being inserted to improve<br />

performance.<br />

Why Big SSI?<br />

In this section we discuss the rationale for<br />

building such a large single-system image<br />

(SSI) box as an Altix system with 512 CPU’s<br />

and (potentially) several TB of main memory:<br />

(1) Shared memory systems are more flexible<br />

and easier to manage than a cluster. <strong>One</strong> can<br />

simulate message passing on shared memory,<br />

but not the other way around. Software for<br />

cluster management and system maintenance<br />

exists, but can be expensive or complex to use.<br />

(2) Shared memory style programming is generally<br />

simpler and more easily understood than<br />

message passing. Debugging of code is often<br />

simpler on a SSI system than on a cluster.<br />

(3) It is generally easier to port or write<br />

codes from scratch using the shared memory<br />

paradigm. Additionally it is often possible to<br />

simply ignore large sections of the code (e.g.<br />

those devoted to data input and output) and<br />

only parallelize the part that matters.<br />

(4) A shared memory system supports easier<br />

load balancing within a computation. <strong>The</strong><br />

mapping of grid points to a node determines<br />

the computational load on the node. Some grid<br />

points may be located near more rapidly changing<br />

parts of computation, resulting in higher<br />

computational load. Balancing this over time<br />

requires moving grid points from node to node<br />

in a cluster, where in a shared memory system<br />

such re-balancing is typically simpler.<br />

(5) Access to large global data sets is simplified.<br />

Often, the parallel computation depends<br />

on a large data set describing, for example, the<br />

precise dimensions and characteristics of the<br />

physical object that is being modeled. This<br />

data set can be too large to fit into the node<br />

memories available on a clustered machine, but<br />

it can readily be loaded into memory on a large<br />

shared memory machine.<br />

(6) Not everything fits into the cluster model.<br />

While many production codes have been converted<br />

to message passing, the overall computation<br />

may still contain one or more phases that<br />

are better performed using a large shared memory<br />

system. Or, there may be a subset of users<br />

of the system who would prefer a shared memory<br />

paradigm to a message passing one. This<br />

can be a particularly important consideration in<br />

large data-center environments.<br />

<strong>Kernel</strong> Changes<br />

In this section we describe the most significant<br />

kernel problems we have encountered in running<br />

<strong>Linux</strong> on a 512 processor Altix system.<br />

Cache line and TLB Conflicts<br />

Cache line conflicts occur in every cachecoherent<br />

multiprocessor system, to one extent<br />

or another, and whether or not the conflict exhibits<br />

itself as a performance problem is dependent<br />

on the rate at which the conflict occurs and<br />

the time required by the hardware to resolve


136 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

the conflict. <strong>The</strong> latter time is typically proportional<br />

to the number of processors involved in<br />

the conflict. On Altix systems with 256 processors<br />

or more, we have encountered some cache<br />

line conflicts that can effectively halt forward<br />

progress of the machine. Typically, these conflicts<br />

involve global variables that are updated<br />

at each timer tick (or faster) by every processor<br />

in the system.<br />

<strong>One</strong> example of this kind of problem is the default<br />

kernel profiler. When we first enabled<br />

the default kernel profiler on a 512 CPU system,<br />

the system would not boot. <strong>The</strong> reason<br />

was that once per timer tick, each processor<br />

in the system was trying to update the profiler<br />

bin corresponding to the CPU idle routine.<br />

A work around to this problem was to initialize<br />

prof_cpu_mask to CPU_MASK_NONE<br />

instead of the default. This disables profiling<br />

on all processors until the user sets the<br />

prof_cpu_mask.<br />

Another example of this kind of problem was<br />

when we imported some timer code from<br />

Red Hat® AS 3.0. <strong>The</strong> timer code included<br />

a global variable that was used to account for<br />

differences between HZ (typically a power of<br />

2) and the number of microseconds in a second<br />

(nominally 1,000,000). This global variable<br />

was updated by each processor on each<br />

timer tick. <strong>The</strong> result was that on Altix systems<br />

larger than about 384 processors, forward<br />

progress could not be made with this version<br />

of the code. To fix this problem, we made this<br />

global variable a per processor variable. <strong>The</strong><br />

result was that the adjustment for the difference<br />

between HZ and microseconds is done on<br />

a per processor rather than on a global basis,<br />

and now the system will boot.<br />

Still other cache line conflicts were remedied<br />

by identifying cases of false cache line sharing<br />

i.e., those cache lines that inadvertently contain<br />

a field that is frequently written by one CPU<br />

and another field (or fields) that are frequently<br />

read by other CPUs.<br />

Another significant bottleneck is the ia64<br />

do_gettimeofday() with its use of<br />

cmpxchg. That operation is expensive on<br />

most architectures, and concurrent cmpxchg<br />

operations on a common memory location<br />

scale worse than concurrent simple writes from<br />

multiple CPUs. On Altix, four concurrent user<br />

gettimeofday() system calls complete in<br />

almost an order of magnitude more time than a<br />

single gettimeofday(); eight are 20 times<br />

slower than one; and the scaling deteriorates<br />

nonlinearly to the point where 32 concurrent<br />

system calls is 100 times slower than one. At<br />

the present time, we are still exploring a way to<br />

improve this scaling problem in <strong>Linux</strong> 2.6 for<br />

Altix.<br />

While moving data to per-processor storage is<br />

often a solution to the kind of scaling problems<br />

we have discussed here, it is not a panacea,<br />

particularly as the number of processors becomes<br />

large. Often, the system will want to<br />

inspect some data item in the per-processor<br />

storage of each processor in the system. For<br />

small numbers of processors this is not a problem.<br />

But when there are hundreds of processors<br />

involved, such loops can cause a TLB miss<br />

each time through the loop as well as a couple<br />

of cache-line misses, with the result that<br />

the loop may run quite slowly. (A TLB miss<br />

is caused because the per-processor storage areas<br />

are typically isolated from one another in<br />

the kernel’s virtual address space.)<br />

If such loops turn out to be bottlenecks, then<br />

what one must often do is to move the fields<br />

that such loops inspect out of the per-processor<br />

storage areas, and move them into a global<br />

static array with one entry per CPU.<br />

An example of this kind of problem in <strong>Linux</strong><br />

2.6 for Altix is the current allocation scheme<br />

of the per-CPU run queue structures. Each


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 137<br />

per-CPU structure on an Altix system requires<br />

a unique TLB to address it, and each structure<br />

begins at the same virtual offset in a page,<br />

which for a virtually indexed cache means that<br />

the same fields will collide at the same index.<br />

Thus, a CPU scheduler that wishes to<br />

do a quick peek at every other CPU’s nr_<br />

running or cpu_load will not only suffer a<br />

TLB miss on every access, but will also likely<br />

suffer a cache miss because these same virtual<br />

offsets will collide in the cache. Cache coloring<br />

of these addresses would be one way to<br />

solve this problem; we are still exploring ways<br />

to fix this problem in <strong>Linux</strong> 2.6 for Altix.<br />

Lock Conflicts<br />

A cousin of cache line conflicts are the lock<br />

conflicts. Indeed, the root mechanism of the<br />

lock bottleneck is a cache line conflict. For<br />

a spinlock_t the conflict is the cmpxchg<br />

operation on the word that signifies whether or<br />

not the lock is owned. For a rwlock_t the<br />

conflict is the cmpxchg or fetch-and-add operation<br />

on the count of the number of readers<br />

or the bit signifying whether or not the<br />

lock is owned exclusively by a writer. For a<br />

seqlock_t the conflict is the increment of<br />

the sequence number.<br />

For some lock conflicts, such as the rcu_<br />

ctrlblk.mutex, the remedy is to make the<br />

spinlock more fine-grained, e.g., by making it<br />

hierarchical or per-CPU. For other lock conflicts,<br />

the most effective remedy is to reduce<br />

the use of the lock.<br />

<strong>The</strong> O(1) CPU scheduler replaced the global<br />

runqueue_lock with per-CPU run queue<br />

locks, and replaced the global run queue with<br />

per-CPU run queues. While this did substantially<br />

decrease the CPU scheduling bottleneck<br />

for CPU counts in the 8 to 32 range, additional<br />

effort has been necessary to remedy additional<br />

bottlenecks that appear with even large configurations.<br />

For example, we discovered that at 256 processors<br />

and above, we encountered a live lock<br />

early in system boot because hundreds of idle<br />

CPUs are load-balancing and are racing in contention<br />

on one or a few busy CPUs. <strong>The</strong> contention<br />

is so severe that the busy CPU’s scheduler<br />

cannot itself acquire its own run queue<br />

lock, and thus the system live locks.<br />

A remedy we applied in our Altix 2.4-based<br />

kernel was to introduce a progressively longer<br />

back off between successive load-balancing attempts,<br />

if the load-balancing CPU continues<br />

to be unsuccessful in finding a task to pullmigrate.<br />

Perhaps all the busiest CPU’s tasks<br />

are pinned to that CPU, or perhaps all the<br />

tasks are still cache-hot. Regardless of the<br />

reason, a load-balancing failure results in that<br />

CPU delaying the next load-balance attempt<br />

by another incremental increase in time. This<br />

algorithm effectively solved the live lock, as<br />

well as improved other high-contention conflicts<br />

on a busiest CPU’s run queue lock (e.g.,<br />

always finding pinned tasks that can never be<br />

migrated).<br />

This load-balance back off algorithm did not<br />

get accepted into the early 2.6 kernels. <strong>The</strong> latest<br />

2.6.7 CPU scheduler, as developed by Nick<br />

Piggin, incorporates a similar back off algorithm.<br />

However, this algorithm (at least as it<br />

appears in 2.6.7-rc2) continues to cause a boottime<br />

live lock at 512 processors on Altix so we<br />

are continuing to investigate this matter.<br />

Page Cache<br />

Managing the page cache in Altix has been a<br />

challenging problem. <strong>The</strong> reason is that while<br />

a large Altix system may have a lot of memory,<br />

each node in the system only has a relatively<br />

small fraction of that memory available as local<br />

memory. For example, on a 512 CPU sys-


138 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

tem, if the entire system has 512 GB of memory,<br />

each node on the system has only 2 GB of<br />

local memory; less than 0.4% of the available<br />

memory on the system is local. When you consider<br />

that it is quite common on such systems<br />

to deal with files that are tens of GB in size, it<br />

is easy to understand how the page cache could<br />

consume all of the memory on several nodes in<br />

the system just doing normal, buffered-file I/O.<br />

Stated another way, this is the challenge of a<br />

large NUMA system: all memory is addressable,<br />

but only a tiny fraction of that memory<br />

is local. Users of NUMA systems need to<br />

place their most frequently accessed data in local<br />

memory; this is crucial to obtain the maximum<br />

performance possible from the system.<br />

Typically this is done by allocating pages on a<br />

first-touch basis; that is, we attempt to allocate<br />

a page on the node where it is first referenced.<br />

If all of the local memory on a node is consumed<br />

by the page cache, then these local storage<br />

allocations will spill over to other (remote)<br />

nodes, the result being a potentially significant<br />

impact on program performance.<br />

Similarly, it is important that the amount of<br />

free memory be balanced across idle nodes in<br />

the system. An imbalance could lead to some<br />

components of a parallel computation running<br />

slower than others because not all components<br />

of the computation were able to allocate their<br />

memory entirely out of local storage. Since the<br />

overall speed of parallel computation is determined<br />

by the execution of its slowest component,<br />

the performance of the entire application<br />

can be impacted by a non-local storage allocation<br />

on only a few nodes.<br />

<strong>One</strong> might think that bdflush or kupdated<br />

(in a <strong>Linux</strong> 2.4 system) would be responsible<br />

for cleaning up unused page-cache pages.<br />

As the OLS reader knows, these daemons<br />

are responsible not for deallocating page-cache<br />

pages, but cleaning them. It is the swap daemon<br />

kswapd that is responsible for causing<br />

page-cache pages to be deallocated. However,<br />

in many situations we have encountered, even<br />

though multiple nodes of the system would be<br />

completely out of local memory, there would<br />

still be lots of free memory elsewhere in the<br />

system. As a result, kswapd will never start.<br />

Once the system gets into such a state, the<br />

local memory on those nodes can remain allocated<br />

entirely to page-cache pages for very<br />

long stretches of time since as far as the kernel<br />

is concerned there is no memory “pressure”.<br />

To get around this problem, particularly<br />

for benchmarking studies, users have often<br />

resorted to programs that allocate and touch<br />

all of the memory on the system, thus causing<br />

kswapd to wake up and free unneeded buffer<br />

cache pages.<br />

We have dealt with this problem in a number<br />

of ways, but the first approach was to change<br />

page_cache_alloc() so that instead<br />

of allocating the page on the local node, we<br />

spread allocations across all nodes in the<br />

system. To do this, we added a new GFP<br />

flag: GFP_ROUND_ROBIN and a new procedure:<br />

alloc_pages_round_robin().<br />

alloc_pages_round_robin() maintains<br />

a counter in per-CPU storage; the<br />

counter is incremented on each call to<br />

page_cache_alloc(). <strong>The</strong> value of the<br />

counter, modulus the number of nodes in<br />

the system, is used to select the zonelist<br />

passed to __alloc_pages(). Like other<br />

NUMA implementations, in <strong>Linux</strong> for Altix<br />

there is a zonelist for each node, and the<br />

zonelists are sorted in nearest neighbor<br />

order with the zone for the local node as the<br />

first entry of the zonelist. <strong>The</strong> result is that<br />

each time page_cache_alloc() is called,<br />

the returned page is allocated on the next node<br />

in sequence, or as close as possible to that<br />

node.<br />

<strong>The</strong> rationale for allocating page-cache pages


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 139<br />

in this way is that while pages are local resources,<br />

the page cache is a global resource, usable<br />

by all processes on the system. Thus, even<br />

if a process is bound to a particular node, in<br />

general it does not make sense to allocate pagecache<br />

pages just on that node, since some other<br />

process in the system may be reading that same<br />

file and hence sharing the pages. So instead of<br />

flooding the current node with the page-cache<br />

pages for files that processes on that node have<br />

opened, we “tax” every node in the system with<br />

a fraction of the page-cache pages. In this<br />

way, we try to conserve a scarce resource (local<br />

memory) by spreading page-cache allocations<br />

over all nodes in the system.<br />

However, even this step was not enough to keep<br />

local storage usage balanced among nodes in<br />

the system. After reading a 10 GB file, for<br />

example, we found that the node where the<br />

reading process was running would have up to<br />

40,000 pages more storage allocated than other<br />

nodes in the system. It turned out the reason for<br />

this was that buffer heads for the read operation<br />

were being allocated locally. To solve this<br />

problem in our <strong>Linux</strong> 2.4.21 kernel for Altix,<br />

we modified kmem_cache_grow() so that<br />

it would pass the GFP_ROUND_ROBIN flag to<br />

kmem_getpages() with the result that the<br />

slab caches on our systems are now also allocated<br />

out of round-robin storage. Of course,<br />

this is not a perfect solution, since there are situations<br />

where it makes perfect sense to allocate<br />

a slab cache entry locally; but this was an expedient<br />

solution appropriate for our product. For<br />

<strong>Linux</strong> 2.6 for Altix we would like to see the<br />

slab allocator be made NUMA aware. (Manfred<br />

Spraul has created some patches to do this<br />

and we are currently evaluating these changes.)<br />

<strong>The</strong> previous two changes solved many of the<br />

cases where a local storage could be exhausted<br />

by allocation of page-cache pages. However,<br />

they still did not solve the problem of local allocations<br />

spilling off node, particularly in those<br />

cases where storage allocation was tight across<br />

the entire system. In such situations, the system<br />

would often start running the synchronous<br />

swapping code even though most (if not all) of<br />

the page-cache pages on the system were clean<br />

and unreferenced outside of the page-cache.<br />

With the very-large memory sizes typical of<br />

our larger Altix customers, entering the synchronous<br />

swapping code needs to be avoided<br />

if at all possible since this tends to freeze the<br />

system for 10s of seconds. Additionally, the<br />

round robin allocation fixes did not solve the<br />

problem of poor and unrepeatable performance<br />

on benchmarks due to the existence of significant<br />

amounts of page-cache storage left over<br />

from previous executions.<br />

To solve these problems, we introduced a routine<br />

called toss_buffer_cache_pages_<br />

node() (referred to here as toss(), for<br />

brevity). In a related change, we made the<br />

active and inactive lists per node rather than<br />

global. toss() first scans the inactive list<br />

(on a particular node) looking for idle pagecache<br />

pages to release back to the free page<br />

pool. If not enough such pages are found<br />

on the inactive list, then the active list is<br />

also scanned. Finally, if toss() has not<br />

called shrink_slab_caches() recently,<br />

that routine is also invoked in order to more<br />

aggressively free unused slab-cache entries.<br />

toss() was patterned after the main loop<br />

of shrink_caches() except that it would<br />

never call swap_out() and if it encountered<br />

a page that didn’t look to be easily free able, it<br />

would just skip that page and go on to the next<br />

page.<br />

A call to toss() was added in __alloc_<br />

pages() in such a way that if allocation on<br />

the current node fails, then before trying to allocate<br />

from some other node (i. e. spilling<br />

to another node), the system will first see if<br />

it can free enough page-cache pages from the<br />

current node so that the current node alloca-


140 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

tion can succeed. In subsequent allocation<br />

passes, toss() is also called to free pagecache<br />

pages on nodes other than the current<br />

one. <strong>The</strong> result of this change is that clean<br />

page-cache pages are effectively treated as free<br />

memory by the page allocator.<br />

At the same time that the toss() code<br />

was added, we added a new user command<br />

bcfree that could be used to free all<br />

idle page-cache pages. (On the __alloc_<br />

pages() path, toss() would only try to<br />

free 32 pages per node.) <strong>The</strong> bcfree command<br />

was intended to be used only for resetting<br />

the state of the page cache before running a<br />

benchmark, and in lieu of rebooting the system<br />

in order to get a clean system state. However,<br />

our customers found that this command could<br />

be used to reduce the size of the page cache<br />

and to avoid situations where large amounts<br />

of buffered-file I/O could force the system to<br />

begin swapping. Since bcfree kills the entire<br />

page-cache, however, this was regarded<br />

as a substandard solution that could also hurt<br />

read performance of cached data and we began<br />

looking for another way to solve this “BIGIO”<br />

problem.<br />

Just to be specific, the BIGIO problem we were<br />

trying to solve was based on the behavior of our<br />

<strong>Linux</strong> 2.4.21 kernel for Altix. A customer reported<br />

that on a 256 GB Altix system, if 200<br />

GB were allocated and 50 GB free, that if the<br />

user program then tried to write 100 GB of data<br />

out to disk, the system would start to swap,<br />

and then in many cases fill up the swap space.<br />

At that point our Out-of-memory (OOM) killer<br />

would wake up and kill the user program! (See<br />

the next section for discussion of our OOM<br />

killer changes.)<br />

Initially we were able to work around this<br />

problem by increasing the amount of swap<br />

space on the system. Our experiments showed<br />

that with an amount of swap space equal to<br />

one-quarter the main memory size, the 256 GB<br />

example discussed above would continue to<br />

completion without the OOM killer being invoked.<br />

I/O performance during this phase was<br />

typically one-half of what the hardware could<br />

deliver, since two I/O operations often had to<br />

be completed: one to read the data in from<br />

the swap device, and one to write the data to<br />

the output file. Additionally, while the swap<br />

scan was active, the system was very sluggish.<br />

<strong>The</strong>se problems led us to search for another solution.<br />

Eventually what we developed is an aggressive<br />

method of trimming the page cache when it<br />

started to grow too big. This solution involved<br />

several steps:<br />

(1) We first added a new page list, the<br />

reclaim_list. This increased the size of<br />

struct page by another 16 bytes. On our<br />

system, struct page is allocated on cachealigned<br />

boundaries anyway, so this really did<br />

not cause an increase in storage, since the current<br />

struct page size was less than 112<br />

bytes. Pages were added to the reclaim list<br />

when they were inserted into the page cache.<br />

<strong>The</strong> reclaim list is per node, with per node<br />

locking. Pages were removed from the reclaim<br />

list when they were no longer reclaimable; that<br />

is, they were removed from the reclaim list<br />

when they were marked as dirty due to buffer<br />

file-I/O or when they were mapped into an address<br />

space.<br />

(2) We rewrote toss() to scan the reclaim list<br />

instead of the inactive and active lists. Herein<br />

we will refer to the new version of toss() as<br />

toss_fast().<br />

(3) We introduced a variant of page_cache_<br />

alloc() called page_cache_alloc_<br />

limited(). Associated with this new<br />

routine were two control variables settable<br />

via sysctl(): page_cache_limit and<br />

page_cache_limit_threshold.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 141<br />

(4) We modified the generic_file_<br />

write() path to call page_cache_<br />

alloc_limited() instead of page_<br />

cache_alloc(). page_cache_alloc_<br />

limited() examines the size of the page<br />

cache. If the total amount of free memory<br />

in the system is less than page_cache_<br />

limit_threshold and the size of the page<br />

cache is larger than page_cache_limit,<br />

then page_cache_alloc_limited()<br />

calls page_cache_reduce() to free<br />

enough page-cache pages on the system to<br />

bring the page cache size down below page_<br />

cache_limit. If this succeeds, then page_<br />

cache_alloc_limited() calls page_<br />

cache_alloc to allocate the page. If not,<br />

then we wakeup bdflush and the current<br />

thread is put to sleep for 30ms (a tunable<br />

parameter)<br />

<strong>The</strong> rationale for the reclaim_list and<br />

toss_fast() was that when we needed to<br />

trim the page cache, practically all pages in<br />

the system would typically be on the inactive<br />

list. <strong>The</strong> existing toss() routine scanned<br />

the inactive list and thus was too slow to call<br />

from generic_file_write. Moreover,<br />

most of the pages on the inactive list were<br />

not reclaimable anyway. Most of the pages<br />

on the reclaim_list are reclaimable. As<br />

a result toss_fast() runs much faster and<br />

is more efficient at releasing idle page-cache<br />

pages than the old routine.<br />

<strong>The</strong> rationale for the page_cache_limit_<br />

threshold in addition to the page_<br />

cache_limit is that if there is lots of free<br />

memory then there is no reason to trim the page<br />

cache. <strong>One</strong> might think that because we only<br />

trim the page cache on the file write path that<br />

this approach would still let the page cache<br />

to grow arbitrarily due to file reads. Unfortunately,<br />

this is not the case, since the <strong>Linux</strong> kernel<br />

in normal multiuser operation is constantly<br />

writing something to the disk. So, a page cache<br />

limit enforced at file write time is also an effective<br />

limit on the size of the page cache due to<br />

file reads.<br />

Finally, the rationale for delaying the calling<br />

task when page_cache_reduce() fails is<br />

that we do not want the system to start swapping<br />

to make space for new buffered I/O pages,<br />

since that will reduce I/O bandwidth by as<br />

much as one-half anyway, as well as take a lot<br />

of CPU time to figure out which pages to swap<br />

out. So it is better to reduce the I/O bandwidth<br />

directly, by limiting the rate of requested I/O,<br />

instead of allowing that I/O to proceed at rate<br />

that causes the system to be overrun by pagecache<br />

pages.<br />

Thus far, we have had good experience with<br />

this algorithm. File I/O rates are not substantially<br />

reduced from what the hardware can provide,<br />

the system does not start swapping, and<br />

the system remains responsive and usable during<br />

the period of time when the BIGIO is running.<br />

Of course, this entire discussion is specific to<br />

<strong>Linux</strong> 2.4.21. For <strong>Linux</strong> 2.6, we have plans to<br />

evaluate whether this is a problem in the system<br />

at all. In particular, we want to see if an<br />

appropriate setting for vm_swappiness to<br />

zero can eliminate the “BIGIO causes swapping”<br />

problem. We also are interested in evaluating<br />

the recent set of VM patches that Nick<br />

Piggin [6] has assembled to see if they eliminate<br />

this problem for systems of the size of a<br />

large Altix.<br />

VM and Memory Allocation Fixes<br />

In addition to the page-cache changes described<br />

in the last section, we have made a<br />

number of smaller changes related to virtual<br />

memory and paging performance.<br />

<strong>One</strong> set of such changes increased the parallelism<br />

of page-fault handling for anonymous


142 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

pages in multi-threaded applications. <strong>The</strong>se<br />

applications allocate space using routines that<br />

eventually call mmap(); the result is that<br />

when the application touches the data area for<br />

the first time, it causes a minor page fault.<br />

<strong>The</strong>se faults are serviced while holding the<br />

address space’s page_table_lock. If the<br />

address space is large and there are a large<br />

number of threads executing in the address<br />

space, this spinlock can be an initializationtime<br />

bottleneck for the application. Examination<br />

of the handle_mm_fault() path for<br />

this case shows that the page_table_lock<br />

is acquired unconditionally but then released as<br />

soon as we have determined that this is a notpresent<br />

fault for an anonymous page. So, we<br />

reordered the code checks in handle_mm_<br />

fault() to determine in advance whether or<br />

not this was the case we were in, and if so, to<br />

skip acquiring the lock altogether.<br />

<strong>The</strong> second place the page_table_lock<br />

was used on this path was in<br />

do_anonymous_page(). Here, the<br />

lock was re-acquired to make sure that the<br />

process of allocating a page frame and filling<br />

in the pte is atomic. On Itanium, stores to<br />

page-table entries are normal stores (that is,<br />

the set_pte macro evaluates to a simple<br />

store). Thus, we can use cmpxchg to update<br />

the pte and make sure that only one thread<br />

allocates the page and fills in the pte. <strong>The</strong><br />

compare and exchange effectively lets us lock<br />

on each individual pte. So, for Altix, we<br />

have been able to completely eliminate the<br />

page_table_lock from this particular<br />

page-fault path.<br />

<strong>The</strong> performance improvement from this<br />

change is shown in Figure 1. Here we show the<br />

time required to initially touch 96 GB of data.<br />

As additional processors are added to the problem,<br />

the time required for both the baseline-<br />

<strong>Linux</strong> and <strong>Linux</strong> for Altix versions decrease<br />

until around 16 processors. At that point the<br />

page_table_lock starts to become a significant<br />

bottleneck. For the largest number of<br />

processors, even the time for the <strong>Linux</strong> for Altix<br />

case is starting to increase again. We believe<br />

that this is due to contention for the address<br />

space’s mmap semaphore.<br />

Time to Touch Data (Seconds)<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

baseline 2.4<br />

<strong>Linux</strong> 2.4 for Altix<br />

0<br />

1 10 100<br />

Number of Processors<br />

Figure 1: Time to initially touch 96 GB of data.<br />

This is particularly important for HPC applications<br />

since OpenMP[5], a common parallel<br />

programming model for FORTRAN, is implemented<br />

using a single address space, multiplethread<br />

programming model. <strong>The</strong> optimization<br />

described here is one of the reasons that Altix<br />

has recently set new performance records<br />

for the SPEC® SPEComp® L2001 benchmark<br />

[7].<br />

While the above measurements were taken using<br />

<strong>Linux</strong> 2.4.21 for Altix, a similar problem<br />

exists in <strong>Linux</strong> 2.6. For many other architectures,<br />

this same kind of change can be made;<br />

i386 is one of the exceptions to this statement.<br />

We are planning on porting our <strong>Linux</strong> 2.4.21<br />

based changes to <strong>Linux</strong> 2.6 and submitting the<br />

changes to the <strong>Linux</strong> community for inclusion<br />

in <strong>Linux</strong> 2.6. This may require moving part<br />

of do_anonymous_page() to architecture


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 143<br />

dependent code to allow for the fact that not<br />

all architectures can use the compare and exchange<br />

approach to eliminate the use of the<br />

page_table_lock in do_anonymous_<br />

page(). However, the performance improvement<br />

shown in Figure 1 is significant for Altix<br />

so we would we would like to explore some<br />

way of incorporating this code into the mainline<br />

kernel.<br />

We have encountered similar scalability limitations<br />

for other kinds of page-fault behavior.<br />

Figure 2 shows the number of page faults<br />

per second of wall clock time measured for<br />

multiple processes running simultaneously and<br />

faulting in a 1 GB /dev/zero mapping. Unlike<br />

the previous case described here, in this<br />

case each process has its own private mapping.<br />

(Here the number of processes is equal to the<br />

number of CPUs.) <strong>The</strong> dramatic difference between<br />

the baseline 2.4 and 2.6 cases and <strong>Linux</strong><br />

for Altix is due to elimination of a lock in the<br />

super block for /dev/zero.<br />

Page Faults/second (wall clock)<br />

1e+07<br />

1e+06<br />

100000<br />

2.4 baseline<br />

<strong>Linux</strong> 2.4 for Altix<br />

2.6 baseline<br />

10000<br />

1 10 100<br />

CPUS<br />

Figure 2: Page Faults per Second of Wall Clock<br />

Time.<br />

<strong>The</strong> lock in the super block protects two<br />

counts: <strong>One</strong> count limits the maximum number<br />

of /dev/zero mappings to 2 63 ; the second<br />

count limits the number of pages assigned<br />

to a /dev/zero mapping to 2 63 . Neither<br />

one of these counts is particularly useful for<br />

a /dev/zero mapping. We eliminated this<br />

lock and obtained a dramatic performance improvement<br />

for this micro-benchmark (at 512<br />

CPUs the improvement was in excess of 800x).<br />

This optimization is important in decreasing<br />

startup time for large message-passing applications<br />

on the Altix system.<br />

A related change is to distribute the count of<br />

pages in the page cache from a single global<br />

variable to a per node variable. Because every<br />

processor in the system needs to update<br />

the page-cache count when adding or removing<br />

pages from the page cache, contention for<br />

the cache line containing this global variable<br />

becomes significant. We changed this global<br />

count to a per-node count. When a page is inserted<br />

into (or removed from) the page cache,<br />

we update the page cache-count on the same<br />

node as the page itself. When we need the<br />

total number of pages in the page cache (for<br />

example if someone reads /proc/meminfo)<br />

we run a loop that sums the per node counts.<br />

However, since the latter operation is much less<br />

frequent than insertions and deletions from the<br />

page cache, this optimization is an overall performance<br />

improvement.<br />

Another change we have made in the VM<br />

subsystem is in the out-of-memory (OOM)<br />

killer for Altix. In <strong>Linux</strong> 2.4.21, the<br />

OOM killer is called from the top of<br />

memory-free and swap-out call chain. oom_<br />

kill() is called from try_to_free_<br />

pages_zone() when calls to shrink_<br />

caches() at memory priority levels 6<br />

through 0 have all failed. Inside oom_kill()<br />

a number of checks are performed, and if any<br />

of these checks succeed, the system is declared<br />

to not be out-of-memory. <strong>One</strong> of those checks<br />

is “if it has been more than 5 seconds since<br />

oom_kill() was last called, then we are not


144 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

OOM.” On a large-memory Altix system, it can<br />

easily take much longer than that to complete<br />

the necessary calls to shrink_caches().<br />

<strong>The</strong> result is that an Altix system never goes<br />

OOM in spite of the fact that swap space is full<br />

and there is no memory to be allocated.<br />

It seemed to us that part of the problem here<br />

is the amount of time it can take for a swap<br />

full condition (readily detectable in try_<br />

to_swap_out() to bubble all the way up<br />

to the top level in try_to_free_pages_<br />

zone(), especially on a large memory machine.<br />

To solve this problem on Altix, we<br />

decided to drive the OOM killer directly off<br />

of detection of swap-space-full condition provided<br />

that the system also continues to try to<br />

swap out additional pages. A count of the<br />

number of successful swaps and unsuccessful<br />

swap attempts is maintained in try_to_<br />

swap_out(). If, in a 10 second interval, the<br />

number of successful swap outs is less than<br />

one percent of the number of attempted swap<br />

outs, and the total number of swap out attempts<br />

exceeds a specified threshold, then try_to_<br />

swap_out()) will directly wake the OOM<br />

killer thread (also new in our implementation).<br />

This thread will wait another 10 seconds, and<br />

if the out-of-swap condition persists, it will invoke<br />

oom_kill() to select a victim and kill<br />

it. <strong>The</strong> OOM killer thread will repeat this sleep<br />

and kill cycle until it appears that swap space<br />

is no longer full or the number of attempts to<br />

swap out new pages (since the thread went to<br />

sleep) falls below the threshold.<br />

In our experience, this has made invocation of<br />

the OOM killer much more reliable than it was<br />

before, at least on Altix. Once again, this implementation<br />

was for <strong>Linux</strong> 2.4.21; we are in<br />

the process of evaluating this problem and the<br />

associated fix on <strong>Linux</strong> 2.6 at the present time.<br />

Another fix we have made to the VM system<br />

in <strong>Linux</strong> 2.4.21 for Altix is in handling<br />

of HUGETLB pages. <strong>The</strong> existing implementation<br />

in <strong>Linux</strong> 2.4.21 allocates HUGETLB<br />

pages to an address space at mmap() time (see<br />

hugetlb_prefault()); it also zeroes the<br />

pages at this time. This processing is done by<br />

the thread that makes the mmap() call. In<br />

particular, this means that zeroing of the allocated<br />

HUGETLB pages is done by a single<br />

processor. On a machine with 4 TB of<br />

memory and with as much memory allocated<br />

to HUGETLB pages as possible, our measurements<br />

have shown that it can take as long as<br />

5,000 seconds to allocate and zero all available<br />

HUGETLB pages. Worse yet, the thread that<br />

does this operation holds the address space’s<br />

mmap_sem and the page_table_lock for<br />

the entire 5,000 seconds. Unfortunately, many<br />

commands that query system state (such as ps<br />

and w) also wish to acquire one of these locks.<br />

<strong>The</strong> result is that the system appears to be hung<br />

for the entire 5,000 seconds.<br />

We solved this problem on Altix by changing<br />

the implementation of HUGETLB page allocation<br />

from prefault to allocate on fault. Many<br />

others have created similar patches; our patch<br />

was unique in that it also allowed zeroing of<br />

pages to occur in parallel if the HUGETLB<br />

page faults occurred on different processors.<br />

This was crucial to allow a large HUGETLB<br />

page region to be faulted into an address space<br />

in parallel, using as many processors as possible.<br />

For example, we have observed speedups<br />

of 25x using 16 processors to touch O(100 GB)<br />

of HUGETLB pages. (<strong>The</strong> speedup is super<br />

linear because if you use just one processor<br />

it has to zero many remote pages, whereas if<br />

you use more processors, at least some of the<br />

pages you are zeroing are local or on nearby<br />

nodes.) Assuming we can achieve the same<br />

kind of speedup on a 4 TB system, we would<br />

reduce the 5,000 second time stated above to<br />

200 seconds.<br />

Recently, we have worked with Kenneth Chen


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 145<br />

to get a similar set of changes proposed for<br />

<strong>Linux</strong> 2.6 [3]. Once this set of changes is accepted<br />

into the mainline this particular problem<br />

will be solved for <strong>Linux</strong> 2.6. <strong>The</strong>se changes are<br />

also necessary for Andi Kleen’s NUMA placement<br />

algorithms [4] to apply to HUGETLB<br />

pages, since otherwise pages are placed at<br />

hugetlb_prefault() time.<br />

A final set of changes is related to large kernel<br />

tables. As previously mentioned, on an Altix<br />

system with 512 processors, less than 0.4% of<br />

the available memory is local. Certain tables in<br />

the <strong>Linux</strong> kernel are sized to be on the order of<br />

one percent of available memory. (An example<br />

of this is the TCP/IP hash table.) Allocating<br />

a table of this size can use all of the local<br />

memory on a node, resulting in exactly the kind<br />

of storage-allocation imbalance we developed<br />

the page-cache changes to solve. To avoid this<br />

problem, we also implement round-robin allocation<br />

of these large tables. Our current technique<br />

uses vm_alloc() to do this. Unfortunately,<br />

this is not portable across all architectures,<br />

since certain architectures have limited<br />

amounts of space that can be allocated by<br />

vm_alloc(). Nonetheless, this is a change<br />

that we need to make; we are still exploring<br />

ways of making this change acceptable to the<br />

<strong>Linux</strong> community.<br />

Once we have solved the initial allocation<br />

problem for these tables, there is still the problem<br />

of getting them appropriately sized for an<br />

Altix system. Clearly if there are 4 TB of main<br />

memory, it does not make much sense to allocate<br />

a TCP/IP hash table of 40 GB, particularly<br />

since the TCP/IP traffic into an Altix system<br />

does not increase with memory size the way<br />

one might expect it to scale with a traditional<br />

<strong>Linux</strong> server. We have seen cases where system<br />

performance is significantly hampered due<br />

to lookups in these overly large tables. At the<br />

moment, we are still exploring a solution acceptable<br />

to the community to solve this particular<br />

problem.<br />

I/O Changes for Altix<br />

<strong>One</strong> of the design goals for the Altix system<br />

is that it support standard PCI devices and<br />

their associated <strong>Linux</strong> drivers as much as possible.<br />

In this section we discuss the performance<br />

improvements built into the Altix hardware<br />

and supported through new driver interfaces<br />

in <strong>Linux</strong> that help us to meet this goal<br />

with excellent performance even on very large<br />

Altix systems.<br />

According to the PCI specification, DMA<br />

writes and PIO read responses are strongly ordered.<br />

On large NUMA systems, however,<br />

DMA writes can take a long time to complete.<br />

Since most PIO reads do not imply completion<br />

of a previous DMA write, relaxing the ordering<br />

rules of DMA writes and PIO read responses<br />

can greatly improve system performance.<br />

Another large system issue relates to initiating<br />

PIO writes from multiple CPUs. PIO writes<br />

from two different CPUs may arrive out of order<br />

at a device. <strong>The</strong> usual way to ensure ordering<br />

is through a combination of locking and a<br />

PIO read (see Documentation/io_ordering.txt).<br />

On large systems, however, doing this read can<br />

be very expensive, particularly if it must be ordered<br />

with respect to unrelated DMA writes.<br />

Finally, the NUMA nature of large machines<br />

make some optimizations obvious and desirable.<br />

Many devices use so-called consistent<br />

system memory for retrieving commands<br />

and storing status information; allocating that<br />

memory close to its associated device makes<br />

sense.<br />

Making non–dependent PIO reads fast<br />

In its I/O chipsets, SGI chose to relax the ordering<br />

between DMAs and PIOs, instead adding


146 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

a barrier attribute to certain DMA writes (to<br />

consistent PCI allocations on Altix) and to interrupts.<br />

This works well with controllers that<br />

use DMA writes to indicate command completions<br />

(for example a SCSI controller with a<br />

response queue, where the response queue is<br />

allocated using pci_alloc_consistent,<br />

so that writes to the response queue have the<br />

barrier attribute). When we ported <strong>Linux</strong> to<br />

Altix, this behavior became a problem, because<br />

many <strong>Linux</strong> PCI drivers use PIO read responses<br />

to imply a status of a DMA write. For<br />

example, on an IDE controller, a bit status register<br />

read is performed to find out if a command<br />

is complete (command complete status implies<br />

that DMA writes of that command’s data are<br />

completed). As a result, SGI had to implement<br />

a rather heavyweight mechanism to guarantee<br />

ordering of DMA writes and PIO reads. This<br />

mechanism involves doing an explicit flush of<br />

DMA write data after each PIO read.<br />

For the cases in which strong ordering of PIO<br />

read responses and DMA writes are not necessary,<br />

a new API was needed so that drivers<br />

could communicate that a given PIO read response<br />

could used relaxed ordering with respect<br />

to prior DMA writes. <strong>The</strong> read_<br />

relaxed API [8] was added early in the 2.6<br />

series for this purpose, and mirrors the normal<br />

read routines, which have variants for various<br />

sized reads.<br />

<strong>The</strong> results below show how expensive a normal<br />

PIO read transaction can be, especially on<br />

a system doing a lot of I/O (and thus DMA).<br />

Type of PIO Time (ns)<br />

normal PIO read 3875<br />

relaxed PIO read 1299<br />

Table 1: Normal vs. relaxed PIO reads on an<br />

idle system<br />

It remains to be seen whether this API will also<br />

apply to the newly added RO bit in the PCI-<br />

Type of PIO Time (ns)<br />

normal PIO read 4889<br />

relaxed PIO read 1646<br />

Table 2: Normal vs. relaxed PIO reads on a<br />

busy system<br />

X specification—the author is hopeful! Either<br />

way, it does give hardware vendors who want<br />

to support <strong>Linux</strong> some additional flexibility in<br />

their design.<br />

Ordering posted writes efficiently<br />

On many platforms, PIO writes from different<br />

CPUs will not necessarily arrive in order (i.e.,<br />

they may be intermixed) even when locking is<br />

used. Since the platform has no way of knowing<br />

whether a given PIO read depends on preceding<br />

writes, it has to guarantee that all writes<br />

have completed before allowing a read transaction<br />

to complete. So performing a read prior<br />

to releasing a lock protecting a region doing<br />

writes is sufficient to guarantee that the writes<br />

arrive in the correct order.<br />

However, performing PIO reads can be an expensive<br />

operation, especially if the device is on<br />

a distant node. SGI chipset designers foresaw<br />

this problem, however, and provided a way to<br />

ensure ordering by simply reading a register<br />

from the chipset on the local node. When the<br />

register indicates that all PIO writes are complete,<br />

it means they have arrived at the chipset<br />

attached to the device, and so are guaranteed<br />

to arrive at the device in the intended order.<br />

<strong>The</strong> SGI sn2 specific portion of the <strong>Linux</strong> ia64<br />

port (sn2 is the architecture name for Altix in<br />

the <strong>Linux</strong> kernel source tree) provides a small<br />

function, sn_mmiob() (for memory–mapped<br />

I/O barrier, analogous to the mb() macro), to<br />

do just that. It can be used in place of reads<br />

that are intended to deal with posted writes and<br />

provides some benefit:


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 147<br />

Type of flush<br />

Time (ns)<br />

regular PIO read 5940<br />

relaxed PIO read 2619<br />

sn_mmiob() 1610<br />

(local chipset read alone) 399<br />

Table 3: Normal vs. fast flushing of 5 PIO<br />

writes<br />

Adding this API to <strong>Linux</strong> (i.e., making it nonsn2-specific)<br />

was discussed some time ago [9],<br />

and may need to be raised again, since it does<br />

appear to be useful on Altix, and is probably<br />

similarly useful on other platforms.<br />

Local allocation of consistent DMA mappings<br />

Consistent DMA mappings are used frequently<br />

by drivers to store command and status buffers.<br />

<strong>The</strong>y are frequently read and written by the<br />

device that owns them, so making sure they<br />

can be accessed quickly is important. <strong>The</strong> table<br />

below shows the difference in the number<br />

of operations per second that can be<br />

achieved using local versus remote allocation<br />

of consistent DMA buffers. Local allocations<br />

were guaranteed by changing the pci_<br />

alloc_consistent function so that it calls<br />

alloc_pages_node using the node closest<br />

to the PCI device in question.<br />

Type<br />

I/Os per second<br />

Local consistent buffer 46231<br />

Remote consistent buffer 41295<br />

Table 4: Local vs. remote DMA buffer allocation<br />

Although this change is platform specific, it<br />

can be made generic if a pci_to_node or<br />

pci_to_nodemask routine is added to the<br />

<strong>Linux</strong> topology API.<br />

Concluding Remarks<br />

Today, our <strong>Linux</strong> 2.4.21 kernel for Altix provides<br />

a productive platform for our highperformance-computing<br />

users who desire to<br />

exploit the features of the SGI Altix 3000 hardware.<br />

To achieve this goal, we have made a<br />

number of changes to our <strong>Linux</strong> for Altix kernel.<br />

We are now in the process of either moving<br />

those changes forward to <strong>Linux</strong> 2.6 for Altix,<br />

or of evaluating the <strong>Linux</strong> 2.6 kernel on Altix<br />

in order to determine if these changes are indeed<br />

needed at all. Our goal is to develop a<br />

version of the <strong>Linux</strong> 2.6 kernel for Altix that<br />

not only supports our HPC customers equally<br />

well as our existing <strong>Linux</strong> 2.4.21 kernel, but<br />

also consists as much as possible of community<br />

supported code.<br />

References<br />

[1] Ray Bryant and John Hawkes, <strong>Linux</strong><br />

Scalability for Large NUMA Systems,<br />

Proceedings of the 2003 Ottawa <strong>Linux</strong><br />

Symposium, Ottawa, Ontario, Canada,<br />

(July 2003).<br />

[2] Daniel Lenoski, James Laudon, Truman<br />

Joe, David Nakahira, Luis Stevens,<br />

Anoop Gupta, and John Hennesy, <strong>The</strong><br />

DASH prototype: Logic overhead and<br />

performance, IEEE Transactions on<br />

Parallel and Distributed Systems,<br />

4(1):41-61, January 1993.<br />

[3] Kenneth Chen, “hugetlb demand paging<br />

patch part [0/3],”<br />

linux-kernel@vger.kernel.org,<br />

2004-04-13 23:17:04,<br />

http://marc.theaimsgroup.<br />

com/?l=linux-kernel&m=<br />

108189860419356&w=2<br />

[4] Andi Kleen, “Patch: NUMA API for<br />

<strong>Linux</strong>,” linux-kernel@vger.kernel.org,


148 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Tue, 6 Apr 2004 15:33:22 +0200,<br />

http:<br />

//lwn.net/Articles/79100/<br />

[5] http://www.openmp.org<br />

[6] Nick Piggin, “MM patches,”<br />

http://www.kerneltrap.org/<br />

~npiggin/nickvm-267r1m1.gz<br />

[7] http://www.spec.org/omp/<br />

results/ompl2001.html<br />

[8] http://linux.bkbits.net:<br />

8080/linux-2.5/cset%<br />

4040213ca0d3eIznHTPAR_<br />

kLCsMZI9VQ?nav=index.html|<br />

ChangeSet@-1d<br />

[9] http://www.cs.helsinki.fi/<br />

linux/linux-kernel/2002-01/<br />

1540.html<br />

© 2004 Silicon Graphics, Inc. Permission to redistribute<br />

in accordance with Ottawa <strong>Linux</strong> Symposium<br />

paper submission guidelines is granted; all<br />

other rights reserved. Silicon Graphics, SGI and<br />

Altix are registered trademarks and OpenMP is a<br />

trademark of Silicon Graphics, Inc., in the U.S.<br />

and/or other countries worldwide. <strong>Linux</strong> is a registered<br />

trademark of Linus Torvalds in several countries.<br />

Intel and Itanium are trademarks or registered<br />

trademarks of Intel Corporation or its subsidiaries<br />

in the United States and other countries. Red Hat<br />

and all Red Hat-based trademarks are trademarks<br />

or registered trademarks of Red Hat, Inc. in the<br />

United States and other countries. All other trademarks<br />

mentioned herein are the property of their<br />

respective owners.


Get More Device Drivers out of the <strong>Kernel</strong>!<br />

Peter Chubb ∗<br />

National ICT Australia<br />

and<br />

<strong>The</strong> University of New South Wales<br />

peterc@gelato.unsw.edu.au<br />

Abstract<br />

Now that <strong>Linux</strong> has fast system calls, good<br />

(and getting better) threading, and cheap context<br />

switches, it’s possible to write device<br />

drivers that live in user space for whole new<br />

classes of devices. Of course, some device<br />

drivers (Xfree, in particular) have always run<br />

in user space, with a little bit of kernel support.<br />

With a little bit more kernel support (a way to<br />

set up and tear down DMA safely, and a generalised<br />

way to be informed of and control interrupts)<br />

almost any PCI bus-mastering device<br />

could have a user-mode device driver.<br />

I shall talk about the benefits and drawbacks<br />

of device drivers being in user space or kernel<br />

space, and show that performance concerns<br />

are not really an issue—in fact, on some platforms,<br />

our user-mode IDE driver out-performs<br />

the in-kernel one. I shall also present profiling<br />

and benchmark results that show where time is<br />

spent in in-kernel and user-space drivers, and<br />

describe the infrastructure I’ve added to the<br />

<strong>Linux</strong> kernel to allow portable, efficient userspace<br />

drivers to be written.<br />

∗ This work was funded by HP, National ICT Australia,<br />

the ARC, and the University of NSW through the<br />

Gelato programme (http://www.gelato.unsw.<br />

edu.au)<br />

1 Introduction<br />

Normal device drivers in <strong>Linux</strong> run in the kernel’s<br />

address space with kernel privilege. This<br />

is not the only place they can run—see Figure<br />

1.<br />

Address Space<br />

<strong>Kernel</strong><br />

Own<br />

Client<br />

A<br />

B<br />

<strong>Kernel</strong><br />

Privilege<br />

C<br />

D<br />

User<br />

Figure 1: Where a Device Driver can Live<br />

Point A is the normal <strong>Linux</strong> device driver,<br />

linked with the kernel, running in the kernel<br />

address space with kernel privilege.<br />

Device drivers can also be linked directly with<br />

the applications that use them (Point B)—<br />

the so-called ‘in-process’ device drivers proposed<br />

by [Keedy, 1979]—or run in a separate<br />

process, and be talked to by an IPC mechanism<br />

(for example, an X server, point D).<br />

<strong>The</strong>y can also run with kernel privilege, but<br />

with a separate kernel address space (Point


150 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

C) (as in the Nooks system described by<br />

[Swift et al., 2002]).<br />

2 Motivation<br />

Traditionally, device drivers have been developed<br />

as part of the kernel source. As such, they<br />

have to be written in the C language, and they<br />

have to conform to the (rapidly changing) interfaces<br />

and conventions used by kernel code.<br />

Even though drivers can be written as modules<br />

(obviating the need to reboot to try out<br />

a new version of the driver 1 ), in-kernel driver<br />

code has access to all of kernel memory, and<br />

runs with privileges that give it access to all instructions<br />

(not just unprivileged ones) and to<br />

all I/O space. As such, bugs in drivers can easily<br />

cause kernel lockups or panics. And various<br />

studies (e.g., [Chou et al., 2001]) estimate that<br />

more than 85% of the bugs in an operating system<br />

are driver bugs.<br />

Device drivers that run as user code, however,<br />

can use any language, can be developed<br />

using any IDE, and can use whatever internal<br />

threading, memory management, etc., techniques<br />

are most appropriate. When the infrastructure<br />

for supporting user-mode drivers is adequate,<br />

the processes implementing the driver<br />

can be killed and restarted almost with impunity<br />

as far as the rest of the operating system<br />

goes.<br />

Drivers that run in the kernel have to be updated<br />

regularly to match in-kernel interface<br />

changes. Third party drivers are therefore usually<br />

shipped as source code (or with a compilable<br />

stub encapsulating the interface) that has<br />

to be compiled against the kernel the driver is<br />

to be installed into.<br />

This means that everyone who wants to run a<br />

1 except that many drivers currently cannot be unloaded<br />

third-party driver also has to have a toolchain<br />

and kernel source on his or her system, or obtain<br />

a binary for their own kernel from a trusted<br />

third party.<br />

Drivers for uncommon devices (or devices that<br />

the mainline kernel developers do not use regularly)<br />

tend to lag behind. For example, in the<br />

2.6.6 kernel, there are 81 drivers known to be<br />

broken because they have not been updated to<br />

match the current APIs, and a number more<br />

that are still using APIs that have been deprecated.<br />

User/kernel interfaces tend to change much<br />

more slowly than in-kernel ones; thus a<br />

user-mode driver has much more chance of<br />

not needing to be changed when the kernel<br />

changes. Moreover, user mode drivers can be<br />

distributed under licences other than the GPL,<br />

which may make them more attractive to some<br />

people 2 .<br />

User-mode drivers can be either closely or<br />

loosely coupled with the applications that use<br />

them. Two obvious examples are the X server<br />

(XFree86) which uses a socket to communicate<br />

with its clients and so has isolation from kernel<br />

and client address spaces and can be very<br />

complex; and the Myrinet drivers, which are<br />

usually linked into their clients to gain performance<br />

by eliminating context switch overhead<br />

on packet reception.<br />

<strong>The</strong> Nooks work [Swift et al., 2002] showed<br />

that by isolating drivers from the kernel address<br />

space, the most common programming<br />

errors could be made recoverable. In Nooks,<br />

drivers are insulated from the rest of the kernel<br />

by running each in a separate address space,<br />

and replacing the driver ↔ kernel interface<br />

with a new one that uses cross-domain procedure<br />

calls to replace any procedure calls in<br />

the ABI, and that creates shadow copies of any<br />

2 for example, the ongoing problems with the Nvidia<br />

graphics card driver could possibly be avoided.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 151<br />

shared variables in the protected address space<br />

of the driver.<br />

This approach provides isolation, but also has<br />

problems: as the driver model changes, there<br />

is quite a lot of wrapper code that has to be<br />

changed to accommodate the changed APIs.<br />

Also, the value of any shared variable is frozen<br />

for the duration of a driver ABI call. <strong>The</strong><br />

Nooks work is uniprocessor only; locking issues<br />

therefore have not yet been addressed.<br />

Windriver [Jungo, 2003] allows development<br />

of user mode device drivers. It loads a proprietary<br />

device module /dev/windrv6; user<br />

code can interact with this device to setup and<br />

teardown DMA, catch interrupts, etc.<br />

Even from user space, of course, it is possible<br />

to make your machine unusable. Device<br />

drivers have to be trusted to a certain extent to<br />

do what they are advertised to do; this means<br />

that they can program their devices, and possibly<br />

corrupt or spy on the data that they transfer<br />

between their devices and their clients. Moving<br />

a driver to user space does not change this.<br />

It does however make it less likely that a fault<br />

in a driver will affect anything other than its<br />

clients<br />

3 Existing Support<br />

<strong>Linux</strong> has good support for user-mode drivers<br />

that do not need DMA or interrupt handling—<br />

see, e.g., [Nakatani, 2002].<br />

<strong>The</strong> ioperm() and iopl() system calls allow<br />

access to the first 65536 I/O ports; and,<br />

with a patch from Albert Calahan 3 one can<br />

map the appropriate parts of /proc/bus/pci/... to<br />

gain access to memory-mapped registers. Or<br />

on some architectures it is safe to mmap()<br />

/dev/mem.<br />

3 http://lkml.org/lkml/2003/7/13/258<br />

It is usually best to use MMIO if it is available,<br />

because on many 64-bit platforms there<br />

are more than 65536 ports—the PCI specification<br />

says that there are 2 32 ports available—<br />

(and on many architectures the ports are emulated<br />

by mapping memory anyway).<br />

For particular devices—USB input devices,<br />

SCSI devices, devices that hang off the parallel<br />

port, and video drivers such as XFree86—<br />

there is explicit kernel support. By opening a<br />

file in /dev, a user-mode driver can talk through<br />

the USB hub, SCSI controller, AGP controller,<br />

etc., to the device. In addition, the input handler<br />

allows input events to be queued back into<br />

the kernel, to allow normal event handling to<br />

proceed.<br />

libpci allows access to the PCI configuration<br />

space, so that a driver can determine what interrupt,<br />

IO ports and memory locations are being<br />

used (and to determine whether the device<br />

is present or not).<br />

Other recent changes—an improved scheduler,<br />

better and faster thread creation and synchronisation,<br />

a fully preemptive kernel, and faster<br />

system calls—mean that it is possible to write<br />

a driver that operates in user space that is almost<br />

as fast as an in-kernel driver.<br />

4 Implementing the Missing Bits<br />

<strong>The</strong> parts that are missing are:<br />

1. the ability to claim a device from user<br />

space so that other drivers do not try to<br />

handle it;<br />

2. <strong>The</strong> ability to deliver an interrupt from a<br />

device to user space,<br />

3. <strong>The</strong> ability to set up and tear-down DMA<br />

between a device and some process’s<br />

memory, and


152 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

4. the ability to loop a device driver’s control<br />

and data interfaces into the appropriate<br />

part of the kernel (so that, for example,<br />

an IDE driver can appear as a standard<br />

block device), preferably without having<br />

to copy any payload data.<br />

<strong>The</strong> work at UNSW covers only PCI devices,<br />

as that is the only bus available on all of the<br />

architectures we have access to (IA64, X86,<br />

MIPS, PPC, alpha and arm).<br />

4.1 PCI interface<br />

Each device should have only a single driver.<br />

<strong>The</strong>refore one needs a way to associate a driver<br />

with a device, and to remove that association<br />

automatically when the driver exits. This has<br />

to be implemented in the kernel, as it is only<br />

the kernel that can be relied upon to clean up<br />

after a failed process. <strong>The</strong> simplest way to<br />

keep the association and to clean it up in <strong>Linux</strong><br />

is to implement a new filesystem, using the<br />

PCI namespace. Open files are automatically<br />

closed when a process exits, so cleanup also<br />

happens automatically.<br />

A new system call, usr_pci_open(int<br />

bus, int slot, int fn) returns a file<br />

descriptor. Internally, it calls pci_enable_<br />

device() and pci_set_master() to set<br />

up the PCI device after doing the standard<br />

filesystem boilerplate to set up a vnode and a<br />

struct file.<br />

Attempts to open an already-opened PCI device<br />

will fail with -EBUSY.<br />

When the file descriptor is finally closed, the<br />

PCI device is released, and any DMA mappings<br />

removed. All files are closed when a process<br />

dies, so if there is a bug in the driver that<br />

causes it to crash, the system recovers ready for<br />

the driver to be restarted.<br />

4.2 DMA handling<br />

On low-end systems, it’s common for the PCI<br />

bus to be connected directly to the memory<br />

bus, so setting up a DMA transfer means<br />

merely pinning the appropriate bit of memory<br />

(so that the VM system can neither swap it out<br />

nor relocate it) and then converting virtual addresses<br />

to physical addresses.<br />

<strong>The</strong>re are, in general, two kinds of DMA, and<br />

this has to be reflected in the kernel interface:<br />

1. Bi-directional DMA, for holding scattergather<br />

lists, etc., for communication with<br />

the device. Both the CPU and the device<br />

read and write to a shared memory area.<br />

Typically such memory is uncached, and<br />

on some architectures it has to be allocated<br />

from particular physical areas. This<br />

kind of mapping is called PCI-consistent;<br />

there is an internal kernel ABI function to<br />

allocate and deallocate appropriate memory.<br />

2. Streaming DMA, where, once the device<br />

has either read or written the area, it has<br />

no further immediate use for it.<br />

I implemented a new system call 4 , usr_pci_<br />

map(), that does one of three things:<br />

1. Allocates an area of memory suitable for a<br />

PCI-consistent mapping, and maps it into<br />

the current process’s address space; or<br />

2. Converts a region of the current process’s<br />

virtual address space into a scatterlist in<br />

terms of virtual addresses (one entry per<br />

page), pins the memory, and converts the<br />

4 Although multiplexing system calls are in general<br />

deprecated in <strong>Linux</strong>, they are extremely useful while developing,<br />

because it is not necessary to change every<br />

architecture-dependent entry.S when adding new functionality


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 153<br />

scatterlist into a list of addresses suitable<br />

for DMA (by calling pci_map_sg(),<br />

which sets up the IOMMU if appropriate),<br />

or<br />

Device 1<br />

PCI bus<br />

3. Undoes the mapping in point 2.<br />

<strong>The</strong> file descriptor returned from usr_pci_<br />

open() is an argument to usr_pci_<br />

map(). Mappings are tracked as part of the<br />

private data for that open file descriptor, so that<br />

they can be undone if the device is closed (or<br />

the driver dies).<br />

Underlying usr_pci_map() are the kernel<br />

routines pci_map_sg() and pci_unmap_<br />

sg(), and the kernel routine pci_alloc_<br />

consistent().<br />

Different PCI cards can address different<br />

amounts of DMA address space. In the kernel<br />

there is an interface to request that the dma addresses<br />

supplied are within the range addressable<br />

by the card. <strong>The</strong> current implementation<br />

assumes 32-bit addressing, but it would be possible<br />

to provide an interface to allow the real<br />

capabilities of the device to be communicated<br />

to the kernel.<br />

4.2.1 <strong>The</strong> IOMMU<br />

Many modern architectures have an IO memory<br />

management unit (see Figure 2), to convert<br />

from physical to I/O bus addresses—in much<br />

the same way that the processor’s MMU converts<br />

virtual to physical addresses—allowing<br />

even thirty-two bit cards to do single-cycle<br />

DMA to anywhere in the sixty-four bit memory<br />

address space.<br />

On such systems, after the memory has been<br />

pinned, the IOMMU has to be set up to translate<br />

from bus to physical addresses; and then<br />

after the DMA is complete, the translation can<br />

be removed from the IOMMU.<br />

Device 2<br />

Device 3<br />

IOMMU<br />

Figure 2: <strong>The</strong> IO MMU<br />

Main<br />

Memory<br />

<strong>The</strong> processor’s MMU also protects one virtual<br />

address space from another. Currently shipping<br />

IOMMU hardware does not do this: all<br />

mappings are visible to all PCI devices, and<br />

moreover for some physical addresses on some<br />

architectures the IOMMU is bypassed.<br />

For fully secure user-space drivers, one would<br />

want this capability to be turned off, and also<br />

to be able to associate a range of PCI bus addresses<br />

with a particular card, and disallow access<br />

by that card to other addresses. Only thus<br />

could one ensure that a card could perform<br />

DMA only into memory areas explicitly allocated<br />

to it.<br />

4.3 Interrupt Handling<br />

<strong>The</strong>re are essentially two ways that interrupts<br />

can be passed to user level.<br />

<strong>The</strong>y can be mapped onto signals, and sent<br />

asynchronously, or a synchronous ‘wait-forsignal’<br />

mechanism can be used.<br />

A signal is a good intuitive match for what an<br />

interrupt is, but has other problems:<br />

1. <strong>One</strong> is fairly restricted in what one can do<br />

in a signal handler, so a driver will usually


154 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

have to take extra context switches to respond<br />

to an interrupt (into and out of the<br />

signal handler, and then perhaps the interrupt<br />

handler thread wakes up)<br />

2. Signals can be slow to deliver on busy systems,<br />

as they require the process table to<br />

be locked. It would be possible to short<br />

circuit this to some extent.<br />

3. <strong>One</strong> needs an extra mechanism for registering<br />

interest in an interrupt, and for tearing<br />

down the registration when the driver<br />

dies.<br />

For these reasons I decided to map interrupts<br />

onto file descriptors. /proc already has a directory<br />

for each interrupt (containing a file that<br />

can be written to to adjust interrupt routing to<br />

processors); I added a new file to each such directory.<br />

Suitably privileged processes can open<br />

and read these files. <strong>The</strong> files have open-once<br />

semantics; attempts to open them while they<br />

are open return −1 with EBUSY.<br />

When an interrupt occurs, the in-kernel interrupt<br />

handler masks just that interrupt in the interrupt<br />

controller, and then does an up() operation<br />

on a semaphore (well, actually, the implementation<br />

now uses a wait queue, but the<br />

effect is the same).<br />

When a process reads from the file, then kernel<br />

enables the interrupt, then calls down() on a<br />

semaphore, which will block until an interrupt<br />

arrives.<br />

<strong>The</strong> actual data transferred is immaterial, and<br />

in fact none ever is transferred; the read()<br />

operation is used merely as a synchronisation<br />

mechanism.<br />

poll() is also implemented, so a user process<br />

is not forced into the ‘wait for interrupt’<br />

model that we use.<br />

Obviously, one cannot share interrupts between<br />

devices if there is a user process involved.<br />

<strong>The</strong> in-kernel driver merely passes<br />

the interrupt onto the user-mode process; as it<br />

knows nothing about the underlying hardware,<br />

it cannot tell if the interrupt is really for this<br />

driver or not. As such it always reports the interrupt<br />

as ‘handled.’<br />

This scheme works only for level-triggered interrupts.<br />

Fortunately, all PCI interrupts are<br />

level triggered.<br />

If one really wants a signal when an interrupt<br />

happens, one can arrange for a SIGIO using<br />

fcntl().<br />

It may be possible, by more extensive rearrangement<br />

of the interrupt handling code, to<br />

delay the end-of-interrupt to the interrupt controller<br />

until the user process is ready to get an<br />

interrupt. As masking and unmasking interrupts<br />

is slow if it has to go off-chip, delaying<br />

the EOI should be significantly faster than<br />

the current code. However, interrupt delivery<br />

to userspace turns out not to be a bottleneck,<br />

so there’s not a lot of point in this optimisation<br />

(profiles show less than 0.5% of the time<br />

is spent in the kernel interrupt handler and delivery<br />

even for heavy interrupt load—around<br />

1000 cycles per interrupt).<br />

5 Driver Structure<br />

<strong>The</strong> user-mode drivers developed at UNSW are<br />

structured as a preamble, an interrupt thread,<br />

and a control thread (see Figure 3).<br />

<strong>The</strong> preamble:<br />

1. Uses libpci.a to find the device or devices<br />

it is meant to drive,<br />

2. Calls usr_pci_open() to claim the<br />

device, and<br />

3. Spawns the interrupt thread, then


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 155<br />

User<br />

libpci<br />

<strong>Kernel</strong><br />

Generic<br />

IRQ Handler<br />

Client<br />

pci_read_config()<br />

read()<br />

Architecture−dependent<br />

DMA support<br />

IPC or<br />

function calls<br />

Driver<br />

usrdrv<br />

Driver<br />

pci_map()<br />

pci_unmap()<br />

pci_map_sg()<br />

pci_unmap_sg()<br />

so that the control thread can continue. (<strong>The</strong><br />

semaphore is implemented as a pthreads mutex).<br />

<strong>The</strong> driver relies on system calls and threading,<br />

so the fast system call support now available<br />

in <strong>Linux</strong>, and the NPTL are very important to<br />

get good performance. Each physical I/O involves<br />

at least three system calls, plus whatever<br />

is necessary for client communication: a<br />

read() on the interrupt FD, calls to set up<br />

and tear down DMA, and maybe a futex()<br />

operation to wake the client.<br />

<strong>The</strong> system call overhead could be reduced by<br />

combining DMA setup and teardown into a<br />

single system call.<br />

Figure 3: Architecture of a User-Mode Device<br />

Driver<br />

4. Goes into a loop collecting client requests.<br />

<strong>The</strong> interrupt thread:<br />

1. Opens /proc/irq/irq/irq<br />

2. Loops calling read() on the resulting<br />

file descriptor and then calling the driver<br />

proper to handle the interrupt.<br />

3. <strong>The</strong> driver handles the interrupt, calls out<br />

to the control thread(s) to say that work is<br />

completed or that there has been an error,<br />

queues any more work to the device, and<br />

then repeats from step 2.<br />

For the lowest latency, the interrupt thread can<br />

be run as a real time thread. For our benchmarks,<br />

however, this was not done.<br />

<strong>The</strong> control thread queues work to the driver<br />

then sleeps on a semaphore. When the driver,<br />

running in the interrupt thread, determines that<br />

a request is complete, it signals the semaphore<br />

6 Looping the Drivers<br />

An operating system has two functions with regard<br />

to devices: firstly to drive them, and secondly<br />

to abstract them, so that all devices of the<br />

same class have the same interface. While a<br />

standalone user-level driver is interesting in its<br />

own right (and could be used, for example, to<br />

test hardware, or could be linked into an application<br />

that doesn’t like sharing the device with<br />

anyone), it is much more useful if the driver<br />

can be used like any other device.<br />

For the network interface, that’s easy: use<br />

the tun/tap interface and copy frames between<br />

the driver and /dev/net/tun. Having to copy<br />

slows things down; others on the team here are<br />

planning to develop a zero-copy equivalent of<br />

tun/tap.<br />

For the IDE device, there’s no standard <strong>Linux</strong><br />

way to have a user-level block device, so I implemented<br />

one. It is a filesystem that has pairs<br />

of directories: a master and a slave. When<br />

the filesystem is mounted, creating a file in the<br />

master directory creates a set of block device<br />

special files, one for each potential partition, in


156 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

the slave directory. <strong>The</strong> file in the master directory<br />

can then be used to communicate via<br />

a very simple protocol between a user level<br />

block device and the kernel’s block layer. <strong>The</strong><br />

block device special files in the slave directory<br />

can then be opened, closed, read, written or<br />

mounted, just as any other block device.<br />

<strong>The</strong> main reason for using a mounted filesystem<br />

was to allow easy use of dynamic major<br />

numbers.<br />

I didn’t bother implementing ioctl; it was not<br />

necessary for our performance tests, and when<br />

the driver runs at user level, there are cleaner<br />

ways to communicate out-of-band data with<br />

the driver, anyway.<br />

7 Results<br />

Device drivers were coded up by<br />

[Leslie and Heiser, 2003] for a CMD680<br />

IDE disc controller, and by another PhD<br />

student (Daniel Potts) for a DP83820 Gigabit<br />

ethernet controller. Daniel also designed and<br />

implemented the tuntap interface.<br />

7.1 IDE driver<br />

<strong>The</strong> disc driver was linked into a program that<br />

read 64 Megabytes of data from a Maxtor 80G<br />

disc into a buffer, using varying read sizes.<br />

Measurements were also made using <strong>Linux</strong>’s<br />

in-kernel driver, and a program that read 64M<br />

of data from the same on-disc location using<br />

O_DIRECT and the same read sizes.<br />

We also measured write performance, but the<br />

results are sufficiently similar that they are not<br />

reproduced here.<br />

At the same time as the tests, a lowpriority<br />

process attempted to increment a 64-<br />

bit counter as fast as possible. <strong>The</strong> number of<br />

increments was calibrated to processor time on<br />

an otherwise idle system; reading the counter<br />

before and after a test thus gives an indication<br />

of how much processor time is available to processes<br />

other than the test process.<br />

<strong>The</strong> initial results were disappointing; the<br />

user-mode drivers spent far too much time<br />

in the kernel. This was tracked down to<br />

kmalloc(); so the usr_pci_map() function<br />

was changed to maintain a small cache<br />

of free mapping structures instead of calling<br />

kmalloc() and kfree() each time (we<br />

could have used the slab allocator, but it’s easier<br />

to ensure that the same cache-hot descriptor<br />

is reused by coding a small cache ourselves).<br />

This resulted in the performance graphs in Figure<br />

4.<br />

<strong>The</strong> two drivers compared are the new<br />

CMD680 driver running in user space, and<br />

<strong>Linux</strong>’s in-kernel SIS680 driver. As can be<br />

seen, there is very little to choose between<br />

them.<br />

<strong>The</strong> graphs show average of ten runs; the standard<br />

deviations were calculated, but are negligible.<br />

Each transfer request takes five system calls to<br />

do, in the current design. <strong>The</strong> client queues<br />

work to the driver, which then sets up DMA for<br />

the transfer (system call one), starts the transfer,<br />

then returns to the client, which then sleeps<br />

on a semaphore (system call two). <strong>The</strong> interrupt<br />

thread has been sleeping in read(),<br />

when the controller finishes its DMA, it cause<br />

an interrupt, which wakes the interrupt thread<br />

(half of system call three). <strong>The</strong> interrupt thread<br />

then tears down the DMA (system call four),<br />

and starts any queued and waiting activity, then<br />

signals the semaphore (system call five) and<br />

goes back to read the interrupt FD again (the<br />

other half of system call three).<br />

When the transfer is above 128k, the IDE controller<br />

can no longer do a single DMA opera-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 157<br />

100<br />

50<br />

80<br />

kernel read<br />

user read<br />

40<br />

CPU (%)<br />

60<br />

40<br />

30<br />

20<br />

Throughput (MiB/s)<br />

20<br />

10<br />

0<br />

0<br />

1 4 16 64 256 1024 4096 16384 65536<br />

Transfer size (k)<br />

Figure 4: Throughput and CPU usage for the user-mode IDE driver on Itanium-2, reading from a<br />

disk<br />

tion, so has to generate multiple transfers <strong>The</strong><br />

<strong>Linux</strong> kernel splits DMA requests above 64k,<br />

thus increasing the overhead.<br />

<strong>The</strong> time spent in this driver is divided as<br />

shown in Figure 5.<br />

IRQ<br />

Hardware<br />

Scheduler<br />

2.2 Latency 1<br />

<strong>Kernel</strong> Stub 0.4<br />

UserMode<br />

Handler<br />

Queue<br />

New<br />

Work<br />

Signal<br />

Client<br />

DMA...<br />

Figure 5: Timeline (in µseconds)<br />

Scheduler Latency<br />

1. Packet receive performance, where packets<br />

were dropped and counted at the layer<br />

immediately above the driver<br />

2. Packet transmit performance, where packets<br />

were generated and fed to the driver,<br />

and<br />

3. Ethernet-layer packet echoing, where the<br />

protocol layer swapped source and destination<br />

MAC-addresses, and fed received<br />

packets back into the driver.<br />

7.2 Gigabit Ethernet<br />

<strong>The</strong> Gigabit driver results are more interesting.<br />

We tested these using [ipbench, 2004]<br />

with four clients, all with pause control turned<br />

off. We ran three tests:<br />

We did not want to start comparing IP stacks,<br />

so none of these tests actually use higher level<br />

protocols.<br />

We measured three different configurations: a<br />

standalone application linked with the driver,<br />

the driver looped back into /dev/net/tap and<br />

the standard in-kernel driver, all with interrupt


158 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

holdoff set to 0, 1, or 2. (By default, the normal<br />

kernel driver sets the interrupt holdoff to 300<br />

µseconds, which led to too many packets being<br />

dropped because of FIFO overflow) Not all<br />

tests were run in all configurations—for example<br />

the linux in-kernel packet generator is sufficiently<br />

different from ours that no fair comparison<br />

could be made.<br />

For the tests that had the driver residing in or<br />

feeding into the kernel, we implemented a new<br />

protocol module to count and either echo or<br />

drop packets, depending on the benchmark.<br />

In all cases, we used the amount of work<br />

achieved by a low priority process to measure<br />

time available for other work while the test was<br />

going on.<br />

<strong>The</strong> throughput graphs in all cases are the<br />

same. <strong>The</strong> maximum possible speed on the<br />

wire is given for raw ethernet by 10 9 × p/(p +<br />

38) bits per second (the parameter 38 is the<br />

ethernet header size (14 octets), plus a 4 octet<br />

frame check sequence, plus a 7 octet preamble,<br />

plus a 1 octet start frame delimiter plus<br />

the minimum 12 octet interframe gap; p is the<br />

packet size in octets). For large packets the performance<br />

in all cases was the same as the theoretical<br />

maximum. For small packet sizes, the<br />

throughput is limited by the PCI bus; you’ll notice<br />

that the slope of the throughput curve when<br />

echoing packets is around half the slope when<br />

discarding packets, because the driver has to do<br />

twice as many DMA operations per packet.<br />

<strong>The</strong> user-mode driver (‘<strong>Linux</strong> user’ on the<br />

graph) outperforms the in-kernel driver<br />

(‘<strong>Linux</strong> orig’)—not in terms of throughput,<br />

where all the drivers perform identically, but<br />

in using much less processing time.<br />

This result was so surprising that we repeated<br />

the tests using an EEpro1000, purportedly a<br />

card with a much better driver, but saw the<br />

same effect—in fact the achieved echo performance<br />

is worse than for the in-kernel ns83820<br />

driver for some packet sizes.<br />

<strong>The</strong> reason appears to be that our driver has<br />

a fixed number of receive buffers, which are<br />

reused when the client is finished with them—<br />

they are allocated only once. This is to provide<br />

congestion control at the lowest possible<br />

level—the card drops packets when the upper<br />

layers cannot keep up.<br />

<strong>The</strong> <strong>Linux</strong> kernel drivers have an essentially<br />

unlimited supply of receive buffers. Overhead<br />

involved in allocating and setting up DMA for<br />

these buffers is excessive, and if the upper layers<br />

cannot keep up, congestion is detected and<br />

the packets dropped in the protocol layer—<br />

after significant work has been done in the<br />

driver.<br />

<strong>One</strong> sees the same problem with the user mode<br />

driver feeding the tuntap interface, as there is<br />

no feedback to throttle the driver. Of course,<br />

here there is an extra copy for each packet,<br />

which also reduces performance.<br />

7.3 Reliability and Failure Modes<br />

In general the user-mode drivers are very reliable.<br />

Bugs in the drivers that would cause<br />

the kernel to crash (for example, a null pointer<br />

reference inside an interrupt handler) cause the<br />

driver to crash, but the kernel continues. <strong>The</strong><br />

driver can then be fixed and restarted.<br />

8 Future Work<br />

<strong>The</strong> main foci of our work now lie in:<br />

1. Reducing the need for context switches<br />

and system calls by merging system calls,<br />

and by trying new driver structures.<br />

2. A zero-copy implementation of tun/tap.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 159<br />

100<br />

1e+09<br />

80<br />

<strong>The</strong>oretical Max<br />

<strong>Kernel</strong> EEPRO1000 driver<br />

User mode driver, 100usec holdoff<br />

<strong>Kernel</strong> NS83820 driver, 100usec holdoff<br />

9e+08<br />

8e+08<br />

CPU (%)<br />

60<br />

40<br />

7e+08<br />

6e+08<br />

5e+08<br />

Throughput (b/s)<br />

4e+08<br />

20<br />

3e+08<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600 2e+08<br />

Packet size (octets)<br />

Figure 6: Receive Throughput and CPU usage for Gigabit Ethernet drivers on Itanium-2<br />

100<br />

1e+09<br />

80<br />

8e+08<br />

CPU (%)<br />

60<br />

40<br />

<strong>The</strong>oretical Max<br />

User mode driver, 200 usec interrupt holdoff<br />

User mode driver, 100 usec interrupt holdoff<br />

User mode driver, 0 usec interrupt holdoff<br />

6e+08<br />

4e+08<br />

Throughput (b/s)<br />

20<br />

2e+08<br />

0<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600<br />

Packet size (octets)<br />

Figure 7: Transmit Throughput and CPU usage for Gigabit Ethernet drivers on Itanium-2


160 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

100<br />

1e+09<br />

80<br />

8e+08<br />

CPU (%)<br />

60<br />

40<br />

<strong>The</strong>oretical Max<br />

User mode driver<br />

In-kernel EEPRO1000 driver<br />

Normal kernel driver<br />

user-mode driver -> /dev/tun/tap0<br />

6e+08<br />

4e+08<br />

Throughput<br />

20<br />

2e+08<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600 0<br />

Packet size<br />

Figure 8: MAC-layer Echo Throughput and CPU usage for Gigabit Ethernet drivers on Itanium-2<br />

3. Improving robustness and reliability of<br />

the user-mode drivers, by experimenting<br />

with the IOMMU on the ZX1 chipset of<br />

our Itanium-2 machines.<br />

4. Measuring the reliability enhancements,<br />

by using artificial fault injection to see<br />

what problems that cause the kernel to<br />

crash are recoverable in user space.<br />

9 Where d’ya Get It?<br />

Patches against the 2.6 kernel are sent to the<br />

<strong>Linux</strong> kernel mailing list, and are on http://<br />

www.gelato.unsw.edu.au/patches<br />

Sample drivers will be made available from the<br />

same website.<br />

5. User-mode filesystems.<br />

In addition there are some housekeeping tasks<br />

to do before this infrastructure is ready for inclusion<br />

in a 2.7 kernel:<br />

1. Replace the ad-hoc memory cache with a<br />

proper slab allocator.<br />

2. Clean up the system call interface<br />

10 Acknowledgements<br />

Other people on the team here did much work<br />

on the actual implementation of the user level<br />

drivers and on the benchmarking infrastructure.<br />

Prominent among them were Ben Leslie<br />

(IDE driver, port of our dp83820 into the kernel),<br />

Daniel Potts (DP83820 driver, tuntap interface),<br />

and Luke McPherson and Ian Wienand<br />

(IPbench).


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 161<br />

References<br />

[Chou et al., 2001] Chou, A., Yang, J., Chelf,<br />

B., Hallem, S., and Engler, D. R. (2001).<br />

An empirical study of operating systems<br />

errors. In Symposium on Operating<br />

Systems Principles, pages 73–88.<br />

http://citeseer.nj.nec.com/<br />

article/chou01empirical.html.<br />

[ipbench, 2004] ipbench (2004). ipbench — a<br />

distributed framework for network<br />

benchmarking.<br />

http://ipbench.sf.net/.<br />

[Jungo, 2003] Jungo (2003). Windriver.<br />

http://www.jungo.com/<br />

windriver.html.<br />

[Keedy, 1979] Keedy, J. L. (1979). A<br />

comparison of two process structuring<br />

models. MONADS Report 4, Dept.<br />

Computer Science, Monash University.<br />

[Leslie and Heiser, 2003] Leslie, B. and<br />

Heiser, G. (2003). Towards untrusted<br />

device drivers. Technical Report<br />

UNSW-CSE-TR-0303, Operating Systems<br />

and Distributed Systems Group, School of<br />

Computer Science and Engineering, <strong>The</strong><br />

University of NSW. CSE techreports<br />

website,<br />

ftp://ftp.cse.unsw.edu.au/<br />

pub/doc/papers/UNSW/0303.pdf.<br />

[Nakatani, 2002] Nakatani, B. (2002).<br />

ELJOnline: User mode drivers.<br />

http://www.linuxdevices.com/<br />

articles/AT5731658926.html.<br />

[Swift et al., 2002] Swift, M., Martin, S.,<br />

Leyand, H. M., and Eggers, S. J. (2002).<br />

Nooks: an architecture for reliable device<br />

drivers. In Proceedings of the Tenth ACM<br />

SIGOPS European Workshop,<br />

Saint-Emilion, France.


162 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Big Servers—2.6 compared to 2.4<br />

Wim A. Coekaerts<br />

Oracle Corporation<br />

wim.coekaerts@oracle.com<br />

Abstract<br />

<strong>Linux</strong> 2.4 has been around in production environments<br />

at companies for a few years now,<br />

we have been able to gather some good data<br />

on how well (or not) things scale up. Number<br />

of CPU’s, amount of memory, number of processes,<br />

IO throughput, etc.<br />

Most of the deployments in production today,<br />

are on relatively small systems, 4- to 8-ways,<br />

8–16GB of memory, in a few cases 32GB.<br />

<strong>The</strong> architecture of choice has also been IA32.<br />

64-bit systems are picking up in popularity<br />

rapidly, however.<br />

Now with 2.6, a lot of the barriers are supposed<br />

to be gone. So, have they really? How much<br />

memory can be used now, how is cpu scaling<br />

these days, how good is IO throughput with<br />

multiple controllers in 2.6.<br />

A lot of people have the assumption that 2.6<br />

resolves all of this. We will go into detail on<br />

what we have found out, what we have tested<br />

and some of the conclusions on how good the<br />

move to 2.6 will really be.<br />

1 Introduction<br />

<strong>The</strong> comparison between the 2.4 and 2.6 kernel<br />

trees are not solely based on performance.<br />

A large part of the testsuites are performance<br />

benchmarks however, as you will see, they<br />

have been used to also measure stability. <strong>The</strong>re<br />

are a number of features added which improve<br />

stability of the kernel under heavy workloads.<br />

<strong>The</strong> goal of comparing the two kernel releases<br />

was more to show how well the 2.6 kernel will<br />

be able to hold up in a real world production<br />

environment. Many companies which have deployed<br />

<strong>Linux</strong> over the last two years are looking<br />

forward to rolling out 2.6 and it is important<br />

to show the benefits of doing such a move.<br />

It will take a few releases before the required<br />

stability is there however it’s clear so far that<br />

the 2.6 kernel has been remarkably solid, so<br />

early on.<br />

Most of the 2.4 based tests have been run on<br />

Red Hat Enterprise <strong>Linux</strong> 3, based on <strong>Linux</strong><br />

2.4.21. This is the enterprise release of Red<br />

Hat’s OS distribution; it contains a large number<br />

of patches on top of the <strong>Linux</strong> 2.4 kernel<br />

tree. Some of the tests have been run on the<br />

kernel.org mainstream 2.4 kernel, to show<br />

the benefit of having extra functionality. However<br />

it is difficult to even just boot up the mainstream<br />

kernel on the test hardware due to lack<br />

of support for drivers, or lack of stability to<br />

complete the testsuite. <strong>The</strong> interesting thing to<br />

keep in mind is that with the current <strong>Linux</strong> 2.6<br />

main stream kernel, most of the testsuites ran<br />

through completition. A number of test runs on<br />

<strong>Linux</strong> 2.6 have been on Novell/SuSE SLES9<br />

beta release.


164 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

2 Test Suites<br />

<strong>The</strong> test suites used to compare the various kernels<br />

are based on an IO simulator for Oracle,<br />

called OraSim and a TPC-C like workload generator<br />

called OAST.<br />

Oracle Simulator (OraSim) is a stand-alone<br />

tool designed to emulate the platform-critical<br />

activities of the Oracle database kernel. Oracle<br />

designed Oracle Simulator to test and characterize<br />

the input and output (I/O) software stack,<br />

the storage system, memory management, and<br />

cluster management of Oracle single instances<br />

and clusters. Oracle Simulator supports both<br />

pass-fail testing for validation, and analytical<br />

testing for debugging and tuning. It runs multiple<br />

processes, with each process representing<br />

the parameters of a particular type of system<br />

load similar to the Oracle database kernel.<br />

OraSim is a relatively straightforward IO<br />

stresstest utility, similar to IOzone or tiobench,<br />

however it is built to be very flexible and configurable.<br />

It has its own script language which allows one<br />

to build very complex IO patterns. <strong>The</strong> tool is<br />

not released under any open source license today<br />

because it has some code linked in which is<br />

part of the RDBMS itself. <strong>The</strong> jobfiles used for<br />

the testing are available online http://oss.<br />

oracle.com/external/ols/jobfiles/.<br />

<strong>The</strong> advantage of using OraSim over a real<br />

database benchmark is mainly the simplicity.<br />

It does not require large amounts of memory or<br />

large installed software components. <strong>The</strong>re is<br />

one executable which is started with the jobfile<br />

as a parameter.<strong>The</strong> jobfiles used can be easily<br />

modified to turn on certain filesystem features,<br />

such as asynchronous IO.<br />

OraSim jobfiles were created to simulate a relatively<br />

small database. 10 files are defined as<br />

actual database datafiles and two files are used<br />

to simulate database journals.<br />

OAST on the other hand is a complete database<br />

stress test kit, based on the TPC-C benchmark<br />

workloads. It requires a full installation of<br />

the database software and relies on an actual<br />

database environment to be created. TPC-C<br />

is an on-line transaction workload. <strong>The</strong> numbers<br />

represented during the testruns are not actual<br />

TPC-C benchmarks results and cannot or<br />

should not be used as a measure of TPC-C<br />

performance—they are TPC-C-like; however,<br />

not the same.<br />

<strong>The</strong> database engine which runs the OAST<br />

benchmark allocates a large shared memory<br />

segment which contains the database caches<br />

for SQL and for data blocks (shared pool and<br />

buffer cache). Every client connection can run<br />

on the same server or the connection can be<br />

over TCP. In case of a local connection, for<br />

each client, 2 processes are spawned on the<br />

system. <strong>One</strong> process is a dedicated database<br />

process and the other is the client code which<br />

communicates with the database server process<br />

through IPC calls. Test run parameters include<br />

run time length in seconds and number of<br />

client connections. As you can see in the result<br />

pages, both remote and local connections have<br />

been tested.<br />

3 Hardware<br />

A number of hardware configurations have<br />

been used. We tried to include various CPU<br />

architectures as well as local SCSI disk versus<br />

network storage (NAS) and fibre channel<br />

(SAN).<br />

Configuration 1 consists of an 8-way IA32<br />

Xeon 2 GHz with 32GB RAM attached to an<br />

EMC CX300 Clariion array with 30 147GB<br />

disks using a QLA2300 fibre channel HBA.<br />

<strong>The</strong> network cards are BCM5701 Broadcom<br />

Gigabit Ethernet.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 165<br />

Configuration 2 consists of an 8-way Itanium 2<br />

1.3 GHz with 8GB RAM attached to a JBOD<br />

fibre channel array with 8 36GB disks using<br />

a QLA2300 fibre channel HBA. <strong>The</strong> network<br />

cards are BCM5701 Broadcom Gigabit Ethernet.<br />

Configuration 3 consists of a 2-way AMD64 2<br />

GHz (Opteron 246) with 6GB RAM attached<br />

to local SCSI disk (LSI Logic 53c1030).<br />

4 Operating System<br />

<strong>The</strong> <strong>Linux</strong> 2.4 test cases were created using<br />

Red Hat Enterprise <strong>Linux</strong> 3 on all architectures.<br />

<strong>Linux</strong> 2.6 was done with SuSE SLES9<br />

on all architectures; however, in a number of<br />

tests the kernel was replaced by the 2.6 mainstream<br />

kernel for comparison.<br />

<strong>The</strong> test suites and benchmarks did not have<br />

to be recompiled to run on either RHEL3 or<br />

SLES9. Of course different executables were<br />

used on the three CPU architectures.<br />

5 Test Results<br />

At the time of writing a lot of changes were<br />

still happening on the 2.6 kernel. As such,<br />

the actual spreadsheets with benchmark data<br />

has been published on a website, the data is<br />

up-to-date with the current kernel tree and can<br />

be found here: http://oss.oracle.com/<br />

external/ols/results/<br />

5.1 IO<br />

If you want to build a huge database server,<br />

which can handle thousands of users, it is important<br />

to be able to attach a large number of<br />

disks. A very big shortcoming in <strong>Linux</strong> 2.4<br />

was the fact that it could only handle 128 or<br />

256.<br />

With some patches SuSE got to around 3700<br />

disks in SLES8, however that meant stealing<br />

major numbers from other components. Really<br />

large database setups which also require<br />

very high IO throughput, usually have disks attached<br />

ranging from a few hundred to a few<br />

thousand.<br />

With the 64-bit dev_t in 2.6, it’s now possible<br />

to attach plenty of disk. Without modifications<br />

it can easily handle tens of thousands of devices<br />

attached. This opens the world to really<br />

large scale datawarehouses, tens of terabytes of<br />

storage.<br />

Another important change is the block IO<br />

layer, the BIO code is much more efficient<br />

when it comes to large IOs being submitted<br />

down from the running application. In 2.4,<br />

every IO got broken down into small chunks,<br />

sometimes causing bottlenecks on allocating<br />

accounting structures. Some of the tests compared<br />

1MB read() and write() calls in<br />

2.4 and 2.6.<br />

5.2 Asynchronous IO and DirectIO<br />

If there is one feature that has always been on<br />

top of the Must Have list for large database<br />

vendors, it must be async IO. Asynchronous IO<br />

allows processes to submit batches of IO operations<br />

and continue on doing different tasks in<br />

the meantime. It improves CPU utilization and<br />

can keep devices more busy. <strong>The</strong> Enterprise<br />

distributions based on <strong>Linux</strong> 2.4 all ship with<br />

the async IO patch applied on top of the mainline<br />

kernel.<br />

<strong>Linux</strong> 2.6 has async IO out of the box. It is<br />

implemented a little different from <strong>Linux</strong> 2.4<br />

however combined with support for direct IO it<br />

is very performant. Direct IO is very useful as<br />

it eliminates copying the userspace buffers into<br />

kernel space. On systems that are constantly<br />

overloaded, there is a nice performance im-


166 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

provement to be gained doing direct IO. <strong>Linux</strong><br />

2.4 did not have direct IO and async IO combined.<br />

As you can see in the performance<br />

graph on AIO+DIO, it provides a significant<br />

reduction in CPU utilization.<br />

5.3 Virtual Memory<br />

<strong>The</strong>re has been another major VM overhaul in<br />

<strong>Linux</strong> 2.6, in fact, even after 2.6.0 was released<br />

a large portion has been re-written. This was<br />

due to large scale testing showing weaknesses<br />

as it relates to number of users that could be<br />

handled on a system. As you can see in the test<br />

results, we were able to go from around 3000<br />

users to over 7000 users. In particular on 32-<br />

bit systems, the VM has been pretty much a<br />

disaster when it comes to deploying a system<br />

with more than 16GB of RAM. With the latest<br />

VM changes it is now possible to push a 32GB<br />

even up to 48GB system pretty reliably.<br />

Support for large pages has also been a big<br />

winner. HUGETLBFS reduces TLB misses by<br />

a decent percentage. In some of the tests it<br />

provides up to a 3% performance gain. In our<br />

tests HUGETLBFS would be used to allocate<br />

the shared memory segment.<br />

5.4 NUMA<br />

<strong>Linux</strong> 2.6 is the first <strong>Linux</strong> kernel with real<br />

NUMA support. As we see high-end customers<br />

looking at deploying large SMP boxes<br />

running <strong>Linux</strong>, this became a real requirement.<br />

In fact even with the AMD64 design, NUMA<br />

support becomes important for performance<br />

even when looking at just a dual-CPU system.<br />

NUMA support has two components; however,<br />

one is the fact that the kernel VM allocates<br />

memory for processes in a more efficient way.<br />

On the other hand, it is possible for applications<br />

to use the NUMA API and tell the OS<br />

where memory should be allocated and how.<br />

Oracle has an extention for Itanium2 to support<br />

the libnuma API from Andi Kleen. Making use<br />

of this extention showed a significant improvement,<br />

up to about 20%. It allows the database<br />

engine to be smart about memory allocations<br />

resulting in a significant performance gain.<br />

6 Conclusion<br />

It is very clear that many of the features that<br />

were requested by the larger corporations providing<br />

enterprise applications actually help a<br />

huge amount. <strong>The</strong> advantage of having Asynchronous<br />

IO or NUMA support in the mainstream<br />

kernel is obvious. It takes a lot of effort<br />

for distribution vendors to maintain patches on<br />

top of the mainline kernel and when functionality<br />

makes sense it helps to have it be included<br />

in mainline. Micro-optimizations are still being<br />

done and in particular the VM subsystem<br />

can improve quite a bit. Most of the stability<br />

issues are around 32-bit, where the LowMem<br />

versus HighMem split wreaks havoc quite frequently.<br />

At least with some of the features now<br />

in the 2.6 kernel it is possible to run servers<br />

with more than 16GB of memory and scale up.<br />

<strong>The</strong> biggest surprise was the stability. It was<br />

very nice to see a new stable tree be so solid<br />

out of the box, this in contrast to earlier stable<br />

kernel trees where it took quite a few iterations<br />

to get to the same point.<br />

<strong>The</strong> major benefit of 2.6 is being able to run on<br />

really large SMP boxes: 32-way Itanium2 or<br />

Power4 systems with large amounts of memory.<br />

This was the last stronghold of the traditional<br />

Unices and now <strong>Linux</strong> can play alongside<br />

with them even there. Very exciting times.


Multi-processor and Frequency Scaling<br />

Making Your Server Behave Like a Laptop<br />

Paul Devriendt<br />

AMD Software Research and Development<br />

paul.devriendt@amd.com<br />

Copyright © 2004 Advanced Micro Devices, Inc.<br />

Abstract<br />

This paper will explore a multi-processor implementation<br />

of frequency management, using<br />

an AMD Opteron processor 4-way server as<br />

a test vehicle.<br />

Topics will include:<br />

• the benefits of doing this, and why server<br />

customers are asking for this,<br />

• the hardware, for case of the AMD<br />

Opteron processor,<br />

• the various software components that<br />

make this work,<br />

• the issues that arise, and<br />

• some areas of exploration for follow on<br />

work.<br />

1 Introduction<br />

Processor frequency management is common<br />

on laptops, primarily as a mechanism for improving<br />

battery life. Other benefits include a<br />

cooler processor and reduced fan noise. Fans<br />

also use a non-trivial amount of power.<br />

This technology is spreading to desktop machines,<br />

driven both by a desire to reduce power<br />

consumption and to reduce fan noise.<br />

Servers and other multiprocessor machines can<br />

equally benefit. <strong>The</strong> multiprocessor frequency<br />

management scenario offers more complexity<br />

(no surprise there). This paper discusses<br />

these complexities, based upon a test implementation<br />

on an AMD Opteron processor 4-<br />

way server. Details within this paper are AMD<br />

processor specific, but the concepts are applicable<br />

to other architectures.<br />

<strong>The</strong> author of this paper would like to make<br />

it clear that he is just the maintainer of the<br />

AMD frequency driver, supporting the AMD<br />

Athlon 64 and AMD Opteron processors.<br />

This frequency driver fits into, and is totally dependent,<br />

on the CPUFreq support. <strong>The</strong> author<br />

has gratefully received much assistance and<br />

support from the CPUFreq maintainer (Dominik<br />

Brodowski).<br />

2 Abbreviations<br />

BKDG: <strong>The</strong> BIOS and <strong>Kernel</strong> Developer’s<br />

Guide. Document published by AMD containing<br />

information needed by system software developers.<br />

See the references section, entry 4.<br />

MSR: Model Specific Register. Processor registers,<br />

accessable only from kernel space, used<br />

for various control functions. <strong>The</strong>se registers<br />

are expected to change across processor<br />

families. <strong>The</strong>se registers are described in the


168 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

BKDG[4].<br />

VRM: Voltage Regulator Module. Hardware<br />

external to the processor that controls the voltage<br />

supplied to the processor. <strong>The</strong> VRM has to<br />

be capable of supplying different voltages on<br />

command. Note that for multiprocessor systems,<br />

it is expected that each processor will<br />

have its own independent VRM, allowing each<br />

processor to change voltage independently. For<br />

systems where more than one processor shares<br />

a VRM, the processors have to be managed as<br />

a group. <strong>The</strong> current frequency driver does not<br />

have this support.<br />

fid: Frequency Identifier. <strong>The</strong> values written<br />

to the control MSR to select a core frequency.<br />

<strong>The</strong>se identifiers are processor family<br />

specific. Currently, these are six bit codes, allowing<br />

the selection of frequencies from 800<br />

MHz to 5 Ghz. See the BKDG[4] for the mappings<br />

from fid to frequency. Note that the frequency<br />

driver does need to “understand” the<br />

mapping of fid to frequency, as frequencies are<br />

exposed to other software components.<br />

vid: Voltage Identifier. <strong>The</strong> values written to<br />

the control MSR to select a voltage. <strong>The</strong>se values<br />

are then driven to the VRM by processor<br />

logic to achieve control of the voltage. <strong>The</strong>se<br />

identifiers are processor model specific. Currently<br />

these identifiers are five bit codes, of<br />

which there are two sets—a standard set and<br />

a low-voltage mobile set. <strong>The</strong> frequency driver<br />

does not need to be able to “understand” the<br />

mapping of vid to voltage, other than perhaps<br />

for debug prints.<br />

VST: Voltage Stabilization Time. <strong>The</strong> length<br />

of time before the voltage has increased and is<br />

stable at a newly increased voltage. <strong>The</strong> driver<br />

has to wait for this time period when stepping<br />

the voltage up. <strong>The</strong> voltage has to be stable<br />

at the new level before applying a further step<br />

up in voltage, or before transitioning to a new<br />

frequency that requires the higher voltage.<br />

MVS: Maximum Voltage Step. <strong>The</strong> maximum<br />

voltage step that can be taken when increasing<br />

the voltage. <strong>The</strong> driver has to step up voltage<br />

in multiple steps of this value when increasing<br />

the voltage. (When decreasing voltage it is not<br />

necessary to step, the driver can merely jump<br />

to the correct voltage.) A typical MVS value<br />

would be 25mV.<br />

RVO: Ramp Voltage Offset. When transitioning<br />

frequencies, it is necessary to temporarily<br />

increase the nominal voltage by this amount<br />

during the frequency transition. A typical RVO<br />

value would be 50mV.<br />

IRT: Isochronous Relief Time. During frequency<br />

transitions, busmasters briefly lose access<br />

to system memory. When making multiple<br />

frequency changes, the processor driver<br />

must delay the next transition for this time<br />

period to allow busmasters access to system<br />

memory. <strong>The</strong> typical value used is 80us.<br />

PLL: Phase Locked Loop. Electronic circuit<br />

that controls an oscillator to maintain a constant<br />

phase angle relative to a reference signal.<br />

Used to synthesize new frequencies which are<br />

a multiple of a reference frequency.<br />

PLL Lock Time: <strong>The</strong> length of time, in microseconds,<br />

for the PLL to lock.<br />

pstate: Performance State. A combination of<br />

frequency/voltage that is supported for the operation<br />

of the processor. A processor will typically<br />

have several pstates available, with higher<br />

frequencies needing higher voltages. <strong>The</strong> processor<br />

clock can not be set to any arbitrary frequency;<br />

it may only be set to one of a limited<br />

set of frequencies. For a given frequency, there<br />

is a minimum voltage needed to operate reliably<br />

at that frequency, and this is the correct<br />

voltage, thus forming the frequency/voltage<br />

pair.<br />

ACPI: Advanced Configuration and Power In-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 169<br />

terface Specification. An industry specification,<br />

initially developed by Intel, Microsoft,<br />

Phoenix and Toshiba. See the reference section,<br />

entry 5.<br />

_PSS: Performance Supported States. ACPI<br />

object that defines the performance states valid<br />

for a processor.<br />

_PPC: Performance Present Capabilities.<br />

ACPI object that defines which of the _PSS<br />

states are currently available, due to current<br />

platform limitations.<br />

PSB Performance State Block. BIOS provided<br />

data structure used to pass information, to the<br />

driver, concerning the pstates available on the<br />

processor. <strong>The</strong> PSB does not support multiprocessor<br />

systems (which use the ACPI _PSS<br />

object instead) and is being deprecated. <strong>The</strong><br />

format of the PSB is defined in the BKDG.<br />

3 Why Does Frequency Management<br />

Affect Power Consumption?<br />

Higher frequency requires higher voltage.<br />

As an example, data for part number<br />

ADA3200AEP4AX:<br />

2.2 GHz @ 1.50 volts, 58 amps max – 89 watts<br />

2.0 GHz @ 1.40 volts, 48 amps max – 69 watts<br />

1.8 GHz @ 1.30 volts, 37 amps max – 50 watts<br />

1.0 GHz @ 1.10 volts, 18 amps max – 22 watts<br />

<strong>The</strong>se figures are worst case current/power figures,<br />

at maximum case temperature, and include<br />

I/O power of 2.2W.<br />

Actual power usage is determined by:<br />

• code currently executing (idle blocks in<br />

the processor consume less power),<br />

• activity from other processors (cache coherency,<br />

memory accesses, pass-through<br />

traffic on the HyperTransport connections),<br />

• processor temperature (current increases<br />

with temperature, at constant workload<br />

and voltage),<br />

• processor voltage.<br />

Increasing the voltage allows operation at<br />

higher frequencies, at the cost of higher power<br />

consumption and higher heat generation. Note<br />

that relationship between frequency and power<br />

consumption is not a linear relationship—a<br />

10% frequency increase will cost more than<br />

10% in power consumption (30% or more).<br />

Total system power usage depends on other devices<br />

in the system, such as whether disk drives<br />

are spinning or stopped, and on the efficiency<br />

of power supplies.<br />

4 Why Should Your Server Behave<br />

Like A Laptop?<br />

• Save power. It is the right thing to do<br />

for the environment. Note that power<br />

consumed is largely converted into heat,<br />

which then becomes a load on the air conditioning<br />

in the server room.<br />

• Save money. Power costs money. <strong>The</strong><br />

power savings for a single server are typically<br />

regarded as trivial in terms of a corporate<br />

budget. However, many large organizations<br />

have racks of many thousands<br />

of servers. <strong>The</strong> power bill is then far from<br />

trivial.<br />

• Cooler components last longer, and this<br />

translates into improved server reliability.<br />

• Government Regulation.


170 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

5 Interesting Scenarios<br />

<strong>The</strong>se are real world scenarios, where the application<br />

of the technology is appropriate.<br />

5.1 Save power in an idle cluster<br />

A cluster would typically be kept running at<br />

all times, allowing remote access on demand.<br />

During the periods when the cluster is idle, reducing<br />

the CPU frequency is a good way to<br />

reduce power consumption (and therefore also<br />

air conditioning load), yet be able to quickly<br />

transition back up to full speed (


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 171<br />

demanding case is a blade server) aggravates<br />

the cooling problem as the neighboring boxes<br />

are also generating heat.<br />

6 System Power Budget<br />

<strong>The</strong> processors are only part of the system. We<br />

therefore need to understand the power consumption<br />

of the entire system to see how significant<br />

processor frequency management is on<br />

the power consumption of the whole system.<br />

A system power budget is obviously platform<br />

specific. This sample DC (direct current)<br />

power budget is for a 4-processor AMD<br />

Opteron processor based system. <strong>The</strong> system<br />

has three 500W power supplies, of which one<br />

is redundant. Analysis shows that for many<br />

operating scenarios, the system could run on<br />

a single power supply.<br />

This analysis is of DC power. For the system<br />

in question, the efficiency of the power supplies<br />

are approximately linear across varying<br />

loads, and thus the DC power figures expressed<br />

as percentages are meaningful as predictors of<br />

the AC (alternating current) power consumption.<br />

For systems with power supplies that are<br />

not linearly efficient across varying loads, the<br />

calculations obviously have to be factored to<br />

take account of power supply efficiency.<br />

System components:<br />

• 4 processors @ 89W = 356W in the maximum<br />

pstate, 4 @ 22W = 88W in the minimum<br />

pstate. <strong>The</strong>se are worst case figures,<br />

at maximium case temperature, with the<br />

worst case instruction mix. <strong>The</strong> figures in<br />

Table1 are reduced from these maximums<br />

by approximately 10% to account for a reduced<br />

case temperature and for a workload<br />

that does not keep all of the processors’<br />

internal units busy.<br />

• Two disk drives (Western Digital 250<br />

GByte SATA), 16W read/write, 10W idle<br />

(spinning), 1.3W sleep (not spinning).<br />

Note SCSI drives typically consume more<br />

power.<br />

• DVD Drive, 10W read, 1W idle/sleep.<br />

• PCI 2.2 Slots – absolute max of 25W per<br />

slot, system will have a total power budget<br />

that may not account for maximum power<br />

in all slots. Estimate 2 slots occupied at a<br />

total of 20W.<br />

• VGA video card in a PCI slot. 5W. (AGP<br />

would be more like 15W+).<br />

• DDR DRAM, 10W max per DIMM, 40W<br />

for 4 GBytes configured as 4 DIMMs.<br />

• Network (built in) 5W.<br />

• Motherboard and components 30W.<br />

• 10 fans @ 6W each. 60W.<br />

• Keyboard + Mouse 3W<br />

See Table 1 for the sample power budget under<br />

busy and light loads.<br />

<strong>The</strong> light load without any frequency reduction<br />

is baselined as 100%.<br />

<strong>The</strong> power consumption is shown for the same<br />

light load with frequency reduction enabled,<br />

and again where the idle loop incorporates the<br />

hlt instruction.<br />

Using frequency management, the power consumption<br />

drops to 43%, and adding the use of<br />

the hlt instruction (assuming 50% time halted),<br />

the power consumption drops further to 33%.<br />

<strong>The</strong>se are significant power savings, for systems<br />

that are under light load conditions at<br />

times. <strong>The</strong> percentage of time that the system<br />

is running under reduced load has to be known<br />

to predict actual power savings.


172 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

system load 4 2 kbd<br />

cpus disks dvd pci vga dram net planar fans mou total<br />

busy 320 32 10 20 5 40 5 30 60 3 525W<br />

90%<br />

light load 310 22 1 15 5 38 5 20 60 3 479W<br />

87% 100%<br />

light load, using 79 22 1 15 5 38 5 20 20 3 208W<br />

frequency reduction 90% 43%<br />

light load, using 32 22 1 15 5 38 5 20 15 3 156<br />

frequency reduction 40% 33%<br />

and using hlt 50%<br />

of the time<br />

Table 1: Sample System Power Budget (DC), in watts<br />

7 Hardware—AMD Opteron<br />

7.1 Software Interface To <strong>The</strong> Hardware<br />

<strong>The</strong>re are two MSRs, the FIDVID_STATUS<br />

MSR and the FIDVID_CONTROL MSR, that<br />

are used for frequency voltage transitions.<br />

<strong>The</strong>se MSRs are the same for the single processor<br />

AMD Athlon 64 processors and for the<br />

AMD Opteron MP capable processors. <strong>The</strong>se<br />

registers are not compatible with the previous<br />

generation of AMD Athlon processors, and<br />

will not be compatible with the next generation<br />

of processors.<br />

<strong>The</strong> CPU frequency driver for AMD processors<br />

therefore has to change across processor<br />

revisions, as do the ACPI _PSS objects that describe<br />

pstates.<br />

<strong>The</strong> status register reports the current fid and<br />

vid, as well as the maximum fid, the start fid,<br />

the maximum vid and the start vid of the particular<br />

processor.<br />

<strong>The</strong>se registers are documented in the<br />

BKDG[4].<br />

As MSRs can only be accessed by executing<br />

code (the read msr or write msr instructions) on<br />

the target processor, the frequency driver has to<br />

use the processor affinity support to force execution<br />

on the correct processor.<br />

7.2 Multiple Memory Controllers<br />

In PC architectures, the memory controller is<br />

a component of the northbridge, which is traditionally<br />

a separate component from the processor.<br />

With AMD Opteron processors, the<br />

northbridge is built into the processor. Thus,<br />

in a multi-processor system there are multiple<br />

memory controllers.<br />

See Figure 1 for a block diagram of a two processor<br />

system.<br />

If a processor is accessing DRAM that is physically<br />

attached to a different processor, the<br />

DRAM access (and any cache coherency traffic)<br />

crosses the coherent HyperTransport interprocessor<br />

links. <strong>The</strong>re is a small performance<br />

penalty in this case. This penalty is of the order<br />

of a DRAM page hit versus a DRAM page<br />

miss, about 1.7 times slower than a local access.<br />

This penalty is minimized by the processor<br />

caches, where data/code residing in remote<br />

DRAM is locally cached. It is also minimized


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 173<br />

by <strong>Linux</strong>’s NUMA support.<br />

Note that a single threaded application that<br />

is memory bandwidth constrained may benefit<br />

from multiple memory controllers, due to the<br />

increase in memory bandwidth.<br />

When the remote processor is transitioned to<br />

a lower frequency, this performance penalty is<br />

worse. An upper bound to the penalty may<br />

be calculated as proportional to the frequency<br />

slowdown. I.e., taking the remote processor<br />

from 2.2 GHz to 1.0 GHz would take the 1.7<br />

factor from above to a factor of 2.56. Note that<br />

this is an absolute worst case, an upper bound<br />

to the factor. Actual impact is workload dependent.<br />

A worst case scenario would be a memory<br />

bound task, doing memory reads at addresses<br />

that are pathologically the worst case for the<br />

caches, with all accesses being to remote memory.<br />

A more typical scenario would see this<br />

penalty alleviated by:<br />

• processor caches, where 64 bytes will<br />

be read and cached for a single access,<br />

so applications that walk linearly through<br />

memory will only see the penalty on 64<br />

byte boundaries,<br />

• memory writes do not take a penalty<br />

(as processor execution continues without<br />

waiting for a write to complete),<br />

• memory may be interleaved,<br />

• kernel NUMA optimizations for noninterleaved<br />

memory (which allocate<br />

memory local to the processor when<br />

possible to avoid this penalty).<br />

7.3 DRAM Interface Speed<br />

<strong>The</strong> DRAM interface speed is impacted by the<br />

core clock frequency. A full table is published<br />

in the processor data sheet; Table 2 shows a<br />

sample of actual DRAM frequencies for the<br />

common specified DRAM frequencies, across<br />

a range of core frequencies.<br />

This table shows that certain DRAM speed /<br />

core speed combinations are suboptimal.<br />

Effective memory performance is influenced<br />

by many factors:<br />

• cache hit rates,<br />

• effectiveness of NUMA memory allocation<br />

routines,<br />

• load on the memory controller,<br />

• size of penalty for remote memory accesses,<br />

• memory speed,<br />

• other hardware related items, such as<br />

types of DRAM accesses.<br />

It is therefore necessary to benchmark the actual<br />

workload to get meaningful data for that<br />

workload.<br />

7.4 UMA<br />

During frequency transitions, and when HyperTransport<br />

LDTSTOP is asserted, DRAM is<br />

placed into self refresh mode. UMA graphics<br />

devices therefore can not access DRAM.<br />

UMA systems therefore need to limit the time<br />

that DRAM is in self refresh mode. Time constraints<br />

are bandwidth dependent, with high<br />

resolution displays needing higher memory<br />

bandwidth. This is handled by the IRT delay<br />

time during frequency transitions. When transitioning<br />

multiple steps, the driver waits an appropriate<br />

length of time to allow external devices<br />

to access memory.


174 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

DDR<br />

AMD Opteron TM<br />

Processor<br />

cHT<br />

AMD Opteron TM<br />

Processor<br />

DDR<br />

ncHT<br />

ncHT<br />

8X AGP<br />

AMD 8151 TM<br />

Graphics Tunnel<br />

AMD 8131 TM<br />

PCI-X Tunnel<br />

PCI-X<br />

ncHT<br />

Legacy PCI<br />

USB<br />

LPC<br />

AMD 8111 TM<br />

I/O Hub<br />

AC ‘97<br />

EIDE<br />

Figure 1: Two Processor System<br />

Processor 100MHz 133MHz 166MHz 200MHz<br />

Core DRAM DRAM DRAM DRAM<br />

Frequency spec spec spec spec<br />

800MHz 100.00 133.33 160.00 160.00<br />

1000MHz 100.00 125.00 166.66 200.00<br />

2000MHz 100.00 133.33 166.66 200.00<br />

2200MHz 100.00 129.41 157.14 200.00<br />

Table 2: DRAM Frequencies For A Range Of Processor Core Frequencies


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 175<br />

7.5 TSC Varying<br />

<strong>The</strong> Time Stamp Counter (TSC) register is<br />

a register that increments with the processor<br />

clock. Multiple reads of the register will see<br />

increasing values. This register increments on<br />

each core clock cycle in the current generation<br />

of processors. Thus, the rate of increase of the<br />

TSC when compared with “wall clock time”<br />

varies as the frequency varies. This causes<br />

problems in code that calibrates the TSC increments<br />

against an external time source, and then<br />

attempts to use the TSC to measure time.<br />

<strong>The</strong> <strong>Linux</strong> kernel uses the TSC for such timings,<br />

for example when a driver calls udelay().<br />

In this case it is not a disaster if the udelay()<br />

call waits for too long as the call is defined to<br />

allow this behavior. <strong>The</strong> case of the udelay()<br />

call returning too quickly can be fatal, and this<br />

has been demonstrated during experimentation<br />

with this code.<br />

This particular problem is resolved by the<br />

cpufreq driver correcting the kernel TSC calibration<br />

whenever the frequency changes.<br />

This issue may impact other code that uses<br />

the TSC register directly. It is interesting to<br />

note that it is hard to define a correct behavior.<br />

Code that calibrates the TSC against an external<br />

clock will be thrown off if the rate of increment<br />

of the TSC should change. However,<br />

other code may expect a certain code sequence<br />

to consistently execute in approximately the<br />

same number of cycles, as measured by the<br />

TSC, and this code will be thrown off if the behavior<br />

of the TSC changes relative to the processor<br />

speed.<br />

7.6 Measurement Of Frequency Transition<br />

Times<br />

<strong>The</strong> time required to perform a transition is a<br />

combination of the software time to execute the<br />

required code, and the hardware time to perform<br />

the transition.<br />

Examples of hardware wait time are:<br />

• waiting for the VRM to be stable at a<br />

newer voltage,<br />

• waiting for the PLL to lock at the new frequency,<br />

• waiting for DRAM to be placed into and<br />

then taken out of self refresh mode around<br />

a frequency transition.<br />

<strong>The</strong> time taken to transition between two states<br />

is dependent on both the initial state and the<br />

target state. This is due to :<br />

• multiple steps being required in some<br />

cases,<br />

• certain operations are lengthier (for example,<br />

voltage is stepped up in multiple<br />

stages, but stepped down in a single step),<br />

• difference in code execution time dependent<br />

on processor speed (although this is<br />

minor).<br />

Measurements, taken by calibrating the frequency<br />

driver, show that frequency transitions<br />

for a processor are taking less than 0.015 seconds.<br />

Further experimentation with multiple processors<br />

showed a worst case transition time of less<br />

than 0.08 seconds to transition all 4 processors<br />

from minimum to maximum frequency, and<br />

slightly faster to transition from maximum to<br />

minimum frequency.<br />

Note, there is a driver optimization under<br />

consideration that would approximately halve<br />

these transition times.


176 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

7.7 Use of Hardware Enforced Throttling<br />

<strong>The</strong> southbridge (I/O Hub, example AMD-<br />

8111 HyperTransport I/O Hub) is capable<br />

of initiating throttling via the HyperTransport<br />

stopclock message, which will ramp down the<br />

CPU grid by the programmed amount. This<br />

may be initiated by the southbridge for thermal<br />

throttling or for other reasons.<br />

This throttling is transparent to software, other<br />

than the performance impact.<br />

This throttling is of greatest value in the lowest<br />

pstate, due to the reduced voltage.<br />

<strong>The</strong> hardware enforced throttling is generally<br />

not of relevance to the software management<br />

of processor frequencies. However, a system<br />

designer would need to take care to ensure<br />

that the optimal scenarios occur—i.e., transition<br />

to a lower frequency/voltage in preference<br />

to hardware throttling in high pstates. <strong>The</strong><br />

BIOS configurations are documented in the<br />

BKDG[4].<br />

For maximum power savings, the southbridge<br />

would be configured to initiate throttling when<br />

the processor executes the hlt instruction.<br />

8 Software<br />

<strong>The</strong> AMD frequency driver is a small part of<br />

the software involved. <strong>The</strong> frequency driver<br />

fits into the CPUFreq architecture, which is<br />

part of the 2.6 kernel. It is also available as a<br />

patch for the 2.4 kernel, and many distributions<br />

do include it.<br />

<strong>The</strong> CPUFreq architecture includes kernel support,<br />

the CPUFreq driver itself (drivers/<br />

cpufreq), an architecture specific driver to<br />

control the hardware (powernow-k8.ko is this<br />

case), and /sys file system code for userland<br />

access.<br />

<strong>The</strong> kernel support code (linux/kernel/<br />

cpufreq.c) handles timing changes such as<br />

updating the kernel constant loops_per_<br />

jiffies, as well as notifiers (system components<br />

that need to be notified of a frequency<br />

change).<br />

8.1 History Of <strong>The</strong> AMD Frequency Driver<br />

<strong>The</strong> CPU frequency driver for AMD Athlon<br />

(the previous generation of processors) was<br />

developed by Dave Jones. This driver supports<br />

single processor transitions only, as the<br />

pstate transition capability was only enabled in<br />

mobile processors. This driver used the PSB<br />

mechanism to determine valid pstates for the<br />

processor. This driver has subsequently been<br />

enhanced to add ACPI support.<br />

<strong>The</strong> initial AMD Athlon 64 and AMD Opteron<br />

driver (developed by me, based upon Dave’s<br />

earlier work, and with much input from Dominik<br />

and others), was also PSB based. This<br />

was followed by a version of the driver that<br />

added ACPI support.<br />

<strong>The</strong> next release is intended to add a built-in<br />

table of pstates that will allow the checking of<br />

BIOS supplied data, and also allow an override<br />

capability to provide pstate data when not supplied<br />

by BIOS.<br />

8.2 User Interface<br />

<strong>The</strong> deprecated /proc/cpufreq (and<br />

/proc/sys) file system offers control over<br />

all processors or individual processors. By<br />

echoing values into this file, the root user<br />

can change policies and change the limits on<br />

available frequencies.<br />

Examples:<br />

Constrain all processors to frequencies between<br />

1.0 GHz and 1.6 GHz, with the performance<br />

policy (effectively chooses 1.6 GHz):


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 177<br />

echo -n "1000000:16000000:<br />

performance" > /proc/cpufreq<br />

Constrain processor 2 to run at only 2.0 GHz:<br />

echo -n "2:2000000:2000000:<br />

performance" > proc/cpufreq<br />

<strong>The</strong> “performance” refers to a policy, with<br />

the other policy available being “powersave.”<br />

<strong>The</strong>se policies simply forced the frequency to<br />

be at the appropriate extreme of the available<br />

range. With the 2.6 kernel, the choice is normally<br />

for a “userspace” governor, which allows<br />

the (root) user or any user space code (running<br />

with root privilege) to dynamically control the<br />

frequency.<br />

With the 2.6 kernel, a new interface in the<br />

/sys filesystem is available to the root user,<br />

deprecating the /proc/cpufreq method.<br />

<strong>The</strong> control and status files exist under<br />

/sys/devices/system/cpu/cpuN/<br />

cpufreq, where N varies from 0 upwards,<br />

dependent on which processors are<br />

online. Among the other files in each processor’s<br />

directory, scaling_min_freq and<br />

scaling_max_freq control the minimum<br />

and maximum of the ranges in which the frequency<br />

may vary. <strong>The</strong> scaling_governor<br />

file is used to control the choice of governor.<br />

See linux/Documentation/<br />

cpu-freq/userguide.txt for more<br />

information.<br />

Examples:<br />

Constrain processor 2 to run only in the range<br />

1.6 GHz to 2.0 GHz:<br />

cd /sys/devices/system/cpu<br />

cd cpu2/cpufreq<br />

echo 1600000 > scaling_min_freq<br />

echo 2000000 > scaling_max_freq<br />

8.3 Control From User Space And User Daemons<br />

<strong>The</strong> interface to the /sys filesystem allows<br />

userland control and query functionality. Some<br />

form of automation of the policy would normally<br />

be part of the desired complete implementation.<br />

This automation is dependent on the reason for<br />

using frequency management. As an example,<br />

for the case of transitioning to a lower pstate<br />

when running on a UPS, a daemon will be notified<br />

of the failure of mains power, and that<br />

daemon will trigger the frequency change by<br />

writing to the control files in the /sys filesystem.<br />

<strong>The</strong> CPUFreq architecture has thus split the<br />

implementation into multiple parts:<br />

1. user space policy<br />

2. kernel space driver for common functionality<br />

3. kernel space driver for processor specific<br />

implementation.<br />

<strong>The</strong>re are multiple user space automation<br />

implementations, not all of which currently<br />

support multiprocessor systems. <strong>One</strong> that<br />

does, and that has been used in this<br />

project is cpufreqd version 1.1.2 (http://<br />

sourceforge.net/projects/cpufreqd).<br />

This daemon is controlled by a configuration<br />

file. Other than making changes to the configuration<br />

file, the author of this paper has not<br />

been involved in any of the development work<br />

on cpufreqd, and is a mere user of this tool.<br />

<strong>The</strong> configuration file specifies profiles and<br />

rules. A profile is a description of the system<br />

settings in that state, and my configuration file<br />

is setup to map the profiles to the processor


178 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

pstates. Rules are used to dynamically choose<br />

which profile to use, and my rules are setup<br />

to transition profiles based on total processor<br />

load.<br />

My simple configuration file to change processor<br />

frequency dependent on system load is:<br />

[General]<br />

pidfile=/var/run/cpufreqd.pid<br />

poll_interval=2<br />

pm_type=acpi<br />

# 2.2 GHz processor speed<br />

[Profile]<br />

name=hi_boost<br />

minfreq=95%<br />

maxfreq=100%<br />

policy=performance<br />

# 2.0 GHz processor speed<br />

[Profile]<br />

name=medium_boost<br />

minfreq=90%<br />

maxfreq=93%<br />

policy=performance<br />

# 1.0 GHz processor Speed<br />

[Profile]<br />

name=lo_boost<br />

minfreq=40%<br />

maxfreq=50%<br />

policy=powersave<br />

[Profile]<br />

name=lo_power<br />

minfreq=40%<br />

maxfreq=50%<br />

policy=powersave<br />

[Rule]<br />

#not busy 0%-40%<br />

name=conservative<br />

ac=on<br />

battery_interval=0-100<br />

cpu_interval=0-40<br />

profile=lo_boost<br />

#medium busy 30%-80%<br />

[Rule]<br />

name=lo_cpu_boost<br />

ac=on<br />

battery_interval=0-100<br />

cpu_interval=30-80<br />

profile=medium_boost<br />

#really busy 70%-100%<br />

[Rule]<br />

name=hi_cpu_boost<br />

ac=on<br />

battery_interval=50-100<br />

cpu_interval=70-100<br />

profile=hi_boost<br />

This approach actually works very well for<br />

multiple small tasks, for transitioning the frequencies<br />

of all the processors together based<br />

on a collective loading statistic.<br />

For a long running, single threaded task, this<br />

approach does not work well as the load is only<br />

high on a single processor, with the others being<br />

idle. <strong>The</strong> average load is thus low, and<br />

all processors are kept at a slow speed. Such<br />

a workload scenario would require an implementation<br />

that looked at the loading of individual<br />

processors, rather than the average. See the<br />

section below on future work.<br />

8.4 <strong>The</strong> Drivers Involved<br />

powernow-k8.ko arch/i386/<br />

kernel/cpu/cpufreq/powernow-k8.<br />

c (the same source code is built as a 32-bit<br />

driver in the i386 tree and as a 64-bit driver<br />

in the x86_64 tree)<br />

drivers/acpi<br />

drivers/cpufreq


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 179<br />

<strong>The</strong> Test Driver<br />

Note that the powernow-k8.ko driver does<br />

not export any read, write, or ioctl interfaces.<br />

For test purposes, a second driver exists with<br />

an ioctl interface for test application use. <strong>The</strong><br />

test driver was a big part of the test effort on<br />

powernow-k8.ko prior to release.<br />

8.5 Frequency Driver Entry Points<br />

powernowk8_init()<br />

Driver late_initcall. Initialization is<br />

late as the acpi driver needs to be initialized<br />

first. Verifies that all processors in the system<br />

are capable of frequency transitions, and that<br />

all processors are supported processors. Builds<br />

a data structure with the addresses of the four<br />

entry points for cpufreq use (listed below), and<br />

calls cpufreq_register_driver().<br />

powernowk8_exit()<br />

Called when the driver is to be unloaded. Calls<br />

cpufreq_unregister_driver().<br />

8.6 Frequency Driver Entry Points For Use By<br />

<strong>The</strong> CPUFreq driver<br />

powernowk8_cpu_init()<br />

This is a per-processor initialization routine.<br />

As we are not guaranteed to be executing on<br />

the processor in question, and as the driver<br />

needs access to MSRs, the driver needs to force<br />

itself to run on the correct processor by using<br />

set_cpus_allowed().<br />

This pre-processor initialization allows for processors<br />

to be taken offline or brought online dynamically.<br />

I.e., this is part of the software support<br />

that would be needed for processor hotplug,<br />

although this is not supported in the hardware.<br />

This routine finds the ACPI pstate data for this<br />

processor, and extracts the (proprietary) data<br />

from the ACPI _PSS objects. This data is verified<br />

as far as is reasonable. Per-processor data<br />

tables for use during frequency transitions are<br />

constructed from this information.<br />

powernowk8_cpu_exit()<br />

Per-processor cleanup routine.<br />

powernowk8_verify()<br />

When the root user (or an application running<br />

on behalf of the root user) requests a change to<br />

the minimum/maximum frequencies, or to the<br />

policy or governor, the frequency driver’s verification<br />

routine is called to verify (and correct<br />

if necessary) the input values. For example,<br />

if the maximum speed of the processor is 2.4<br />

GHz and the user requests that the maximum<br />

range be set to 3.0 GHz, the verify routine will<br />

correct the maximum value to a value that is actually<br />

possible. <strong>The</strong> user can, however, chose a<br />

value that is less than the hardware maximum,<br />

for example 2.0 GHz in this case.<br />

As this routine just needs to access the perprocessor<br />

data, and not any MSRs, it does not<br />

matter which processor executes this code.<br />

powernowk8_target()<br />

This is the driver entry point that actually performs<br />

a transition to a new frequency/voltage.<br />

This entry point is called for each processor<br />

that needs to transition to a new frequency.<br />

<strong>The</strong>re is therefore an optimization possible by<br />

enhancing the interface between the frequency<br />

driver and the CPUFreq driver for the case<br />

where all processors are to be transitioned to<br />

a new, common frequency. However, it is not<br />

clear that such an optimization is worth the<br />

complexity, as the functionality to transition a<br />

single processor would still be needed.<br />

This routine is invoked with the processor


180 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

number as a parameter, and there is no guarantee<br />

as to which processor we are currently executing<br />

on. As the mechanism for changing the<br />

frequency involves accessing MSRs, it is necessary<br />

to execute on the target processor, and<br />

the driver forces its execution onto the target<br />

processor by using set_cpus_allowed().<br />

<strong>The</strong> CPUFreq helpers are then used to determine<br />

the correct target frequency. Once a chosen<br />

target fid and vid are identified:<br />

• the cpufreq driver is called to warn that a<br />

transition is about to occur,<br />

• the actual transition code within<br />

powernow-k8 is called, and then<br />

• the cpufreq driver is called again to confirm<br />

that the transition was successful.<br />

<strong>The</strong> actual transition is protected with a<br />

semaphore that is used across all processors.<br />

This is to prevent transitions on one processor<br />

from interfering with transitions on other<br />

processors. This is due to the inter-processor<br />

communication that occurs at a hardware level<br />

when a frequency transition occurs.<br />

8.7 CPUFreq Interface<br />

<strong>The</strong> CPUFreq interface provides entry points,<br />

that are required to make the system function.<br />

It also provides helper functions, which need<br />

not be used, but are there to provide common<br />

functionality across the set of all architecture<br />

specific drivers. Elimination of duplicate good<br />

is a good thing! An architecture specific driver<br />

can build a table of available frequencies, and<br />

pass this table to the CPUFreq driver. <strong>The</strong><br />

helper functions then simplify the architecture<br />

driver code by manipulating this table.<br />

cpufreq_register_driver()<br />

Registers the frequency driver as being the<br />

driver capable of performing frequency transitions<br />

on this platform. Only one driver may be<br />

registered.<br />

cpufreq_unregister_driver()<br />

Unregisters the driver, when it is being unloaded.<br />

cpufreq_notify_transition()<br />

Used to notify the CPUFreq driver, and thus the<br />

kernel, that a frequency transition is occurring,<br />

and triggering recalibration of timing specific<br />

code.<br />

cpufreq_frequency_table_target()<br />

Helper function to find an appropriate table entry<br />

for a given target frequency. Used in the<br />

driver’s target function.<br />

cpufreq_frequency_table_verify()<br />

Helper function to verify that an input frequency<br />

is valid. This helper is effectively a<br />

complete implementation of the driver’s verify<br />

function.<br />

cpufreq_frequency_table_cpuinfo()<br />

Supplies the frequency table data that is used<br />

on subsequent helper function calls. Also aids<br />

with providing information as to the capabilities<br />

of the processors.<br />

8.8 Calls To <strong>The</strong> ACPI Driver<br />

acpi_processor_register_performance()<br />

acpi_processor_unregister_performance()<br />

Helper functions used at per-processor initialization<br />

time to gain access to the data from the<br />

_PSS object for that processor. This is a preferable<br />

solution to the frequency driver having to<br />

walk the ACPI namespace itself.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 181<br />

8.9 <strong>The</strong> Single Processor Solution<br />

Many of the kernel system calls collapse to<br />

constants when the kernel is built without<br />

multiprocessor support. For example, num_<br />

online_cpus() becomes a macro with the<br />

value 1. By the careful use of the definitions<br />

in smp.h, the same driver code handles<br />

both multiprocessor and single processor machines<br />

without the use of conditional compilation.<br />

<strong>The</strong> multiprocessor support obviously<br />

adds complexity to the code for a single processor<br />

code, but this code is negligible in the case<br />

of transitioning frequencies. <strong>The</strong> driver initialization<br />

and termination code is made more<br />

complex and lengthy, but this is not frequently<br />

executed code. <strong>The</strong>re is also a small penalty in<br />

terms of code space.<br />

<strong>The</strong> author does not feel that the penalty of the<br />

multiple processor support code is noticeable<br />

on a single processor system, but this is obviously<br />

debatable. <strong>The</strong> current choice is to have<br />

a single driver that supports both single processor<br />

and multiple processor systems.<br />

As the primary performance cost is in terms<br />

of additional code space, it is true that a single<br />

processor machine with highly constrained<br />

memory may benefit from a simplified driver<br />

without the additional multi-processor support<br />

code. However, such a machine would see<br />

greater benefit by eliminating other code that<br />

would not be necessary on a chosen platform.<br />

For example, the PSB support code could be<br />

removed from a memory constrained single<br />

processor machine that was using ACPI.<br />

This approach of removing code unnecessary<br />

for a particular platform is not a wonderful approach<br />

when it leads to multiple variants of<br />

the driver, all of which have to be supported<br />

and enhanced, and which makes Kconfig even<br />

more complex.<br />

8.10 Stages Of Development, Test And Debug<br />

Of <strong>The</strong> Driver<br />

<strong>The</strong> algorithm for transitioning to a new frequency<br />

is complex. See the BKDG[4] for a<br />

good description of the steps required, including<br />

flowcharts. In order to test and debug the<br />

frequency/voltage transition code thoroughly,<br />

the author first wrote a simple simulation of the<br />

processor. This simulation maintained a state<br />

machine, verified that fid/vid MSR control activity<br />

was legal, provided fid/vid status MSR<br />

results, and wrote a log file of all activity. <strong>The</strong><br />

core driver code was then written as an application<br />

and linked with this simulation code to<br />

allow testing of all combinations.<br />

<strong>The</strong> driver was then developed as a skeleton<br />

using printk to develop and test the<br />

BIOS/ACPI interfaces without having the frequency/voltage<br />

transition code present. This is<br />

because attempts to actually transition to an invalid<br />

pstate often result in total system lockups<br />

that offer no debug output—if the processor<br />

voltage is too low for the frequency, successful<br />

code execution ceases.<br />

When the skeleton was working correctly, the<br />

actual transition code was dropped into place,<br />

and tested on real hardware, both single processor<br />

and multiple processor. (<strong>The</strong> single processor<br />

driver was released many months before<br />

the multi-processor capable driver as the multiprocessor<br />

capable hardware was not available<br />

in the marketplace.) <strong>The</strong> functional driver was<br />

tested, using printk to trace activity, and using<br />

external hardware to track power usage, and<br />

using a test driver to independently verify register<br />

settings.<br />

<strong>The</strong> functional driver was then made available<br />

to various people in the community for their<br />

feedback. <strong>The</strong> author is grateful for the extensive<br />

feedback received, which included the<br />

changed code to implement suggestions. <strong>The</strong><br />

driver as it exists today is considerably im-


182 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

proved from the initial release, due to this feedback<br />

mechanism.<br />

9 How To Determine Valid PStates<br />

For A Given Processor<br />

AMD defines pstates for each processor. A<br />

performance state is a frequency/voltage pair<br />

that is valid for operation of that processor.<br />

<strong>The</strong>se are specified as fid/vid (frequency identifier/voltage<br />

identifier values) pairs, and are<br />

documented in the Processor <strong>The</strong>rmal and Data<br />

Sheets (see references). <strong>The</strong> worst case processor<br />

power consumption for each pstate is also<br />

characterized. <strong>The</strong> BKDG[4] contains tables<br />

for mapping fid to frequency and vid to voltage.<br />

Pstates are processor specific. I.e., 2.0 GHz at<br />

1.45V may be correct for one model/revision<br />

of processor, but is not necessarily correct for<br />

a different/revision model of processor.<br />

Code can determine whether a processor supports<br />

or does not support pstate transitions by<br />

executing the cpuid instruction. (For details,<br />

see the BKDG[4] or the source code for the<br />

<strong>Linux</strong> frequency driver). This needs to be done<br />

for each processor in an MP system.<br />

Each processor in an MP system could theoretically<br />

have different pstates.<br />

Ideally, the processor frequency driver would<br />

not contain hardcoded pstate tables, as the<br />

driver would then need to be revised for new<br />

processor revisions. <strong>The</strong> chosen solution is to<br />

have the BIOS provide the tables of pstates,<br />

and have the driver retrieve the pstate data from<br />

the BIOS. <strong>The</strong>re are two such tables defined for<br />

use by BIOSs for AMD systems:<br />

1. PSB, AMD’s original proprietary mechanism,<br />

which does not support MP. This<br />

mechanism is being deprecated.<br />

2. ACPI _PSS objects. Whereas the ACPI<br />

specification is a standard, the data within<br />

the _PSS objects is AMD specific (and, in<br />

fact, processor family specific), and thus<br />

there is still a proprietary nature of this solution.<br />

<strong>The</strong> current AMD frequency driver obtains<br />

data from the ACPI objects. ACPI does introduce<br />

some limitations, which are discussed<br />

later. Experimentation is ongoing with a builtin<br />

database approach to the problem in an attempt<br />

to bypass these issues, and also to allow<br />

checking of validity of the ACPI provided data.<br />

10 ACPI And Frequency Restrictions<br />

ACPI[5] provides the _PPC object, that is used<br />

to constrain the pstates available. This object<br />

is dynamic, and can therefore be used in platforms<br />

for purposes such as:<br />

• forcing frequency restrictions when operating<br />

on battery power,<br />

• forcing frequency restrictions due to thermal<br />

conditions.<br />

For battery / mains power transitions, an ACPIcompliant<br />

GPE (General Purpose Event) input<br />

to the chipset (I/O hub) is dedicated to assigning<br />

a SCI (System Control Interrupt) when the<br />

power source changes. <strong>The</strong> ACPI driver will<br />

then execute the ACPI control method (see the<br />

_PSR power source ACPI object), which issues<br />

a notify to the _CPUn object, which triggers<br />

the ACPI driver to re-evaluate the _PPC<br />

object. If the current pstate exceeds that allowed<br />

by this new evaluation of the _PPC object,<br />

the CPU frequency driver will be called to<br />

transition to a lower pstate.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 183<br />

11 ACPI Issues<br />

processors.<br />

ACPI as a standard is not perfect. <strong>The</strong>re is variation<br />

among different implementations, and<br />

<strong>Linux</strong> ACPI support does not work on all machines.<br />

ACPI does introduce some overhead, and some<br />

users are not willing to enable ACPI.<br />

ACPI requires that pstates be of equivalent<br />

power usage and frequency across all processors.<br />

In a system with processors that are capable<br />

of different maximum frequencies (for<br />

example, one processor capable of 2.0 GHz<br />

and a second processor capable of 2.2 GHz),<br />

compliance with the ACPI specification means<br />

that the faster processor(s) will be restricted to<br />

the maximum speed of the slowest processor.<br />

Also, if one processor has 5 available pstates,<br />

the presence of processor with only 4 available<br />

pstates will restrict all processors to 4 pstates.<br />

12 What Is <strong>The</strong>re Today?<br />

AMD is shipping pstate capable AMD Opteron<br />

processors (revision CG). Server processors<br />

prior to revision CG were not pstate capable.<br />

All AMD Athlon 64 processors for mobile and<br />

desktop are pstate capable.<br />

BKDG[4] enhancements to describe the capability<br />

are in progress.<br />

AMD internal BIOSs have the enhancements.<br />

<strong>The</strong>se enhancements are rolling out to the publicly<br />

available BIOSs along with the BKDG<br />

enhancements.<br />

<strong>The</strong> multi-processor capable <strong>Linux</strong> frequency<br />

driver has released under GPL.<br />

<strong>The</strong> cpufreqd user-mode daemon, available<br />

for download from http://sourceforge.<br />

net/projects/cpufreqd supports multiple<br />

13 Other Software-directed Power<br />

Saving Mechanisms<br />

13.1 Use Of <strong>The</strong> HLT Instruction<br />

<strong>The</strong> hlt instruction is normally used when the<br />

operating system has no code for the processor<br />

to execute. This is the ACPI C1 state. Execution<br />

of instructions ceases, until the processor<br />

is restarted with an interrupt. <strong>The</strong> power<br />

savings are maximized when the hlt state is entered<br />

in the minimum pstate, due to the lower<br />

voltage. <strong>The</strong> alternative to the use of the hlt<br />

instruction is a do nothing loop.<br />

13.2 Use of Power Managed Chipset Drivers<br />

Devices on the planar board, such as a PCI-X<br />

bridge or an AGP tunnel, may have the capability<br />

to operate in lower power modes. Entering<br />

and leaving the lower power modes is under the<br />

control of the driver for that device.<br />

Note that HyperTransport attached devices can<br />

transition themselves to lower power modes<br />

when certain messages are seen on the bus.<br />

However, this functionality is typically configurable,<br />

so a chipset driver (or the system BIOS<br />

during bootup) would need to enable this capability.<br />

14 Items For Future Exploration<br />

14.1 A Built-in Database<br />

<strong>The</strong> theory is that the driver could have a builtin<br />

database of processors and the pstates that<br />

they support. <strong>The</strong> driver could then use this<br />

database to obtain the pstate data without dependencies<br />

on ACPI, or use it for enhanced


184 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

checking of the ACPI provided data. <strong>The</strong> disadvantage<br />

of this is the need to update the<br />

database for new processor revisions. <strong>The</strong> advantages<br />

are the ability to overcome the ACPI<br />

imposed restrictions, and also to allow the use<br />

of the technology on systems where the ACPI<br />

support is not enabled.<br />

14.2 <strong>Kernel</strong> Scheduler—CPU Power<br />

An enhanced scheduler for the 2.6 kernel<br />

(2.6.6-bk1) is aware of groups of processors<br />

with different processing power. <strong>The</strong> power<br />

tating of each CPU group should be dynamically<br />

adjusted using a cpufreq transition notifier<br />

as the processor frequencies are changed.<br />

See http://lwn.net/Articles/<br />

80601/ for a detailed acount of the scheduler<br />

changes.<br />

14.3 <strong>The</strong>rmal Management, ACPI <strong>The</strong>rmal<br />

Zones<br />

Publicly available BIOSs for AMD machines<br />

do not implement thermal zones. Obviously<br />

this is one way to provide the input control for<br />

frequency management based on thermal conditions.<br />

14.4 <strong>The</strong>rmal Management, Service Processor<br />

Servers typically have a service processor,<br />

which may be compliant to the IPMI specification.<br />

This service processor is able to accurately<br />

monitor temperature at different locations<br />

within the chassis. <strong>The</strong> 2.6 kernel<br />

includes an IPMI driver. User space code<br />

may use these thermal readings to control fan<br />

speeds and generate administrator alerts. It<br />

may make sense to also use these accurate thermal<br />

readings to trigger frequency transitions.<br />

<strong>The</strong> interaction between thermal events from<br />

the service processor and ACPI thermal zones<br />

may be a problem.<br />

Hiding <strong>The</strong>rmal Conditions<br />

<strong>One</strong> concern with the use of CPU frequency<br />

manipulation to avoid overheating is that hardware<br />

problems may not be noticed. Over temperature<br />

conditions would normally cause administrator<br />

alerts, but if the processor is first<br />

taken to a lower frequency to hold temperature<br />

down, then the alert may not be generated. A<br />

failing fan (not spinning at full speed) could<br />

therefore be missed. Some hardware components<br />

fail gradually, and early warning of imminent<br />

failures is needed to perform planned<br />

maintenance. Losing this data would be badness.<br />

15 Legal Information<br />

Copyright © 2004 Advanced Micro Devices, Inc<br />

Permission to redistribute in accordance with <strong>Linux</strong><br />

Symposium submission guidelines is granted; all<br />

other rights reserved.<br />

AMD, the AMD Arrow logo, AMD Opteron,<br />

AMD Athlon and combinations thereof, AMD-<br />

8111, AMD-8131, and AMD-8151 are trademarks<br />

of Advanced Micro Devices, Inc.<br />

<strong>Linux</strong> is a registered trademark of Linus Torvalds.<br />

HyperTransport is a licensed trademark of the HyperTransport<br />

Technology Consortium.<br />

Other product names used in this publication are for<br />

identification purposes only and may be trademarks<br />

of their respective companies.<br />

16 References<br />

1. AMD Opteron Processor Data Sheet,<br />

publication 23932, available from www.<br />

amd.com<br />

2. AMD Opteron Processor Power And


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 185<br />

<strong>The</strong>rmal Data Sheet, publication 30417,<br />

available from www.amd.com<br />

3. AMD Athlon 64 Processor Power And<br />

<strong>The</strong>rmal Data Sheet, publication 30430,<br />

available from www.amd.com<br />

4. BIOS and <strong>Kernel</strong> Developer’s Guide (the<br />

BKDG) for AMD Athlon 64 and AMD<br />

Opteron Processors, publication 26094,<br />

available from www.amd.com. Chapter<br />

9 covers frequency management.<br />

5. ACPI 2.0b Specification, from www.<br />

acpi.info<br />

6. Text documentation files in the kernel<br />

linux/Documentation/cpu-freq/<br />

directory:<br />

• index.txt<br />

• user-guide.txt<br />

• core.txt<br />

• cpu-drivers.txt<br />

• governors.txt


186 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Dynamic <strong>Kernel</strong> Module Support:<br />

From <strong>The</strong>ory to Practice<br />

Matt Domsch & Gary Lerhaupt<br />

Dell <strong>Linux</strong> Engineering<br />

Matt_Domsch@dell.com, Gary_Lerhaupt@dell.com<br />

Abstract<br />

DKMS is a framework which allows individual<br />

kernel modules to be upgraded without changing<br />

your whole kernel. Its primary audience<br />

is fourfold: system administrators who want<br />

to update a single device driver rather than<br />

wait for a new kernel from elsewhere with it<br />

included; distribution maintainers, who want<br />

to release a single targeted bugfix in between<br />

larger scheduled updates; system manufacturers<br />

who need single modules changed to support<br />

new hardware or to fix bugs, but do not<br />

wish to test whole new kernels; and driver<br />

developers, who must provide updated device<br />

drivers for testing and general use on a wide<br />

variety of kernels, as well as submit drivers to<br />

kernel.org.<br />

Since OLS2003, DKMS has gone from a good<br />

idea to deployed and used. Based on end user<br />

feedback, additional features have been added:<br />

precompiled module tarball support to speed<br />

factory installation; driver disks for Red Hat<br />

distributions; 2.6 kernel support; SuSE kernel<br />

support. Planned features include crossarchitecture<br />

build support and additional distribution<br />

driver disk methods.<br />

In addition to overviewing DKMS and its features,<br />

we explain how to create a dkms.conf file<br />

to DKMS-ify your kernel module source.<br />

1 History<br />

Historically, <strong>Linux</strong> distributions bundle device<br />

drivers into essentially one large kernel package,<br />

for several primary reasons:<br />

• Completeness: <strong>The</strong> <strong>Linux</strong> kernel as distributed<br />

on kernel.org includes all the device<br />

drivers packaged neatly together in<br />

the same kernel tarball. Distro kernels follow<br />

kernel.org in this respect.<br />

• Maintainer simplicity: With over 4000<br />

files in the kernel drivers/ directory,<br />

each possibly separately versioned, it<br />

would be impractical for the kernel maintainer(s)<br />

to provide a separate package for<br />

each driver.<br />

• Quality Assurance / Support organization<br />

simplicity: It is easiest to ask a user “what<br />

kernel version are you running,” and to<br />

compare this against the list of approved<br />

kernel versions released by the QA team,<br />

rather than requiring the customer to provide<br />

a long and extensive list of package<br />

versions, possibly one per module.<br />

• End user install experience: End users<br />

don’t care about which of the 4000 possible<br />

drivers they need to install, they just<br />

want it to work.<br />

This works well as long as you are able to make<br />

the “top of the tree” contain the most current


188 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

and most stable device driver, and you are able<br />

to convince your end users to always run the<br />

“top of the tree.” <strong>The</strong> kernel.org development<br />

processes tend to follow this model with<br />

great success.<br />

But widely used distros cannot ask their users<br />

to always update to the top of the kernel.org<br />

tree. Instead, they start their products from the<br />

top of the kernel.org tree at some point in time,<br />

essentially freezing with that, to begin their test<br />

cycles. <strong>The</strong> duration of these test cycles can<br />

be as short as a few weeks, and as long as a<br />

few years, but 3-6 months is not unusual. During<br />

this time, the kernel.org kernels march forward,<br />

and some (but not all) of these changes<br />

are backported into the distro’s kernel. <strong>The</strong>y<br />

then apply the minimal patches necessary for<br />

them to declare the product finished, and move<br />

the project into the sustaining phase, where<br />

changes are very closely scrutinized before releasing<br />

them to the end users.<br />

1.1 Backporting<br />

It is this sustaining phase that DKMS targets.<br />

DKMS can be used to backport newer device<br />

driver versions from the “top of the tree” kernels<br />

where most development takes place to the<br />

now-historical kernels of released products.<br />

<strong>The</strong> PATCH_MATCH mechanism was specifically<br />

designed to allow the application of<br />

patches to a “top of the tree” device driver to<br />

make it work with older kernels. This allows<br />

driver developers to continue to focus their efforts<br />

on keeping kernel.org up to date, while allowing<br />

that same effort to be used on existing<br />

products with minimal changes. See Section 6<br />

for a further explanation of this feature.<br />

1.2 Driver developers’ packaging<br />

Driver developers have recognized for a long<br />

time that they needed to provide backported<br />

versions of their drivers to match their end<br />

users’ needs. Often these requirements are<br />

imposed on them by system vendors such<br />

as Dell in support of a given distro release.<br />

However, each driver developer was free to<br />

provide the backport mechanism in any way<br />

they chose. Some provided architecturespecific<br />

RPMs which contained only precompiled<br />

modules. Some provided source RPMs<br />

which could be rebuilt for the running kernel.<br />

Some provided driver disks with precompiled<br />

modules. Some provided just source code<br />

patches, and expected the end user to rebuild<br />

the kernel themselves to obtain the desired device<br />

driver version. All provided their own<br />

Makefiles rather than use the kernel-provided<br />

build system.<br />

As a result, different problems were encountered<br />

with each developers’ solution. Some<br />

developers had not included their drivers in<br />

the kernel.org tree for so long that that there<br />

were discrepancies, e.g. CONFIG_SMP vs<br />

__SMP__, CONFIG_2G vs. CONFIG_3G,<br />

and compiler option differences which went<br />

unnoticed and resulted in hard-to-debug issues.<br />

Needless to say, with so many different mechanisms,<br />

all done differently, and all with different<br />

problems, it was a nightmare for end users.<br />

A new mechanism was needed to cleanly handle<br />

applying updated device drivers onto an<br />

end user’s system. Hence DKMS was created<br />

as the one module update mechanism to replace<br />

all previous methods.<br />

2 Goals<br />

DKMS has several design goals.<br />

• Implement only mechanism, not policy.<br />

• Allow system administrators to easily<br />

know what modules, what versions, for


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 189<br />

what kernels, and in what state, they have<br />

on their system.<br />

• Keep module source as it would be found<br />

in the “top of the tree” on kernel.org. Apply<br />

patches to backport the modules to<br />

earlier kernels as necessary.<br />

• Use the kernel-provided build mechanism.<br />

This reduces the Makefile magic<br />

that driver developers need to know, thus<br />

the likelihood of getting it wrong.<br />

• Keep additional DKMS knowledge a<br />

driver developer must have to a minimum.<br />

Only a small per-driver dkms.conf file is<br />

needed.<br />

• Allow multiple versions of any one module<br />

to be present on the system, with only<br />

one active at any given time.<br />

• Allow DKMS-aware drivers to be<br />

packaged in the <strong>Linux</strong> Standard Baseconformant<br />

RPM format.<br />

• Ease of use by multiple audiences: driver<br />

developers, system administrators, <strong>Linux</strong><br />

distros, and system vendors.<br />

We discuss DKMS as it applies to each of these<br />

four audiences.<br />

3 Distributions<br />

All present <strong>Linux</strong> distributions distribute device<br />

drivers bundled into essentially one large<br />

kernel package, for reasons outlined in Section<br />

1. It makes the most sense, most of the<br />

time.<br />

However, there are cases where it does not<br />

make sense.<br />

• Severity 1 bugs are discovered in a single<br />

device driver between larger scheduled<br />

updates. Ideally you’d like your affected<br />

users to be able to get the single<br />

module update without having to release<br />

and Q/A a whole new kernel. Only customers<br />

who are affected by the particular<br />

bug need to update “off-cycle.”<br />

• Solutions vendors, for change control reasons,<br />

often certify their solution on a particular<br />

distribution, scheduled update release,<br />

and sometimes specific kernel version.<br />

<strong>The</strong> latter, combined with releasing<br />

device driver bug fixes as whole new kernels,<br />

puts the customer in the untenable<br />

position of either updating to the new kernel<br />

(and losing the certification of the solution<br />

vendor), or forgoing the bug fix and<br />

possibly putting their data at risk.<br />

• Some device drivers are not (yet) included<br />

in kernel.org nor a distro kernel, however<br />

one may be required for a functional software<br />

solution. <strong>The</strong> current support models<br />

require that the add-on driver “taint”<br />

the kernel or in some way flag to the support<br />

organization that the user is running<br />

an unsupported kernel module. Tainting,<br />

while valid, only has three dimensions<br />

to it at present: Proprietary—non-GPL<br />

licensed; Forced—loaded via insmod<br />

-f; and Unsafe SMP—for some CPUs<br />

which are not designed to be SMPcapable.<br />

A GPL-licensed device driver<br />

which is not yet in kernel.org or provided<br />

by the distribution may trigger none of<br />

these taints, yet the support organization<br />

needs to be aware of this module’s presence.<br />

To avoid this, we expect to see<br />

the distros begin to cryptographically sign<br />

kernel modules that they produce, and<br />

taint on load of an unsigned module. This<br />

would help reduce the support organization’s<br />

work for calls about “unsupported”


190 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

configurations. With DKMS in use, there<br />

is less a need for such methods, as it’s easy<br />

to see which modules have been changed.<br />

Note: this is not to suggest that driver authors<br />

should not submit their drivers to<br />

kernel.org—absolutely they should.<br />

• <strong>The</strong> distro QA team would like to test updates<br />

to specific drivers without waiting<br />

for the kernel maintenance team to rebuild<br />

the kernel package (which can take many<br />

hours in some cases). Likewise, individual<br />

end users may be willing (and often be<br />

required, e.g. if the distro QA team can’t<br />

reproduce the users’s hardware and software<br />

environment exactly) to show that a<br />

particular bug is fixed in a driver, prior<br />

to releasing the fix to all of that distro’s<br />

users.<br />

• New hardware support via driver disks:<br />

Hardware vendors release new hardware<br />

asynchronously to any software vendor<br />

schedule, no matter how hard companies<br />

may try to synchronize releases. OS distributions<br />

provide install methods which<br />

use driver diskettes to enable new hardware<br />

for previously-released versions of<br />

the OS. Generating driver disks has always<br />

been a difficult and error-prone procedure,<br />

different for each OS distribution,<br />

not something that the casual end-user<br />

would dare attempt.<br />

DKMS was designed to address all of these<br />

concerns.<br />

DKMS aims to provide a clear separation between<br />

mechanism (how one updates individual<br />

kernel modules and tracks such activity) and<br />

policy (when should one update individual kernel<br />

modules).<br />

3.1 Mechanism<br />

DKMS provides only the mechanism for updating<br />

individual kernel modules, not policy.<br />

As such, it can be used by distributions (per<br />

their policy) for updating individual device<br />

drivers for individual users affected by Severity<br />

1 bugs, without releasing a whole new kernel.<br />

<strong>The</strong> first mechanism critical to a system administrator<br />

or support organization is the status<br />

command, which reports the name, version,<br />

and state of each kernel module under DKMS<br />

control. By querying DKMS for this information,<br />

system administrators and distribution<br />

support organizations may quickly understand<br />

when an updated device driver is in use to<br />

speed resolution when issues are seen.<br />

DKMS’s ability to generate driver diskettes<br />

gives control to both novice and seasoned system<br />

administrators alike, as they can now perform<br />

work which otherwise they would have<br />

to wait for a support organization to do for<br />

them. <strong>The</strong>y can get their new hardware systems<br />

up-and-running quickly by themselves,<br />

leaving the support organizations with time to<br />

do other more interesting value-added work.<br />

3.2 Policy<br />

Suggested policy items include:<br />

• Updates must pass QA. This seems obvious,<br />

but it reduces broken updates (designed<br />

to fix other problems) from being<br />

released.<br />

• Updates must be submitted, and ideally be<br />

included already, upstream. For this we<br />

expect kernel.org and the OS distribution<br />

to include the update in their next larger<br />

scheduled update. This ensures that when<br />

the next kernel.org kernel or distro update


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 191<br />

comes out, the short-term fix provided via<br />

DKMS is incorporated already.<br />

• <strong>The</strong> AUTOINSTALL mechanism is set to<br />

NO for all modules which are shipped with<br />

the target distro’s kernel. This prevents<br />

the DKMS autoinstaller from installing<br />

a (possibly older) kernel module onto a<br />

newer kernel without being explicitly told<br />

to do so by the system administrator. This<br />

follows from the “all DKMS updates must<br />

be in the next larger release” rule above.<br />

• All issues for which DKMS is used are<br />

tracked in the appropriate bug tracking<br />

databases until they are included upstream,<br />

and are reviewed regularly.<br />

• All DKMS packages are provided as<br />

DKMS-enabled RPMs for easy installation<br />

and removal, per the <strong>Linux</strong> Standard<br />

Base specification.<br />

• All DKMS packages are posted to the distro’s<br />

support web site for download by<br />

system administrators affected by the partiular<br />

issue.<br />

4 System Vendors<br />

DKMS is useful to System Vendors such as<br />

Dell for many of the same reasons it’s useful<br />

to the <strong>Linux</strong> distributions. In addition, system<br />

vendors face additional issues:<br />

• Critical bug fixes for distro-provided<br />

drivers: While we hope to never need<br />

such, and we test extensively with distroprovided<br />

drivers, occasionally we have<br />

discovered a critical bug after the distribution<br />

has cut their gold CDs. We use<br />

DKMS to update just the affected device<br />

drivers.<br />

• Alternate drivers: Dell occasionally needs<br />

to provide an alternate driver for a piece of<br />

hardware rather than that provided by the<br />

distribution natively. For example, Dell<br />

provides the Intel iANS network channel<br />

bonding and failover driver for customers<br />

who have used iANS in the past, and wish<br />

to continue using it rather than upgrading<br />

to the native channel bonding driver resident<br />

in the distribution.<br />

• Factory installation: Dell installs various<br />

OS distribution releases onto new hardware<br />

in its factories. We try not to update<br />

from the gold release of a distribution<br />

version to any of the scheduled updates,<br />

as customers expect to receive gold. We<br />

use DKMS to enable newer device drivers<br />

to handle newer hardware than was supported<br />

natively in the gold release, while<br />

keeping the gold kernel the same.<br />

We briefly describe the policy Dell uses, in addition<br />

to the above rules suggested to OS distributions:<br />

• Prebuilt DKMS tarballs are required for<br />

factory installation use, for all kernels<br />

used in the factory install process. This<br />

prevents the need for the compiler to be<br />

run, saving time through the factories.<br />

Dell rarely changes the factory install images<br />

for a given OS release, so this is not<br />

a huge burden on the DKMS packager.<br />

• All DKMS packages are posted to support.dell.com<br />

for download by system administrators<br />

purchasing systems without<br />

<strong>Linux</strong> factory-installed.


192 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Figure 1: DKMS state diagram.<br />

5 System Administrators<br />

5.1 Understanding the DKMS Life Cycle<br />

Before diving into using DKMS to manage kernel<br />

modules, it is helpful to understand the life<br />

cycle by which DKMS maintains your kernel<br />

modules. In Figure 1, each rectangle represents<br />

a state your module can be in and each<br />

italicized word represents a DKMS action that<br />

can used to switch between the various DKMS<br />

states. In the following section we will look<br />

further into each of these DKMS actions and<br />

then continue on to discuss auxiliary DKMS<br />

functionality that extends and improves upon<br />

your ability to utilize these basic commands.<br />

5.2 RPM and DKMS<br />

5.3 Using DKMS<br />

5.3.1 Add<br />

DKMS manages kernel module versions at<br />

the source code level. <strong>The</strong> first requirement<br />

of using DKMS is that the module<br />

source be located on the build system and<br />

that it be located in the directory /usr/src/<br />

-/. It<br />

also requires that a dkms.conf file exists with<br />

the appropriately formatted directives within<br />

this configuration file to tell DKMS such things<br />

as where to install the module and how to build<br />

it. Once these two requirements have been<br />

met and DKMS has been installed on your system,<br />

you can begin using DKMS by adding a<br />

module/module-version to the DKMS tree. For<br />

example:<br />

# dkms add -m megaraid2 -v 2.10.3<br />

DKMS was designed to work well with Red<br />

Hat Package Manger (RPM). Many times using<br />

DKMS to install a kernel module is as easy<br />

as installing a DKMS-enabled module RPM.<br />

Internally in these RPMs, DKMS is used to<br />

add, build, and install a module. By<br />

wrapping DKMS commands inside of an RPM,<br />

you get the benefits of RPM (package versioning,<br />

security, dependency resolution, and package<br />

distribution methodologies) while DKMS<br />

handles the work RPM does not, versioning<br />

and building of individual kernel modules.<br />

For reference, a sample DKMS-enabled RPM<br />

specfile can be found in the DKMS package.<br />

This example add command would add<br />

megaraid2/2.10.3 to the already existent<br />

/var/dkms tree, leaving it in the Added<br />

state.<br />

5.3.2 Build<br />

Once in the Added state, the module is ready<br />

to be built. This occurs through the DKMS<br />

build command and requires that the proper<br />

kernel sources are located on the system from<br />

the /lib/module/<br />

/build symlink. <strong>The</strong> make command that is


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 193<br />

used to compile the module is specified in the<br />

dkms.conf configuration file. Continuing with<br />

the megaraid2/2.10.3 example:<br />

# dkms build -m megaraid2<br />

-v 2.10.3 -k 2.4.21-4.ELsmp<br />

<strong>The</strong> build command compiles the module<br />

but stops short of installing it. As can be seen<br />

in the above example, build expects a kernelversion<br />

parameter. If this kernel name is left<br />

out, it assumes the currently running kernel.<br />

However, it functions perfectly well to build<br />

modules for kernels that are not currently running.<br />

This functionality is assured through use<br />

of a kernel preparation subroutine that runs before<br />

any module build is performed in order<br />

to ensure that the module being built is linked<br />

against the proper kernel symbols.<br />

Successful completion of a build creates, for<br />

this example, the /var/dkms/megaraid2/<br />

2.10.3/2.4.21-4.ELsmp/ directory as<br />

well as the log and module subdirectories<br />

within this directory. <strong>The</strong> log directory holds<br />

a log file of the module make and the module<br />

directory holds copies of the resultant binaries.<br />

5.3.3 Install<br />

With the completion of a build, the module<br />

can now be installed on the kernel for<br />

which it was built. Installation copies the compiled<br />

module binary to the correct location in<br />

the /lib/modules/ tree as specified in the<br />

dkms.conf file. If a module by that name is<br />

already found in that location, DKMS saves it<br />

in its tree as an original module so at a later<br />

time it can be put back into place if the newer<br />

module is uninstalled. An example install<br />

command:<br />

# dkms install -m megaraid2<br />

-v 2.10.3 -k 2.4.21-4.ELsmp<br />

If a module by the same name is already<br />

installed, DKMS saves a copy in its<br />

tree and does so in the /var/dkms/<br />

/original_module/<br />

directory. In this case, it would be saved to<br />

/var/dkms/megaraid2/original_<br />

module/2.4.21-4.ELsmp/.<br />

5.3.4 Uninstall and Remove<br />

To complete the DKMS cycle, you can also<br />

uninstall or remove your module from the<br />

tree. <strong>The</strong> uninstall command deletes from<br />

/lib/modules the module you installed<br />

and, if applicable, replaces it with its original<br />

module. In scenarios where multiple versions<br />

of a module are located within the DKMS tree,<br />

when one version is uninstalled, DKMS does<br />

not try to understand or assume which of these<br />

other versions to put in its place. Instead, if<br />

a true “original_module” was saved from the<br />

very first DKMS installation, it will be put back<br />

into the kernel and all of the other module versions<br />

for that module will be left in the Built<br />

state. An example uninstall would be:<br />

# dkms uninstall -m megaraid2<br />

-v 2.10.3 -k 2.4.21-4.ELsmp<br />

Again, if the kernel version parameter is unset,<br />

the currently running kernel is assumed,<br />

although, the same behavior does not occur<br />

with the remove command. <strong>The</strong> remove and<br />

uninstall are very similar in that remove<br />

will do all of the same steps as uninstall.<br />

However, when remove is employed, if the<br />

module-version being removed is the last instance<br />

of that module-version for all kernels<br />

on your system, after the uninstall portion of<br />

the remove completes, it will delete all traces<br />

of that module from the DKMS tree. To put it<br />

another way, when an uninstall command<br />

completes, your modules are left in the Built


194 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

state. However, when a remove completes,<br />

you would be left in the Not in Tree state. Here<br />

are two sample remove commands:<br />

# dkms remove -m megaraid2<br />

-v 2.10.3 -k 2.4.21-4.ELsmp<br />

# dkms remove -m megaraid2<br />

-v 2.10.3 --all<br />

With the first example remove command,<br />

your module would be uninstalled and if this<br />

module/module-version were not installed on<br />

any other kernel, all traces of it would be removed<br />

from the DKMS tree all together. If,<br />

say, megaraid2/2.10.3 was also installed on the<br />

2.4.21-4.ELhugemem kernel, the first remove<br />

command would leave it alone and it would remain<br />

intact in the DKMS tree. In the second<br />

example, that would not be the case. It would<br />

uninstall all versions of the megaraid2/2.10.3<br />

module from all kernels and then completely<br />

expunge all references of megaraid2/2.10.3<br />

from the DKMS tree. Thus, remove is what<br />

cleans your DKMS tree.<br />

5.4 Miscellaneous DKMS Commands<br />

5.4.1 Status<br />

DKMS also comes with a fully functional status<br />

command that returns information about<br />

what is currently located in your tree. If no<br />

parameters are set, it will return all information<br />

found. Logically, the specificity of information<br />

returned depends on which parameters<br />

are passed to your status command. Each status<br />

entry returned will be of the state: “added,”<br />

“built,” or “installed,” and if an original module<br />

has been saved, this information will also<br />

be displayed. Some example status commands<br />

include:<br />

# dkms status<br />

# dkms status -m megaraid2<br />

# dkms status -m megaraid2 -v 2.10.3<br />

# dkms status -k 2.4.21-4.ELsmp<br />

# dkms status -m megaraid2<br />

-v 2.10.3 -k 2.4.21-4.ELsmp<br />

5.4.2 Match<br />

Another major feature of DKMS is the match<br />

command. <strong>The</strong> match command takes the configuration<br />

of DKMS installed modules for one<br />

kernel and applies this same configuration to<br />

some other kernel. When the match completes,<br />

the same module/module-versions that were<br />

installed for one kernel are also then installed<br />

on the other kernel. This is helpful when you<br />

are upgrading from one kernel to the next, but<br />

would like to keep the same DKMS modules in<br />

place for the new kernel. Here is an example:<br />

# dkms match<br />

--templatekernel 2.4.21-4.ELsmp<br />

-k 2.4.21-5.ELsmp<br />

As can be seen in the example, the<br />

−−templatekernel is the “match-er”<br />

kernel from which the configuration is based,<br />

while the -k kernel is the “match-ee” upon<br />

which the configuration is instated.<br />

5.4.3 dkms_autoinstaller<br />

Similar in nature to the match command is<br />

the dkms_autoinstaller service. This service<br />

gets installed as part of the DKMS RPM<br />

in the /etc/init.d directory. Depending on<br />

whether an AUTOINSTALL directive is set<br />

within a module’s dkms.conf configuration<br />

file, the dkms_autoinstaller will automatically<br />

build and install that module as you boot your<br />

system into new kernels which do not already<br />

have this module installed.<br />

5.4.4 mkdriverdisk<br />

<strong>The</strong> last miscellaneous DKMS command is<br />

mkdriverdisk. As can be inferred from its<br />

name, mkdriverdisk will take the proper


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 195<br />

sources in your DKMS tree and create a driver<br />

disk image for use in providing updated drivers<br />

to <strong>Linux</strong> distribution installations. A sample<br />

mkdriverdisk might look like:<br />

You have built the megaraid2 device driver,<br />

version 2.10.3, for two different kernel families<br />

(here 2.4.20-9 and 2.4.21-4.EL), on your<br />

master build system.<br />

# dkms mkdriverdisk -d redhat<br />

-m megaraid2 -v 2.10.3<br />

-k 2.4.21-4.ELBOOT<br />

Currently, the only supported distribution<br />

driver disk format is Red Hat. For more<br />

information on the extra necessary files and<br />

their formats for DKMS to create Red<br />

Hat driver disks, see http://people.<br />

redhat.com/dledford. <strong>The</strong>se files<br />

should be placed in your module source directory.<br />

5.5 Systems Management with DKMS Tarballs<br />

As we have seen, DKMS provides a simple<br />

mechanism to build, install, and track device<br />

driver updates. So far, all these actions have<br />

related to a single machine. But what if you’ve<br />

got many similar machines under your administrative<br />

control? What if you have a compiler<br />

and kernel source on only one system (your<br />

master build system), but you need to deploy<br />

your newly built driver to all your other systems?<br />

DKMS provides a solution to this as<br />

well—in the mktarball and ldtarball<br />

commands.<br />

<strong>The</strong> mktarball command rolls up copies of<br />

each device driver module file which you’ve<br />

built using DKMS into a compressed tarball.<br />

You may then copy this tarball to each<br />

of your target systems, and use the DKMS<br />

ldtarball command to load those into your<br />

DKMS tree, leaving each module in the Built<br />

state, ready to be installed. This avoids the<br />

need for both kernel source and compilers to<br />

be on every target system.<br />

For example:<br />

# dkms status<br />

megaraid2, 2.10.3, 2.4.20-9: built<br />

megaraid2, 2.10.3, 2.4.20-9bigmem: built<br />

megaraid2, 2.10.3, 2.4.20-9BOOT: built<br />

megaraid2, 2.10.3, 2.4.20-9smp: built<br />

megaraid2, 2.10.3, 2.4.21-4.EL: built<br />

megaraid2, 2.10.3, 2.4.21-4.ELBOOT: built<br />

megaraid2, 2.10.3, 2.4.21-4.ELhugemem: built<br />

megaraid2, 2.10.3, 2.4.21-4.ELsmp: built<br />

You wish to deploy this version of the<br />

driver to several systems, without rebuilding<br />

from source each time. You can use the<br />

mktarball command to generate one tarball<br />

for each kernel family:<br />

# dkms mktarball -m megaraid2<br />

-v 2.10.3<br />

-k 2.4.21-4.EL,2.4.21-4.ELsmp,<br />

2.4.21-4.ELBOOT,2.4.21-4.ELhugemem<br />

Marking /usr/src/megaraid2-2.10.3 for archiving...<br />

Marking kernel 2.4.21-4.EL for archiving...<br />

Marking kernel 2.4.21-4.ELBOOT for archiving...<br />

Marking kernel 2.4.21-4.ELhugemem for archiving...<br />

Marking kernel 2.4.21-4.ELsmp for archiving...<br />

Tarball location:<br />

/var/dkms/megaraid2/2.10.3/tarball/<br />

megaraid2-2.10.3-manykernels.tgz<br />

Done.<br />

You can make one big tarball containing modules<br />

for both families by omitting the -k argument<br />

and kernel list; DKMS will include a<br />

module for every kernel version found.<br />

You may then copy the tarball (renaming it if<br />

you wish) to each of your target systems using<br />

any mechanism you wish, and load the modules<br />

in. First, see that the target DKMS tree<br />

does not contain the modules you’re loading:<br />

# dkms status<br />

Nothing found within the DKMS tree for<br />

this status command. If your modules were<br />

not installed with DKMS, they will not show<br />

up here.<br />

<strong>The</strong>n, load the tarball on your target system:


196 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

# dkms ldtarball<br />

--archive=megaraid2-2.10.3-manykernels.tgz<br />

Loading tarball for module:<br />

megaraid2 / version: 2.10.3<br />

Loading /usr/src/megaraid2-2.10.3...<br />

Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.EL...<br />

Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.ELBOOT...<br />

Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.ELhugemem...<br />

Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.ELsmp...<br />

Creating /var/dkms/megaraid2/2.10.3/source symlink...<br />

Finally, verify the modules are present, and in<br />

the Built state:<br />

# dkms status<br />

megaraid2, 2.10.3, 2.4.21-4.EL: built<br />

megaraid2, 2.10.3, 2.4.21-4.ELBOOT: built<br />

megaraid2, 2.10.3, 2.4.21-4.ELhugemem: built<br />

megaraid2, 2.10.3, 2.4.21-4.ELsmp: built<br />

DKMS ldtarball leaves the modules in the<br />

Built state, not the Installed state. For each kernel<br />

version you want your modules to be installed<br />

into, follow the install steps as above.<br />

6 Driver Developers<br />

As the maintainer of a kernel module, the only<br />

thing you need to do to get DKMS interoperability<br />

is place a small dkms.conf file in your<br />

driver source tarball. Once this has been done,<br />

any user of DKMS can simply do:<br />

dkms ldtarball --archive /path/to/foo-1.0.tgz<br />

That’s it. We could discuss at length (which<br />

we will not rehash in this paper) the best methods<br />

to utilizing DKMS within a dkms-enabled<br />

module RPM, but for simple DKMS usability,<br />

the buck stops here. With the dkms.conf file<br />

in place, you have now positioned your source<br />

tarball to be usable by all manner and skill level<br />

of <strong>Linux</strong> users utilizing your driver. Effectively,<br />

you have widely increased your testing<br />

base without having to wade into package management<br />

or pre-compiled binaries. DKMS will<br />

handle this all for you. Along the same line,<br />

by leveraging DKMS you can now easily allow<br />

more widespread testing of your driver. Since<br />

driver versions can now be cleanly tracked outside<br />

of the kernel tree, you no longer must wait<br />

for the next kernel release in order for the community<br />

to register the necessary debugging cycles<br />

against your code. Instead, DKMS can be<br />

counted on to manage various versions of your<br />

kernel module such that any catastrophic errors<br />

in your code can be easily mitigated by a singular<br />

dkms uninstall command.<br />

This leaves the composition of the dkms.conf<br />

as the only interesting piece left to discuss<br />

for the driver developer audience. With that<br />

in mind, we will now explicate over two<br />

dkms.conf examples ranging from that which<br />

is minimally required (Figure 2) to that which<br />

expresses maximal configuration (Figure 3).<br />

6.1 Minimal dkms.conf for 2.4 kernels<br />

Referring to Figure 2, the first thing that is distinguishable<br />

is the definition of the version of<br />

the package and the make command to be used<br />

to compile your module. This is only necessary<br />

for 2.4-based kernels, and lets the developer<br />

specify their desired make incantation.<br />

Reviewing the rest of the dkms.conf,<br />

PACKAGE_NAME and BUILT_MODULE_<br />

NAME[0] appear to be duplicate in nature,<br />

but this is only the case for a package which<br />

contains only one kernel module within it.<br />

Had this example been for something like<br />

ALSA, the name of the package would be<br />

“alsa,” but the BUILT_MODULE_NAME array<br />

would instead be populated with the names of<br />

the kernel modules within the ALSA package.<br />

<strong>The</strong> final required piece of this minimal example<br />

is the DEST_MODULE_LOCATION array.<br />

This simply tells DKMS where in the<br />

/lib/modules tree it should install your module.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 197<br />

PACKAGE_NAME="megaraid2"<br />

PACKAGE_VERSION="2.10.3"<br />

MAKE[0]="make -C ${kernel_source_dir}<br />

SUBDIRS=${dkms_tree}/${PACKAGE_NAME}/${PACKAGE_VERSION}/build modules"<br />

BUILT_MODULE_NAME[0]="megaraid2"<br />

DEST_MODULE_LOCATION[0]="/kernel/drivers/scsi/"<br />

Figure 2: A minimal dkms.conf<br />

6.2 Minimal dkms.conf for 2.6 kernels<br />

In the current version of DKMS, for 2.6 kernels<br />

the MAKE command listed in the dkms.conf<br />

is wholly ignored, and instead DKMS will always<br />

use:<br />

make -C /lib/modules/$kernel_version/build \<br />

M=$dkms_tree/$module/$module_version/build<br />

This jibes with the new external module build<br />

infrastructure supported by Sam Ravnborg’s<br />

kernel Makefile improvements, as DKMS will<br />

always build your module in a build subdirectory<br />

it creates for each version you have<br />

installed. Similarly, an impending future<br />

version of DKMS will also begin to ignore<br />

the PACKAGE_VERSION as specified in<br />

dkms.conf in favor of the new modinfo provided<br />

information as implemented by Rusty<br />

Russell.<br />

With regard to removing the requirement for<br />

DEST_MODULE_LOCATION for 2.6 kernels,<br />

given that similar information should be located<br />

in the install target of the Makefile provided<br />

with your package, it is theoretically possible<br />

that DKMS could one day glean such<br />

information from the Makefile instead. In<br />

fact, in a simple scenario as this example, it<br />

is further theoretically possible that the name<br />

of the package and of the built module could<br />

also be determined from the package Makefile.<br />

In effect, this would completely remove<br />

any need for a dkms.conf whatsoever, thus enabling<br />

all simple module tarballs to be automatically<br />

DKMS enabled.<br />

Though, as these features have not been explored<br />

and as package maintainers would<br />

likely want to use some of the other dkms.conf<br />

directive features which are about to be elaborated<br />

upon, it is likely that requiring a<br />

dkms.conf will continue for the foreseeable future.<br />

6.3 Optional dkms.conf directives<br />

In the real-world version of the Dell’s DKMSenabled<br />

megaraid2 package, we also specify<br />

the optional directives:<br />

MODULES_CONF_ALIAS_TYPE[0]=<br />

"scsi_hostadapter"<br />

MODULES_CONF_OBSOLETES[0]=<br />

"megaraid,megaraid_2002"<br />

REMAKE_INITRD="yes"<br />

<strong>The</strong>se directives tell DKMS to remake the kernel’s<br />

initial ramdisk after every DKMS install<br />

or uninstall of this module. <strong>The</strong>y further specify<br />

that before this happens, /etc/modules.conf<br />

(or /etc/sysconfig/kernel) should be edited intelligently<br />

so that the initrd is properly assembled.<br />

In this case, if /etc/modules.conf already<br />

contains a reference to either “megaraid” or<br />

“megaraid_2002,” these will be switched to<br />

“megaraid2.” If no such references are found,


198 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

then a new “scsi_hostadapter” entry will be<br />

added as the last such scsi_hostadapter number.<br />

On the other hand, if it had also included:<br />

MODULES_CONF_OBSOLETES_ONLY="yes"<br />

then had no obsolete references been found,<br />

a new “scsi_hostadapter” line would not have<br />

been added. This would be useful in scenarios<br />

where you instead want to rely on something<br />

like Red Hat’s kudzu program for adding references<br />

for your kernel modules.<br />

As well one could hypothetically also specify<br />

within the dkms.conf:<br />

DEST_MODULE_NAME[0]="megaraid"<br />

This would cause the resultant megaraid2 kernel<br />

module to be renamed to “megaraid” before<br />

being installed. Rather than having to<br />

propagate various one-off naming mechanisms<br />

which include the version as part of the module<br />

name in /lib/modules as has been previous<br />

common practice, DKMS could instead be relied<br />

upon to manage all module versioning to<br />

avoid such clutter. Was megaraid_2002 a version<br />

or just a special year in the hearts of the<br />

megaraid developers? While you and I might<br />

know the answer to this, it certainly confused<br />

Dell’s customers.<br />

Continuing with hypothetical additions to the<br />

dkms.conf in Figure 2, one could also include:<br />

BUILD_EXCLUSIVE_KERNEL="^2\.4.*"<br />

BUILD_EXCLUSIVE_ARCH="i.86"<br />

In the event that you know the code you produced<br />

is not portable, this is how you can tell<br />

DKMS to keep people from trying to build it<br />

elsewhere. <strong>The</strong> above restrictions would only<br />

allow the kernel module to be built on 2.4 kernels<br />

on x86 architectures.<br />

Continuting with optional dkms.conf directives,<br />

the ALSA example in Figure 3 is taken<br />

directly from a DKMS-enabled package that<br />

Dell released to address sound issues on the<br />

Precision 360 workstation. It is slightly<br />

abridged as the alsa-driver as delivered actually<br />

installs 13 separate kernel modules, but for the<br />

sake of this example, only 9 are shown.<br />

In this example, we have:<br />

AUTOINSTALL="yes"<br />

This tells the boot-time service<br />

dkms_autoinstaller that this package should be<br />

built and installed as you boot into a new kernel<br />

that DKMS has not already installed this<br />

package upon. By general policy, Dell only<br />

allows AUTOINSTALL to be set if the kernel<br />

modules are not already natively included<br />

with the kernel. This is to avoid the scenario<br />

where DKMS might automatically install<br />

over a newer version of the kernel module as<br />

provided by some newer version of the kernel.<br />

However, given the 2.6 modinfo changes,<br />

DKMS can now be modified to intelligently<br />

check the version of a native kernel module<br />

before clobbering it with some older version.<br />

This will likely result in a future policy change<br />

within Dell with regard to this feature.<br />

In this example, we also have:<br />

PATCH[0]="adriver.h.patch"<br />

PATCH_MATCH[0]="2.4.[2-9][2-9]"<br />

<strong>The</strong>se two directives indicate to DKMS that<br />

if the kernel that the kernel module is being<br />

built for is >=2.4.22 (but still of the 2.4 family),<br />

the included adriver.h.patch should first be


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 199<br />

PACKAGE_NAME="alsa-driver"<br />

PACKAGE_VERSION="0.9.0rc6"<br />

MAKE="sh configure --with-cards=intel8x0 --with-sequencer=yes \<br />

--with-kernel=/lib/modules/$kernelver/build \<br />

--with-moddir=/lib/modules/$kernelver/kernel/sound > /dev/null; make"<br />

AUTOINSTALL="yes"<br />

PATCH[0]="adriver.h.patch"<br />

PATCH_MATCH[0]="2.4.[2-9][2-9]"<br />

POST_INSTALL="alsa-driver-dkms-post.sh"<br />

MODULES_CONF[0]="alias char-major-116 snd"<br />

MODULES_CONF[1]="alias snd-card-0 snd-intel8x0"<br />

MODULES_CONF[2]="alias char-major-14 soundcore"<br />

MODULES_CONF[3]="alias sound-slot-0 snd-card-0"<br />

MODULES_CONF[4]="alias sound-service-0-0 snd-mixer-oss"<br />

MODULES_CONF[5]="alias sound-service-0-1 snd-seq-oss"<br />

MODULES_CONF[6]="alias sound-service-0-3 snd-pcm-oss"<br />

MODULES_CONF[7]="alias sound-service-0-8 snd-seq-oss"<br />

MODULES_CONF[8]="alias sound-service-0-12 snd-pcm-oss"<br />

MODULES_CONF[9]="post-install snd-card-0 /usr/sbin/alsactl restore >/dev/null 2>&1 || :"<br />

MODULES_CONF[10]="pre-remove snd-card-0 /usr/sbin/alsactl store >/dev/null 2>&1 || :"<br />

BUILT_MODULE_NAME[0]="snd-pcm"<br />

BUILT_MODULE_LOCATION[0]="acore"<br />

DEST_MODULE_LOCATION[0]="/kernel/sound/acore"<br />

BUILT_MODULE_NAME[1]="snd-rawmidi"<br />

BUILT_MODULE_LOCATION[1]="acore"<br />

DEST_MODULE_LOCATION[1]="/kernel/sound/acore"<br />

BUILT_MODULE_NAME[2]="snd-timer"<br />

BUILT_MODULE_LOCATION[2]="acore"<br />

DEST_MODULE_LOCATION[2]="/kernel/sound/acore"<br />

BUILT_MODULE_NAME[3]="snd"<br />

BUILT_MODULE_LOCATION[3]="acore"<br />

DEST_MODULE_LOCATION[3]="/kernel/sound/acore"<br />

BUILT_MODULE_NAME[4]="snd-mixer-oss"<br />

BUILT_MODULE_LOCATION[4]="acore/oss"<br />

DEST_MODULE_LOCATION[4]="/kernel/sound/acore/oss"<br />

BUILT_MODULE_NAME[5]="snd-pcm-oss"<br />

BUILT_MODULE_LOCATION[5]="acore/oss"<br />

DEST_MODULE_LOCATION[5]="/kernel/sound/acore/oss"<br />

BUILT_MODULE_NAME[6]="snd-seq-device"<br />

BUILT_MODULE_LOCATION[6]="acore/seq"<br />

DEST_MODULE_LOCATION[6]="/kernel/sound/acore/seq"<br />

BUILT_MODULE_NAME[7]="snd-seq-midi-event"<br />

BUILT_MODULE_LOCATION[7]="acore/seq"<br />

DEST_MODULE_LOCATION[7]="/kernel/sound/acore/seq"<br />

BUILT_MODULE_NAME[8]="snd-seq-midi"<br />

BUILT_MODULE_LOCATION[8]="acore/seq"<br />

DEST_MODULE_LOCATION[8]="/kernel/sound/acore/seq"<br />

BUILT_MODULE_NAME[9]="snd-seq"<br />

BUILT_MODULE_LOCATION[9]="acore/seq"<br />

DEST_MODULE_LOCATION[9]="/kernel/sound/acore/seq"<br />

Figure 3: An elaborate dkms.conf


200 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

applied to the module source before a module<br />

build occurs. In this way, by including various<br />

patches needed for various kernel versions,<br />

you can distribute one source tarball and ensure<br />

it will always properly build regardless of<br />

the end user target kernel. If no corresponding<br />

PATCH_MATCH[0] entry were specified for<br />

PATCH[0], then the adriver.h.patch would always<br />

get applied before a module build. As<br />

DKMS always starts off each module build<br />

with pristine module source, you can always<br />

ensure the right patches are being applied.<br />

Also seen in this example is:<br />

MODULES_CONF[0]=<br />

"alias char-major-116 snd"<br />

MODULES_CONF[1]=<br />

"alias snd-card-0 snd-intel8x0"<br />

Unlike the previous discussion of<br />

/etc/modules.conf changes, any entries<br />

placed into the MODULES_CONF array are<br />

automatically added into /etc/modules.conf<br />

during a module install. <strong>The</strong>se are later only<br />

removed during the final module uninstall.<br />

Lastly, we have:<br />

POST_INSTALL="alsa-driver-dkms-post.sh"<br />

In the event that you have other scripts that<br />

must be run during various DKMS events,<br />

DKMS includes POST_ADD, POST_BUILD,<br />

POST_INSTALL and POST_REMOVE functionality.<br />

7 Future<br />

As you can tell from the above, DKMS is very<br />

much ready for deployment now. However, as<br />

with all software projects, there’s room for improvement.<br />

7.1 Cross-Architecture Builds<br />

DKMS today has no concept of a platform architecture<br />

such as i386, x86_64, ia64, sparc,<br />

and the like. It expects that it is building kernel<br />

modules with a native compiler, not a cross<br />

compiler, and that the target architecture is the<br />

native architecture. While this works in practice,<br />

it would be convenient if DKMS were able<br />

to be used to build kernel modules for nonnative<br />

architectures.<br />

Today DKMS handles the cross-architecture<br />

build process by having separate /var/dkms directory<br />

trees for each architecture, and using<br />

the dkmstree option to specify a using a different<br />

tree, and the config option to specify<br />

to use a different kernel configuration file.<br />

Going forward, we plan to add an −−arch<br />

option to DKMS, or have it glean it from the<br />

kernel config file and act accordingly.<br />

7.2 Additional distribution driver disks<br />

DKMS today supports generating driver disks<br />

in the Red Hat format only. We recognize that<br />

other distributions accomplish the same goal<br />

using other driver disk formats. This should<br />

be relatively simple to add once we understand<br />

what the additional formats are.<br />

8 Conclusion<br />

DKMS provides a simple and unified mechanism<br />

for driver authors, <strong>Linux</strong> distributions,<br />

system vendors, and system administrators to<br />

update the device drivers on a target system<br />

without updating the whole kernel. It allows<br />

driver developers to keep their work aimed at<br />

the “top of the tree,” and to backport that work<br />

to older kernels painlessly. It allows <strong>Linux</strong> distributions<br />

to provide updates to single device<br />

drivers asynchronous to the release of a larger


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 201<br />

scheduled update, and to know what drivers<br />

have been updated. It lets system vendors<br />

ship newer hardware than was supported in a<br />

distribution’s “gold” release without invalidating<br />

any test or certification work done on the<br />

“gold” release. It lets system administrators<br />

update individual drivers to match their environment<br />

and their needs, regardless of whose<br />

kernel they are running. It lets end users track<br />

which module versions have been added to<br />

their system.<br />

We believe DKMS is a project whose time has<br />

come, and encourage everyone to use it.<br />

9 References<br />

DKMS is licensed under the GNU General<br />

Public License. It is available at<br />

http://linux.dell.com/dkms/,<br />

and has a mailing list dkms-devel@<br />

lists.us.dell.com to which you may<br />

subscribe at http://lists.us.dell.<br />

com/.


202 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


e100 Weight Reduction Program<br />

Writing for Maintainability<br />

Scott Feldman<br />

Intel Corporation<br />

scott.feldman@intel.com<br />

Abstract<br />

Corporate-authored device drivers are<br />

bloated/buggy with dead code, HW and<br />

OS abstraction layers, non-standard user<br />

controls, and support for complicated HW<br />

features that provide little or no value. e100<br />

in 2.6.4 has been rewritten to address these<br />

issues and in the process lost 75% of the lines<br />

of code, with no loss of functionality. This<br />

paper gives guidelines to other corporate driver<br />

authors.<br />

Introduction<br />

This paper gives some basic guidelines to corporate<br />

device driver maintainers based on experiences<br />

I had while re-writing the e100 network<br />

device driver for Intel’s PRO/100+ Ethernet<br />

controllers. By corporate maintainer, I<br />

mean someone employed by a corporation to<br />

provide <strong>Linux</strong> driver support for that corporation’s<br />

device. Of course, these guidelines may<br />

apply to non-corporate individuals as well, but<br />

the intended audience is the corporate driver<br />

author.<br />

<strong>The</strong> assumption behind these guidelines is that<br />

the device driver is intended for inclusion in<br />

the <strong>Linux</strong> kernel. For a driver to be accepted<br />

into the <strong>Linux</strong> kernel, it must meet both technical<br />

and non-technical requirements. This paper<br />

focuses on the non-technical requirements,<br />

specifically maintainability.<br />

Guideline #1: Maintainability over<br />

Everything Else<br />

Corporate marketing requirements documents<br />

specify priority order to features and performance<br />

and schedule (time-to-market), but<br />

rarely specify maintainability. However, maintainability<br />

is the most important requirement<br />

for <strong>Linux</strong> kernel drivers.<br />

Why?<br />

• You will not be the long-term driver maintainer.<br />

• Your company will not be the long-term<br />

driver maintainer.<br />

• Your driver will out-live your interest in it.<br />

Driver code should be written so a like-skilled<br />

kernel maintainer can fix a problem in a reasonable<br />

amount of time without you or your resources.<br />

Here are a few items to keep in mind<br />

to improve maintainability.<br />

• Use kernel coding style over corporate<br />

coding style<br />

• Document how the driver/device works, at<br />

a high level, in a “<strong>The</strong>ory of Operation”<br />

comment section


204 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

old driver v2<br />

VLANs tagging/<br />

stripping<br />

Tx/Rx checksum of<br />

loading<br />

interrupt moderation<br />

new driver v3<br />

use SW VLAN support<br />

in kernel<br />

use SW checksum<br />

support in kernel<br />

use NAPI support in<br />

kernel<br />

Table 1: Feature migration in e100<br />

• Document hardware workarounds<br />

Guideline #2: Don’t Add Features<br />

for Feature’s Sake<br />

Consider the code complexity to support the<br />

feature versus the user’s benefit. Is the device<br />

still usable without the feature? Is the device<br />

performing reasonably for the 80% usecase<br />

without the feature? Is the hardware offload<br />

feature working against ever increasing<br />

CPU/memory/IO speeds? Is there a software<br />

equivalent to the feature already provided in<br />

the OS?<br />

If the answer is yes to any of these questions, it<br />

is better to not implement the feature, keeping<br />

the complexity in the driver low and maintainability<br />

high.<br />

Table 1 shows features removed from the driver<br />

during the re-write of e100 because the OS already<br />

provides software equivalents.<br />

Guideline #3: Limit User-Controls—<br />

Use What’s Built into the OS<br />

Most users will use the default settings, so before<br />

adding a user-control, consider:<br />

1. If the driver model for your device class<br />

already provides a mechanism for the<br />

user-control, enable that support in the<br />

old driver v2<br />

new driver v3<br />

BundleMax<br />

not needed – NAPI<br />

BundleSmallFr not needed – NAPI<br />

IntDelay<br />

not needed – NAPI<br />

ucode<br />

not needed – NAPI<br />

RxDescriptors ethtool -G<br />

TxDescriptors ethtool -G<br />

XsumRX<br />

not needed – checksum<br />

in OS<br />

IFS<br />

always enabled<br />

e100_speed_duplex ethtool -s<br />

Table 2: User-control migration in e100<br />

driver rather than adding a custom usercontrol.<br />

2. If the driver model doesn’t provide a usercontrol,<br />

but the user-control is potentially<br />

useful to other drivers, extend the driver<br />

model to include user-control.<br />

3. If the user-control is to enable/disable a<br />

workaround, enable the workaround without<br />

the use of a user-control. (Solve<br />

the problem without requiring a decision<br />

from the user).<br />

4. If the user-control is to tune performance,<br />

tune the driver for the 80% use-case and<br />

remove the user-control.<br />

Table 2 shows user-controls (implemented as<br />

module parameters) removed from the driver<br />

during the re-write of e100 because the OS<br />

already provides built-in user-controls, or the<br />

user-control was no longer needed.<br />

Guideline #4: Don’t Write Code<br />

that’s Already in the <strong>Kernel</strong><br />

Look for library code that’s already used by<br />

other drivers and adapt that to your driver.<br />

Common hardware is often used between vendors’<br />

devices, so shared code will work for all<br />

(and be debugged by all).


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 205<br />

For example, e100 has a highly MDIcompliant<br />

PHY interface, so use mii.c for<br />

standard PHY access and remove custom code<br />

from the driver.<br />

For another example, e100 v2 used /proc/<br />

net/IntelPROAdapter to report driver<br />

information. This functionality was replaced<br />

with ethtool, sysfs, lspci, etc.<br />

Look for opportunities to move code out of the<br />

driver into generic code.<br />

Guideline #5: Don’t Use OSabstraction<br />

Layers<br />

A common corporate design goal is to reuse<br />

driver code as much as possible between OSes.<br />

This allows a driver to be brought up on one OS<br />

and “ported” to another OS with little work.<br />

After all, the hardware interface to the device<br />

didn’t change from one OS to the next, so<br />

all that is required is an OS-abstraction layer<br />

that wraps the OS’s native driver model with a<br />

generic driver model. <strong>The</strong> driver is then written<br />

to the generic driver model and it’s just a matter<br />

of porting the OS-abstraction layer to each<br />

target OS.<br />

<strong>The</strong>re are problems when doing this with<br />

<strong>Linux</strong>:<br />

1. <strong>The</strong> OS-abstraction wrapper code means<br />

nothing to an outside <strong>Linux</strong> maintainer<br />

and just obfuscates the real meaning behind<br />

the code. This makes your code<br />

harder to follow and therefore harder to<br />

maintain.<br />

2. <strong>The</strong> generic driver model may not map 1:1<br />

with the native driver model leaving gaps<br />

in compatibility that you’ll need to fix up<br />

with OS-specific code.<br />

3. Limits your ability to back-port contributions<br />

given under GPL to non-GPL OSes.<br />

Guideline #6: Use kcompat Techniques<br />

to Move Legacy <strong>Kernel</strong> Support<br />

out of the Driver (and <strong>Kernel</strong>)<br />

Users may not be able to move to the latest<br />

kernel.org kernel, so there is a need<br />

to provide updated device drivers that can be<br />

installed against legacy kernels. <strong>The</strong> need is<br />

driven by 1) bug fixes, 2) new hardware support<br />

that wasn’t included in the driver when the<br />

driver was included in the legacy kernel.<br />

<strong>The</strong> best strategy is to:<br />

1. Maintain your driver code to work against<br />

the latest kernel.org development<br />

kernel API. This will make it easier to<br />

keep the driver in the kernel.org kernel<br />

synchronized with your code base as<br />

changes (patches) are almost always in<br />

reference to the latest kernel.org kernel.<br />

2. Provide a kernel-compat-layer (kcompat)<br />

to translate the latest API to the supported<br />

legacy kernel API. <strong>The</strong> driver code is void<br />

of any ifdef code for legacy kernel support.<br />

All of the ifdef logic moves to the<br />

kcompat layer. <strong>The</strong> kcompat layer is not<br />

included in the latest kernel.org kernel<br />

(by definition).<br />

Here is an example with e100.<br />

In driver code, use the latest API:<br />

s = pci_name(pdev);<br />

...<br />

free_netdev(netdev);


206 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

In kcompat code, translate to legacy kernel<br />

API:<br />

#if ( LINUX_VERSION_CODE < \<br />

KERNEL_VERSION(2,4,22) )<br />

#define pci_name(x) ((x)->slot_name)<br />

#endif<br />

#ifndef HAVE_FREE_NETDEV<br />

#define free_netdev(x) kfree(x)<br />

#endif<br />

Guideline #7: Plan to Re-write the<br />

Driver at Least Once<br />

You will not get it right the first time. Plan on<br />

rewriting the driver from scratch at least once.<br />

This will cleanse the code, removing dead code<br />

and organizing/consolidating functionality.<br />

For example, the last e100 re-write reduced the<br />

driver size by 75% without loss of functionality.<br />

Conclusion<br />

Following these guidelines will result in more<br />

maintainable device drivers with better acceptance<br />

into the <strong>Linux</strong> kernel tree. <strong>The</strong> basic<br />

idea is to remove as much as possible from the<br />

driver without loss of functionality.<br />

References<br />

• <strong>The</strong> latest e100 driver code is available at<br />

linux/driver/net/e100.c (2.6.4<br />

kernel or higher).<br />

• An example of kcompat is here:<br />

http://sf.net/projects/<br />

gkernel


NFSv4 and rpcsec_gss for linux<br />

J. Bruce Fields<br />

University of Michigan<br />

bfields@umich.edu<br />

Abstract<br />

<strong>The</strong> 2.6 <strong>Linux</strong> kernels now include support for<br />

version 4 of NFS. In addition to built-in locking<br />

and ACL support, and features designed to<br />

improve performance over the Internet, NFSv4<br />

also mandates the implementation of strong<br />

cryptographic security. This security is provided<br />

by rpcsec_gss, a standard, widely implemented<br />

protocol that operates at the rpc level,<br />

and hence can also provide security for NFS<br />

versions 2 and 3.<br />

1 <strong>The</strong> rpcsec_gss protocol<br />

<strong>The</strong> rpc protocol, which all version of NFS<br />

and related protocols are built upon, includes<br />

generic support for authentication mechanisms:<br />

each rpc call has two fields, the credential<br />

and the verifier, each consisting of a<br />

32-bit integer, designating a “security flavor,”<br />

followed by 400 bytes of opaque data whose<br />

structure depends on the specified flavor. Similarly,<br />

each reply includes a single “verifier.”<br />

Until recently, the only widely implemented<br />

security flavor has been the auth_unix flavor,<br />

which uses the credential to pass uid’s and<br />

gid’s and simply asks the server to trust them.<br />

This may be satisfactory given physical security<br />

over the clients and the network, but for<br />

many situations (including use over the Internet),<br />

it is inadequate.<br />

Thus rfc 2203 defines the rpcsec_gss protocol,<br />

which uses rpc’s opaque security fields to carry<br />

cryptographically secure tokens. <strong>The</strong> cryptographic<br />

services are provided by the GSS-API<br />

(“Generic Security Service Application Program<br />

Interface,” defined by rfc 2743), allowing<br />

the use of a wide variety of security mechanisms,<br />

including, for example, Kerberos.<br />

Three levels of security are provided by rpcsec_gss:<br />

1. Authentication only: <strong>The</strong> rpc header of<br />

each request and response is signed.<br />

2. Integrity: <strong>The</strong> header and body of each request<br />

and response is signed.<br />

3. Privacy: <strong>The</strong> header of each request is<br />

signed, and the body is encrypted.<br />

<strong>The</strong> combination of a security level with a<br />

GSS-API mechanism can be designated by a<br />

32-bit “pseudoflavor.” <strong>The</strong> mount protocol<br />

used with NFS versions 2 and 3 uses a list<br />

of pseudoflavors to communicate the security<br />

capabilities of a server. NFSv4 does not use<br />

pseudoflavors on the wire, but they are still useful<br />

in internal interfaces.<br />

Security protocols generally require some initial<br />

negotiation, to determine the capabilities<br />

of the systems involved and to choose session<br />

keys. <strong>The</strong> rpcsec_gss protocol uses calls with<br />

procedure number 0 for this purpose. Normally<br />

such a call is a simple “ping” with no<br />

side-effects, useful for measuring round-trip


208 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

latency or testing whether a certain service is<br />

running. However a call with procedure number<br />

0, if made with authentication flavor rpcsec_gss,<br />

may use certain fields in the credential<br />

to indicate that it is part of a context-initiation<br />

exchange.<br />

2 <strong>Linux</strong> implementation of rpcsec_gss<br />

<strong>The</strong> <strong>Linux</strong> implementation of rpcsec_gss consists<br />

of several pieces:<br />

1. Mechanism-specific code, currently for<br />

two mechanisms: krb5 and spkm3.<br />

2. A stripped-down in-kernel version of the<br />

GSS-API interface, with an interface that<br />

allows mechanism-specific code to register<br />

support for various pseudoflavors.<br />

3. Client and server code which uses the<br />

GSS-API interface to encode and decode<br />

rpc calls and replies.<br />

4. A userland daemon, gssd, which performs<br />

context initiation.<br />

2.1 Mechanism-specific code<br />

<strong>The</strong> NFSv4 RFC mandates the implementation<br />

(though not the use) of three GSS-API mechanisms:<br />

krb5, spkm3, and lipkey.<br />

Our krb5 implementation supports three<br />

pseudoflavors: krb5, krb5i, and krb5p, providing<br />

authentication only, integrity, and<br />

privacy, respectively. <strong>The</strong> code is derived from<br />

MIT’s Kerberos implementation, somewhat<br />

simplified, and not currently supporting the<br />

variety of encryption algorithms that MIT’s<br />

does. <strong>The</strong> krb5 mechanism is also supported<br />

by NFS implementations from Sun, Network<br />

Appliance, and others, which it interoperates<br />

with.<br />

<strong>The</strong> Low Infrastructure Public Key Mechanism<br />

(“lipkey,” specified by rfc 2847), is a public key<br />

mechanism built on top of the Simple Public<br />

Key Mechanism (spkm), which provides functionality<br />

similar to that of TLS, allowing a secure<br />

channel to be established using a serverside<br />

certificate and a client-side password.<br />

We have a preliminary implementation of<br />

spkm3 (without privacy), but none yet of lipkey.<br />

Other NFS implementors have not yet<br />

implemented either of these mechanisms, but<br />

there appears to be sufficient interest from the<br />

grid community for us to continue implementation<br />

even if it is <strong>Linux</strong>-only for now.<br />

2.2 GSS-API<br />

<strong>The</strong> GSS-API interface as specified is very<br />

complex. Fortunately, rpcsec_gss only requires<br />

a subset of the GSS-API, and even less is required<br />

for per-packet processing.<br />

Our implementation is derived by the implementation<br />

in MIT Kerberos, and initially<br />

stayed fairly close the the GSS-API specification;<br />

but over time we have pared it down to<br />

something quite a bit simpler.<br />

<strong>The</strong> kernel gss interface also provides APIs<br />

by which code implementing particular mechanisms<br />

can register itself to the gss-api code<br />

and hence can be safely provided by modules<br />

loaded at runtime.<br />

2.3 RPC code<br />

<strong>The</strong> RPC code has been enhanced by the addition<br />

of a new rpcsec_gss mechanism which authenticates<br />

calls and replies and which wraps<br />

and unwraps rpc bodies in the case of integrity<br />

and privacy.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 209<br />

This is relatively straightforward, though<br />

somewhat complicated by the need to handle<br />

discontiguous buffers containing page data.<br />

Caches for session state are also required on<br />

both client and server; on the client a preexisting<br />

rpc credentials cache is used, and on the<br />

server we use the same caching infrastructure<br />

used for caching of client and export information.<br />

2.4 Userland daemon<br />

We had no desire to put a complete implementation<br />

of Kerberos version 5 or the other mechanisms<br />

into the kernel. Fortunately, the work<br />

performed by the various GSS-API mechanisms<br />

can be divided neatly into context initiation<br />

and per-packet processing. <strong>The</strong> former<br />

is complex and is performed only once per session,<br />

while the latter is simple by comparison<br />

and needs to be performed on every packet.<br />

<strong>The</strong>refore it makes sense to put the packet processing<br />

in the kernel, and have the context initiation<br />

performed in userspace.<br />

Since it is the kernel that knows when context<br />

initiation is necessary, we require a mechanism<br />

allowing the kernel to pass the necessary parameters<br />

to a userspace daemon whenever context<br />

initiation is needed, and allowing the daemon<br />

to respond with the completed security<br />

context.<br />

This problem was solved in different ways<br />

on the client and server, but both use special<br />

files (the former in a dedicated filesystem,<br />

rpc_pipefs, and the latter in the proc filesystem),<br />

which our userspace daemon, gssd, can<br />

poll for requests and then write responses back<br />

to.<br />

In the case of Kerberos, the sequence of events<br />

will be something like this:<br />

1. <strong>The</strong> user gets Kerberos credentials using<br />

kinit, which are cached on a local filesystem.<br />

2. <strong>The</strong> user attempts to perform an operation<br />

on an NFS filesystem mounted with krb5<br />

security.<br />

3. <strong>The</strong> kernel rpc client looks for the a security<br />

context for the user in its cache; not<br />

finding any, it does an upcall to gssd to request<br />

one.<br />

4. Gssd, on receiving the upcall, reads the<br />

user’s Kerberos credentials from the local<br />

filesystem and uses them to construct<br />

a null rpc request which it sends to the<br />

server.<br />

5. <strong>The</strong> server kernel makes an upcall which<br />

passes the null request to its gssd.<br />

6. At this point, the server gssd has all it<br />

needs to construct a security context for<br />

this session, consisting mainly of a session<br />

key. It passes this context down to<br />

the kernel rpc server, which stores it in its<br />

context cache.<br />

7. <strong>The</strong> server’s gssd then constructs the null<br />

rpc reply, which it gives to the kernel to<br />

return to the client gssd.<br />

8. <strong>The</strong> client gssd uses this reply to construct<br />

its own security context, and passes this<br />

context to the kernel rpc client.<br />

9. <strong>The</strong> kernel rpc client then uses this context<br />

to send the first real rpc request to the<br />

server.<br />

10. <strong>The</strong> server uses the new context in its<br />

cache to verify the rpc request, and to<br />

compose its reply.


210 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

3 <strong>The</strong> NFSv4 protocol<br />

While rpcsec_gss works equally well on all existing<br />

versions of NFS, much of the work on<br />

rpcsec_gss has been motivated by NFS version<br />

4, which is the first version of NFS to make<br />

rpcsec_gss mandatory to implement.<br />

This new version of NFS is specified by rfc<br />

3530, which says:<br />

“Unlike earlier versions, the NFS version 4<br />

protocol supports traditional file access while<br />

integrating support for file locking and the<br />

mount protocol. In addition, support for strong<br />

security (and its negotiation), compound operations,<br />

client caching, and internationalization<br />

have been added. Of course, attention has been<br />

applied to making NFS version 4 operate well<br />

in an Internet environment.”<br />

Descriptions of some of these features follow,<br />

with some notes about their implementation in<br />

<strong>Linux</strong>.<br />

3.1 Compound operations<br />

Each rpc request includes a procedure number,<br />

which describes the operation to be performed.<br />

<strong>The</strong> format of the body of the rpc request (the<br />

arguments) and of the reply depend on the program<br />

number. Procedure 0 is reserved as a noop<br />

(except when it is used for rpcsec_gss context<br />

initiation, as described above).<br />

<strong>The</strong> NFSv4 protocol only supports one nonzero<br />

procedure, procedure 1, the compound<br />

procedure.<br />

<strong>The</strong> body of a compound is a list of operations,<br />

each with its own arguments. For example,<br />

a compound request performing a lookup<br />

might consist of 3 operations: a PUTFH, with<br />

a filehandle, which sets the “current filehandle”<br />

to the provided filehandle; a LOOKUP, with a<br />

name, which looks up the name in the directory<br />

given by the current filehandle and then modifies<br />

the current filehandle to be the filehandle of<br />

the result; a GETFH, with no arguments, which<br />

returns the new value of the current filehandle;<br />

and a GETATTR, with a bitmask specifying a<br />

set of attributes to return for the looked-up file.<br />

<strong>The</strong> server processes these operations in order,<br />

but with no guarantee of atomicity. On encountering<br />

any error, it stops and returns the results<br />

of the operations up to and including the operation<br />

that failed.<br />

In theory complex operations could therefore<br />

be done by long compounds which perform<br />

complex series of operations.<br />

In practice, the compounds sent by the <strong>Linux</strong><br />

client correspond very closely to NFSv2/v3<br />

procedures—the VFS and the POSIX filesystem<br />

API make it difficult to do otherwise—and<br />

our server, like most NFSv4 servers we know<br />

of, rejects overly long or complex compounds.<br />

3.2 Well-known port for NFS<br />

RPC allows services to be run on different<br />

ports, using the “portmap” service to map program<br />

numbers to ports. While flexible, this<br />

system complicates firewall management; so<br />

NFSv4 recommends the use of port 2049.<br />

In addition, the use of sideband protocols for<br />

mounting, locking, etc. also complicates firewall<br />

management, as multiple connections to<br />

multiple ports are required for a single NFS<br />

mount. NFSv4 eliminates these extra protocols,<br />

allowing all traffic to pass over a single<br />

connection using one protocol.<br />

3.3 No more mount protocol<br />

Earlier versions of NFS use a separate protocol<br />

for mount. <strong>The</strong> mount protocol exists primarily<br />

to map path names, presented to the server as


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 211<br />

strings, to filehandles, which may then be used<br />

in the NFS protocol.<br />

NFSv4 instead uses a single operation, PUT-<br />

ROOTFH, that returns a filehandle; clients can<br />

then use ordinary lookups to traverse to the<br />

filesystem they wish to mount. This changes<br />

the behavior of NFS in a few subtle ways: for<br />

example, the special status of mounts in the old<br />

protocol meant that mounting /usr and then<br />

looking up local might get you a different<br />

object than would mounting /usr/local;<br />

under NFSv4 this can no longer happen.<br />

A server that exports multiple filesystems must<br />

knit them together using a single “pseudofilesystem”<br />

which links them to a common<br />

root.<br />

On <strong>Linux</strong>’s nfsd the pseudofilesystem is a<br />

real filesystem, marked by the export option<br />

“fsid=0”. An adminstrator that is content to<br />

export a single filesystem can export it with<br />

“fsid=0”, and clients will find it just by mounting<br />

the path “/”.<br />

<strong>The</strong> expected use for “fsid=0”, however, is to<br />

designate a filesystem that is used just a collection<br />

of empty directories used as mountpoints<br />

for exported filesystems, which are mounted<br />

using mount ---bind; thus an administrator<br />

could export /bin and /local/src by:<br />

mkdir -p /exports/home<br />

mkdir -p /exports/bin/<br />

mount --bind /home /exports/home<br />

mount --bind /bin/ /exports/bin<br />

and then using an exports file something like:<br />

/exports *.foo.com(fsid=0,crossmnt)<br />

/exports/home *.foo.com<br />

/exports/bin *.foo.com<br />

Clients in foo.com can then mount<br />

server.foo.com:/bin or server.<br />

foo.com:/home. However the relationship<br />

between the original mountpoint on the server<br />

and the mountpoint under /exports (which<br />

determines the path seen by the client) is<br />

arbitrary, so the administrator could just as<br />

well export /home as /some/other/path<br />

if desired.<br />

This gives maximum flexibility at the expense<br />

of some confusion for adminstrators used to<br />

earlier NFS versions.<br />

3.4 No more lock protocol<br />

Locking has also been absorbed into the<br />

NFSv4 protocol. In addition to advantages<br />

enumerated above, this allows servers to support<br />

mandatory locking if desired. Previously<br />

this was impossible because it was impossible<br />

to tell whether a given read or write<br />

should be ordered before or after a lock request.<br />

NFSv4 enforces such sequencing by<br />

providing a stateid field on each read or write<br />

which identifies the locking state that the operation<br />

was performed under; thus for example a<br />

write that occurred while a lock was held, but<br />

that appeared on the server to have occurred after<br />

an unlock, can be identified as belonging to<br />

a previous locking context, and can therefore<br />

be correctly rejected.<br />

<strong>The</strong> additional state required to manage locking<br />

is the source of much of the additional complexity<br />

in NFSv4.<br />

3.5 String representations of user and group<br />

names<br />

Previous versions of NFS use integers to represent<br />

users and groups; while simple to handle,<br />

they can make NFS installations difficult to<br />

manage, particularly across adminstrative domains.<br />

Version 4, therefore, uses string names<br />

of the form user@domain.<br />

This poses some challenges for the kernel im-


212 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

plementation. In particular, while the protocol<br />

may use string names, the kernel still needs to<br />

deal with uid’s, so it must map between NFSv4<br />

string names and integers.<br />

As with rpcsec_gss context initation, we solve<br />

this problem by making upcalls to a userspace<br />

daemon; with the mapping in userspace, it is<br />

easy to use mechanisms such as NIS or LDAP<br />

to do the actual mapping without introducing<br />

large amounts of code into the kernel. So as not<br />

to degrade performance by requiring a context<br />

switch every time we process a packet carrying<br />

a name, we cache the results of this mapping in<br />

the kernel.<br />

3.6 Delegations<br />

NFSv4, like previous versions of NFS, does<br />

not attempt to provide full cache consistency.<br />

Instead, all that is guaranteed is that if an open<br />

follows a close of the same file, then data read<br />

after the open will reflect any modifications<br />

performed before the close. This makes both<br />

open and close potentially high latency operations,<br />

since they must wait for at least one<br />

round trip before returning–in the close case,<br />

to flush out any pending writes, and in the<br />

open case, to check the attributes of the file in<br />

question to determine whether the local cache<br />

should be invalidated.<br />

Locks provide similar semantics—writes are<br />

flushed on unlock, and cache consistency is<br />

verified on lock—and hence lock operations<br />

are also prone to high latencies.<br />

To mitigate these concerns, and to encourage<br />

the use of NFS’s locking features, delegations<br />

have been added to NFSv4. Delegations are<br />

granted or denied by the server in response to<br />

open calls, and give the client the right to perform<br />

later locks and opens locally, without the<br />

need to contact the server. A set of callbacks<br />

is provided so that the server can notify the<br />

client when another client requests an open that<br />

would confict with the open originally obtained<br />

by the client.<br />

Thus locks and opens may be performed<br />

quickly by the client in the common case when<br />

files are not being shared, but callbacks ensure<br />

that correct close-to-open (and unlock-to-lock)<br />

semantics may be enforced when there is contention.<br />

To allow other clients to proceed when a client<br />

holding a delegation reboots, clients are required<br />

to periodically send a “renew” operation<br />

to the server, indicating that it is still alive;<br />

a client that fails to send a renew operation<br />

within a given lease time (established when the<br />

client first contacts the server) may have all of<br />

its delegations and other locking state revoked.<br />

Most implementations of NFSv4 delegations,<br />

including <strong>Linux</strong>’s, are still young, and we<br />

haven’t yet gathered good data on the performance<br />

impact.<br />

Nevertheless, further extensions, including<br />

delegations over directories, are under consideration<br />

for future versions of the protocol.<br />

3.7 ACLs<br />

ACL support is integrated into the protocol,<br />

with ACLs that are more similar to those found<br />

in NT than to the POSIX ACLs supported by<br />

<strong>Linux</strong>.<br />

Thus while it is possible to translate an arbitrary<br />

<strong>Linux</strong> ACL to an NFS4 ACL with nearly<br />

identical meaning, most NFS ACLs have no<br />

reasonable representation as <strong>Linux</strong> ACLs.<br />

Marius Eriksen has written a draft describing<br />

the POSIX to NFS4 ACL translation. Currently<br />

the <strong>Linux</strong> implementation uses this mapping,<br />

and rejects any NFS4 ACL that isn’t exactly<br />

in the image of this mapping. This en-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 213<br />

sures userland support from all tools that currently<br />

support POSIX ACLs, and simplifies<br />

ACL management when an exported filesystem<br />

is also used by local users, since both nfsd<br />

and the local users can use the backend filesystem’s<br />

POSIX ACL implementation. However<br />

it makes it difficult to interoperate with NFSv4<br />

implementations that support the full ACL protocol.<br />

For that reason we will eventually also<br />

want to add support for NFSv4 ACLs.<br />

4 Acknowledgements and Further<br />

Information<br />

This work has been sponsored by Sun Microsystems,<br />

Network Appliance, and the<br />

Accelerated Strategic Computing Initiative<br />

(ASCI). For further information, see www.<br />

citi.umich.edu/projects/nfsv4/.


214 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Comparing and Evaluating epoll, select, and poll<br />

Event Mechanisms<br />

Louay Gammo, Tim Brecht, Amol Shukla, and David Pariag<br />

University of Waterloo<br />

{lgammo,brecht,ashukla,db2pariag}@cs.uwaterloo.ca<br />

Abstract<br />

This paper uses a high-performance, eventdriven,<br />

HTTP server (the µserver) to compare<br />

the performance of the select, poll, and epoll<br />

event mechanisms. We subject the µserver to<br />

a variety of workloads that allow us to expose<br />

the relative strengths and weaknesses of each<br />

event mechanism.<br />

Interestingly, initial results show that the select<br />

and poll event mechanisms perform comparably<br />

to the epoll event mechanism in the<br />

absence of idle connections. Profiling data<br />

shows a significant amount of time spent in executing<br />

a large number of epoll_ctl system<br />

calls. As a result, we examine a variety<br />

of techniques for reducing epoll_ctl overhead<br />

including edge-triggered notification, and<br />

introducing a new system call (epoll_ctlv)<br />

that aggregates several epoll_ctl calls into<br />

a single call. Our experiments indicate that although<br />

these techniques are successful at reducing<br />

epoll_ctl overhead, they only improve<br />

performance slightly.<br />

1 Introduction<br />

<strong>The</strong> Internet is expanding in size, number of<br />

users, and in volume of content, thus it is imperative<br />

to be able to support these changes<br />

with faster and more efficient HTTP servers.<br />

A common problem in HTTP server scalability<br />

is how to ensure that the server handles<br />

a large number of connections simultaneously<br />

without degrading the performance. An<br />

event-driven approach is often implemented in<br />

high-performance network servers [14] to multiplex<br />

a large number of concurrent connections<br />

over a few server processes. In eventdriven<br />

servers it is important that the server<br />

focuses on connections that can be serviced<br />

without blocking its main process. An event<br />

dispatch mechanism such as select is used<br />

to determine the connections on which forward<br />

progress can be made without invoking<br />

a blocking system call. Many different<br />

event dispatch mechanisms have been used<br />

and studied in the context of network applications.<br />

<strong>The</strong>se mechanisms range from select,<br />

poll, /dev/poll, RT signals, and epoll<br />

[2, 3, 15, 6, 18, 10, 12, 4].<br />

<strong>The</strong> epoll event mechanism [18, 10, 12] is designed<br />

to scale to larger numbers of connections<br />

than select and poll. <strong>One</strong> of the<br />

problems with select and poll is that in<br />

a single call they must both inform the kernel<br />

of all of the events of interest and obtain new<br />

events. This can result in large overheads, particularly<br />

in environments with large numbers<br />

of connections and relatively few new events<br />

occurring. In a fashion similar to that described<br />

by Banga et al. [3] epoll separates mechanisms<br />

for obtaining events (epoll_wait)<br />

from those used to declare and control interest


216 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

in events (epoll_ctl).<br />

Further reductions in the number of generated<br />

events can be obtained by using edge-triggered<br />

epoll semantics. In this mode events are only<br />

provided when there is a change in the state of<br />

the socket descriptor of interest. For compatibility<br />

with the semantics offered by select<br />

and poll, epoll also provides level-triggered<br />

event mechanisms.<br />

To compare the performance of epoll with<br />

select and poll, we use the µserver [4, 7]<br />

web server. <strong>The</strong> µserver facilitates comparative<br />

analysis of different event dispatch mechanisms<br />

within the same code base through<br />

command-line parameters. Recently, a highly<br />

tuned version of the single process event driven<br />

µserver using select has shown promising<br />

results that rival the performance of the inkernel<br />

TUX web server [4].<br />

Interestingly, in this paper, we found that for<br />

some of the workloads considered select<br />

and poll perform as well as or slightly better<br />

than epoll. <strong>One</strong> such result is shown in<br />

Figure 1. This motivated further investigation<br />

with the goal of obtaining a better understanding<br />

of epoll’s behaviour. In this paper, we describe<br />

our experience in trying to determine<br />

how to best use epoll, and examine techniques<br />

designed to improve its performance.<br />

<strong>The</strong> rest of the paper is organized as follows:<br />

In Section 2 we summarize some existing work<br />

that led to the development of epoll as a scalable<br />

replacement for select. In Section 3 we<br />

describe the techniques we have tried to improve<br />

epoll’s performance. In Section 4 we describe<br />

our experimental methodology, including<br />

the workloads used in the evaluation. In<br />

Section 5 we describe and analyze the results<br />

of our experiments. In Section 6 we summarize<br />

our findings and outline some ideas for future<br />

work.<br />

2 Background and Related Work<br />

Event-notification mechanisms have a long<br />

history in operating systems research and development,<br />

and have been a central issue in<br />

many performance studies. <strong>The</strong>se studies have<br />

sought to improve mechanisms and interfaces<br />

for obtaining information about the state of<br />

socket and file descriptors from the operating<br />

system [2, 1, 3, 13, 15, 6, 18, 10, 12]. Some<br />

of these studies have developed improvements<br />

to select, poll and sigwaitinfo by reducing<br />

the amount of data copied between the<br />

application and kernel. Other studies have reduced<br />

the number of events delivered by the<br />

kernel, for example, the signal-per-fd scheme<br />

proposed by Chandra et al. [6]. Much of the<br />

aforementioned work is tracked and discussed<br />

on the web site, “<strong>The</strong> C10K Problem” [8].<br />

Early work by Banga and Mogul [2] found<br />

that despite performing well under laboratory<br />

conditions, popular event-driven servers performed<br />

poorly under real-world conditions.<br />

<strong>The</strong>y demonstrated that the discrepancy is due<br />

the inability of the select system call to<br />

scale to the large number of simultaneous connections<br />

that are found in WAN environments.<br />

Subsequent work by Banga et al. [3] sought to<br />

improve on select’s performance by (among<br />

other things) separating the declaration of interest<br />

in events from the retrieval of events on<br />

that interest set. Event mechanisms like select<br />

and poll have traditionally combined these<br />

tasks into a single system call. However, this<br />

amalgamation requires the server to re-declare<br />

its interest set every time it wishes to retrieve<br />

events, since the kernel does not remember the<br />

interest sets from previous calls. This results in<br />

unnecessary data copying between the application<br />

and the kernel.<br />

<strong>The</strong> /dev/poll mechanism was adapted<br />

from Sun Solaris to <strong>Linux</strong> by Provos et al. [15],


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 217<br />

and improved on poll’s performance by introducing<br />

a new interface that separated the declaration<br />

of interest in events from retrieval. <strong>The</strong>ir<br />

/dev/poll mechanism further reduced data<br />

copying by using a shared memory region to<br />

return events to the application.<br />

<strong>The</strong> kqueue event mechanism [9] addressed<br />

many of the deficiencies of select and poll<br />

for FreeBSD systems. In addition to separating<br />

the declaration of interest from retrieval,<br />

kqueue allows an application to retrieve<br />

events from a variety of sources including<br />

file/socket descriptors, signals, AIO completions,<br />

file system changes, and changes in<br />

process state.<br />

<strong>The</strong> epoll event mechanism [18, 10, 12] investigated<br />

in this paper also separates the declaration<br />

of interest in events from their retrieval.<br />

<strong>The</strong> epoll_create system call instructs<br />

the kernel to create an event data structure<br />

that can be used to track events on a number<br />

of descriptors. <strong>The</strong>reafter, the epoll_ctl<br />

call is used to modify interest sets, while the<br />

epoll_wait call is used to retrieve events.<br />

Another drawback of select and poll is<br />

that they perform work that depends on the<br />

size of the interest set, rather than the number<br />

of events returned. This leads to poor performance<br />

when the interest set is much larger than<br />

the active set. <strong>The</strong> epoll mechanisms avoid this<br />

pitfall and provide performance that is largely<br />

independent of the size of the interest set.<br />

3 Improving epoll Performance<br />

Figure 1 in Section 5 shows the throughput<br />

obtained when using the µserver with the select,<br />

poll, and level-triggered epoll (epoll-LT)<br />

mechanisms. In this graph the x-axis shows<br />

increasing request rates and the y-axis shows<br />

the reply rate as measured by the clients that<br />

are inducing the load. This graph shows results<br />

for the one-byte workload. <strong>The</strong>se results<br />

demonstrate that the µserver with leveltriggered<br />

epoll does not perform as well as<br />

select under conditions that stress the event<br />

mechanisms. This led us to more closely examine<br />

these results. Using gprof, we observed<br />

that epoll_ctl was responsible for a<br />

large percentage of the run-time. As can been<br />

seen in Table 1 in Section 5 over 16% of the<br />

time is spent in epoll_ctl. <strong>The</strong> gprof output<br />

also indicates (not shown in the table) that<br />

epoll_ctl was being called a large number<br />

of times because it is called for every state<br />

change for each socket descriptor. We examine<br />

several approaches designed to reduce the<br />

number of epoll_ctl calls. <strong>The</strong>se are outlined<br />

in the following paragraphs.<br />

<strong>The</strong> first method uses epoll in an edgetriggered<br />

fashion, which requires the µserver<br />

to keep track of the current state of the socket<br />

descriptor. This is required because with the<br />

edge-triggered semantics, events are only received<br />

for transitions on the socket descriptor<br />

state. For example, once the server reads data<br />

from a socket, it needs to keep track of whether<br />

or not that socket is still readable, or if it will<br />

get another event from epoll_wait indicating<br />

that the socket is readable. Similar state<br />

information is maintained by the server regarding<br />

whether or not the socket can be written.<br />

This method is referred to in our graphs and<br />

the rest of the paper epoll-ET.<br />

<strong>The</strong> second method, which we refer to as<br />

epoll2, simply calls epoll_ctl twice per<br />

socket descriptor. <strong>The</strong> first to register with the<br />

kernel that the server is interested in read and<br />

write events on the socket. <strong>The</strong> second call occurs<br />

when the socket is closed. It is used to<br />

tell epoll that we are no longer interested in<br />

events on that socket. All events are handled<br />

in a level-triggered fashion. Although this approach<br />

will reduce the number of epoll_ctl<br />

calls, it does have potential disadvantages.


218 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

<strong>One</strong> disadvantage of the epoll2 method is that<br />

because many of the sockets will continue to be<br />

readable or writable epoll_wait will return<br />

sooner, possibly with events that are currently<br />

not of interest to the server. For example, if the<br />

server is waiting for a read event on a socket it<br />

will not be interested in the fact that the socket<br />

is writable until later. Another disadvantage is<br />

that these calls return sooner, with fewer events<br />

being returned per call, resulting in a larger<br />

number of calls. Lastly, because many of the<br />

events will not be of interest to the server, the<br />

server is required to spend a bit of time to determine<br />

if it is or is not interested in each event<br />

and in discarding events that are not of interest.<br />

<strong>The</strong> third method uses a new system call named<br />

epoll_ctlv. This system call is designed to<br />

reduce the overhead of multiple epoll_ctl<br />

system calls by aggregating several calls to<br />

epoll_ctl into one call to epoll_ctlv.<br />

This is achieved by passing an array of epoll<br />

events structures to epoll_ctlv, which then<br />

calls epoll_ctl for each element of the array.<br />

Events are generated in level-triggered<br />

fashion. This method is referred to in the figures<br />

and the remainder of the paper as epollctlv.<br />

We use epoll_ctlv to add socket descriptors<br />

to the interest set, and for modifying<br />

the interest sets for existing socket descriptors.<br />

However, removal of socket descriptors<br />

from the interest set is done by explicitly calling<br />

epoll_ctl just before the descriptor is<br />

closed. We do not aggregate deletion operations<br />

because by the time epoll_ctlv is<br />

invoked, the µserver has closed the descriptor<br />

and the epoll_ctl invoked on that descriptor<br />

will fail.<br />

<strong>The</strong> µserver does not attempt to batch the closing<br />

of descriptors because it can run out of<br />

available file descriptors. Hence, the epollctlv<br />

method uses both the epoll_ctlv and<br />

the epoll_ctl system calls. Alternatively,<br />

we could rely on the close system call to<br />

remove the socket descriptor from the interest<br />

set (and we did try this). However, this<br />

increases the time spent by the µserver in<br />

close, and does not alter performance. We<br />

verified this empirically and decided to explicitly<br />

call epoll_ctl to perform the deletion<br />

of descriptors from the epoll interest set.<br />

4 Experimental Environment<br />

<strong>The</strong> experimental environment consists of a<br />

single server and eight clients. <strong>The</strong> server contains<br />

dual 2.4 GHz Xeon processors, 1 GB of<br />

RAM, a 10,000 rpm SCSI disk, and two one<br />

Gigabit Ethernet cards. <strong>The</strong> clients are identical<br />

to the server with the exception of their<br />

disks which are EIDE. <strong>The</strong> server and clients<br />

are connected with a 24-port Gigabit switch.<br />

To avoid network bottlenecks, the first four<br />

clients communicate with the server’s first Ethernet<br />

card, while the remaining four use a different<br />

IP address linked to the second Ethernet<br />

card. <strong>The</strong> server machine runs a slightly modified<br />

version of the 2.6.5 <strong>Linux</strong> kernel in uniprocessor<br />

mode.<br />

4.1 Workloads<br />

This section describes the workloads that we<br />

used to evaluate performance of the µserver<br />

with the different event notification mechanisms.<br />

In all experiments, we generate HTTP<br />

loads using httperf [11], an open-loop workload<br />

generator that uses connection timeouts to<br />

generate loads that can exceed the capacity of<br />

the server.<br />

Our first workload is based on the widely used<br />

SPECweb99 benchmarking suite [17]. We use<br />

httperf in conjunction with a SPECweb99 file<br />

set and synthetic HTTP traces. Our traces<br />

have been carefully generated to recreate the


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 219<br />

file classes, access patterns, and number of requests<br />

issued per (HTTP 1.1) connection that<br />

are used in the static portion of SPECweb99.<br />

<strong>The</strong> file set and server caches are sized so that<br />

the entire file set fits in the server’s cache. This<br />

ensures that differences in cache hit rates do<br />

not affect performance.<br />

Our second workload is called the one-byte<br />

workload. In this workload, the clients repeatedly<br />

request the same one byte file from the<br />

server’s cache. We believe that this workload<br />

stresses the event dispatch mechanism by minimizing<br />

the amount of work that needs to be<br />

done by the server in completing a particular<br />

request. By reducing the effect of system calls<br />

such as read and write, this workload isolates<br />

the differences due to the event dispatch<br />

mechanisms.<br />

To study the scalability of the event dispatch<br />

mechanisms as the number of socket descriptors<br />

(connections) is increased, we use idleconn,<br />

a program that comes as part of the<br />

httperf suite. This program maintains a steady<br />

number of idle connections to the server (in addition<br />

to the active connections maintained by<br />

httperf). If any of these connections are closed<br />

idleconn immediately re-establishes them. We<br />

first examine the behaviour of the event dispatch<br />

mechanisms without any idle connections<br />

to study scenarios where all of the connections<br />

present in a server are active. We then<br />

pre-load the server with a number of idle connections<br />

and then run experiments. <strong>The</strong> idle<br />

connections are used to increase the number<br />

of simultaneous connections in order to simulate<br />

a WAN environment. In this paper we<br />

present experiments using 10,000 idle connections,<br />

our findings with other numbers of idle<br />

connections were similar and they are not presented<br />

here.<br />

4.2 Server Configuration<br />

For all of our experiments, the µserver is run<br />

with the same set of configuration parameters<br />

except for the event dispatch mechanism. <strong>The</strong><br />

µserver is configured to use sendfile to take<br />

advantage of zero-copy socket I/O while writing<br />

replies. We use TCP_CORK in conjunction<br />

with sendfile. <strong>The</strong> same server options<br />

are used for all experiments even though<br />

the use of TCP_CORK and sendfile may<br />

not provide benefits for the one-byte workload<br />

when compared with simply using writev.<br />

4.3 Experimental Methodology<br />

We measure the throughput of the µserver using<br />

different event dispatch mechanisms. In<br />

our graphs, each data point is the result of a<br />

two minute experiment. Trial and error revealed<br />

that two minutes is sufficient for the<br />

server to achieve a stable state of operation. A<br />

two minute delay is used between consecutive<br />

experiments, which allows the TIME_WAIT<br />

state on all sockets to be cleared before the subsequent<br />

run. All non-essential services are terminated<br />

prior to running any experiment.<br />

5 Experimental Results<br />

In this section we first compare the throughput<br />

achieved when using level-triggered epoll with<br />

that observed when using select and poll<br />

under both the one-byte and SPECweb99-<br />

like workloads with no idle connections. We<br />

then examine the effectiveness of the different<br />

methods described for reducing the number<br />

of epoll_ctl calls under these same<br />

workloads. This is followed by a comparison<br />

of the performance of the event dispatch<br />

mechanisms when the server is pre-loaded with<br />

10,000 idle connections. Finally, we describe<br />

the results of experiments in which we tune the


220 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

accept strategy used in conjunction with epoll-<br />

LT and epoll-ctlv to further improve their performance.<br />

We initially ran the one byte and the<br />

SPECweb99-like workloads to compare the<br />

performance of the select, poll and leveltriggered<br />

epoll mechanisms.<br />

As shown in Figure 1 and Figure 2, for both<br />

of these workloads select and poll perform as<br />

well as epoll-LT. It is important to note that because<br />

there are no idle connections for these<br />

experiments the number of socket descriptors<br />

tracked by each mechanism is not very high.<br />

As expected, the gap between epoll-LT and select<br />

is more pronounced for the one byte workload<br />

because it places more stress on the event<br />

dispatch mechanism.<br />

We tried to improve the performance of the<br />

server by exploring different techniques for using<br />

epoll as described in Section 3. <strong>The</strong> effect<br />

of these techniques on the one-byte workload<br />

is shown in Figure 3. <strong>The</strong> graphs in this figure<br />

show that for this workload the techniques used<br />

to reduce the number of epoll_ctl calls do<br />

not provide significant benefits when compared<br />

with their level-triggered counterpart (epoll-<br />

LT). Additionally, the performance of select<br />

and poll is equal to or slightly better than each<br />

of the epoll techniques. Note that we omit the<br />

line for poll from Figures 3 and 4 because it is<br />

nearly identical to the select line.<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

select<br />

epoll-LT<br />

epoll-ET<br />

epoll-ctlv<br />

epoll2<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

select<br />

poll<br />

epoll-LT<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

Figure 3: µserver performance on one byte<br />

workload with no idle connections<br />

5000<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

Figure 1: µserver performance on one byte<br />

workload using select, poll, and epoll-LT<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

select<br />

poll<br />

epoll-LT<br />

Figure 2: µserver performance on<br />

SPECweb99-like workload using select,<br />

poll, and epoll-LT<br />

We further analyze the results from Figure 3<br />

by profiling the µserver using gprof at the request<br />

rate of 22,000 requests per second. Table<br />

1 shows the percentage of time spent in system<br />

calls (rows) under the various event dispatch<br />

methods (columns). <strong>The</strong> output for system<br />

calls and µserver functions which do not<br />

contribute significantly to the total run-time is<br />

left out of the table for clarity.<br />

If we compare the select and poll columns<br />

we see that they have a similar breakdown including<br />

spending about 13% of their time indicating<br />

to the kernel events of interest and<br />

obtaining events. In contrast the epoll-LT,<br />

epoll-ctlv, and epoll2 approaches spend about<br />

21 – 23% of their time on their equivalent<br />

functions (epoll_ctl, epoll_ctlv and<br />

epoll_wait). Despite these extra overheads<br />

the throughputs obtained using the epoll techniques<br />

compare favourably with those obtained


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 221<br />

select epoll-LT epoll-ctlv epoll2 epoll-ET poll<br />

read 21.51 20.95 21.41 20.08 22.19 20.97<br />

close 14.90 14.05 14.90 13.02 14.14 14.79<br />

select 13.33 - - - - -<br />

poll - - - - - 13.32<br />

epoll_ctl - 16.34 5.98 10.27 11.06 -<br />

epoll_wait - 7.15 6.01 12.56 6.52 -<br />

epoll_ctlv - - 9.28 - - -<br />

setsockopt 11.17 9.13 9.13 7.57 9.08 10.68<br />

accept 10.08 9.51 9.76 9.05 9.30 10.20<br />

write 5.98 5.06 5.10 4.13 5.31 5.70<br />

fcntl 3.66 3.34 3.37 3.14 3.34 3.61<br />

sendfile 3.43 2.70 2.71 3.00 3.91 3.43<br />

Table 1: gprof profile data for the µserver under the one-byte workload at 22,000 requests/sec<br />

using select and poll. We note that when<br />

using select and poll the application requires<br />

extra manipulation, copying, and event<br />

scanning code that is not required in the epoll<br />

case (and does not appear in the gprof data).<br />

<strong>The</strong> results in Table 1 also show that the<br />

overhead due to epoll_ctl calls is reduced<br />

in epoll-ctlv, epoll2 and epoll-ET, when<br />

compared with epoll-LT. However, in each<br />

case these improvements are offset by increased<br />

costs in other portions of the code.<br />

<strong>The</strong> epoll2 technique spends twice as much<br />

time in epoll_wait when compared with<br />

epoll-LT. With epoll2 the number of calls<br />

to epoll_wait is significantly higher, the<br />

average number of descriptors returned is<br />

lower, and only a very small proportion of<br />

the calls (less than 1%) return events that<br />

need to be acted upon by the server. On the<br />

other hand, when compared with epoll-LT the<br />

epoll2 technique spends about 6% less time<br />

on epoll_ctl calls so the total amount of<br />

time spent dealing with events is comparable<br />

with that of epoll-LT. Despite the significant<br />

epoll_wait overheads epoll2 performance<br />

compares favourably with the other methods<br />

on this workload.<br />

Using the epoll-ctlv technique, gprof indicates<br />

that epoll_ctlv and epoll_ctl combine<br />

for a total of 1,949,404 calls compared<br />

with 3,947,769 epoll_ctl calls when using<br />

epoll-LT. While epoll-ctlv helps to reduce<br />

the number of user-kernel boundary crossings,<br />

the net result is no better than epoll-<br />

LT. <strong>The</strong> amount of time taken by epoll-ctlv<br />

in epoll_ctlv and epoll_ctl system<br />

calls is about the same (around 16%) as<br />

that spent by level-triggered epoll in invoking<br />

epoll_ctl.<br />

When comparing the percentage of time epoll-<br />

LT and epoll-ET spend in epoll_ctl we see<br />

that it has been reduced using epoll-ET from<br />

16% to 11%. Although the epoll_ctl time<br />

has been reduced it does not result in an appreciable<br />

improvement in throughput. We also<br />

note that about 2% of the run-time (which is<br />

not shown in the table) is also spent in the<br />

epoll-ET case checking, and tracking the state<br />

of the request (i.e., whether the server should<br />

be reading or writing) and the state of the<br />

socket (i.e., whether it is readable or writable).<br />

We expect that this can be reduced but that it<br />

wouldn’t noticeably impact performance.<br />

Results for the SPECweb99-like workload are


222 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

shown in Figure 4. Here the graph shows that<br />

all techniques produce very similar results with<br />

a very slight performance advantage going to<br />

epoll-ET after the saturation point is reached.<br />

<strong>The</strong> results for the SPECweb99-like workload<br />

with 10,000 idle connections are shown in Figure<br />

6. In this case each of the event mechanisms<br />

is impacted in a manner similar to that<br />

in which they are impacted by idle connections<br />

in the one-byte workload case.<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

select<br />

epoll-LT<br />

epoll-ET<br />

epoll2<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

select<br />

poll<br />

epoll-ET<br />

epoll-LT<br />

epoll2<br />

5000<br />

5000<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

Figure 4: µserver performance on<br />

SPECweb99-like workload with no idle<br />

connections<br />

5.1 Results With Idle Connections<br />

We now compare the performance of the event<br />

mechanisms with 10,000 idle connections. <strong>The</strong><br />

idle connections are intended to simulate the<br />

presence of larger numbers of simultaneous<br />

connections (as might occur in a WAN environment).<br />

Thus, the event dispatch mechanism<br />

has to keep track of a large number of descriptors<br />

even though only a very small portion of<br />

them are active.<br />

By comparing results in Figures 3 and 5 one<br />

can see that the performance of select and poll<br />

degrade by up to 79% when the 10,000 idle<br />

connections are added. <strong>The</strong> performance of<br />

epoll2 with idle connections suffers similarly<br />

to select and poll. In this case, epoll2 suffers<br />

from the overheads incurred by making a large<br />

number of epoll_wait calls the vast majority<br />

of which return events that are not of current<br />

interest to the server. Throughput with<br />

level-triggered epoll is slightly reduced with<br />

the addition of the idle connections while edgetriggered<br />

epoll is not impacted.<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

Figure 5: µserver performance on one byte<br />

workload and 10,000 idle connections<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

select<br />

poll<br />

epoll-ET<br />

epoll-LT<br />

epoll2<br />

Figure 6: µserver performance on<br />

SPECweb99-like workload and 10,000<br />

idle connections<br />

5.2 Tuning Accept Strategy for epoll<br />

<strong>The</strong> µserver’s accept strategy has been tuned<br />

for use with select. <strong>The</strong> µserver includes a<br />

parameter that controls the number of connections<br />

that are accepted consecutively. We call<br />

this parameter the accept-limit. Parameter values<br />

range from one to infinity (Inf). A value of<br />

one limits the server to accepting at most one<br />

connection when notified of a pending connection<br />

request, while Inf causes the server to consecutively<br />

accept all currently pending connections.<br />

To this point we have used the accept strategy<br />

that was shown to be effective for select by


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 223<br />

Brecht et al. [4] (i.e., accept-limit is Inf). In<br />

order to verify whether the same strategy performs<br />

well with the epoll-based methods we<br />

explored their performance under different accept<br />

strategies.<br />

Figure 7 examines the performance of leveltriggered<br />

epoll after the accept-limit has been<br />

tuned for the one-byte workload (other values<br />

were explored but only the best values<br />

are shown). Level-triggered epoll with an accept<br />

limit of 10 shows a marked improvement<br />

over the previous accept-limit of Inf,<br />

and now matches the performance of select<br />

on this workload. <strong>The</strong> accept-limit of 10 also<br />

improves peak throughput for the epoll-ctlv<br />

model by 7%. This gap widens to 32% at<br />

21,000 requests/sec. In fact the best accept<br />

strategy for epoll-ctlv fares slightly better than<br />

the best accept strategy for select.<br />

Replies/s<br />

25000<br />

20000<br />

15000<br />

10000<br />

select accept=Inf<br />

epoll-LT accept=Inf<br />

5000<br />

epoll-LT accept=10<br />

epoll-ctlv accept=Inf<br />

epoll-ctlv accept=10<br />

0<br />

0 5000 10000 15000 20000 25000 30000<br />

Requests/s<br />

Figure 7: µserver performance on one byte<br />

workload with different accept strategies and<br />

no idle connections<br />

Varying the accept-limit did not improve the<br />

performance of the edge-triggered epoll technique<br />

under this workload and it is not shown<br />

in the graph. However, we believe that the effects<br />

of the accept strategy on the various epoll<br />

techniques warrants further study as the efficacy<br />

of the strategy may be workload dependent.<br />

6 Discussion<br />

In this paper we use a high-performance eventdriven<br />

HTTP server, the µserver, to compare<br />

and evaluate the performance of select, poll,<br />

and epoll event mechanisms. Interestingly,<br />

we observe that under some of the workloads<br />

examined the throughput obtained using<br />

select and poll is as good or slightly better<br />

than that obtained with epoll. While these<br />

workloads may not utilize representative numbers<br />

of simultaneous connections they do stress<br />

the event mechanisms being tested.<br />

Our results also show that a main source of<br />

overhead when using level-triggered epoll is<br />

the large number of epoll_ctl calls. We<br />

explore techniques which significantly reduce<br />

the number of epoll_ctl calls, including<br />

the use of edge-triggered events and a system<br />

call, epoll_ctlv, which allows the µserver<br />

to aggregate large numbers of epoll_ctl<br />

calls into a single system call. While these<br />

techniques are successful in reducing the number<br />

of epoll_ctl calls they do not appear<br />

to provide appreciable improvements in performance.<br />

As expected, the introduction of idle connections<br />

results in dramatic performance degradation<br />

when using select and poll, while not<br />

noticeably impacting the performance when<br />

using epoll. Although it is not clear that<br />

the use of idle connections to simulate larger<br />

numbers of connections is representative of<br />

real workloads, we find that the addition of<br />

idle connections does not significantly alter<br />

the performance of the edge-triggered and<br />

level-triggered epoll mechanisms. <strong>The</strong> edgetriggered<br />

epoll mechanism performs best with<br />

the level-triggered epoll mechanism offering<br />

performance that is very close to edgetriggered.<br />

In the future we plan to re-evaluate some of


224 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

the mechanisms explored in this paper under<br />

more representative workloads that include<br />

more representative wide area network conditions.<br />

<strong>The</strong> problem with the technique of using<br />

idle connections is that the idle connections<br />

simply inflate the number of connections without<br />

doing any useful work. We plan to explore<br />

tools similar to Dummynet [16] and NIST Net<br />

[5] in order to more accurately simulate traffic<br />

delays, packet loss, and other wide area network<br />

traffic characteristics, and to re-examine<br />

the performance of Internet servers using different<br />

event dispatch mechanisms and a wider<br />

variety of workloads.<br />

7 Acknowledgments<br />

We gratefully acknowledge Hewlett Packard,<br />

the Ontario Research and Development Challenge<br />

Fund, and the National Sciences and Engineering<br />

Research Council of Canada for financial<br />

support for this project.<br />

References<br />

[1] G. Banga, P. Druschel, and J.C. Mogul.<br />

Resource containers: A new facility for<br />

resource management in server systems.<br />

In Operating Systems Design and<br />

Implementation, pages 45–58, 1999.<br />

[2] G. Banga and J.C. Mogul. Scalable<br />

kernel performance for Internet servers<br />

under realistic loads. In Proceedings of<br />

the 1998 USENIX Annual Technical<br />

Conference, New Orleans, LA, 1998.<br />

[3] G. Banga, J.C. Mogul, and P. Druschel.<br />

A scalable and explicit event delivery<br />

mechanism for UNIX. In Proceedings of<br />

the 1999 USENIX Annual Technical<br />

Conference, Monterey, CA, June 1999.<br />

[4] Tim Brecht, David Pariag, and Louay<br />

Gammo. accept()able strategies for<br />

improving web server performance. In<br />

Proceedings of the 2004 USENIX Annual<br />

Technical Conference (to appear), June<br />

2004.<br />

[5] M. Carson and D. Santay. NIST Net – a<br />

<strong>Linux</strong>-based network emulation tool.<br />

Computer Communication Review, to<br />

appear.<br />

[6] A. Chandra and D. Mosberger.<br />

Scalability of <strong>Linux</strong> event-dispatch<br />

mechanisms. In Proceedings of the 2001<br />

USENIX Annual Technical Conference,<br />

Boston, 2001.<br />

[7] HP Labs. <strong>The</strong> userver home page, 2004.<br />

Available at http://hpl.hp.com/<br />

research/linux/userver.<br />

[8] Dan Kegel. <strong>The</strong> C10K problem, 2004.<br />

Available at http:<br />

//www.kegel.com/c10k.html.<br />

[9] Jonathon Lemon. Kqueue—a generic<br />

and scalable event notification facility. In<br />

Proceedings of the USENIX Annual<br />

Technical Conference, FREENIX Track,<br />

2001.<br />

[10] Davide Libenzi. Improving (network)<br />

I/O performance. Available at<br />

http://www.xmailserver.org/<br />

linux-patches/nio-improve.<br />

html.<br />

[11] D. Mosberger and T. Jin. httperf: A tool<br />

for measuring web server performance.<br />

In <strong>The</strong> First Workshop on Internet Server<br />

Performance, pages 59–67, Madison,<br />

WI, June 1998.<br />

[12] Shailabh Nagar, Paul Larson, Hanna<br />

Linder, and David Stevens. epoll<br />

scalability web page. Available at<br />

http://lse.sourceforge.net/<br />

epoll/index.html.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 225<br />

[13] M. Ostrowski. A mechanism for scalable<br />

event notification and delivery in <strong>Linux</strong>.<br />

Master’s thesis, Department of Computer<br />

Science, University of Waterloo,<br />

November 2000.<br />

[14] Vivek S. Pai, Peter Druschel, and Willy<br />

Zwaenepoel. Flash: An efficient and<br />

portable Web server. In Proceedings of<br />

the USENIX 1999 Annual Technical<br />

Conference, Monterey, CA, June 1999.<br />

http://citeseer.nj.nec.com/<br />

article/pai99flash.html.<br />

[15] N. Provos and C. Lever. Scalable<br />

network I/O in <strong>Linux</strong>. In Proceedings of<br />

the USENIX Annual Technical<br />

Conference, FREENIX Track, June 2000.<br />

[16] Luigi Rizzo. Dummynet: a simple<br />

approach to the evaluation of network<br />

protocols. ACM Computer<br />

Communication Review, 27(1):31–41,<br />

1997.<br />

http://citeseer.ist.psu.<br />

edu/rizzo97dummynet.html.<br />

[17] Standard Performance Evaluation<br />

Corporation. SPECWeb99 Benchmark,<br />

1999. Available at http://www.<br />

specbench.org/osg/web99.<br />

[18] David Weekly. /dev/epoll – a highspeed<br />

<strong>Linux</strong> kernel patch. Available at<br />

http://epoll.hackerdojo.com.


226 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


<strong>The</strong> (Re)Architecture of the X Window System<br />

James Gettys<br />

jim.gettys@hp.com<br />

Keith Packard<br />

keithp@keithp.com<br />

HP Cambridge Research Laboratory<br />

Abstract<br />

<strong>The</strong> X Window System, Version 11, is the standard<br />

window system on <strong>Linux</strong> and UNIX systems.<br />

X11, designed in 1987, was “state of<br />

the art” at that time. From its inception, X has<br />

been a network transparent window system in<br />

which X client applications can run on any machine<br />

in a network using an X server running<br />

on any display. While there have been some<br />

significant extensions to X over its history (e.g.<br />

OpenGL support), X’s design lay fallow over<br />

much of the 1990’s. With the increasing interest<br />

in open source systems, it was no longer<br />

sufficient for modern applications and a significant<br />

overhaul is now well underway. This<br />

paper describes revisions to the architecture of<br />

the window system used in a growing fraction<br />

of desktops and embedded systems<br />

1 Introduction<br />

While part of this work on the X window system<br />

[SG92] is “good citizenship” required by<br />

open source, some of the architectural problems<br />

solved ease the ability of open source applications<br />

to print their results, and some of<br />

the techniques developed are believed to be in<br />

advance of the commercial computer industry.<br />

<strong>The</strong> challenges being faced include:<br />

• X’s fundamentally flawed font architecture<br />

made it difficult to implement good<br />

WYSIWYG systems<br />

• Inadequate 2D graphics, which had always<br />

been intended to be augmented<br />

and/or replaced<br />

• Developers are loathe to adopt any new<br />

technology that limits the distribution of<br />

their applications<br />

• Legal requirements for accessibility for<br />

screen magnifiers are difficult to implement<br />

• Users desire modern user interface eye<br />

candy, which sport translucent graphics<br />

and windows, drop shadows, etc.<br />

• Full integration of applications into 3 D<br />

environments<br />

• Collaborative shared use of X (e.g. multiple<br />

simultaneous use of projector walls or<br />

other shared applications)<br />

While some of this work has been published<br />

elsewhere, there has never been any overview<br />

paper describing this work as an integrated<br />

whole, and the compositing manager work described<br />

below is novel as of fall 2003. This<br />

work represents a long term effort that started<br />

in 1999, and will continue for several years<br />

more.


228 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

2 Text and Graphics<br />

X’s obsolete 2D bit-blit based text and graphics<br />

system problems were most urgent. <strong>The</strong> development<br />

of the Gnome and KDE GUI environments<br />

in the period 1997-2000 had shown<br />

X11’s fundamental soundness, but confirmed<br />

the authors’ belief that the rendering system in<br />

X was woefully inadequate. <strong>One</strong> of us participated<br />

in the original X11 design meetings<br />

where the intent was to augment the rendering<br />

design at a later date; but the “GUI Wars” of the<br />

late 1980’s doomed effort in this area. Good<br />

printing support has been particularly difficult<br />

to implement in X applications, as fonts have<br />

were opaque X server side objects not directly<br />

accessible by applications.<br />

Most applications now composite images in<br />

sophisticated ways, whether it be in Flash media<br />

players, or subtly as part of anti-aliased<br />

characters. Bit-Blit is not sufficient for these<br />

applications, and these modern applications<br />

were (if only by their use of modern toolkits)<br />

all resorting to pixel based image manipulation.<br />

<strong>The</strong> screen pixels are retrieved from<br />

the window system, composited in clients, and<br />

then restored to the screen, rather than directly<br />

composited in hardware, resulting in poor performance.<br />

Inspired by the model first implemented<br />

in the Plan 9 window system, a graphics<br />

model based on Porter/Duff [PD84] image<br />

compositing was chosen. This work resulted in<br />

the X Render extension [Pac01a].<br />

X11’s core graphics exposed fonts as a server<br />

side abstraction. This font model was, at best,<br />

marginally adequate by 1987 standards. Even<br />

WYSIWYG systems of that era found them insufficient.<br />

Much additional information embedded<br />

in fonts (e.g. kerning tables) were not<br />

available from X whatsoever. Current competitive<br />

systems implement anti-aliased outline<br />

fonts. Discovering the Unicode coverage of a<br />

font, required by current toolkits for internationalization,<br />

was causing major performance<br />

problems. Deploying new server side font<br />

technology is slow, as X is a distributed system,<br />

and many X servers are seldom (or never)<br />

updated.<br />

<strong>The</strong>refore, a more fundamental change in X’s<br />

architecture was undertaken: to no longer use<br />

server side fonts at all, but to allow applications<br />

direct access to font files and have the window<br />

system cache and composite glyphs onto the<br />

screen.<br />

<strong>The</strong> first implementation of the new font system<br />

[Pac01b] taught a vital lesson. Xft1<br />

provided anti-aliased text and proper font<br />

naming/substitution support, but reverted to<br />

the core X11 bitmap fonts if the Render<br />

extension was not present. Xft1 included<br />

the first implementation what is called “subpixel<br />

decimation,” which provides higher quality<br />

subpixel based rendering than Microsoft’s<br />

ClearType [Pla00] technology in a completely<br />

general algorithm.<br />

Despite these advances, Xft1 received at best<br />

a lukewarm reception. If an application developer<br />

wanted anti-aliased text universally, Xft1<br />

did not help them, since it relied on the Render<br />

extension which had not yet been widely deployed;<br />

instead, the developer would be faced<br />

with two implementations, and higher maintenance<br />

costs. This (in retrospect obvious) rational<br />

behavior of application developers shows<br />

the high importance of backwards compatibility;<br />

X extensions intended for application developers’<br />

use must be designed in a downward<br />

compatible form whenever possible, and<br />

should enable a complete conversion to a new<br />

facility, so that multiple code paths in applications<br />

do not need testing and maintenance.<br />

<strong>The</strong>se principles have guided later development.<br />

<strong>The</strong> font installation, naming, substitution,<br />

and internationalization problems were sepa-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 229<br />

rated from Xft into a library named Fontconfig<br />

[Pac02], (since some printer only applications<br />

need this functionality independent of<br />

the window system.) Fontconfig provides internationalization<br />

features in advance of those<br />

in commercial systems such as Windows or<br />

OS X, and enables trivial font installation with<br />

good performance even when using thousands<br />

of fonts. Xft2 was also modified to operate<br />

against legacy X servers lacking the Render extension.<br />

Xft2 and Fontconfig’s solving of several major<br />

problems and lack of deployment barriers<br />

enabled rapid acceptance and deployment in<br />

the open source community, seeing almost universal<br />

use and uptake in less than one calendar<br />

year. <strong>The</strong>y have been widely deployed on<br />

<strong>Linux</strong> systems since the end of 2002. <strong>The</strong>y also<br />

“future proof” open source systems against<br />

coming improvements in font systems (e.g.<br />

OpenType), as the window system is no longer<br />

a gating item for font technology.<br />

Sun Microsystems implemented a server side<br />

font extension over the last several years; for<br />

the reasons outlined in this section, it has not<br />

been adopted by open source developers.<br />

While Xft2 and Fontconfig finally freed application<br />

developers from the tyranny of<br />

X11’s core font system, improved performance<br />

[PG03], and at a stroke simplified their<br />

printing problems, it has still left a substantial<br />

burden on applications. <strong>The</strong> X11 core graphics,<br />

even augmented by the Render extension,<br />

lack convenient facilities for many applications<br />

for even simple primitives like splines, tasteful<br />

wide lines, stroking paths, etc, much less provide<br />

simple ways for applications to print the<br />

results on paper.<br />

3 Cairo<br />

<strong>The</strong> Cairo library [WP03], developed by one of<br />

the authors in conjunction with by Carl Worth<br />

of ISI, is designed to solve this problem. Cairo<br />

provides a state full user-level API with support<br />

for the PDF 1.4 imaging model. Cairo provides<br />

operations including stroking and filling<br />

Bézier cubic splines, transforming and compositing<br />

translucent images, and anti-aliased<br />

text rendering. <strong>The</strong> PostScript drawing model<br />

has been adapted for use within applications.<br />

Extensions needed to support much of the PDF<br />

1.4 imaging operations have been included.<br />

This integration of the familiar PostScript operational<br />

model within the native application<br />

language environments provides a simple and<br />

powerful new tool for graphics application development.<br />

Cairo’s rendering algorithms use work done<br />

in the 1980’s by Guibas, Ramshaw, and<br />

Stolfi [GRS83] along with work by John<br />

Hobby [Hob85], which has never been exploited<br />

in Postscript or in Windows. <strong>The</strong> implementation<br />

is fast, precise, and numerically<br />

stable, supports hardware acceleration, and is<br />

in advance of commercial systems.<br />

Of particular note is the current development of<br />

Glitz [NR04], an OpenGL backend for Cairo,<br />

being developed by a pair of master’s students<br />

in Sweden. Not only is it showing that a high<br />

speed implementation of Cairo is possible, it<br />

implements an interface very similar to the X<br />

Render extension’s interface. More about this<br />

in the OpenGL section below.<br />

Cairo is in the late stages of development and<br />

is being widely adopted in the open source<br />

community. It includes the ability to render<br />

to Postscript and a PDF back end is planned,<br />

which should greatly improve applications’<br />

printing support. Work to incorporate Cairo in<br />

the Gnome and KDE desktop environments is


230 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

well underway, as are ports to Windows and<br />

Apple’s MacIntosh, and it is being used by the<br />

Mono project. As with Xft2, Cairo works with<br />

all X servers, even those without the Render<br />

extension.<br />

4 Accessibility and Eye-Candy<br />

Several years ago, one of us implemented a<br />

prototype X system that used image compositing<br />

as the fundamental primitive for constructing<br />

the screen representation of the window hierarchy<br />

contents. Child window contents were<br />

composited to their parent windows which<br />

were incrementally composed to their parents<br />

until the final screen image was formed, enabling<br />

translucent windows. <strong>The</strong> problem with<br />

this simplistic model was twofold—first, a<br />

naïve implementation consumed enormous resources<br />

as each window required two complete<br />

off screen buffers (one for the window<br />

contents themselves, and one for the window<br />

contents composited with the children) and<br />

took huge amounts of time to build the final<br />

screen image as it recursively composited windows<br />

together. Secondly, the policy governing<br />

the compositing was hardwired into the X<br />

server. An architecture for exposing the same<br />

semantics with less overhead seemed almost<br />

possible, and pieces of it were implemented<br />

(miext/layer). However, no complete system<br />

was fielded, and every copy of the code tracked<br />

down and destroyed to prevent its escape into<br />

the wild.<br />

Both Mac OS X and DirectFB [Hun04] perform<br />

window-level compositing by creating<br />

off-screen buffers for each top-level window<br />

(in OS X, the window system is not nested,<br />

so there are only top-level windows). <strong>The</strong><br />

screen image is then formed by taking the resulting<br />

images and blending them together on<br />

the screen. Without handling the nested window<br />

case, both of these systems provide the<br />

desired functionality with a simple implementation.<br />

This simple approach is inadequate<br />

for X as some desktop environments nest the<br />

whole system inside a single top-level window<br />

to allow panning, and X’s long history<br />

has shown the value of separating mechanism<br />

from policy (Gnome and KDE were developed<br />

over 10 years after X11’s design). <strong>The</strong> fix is<br />

pretty easy—allow applications to select which<br />

pieces of the window hierarchy are to be stored<br />

off-screen and which are to be drawn to their<br />

parent storage.<br />

With window hierarchy contents stored in offscreen<br />

buffers, an external application can now<br />

control how the screen contents are constructed<br />

from the constituent sub-windows and whatever<br />

other graphical elements are desired. This<br />

eliminated the complexities surrounding precisely<br />

what semantics would be offered in<br />

window-level compositing within the X server<br />

and the design of the underlying X extensions.<br />

<strong>The</strong>y were replaced by some concerns over the<br />

performance implications of using an external<br />

agent (the “Compositing Manager”) to execute<br />

the requests needed to present the screen image.<br />

Note that every visible pixel is under the<br />

control of the compositing manager, so screen<br />

updates are limited to how fast that application<br />

can get the bits painted to the screen.<br />

<strong>The</strong> architecture is split across three new extensions:<br />

• Composite, which controls which subhierarchies<br />

within the window tree are<br />

rendered to separate buffers.<br />

• Damage, which tracks modified areas<br />

with windows, informing the Composting<br />

Manager which areas of the off-screen hierarchy<br />

components have changed.<br />

• Xfixes, which includes new Region objects<br />

permitting all of the above computation<br />

to be performed indirectly within the


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 231<br />

X server, avoiding round trips.<br />

Multiple applications can take advantage of the<br />

off screen window contents, allowing thumbnail<br />

or screen magnifier applications to be included<br />

in the desktop environment.<br />

To allow applications other than the compositing<br />

manager to present alpha-blended content<br />

to the screen, a new X Visual was added to the<br />

server. At 32 bits deep, it provides 8 bits of<br />

red, green and blue along with 8 bits of alpha<br />

value. Applications can create windows using<br />

this visual and the compositing manager can<br />

composite them onto the screen.<br />

Nothing in this fundamental design indicates<br />

that it is used for constructing translucent windows;<br />

redirection of window contents and notification<br />

of window content change seems<br />

pretty far removed from one of the final goals.<br />

But note the compositing manger can use whatever<br />

X requests it likes to paint the combined<br />

image, including requests from the Render<br />

extension, which does know how to blend<br />

translucent images together. <strong>The</strong> final image<br />

is constructed programmatically so the possible<br />

presentation on the screen is limited only<br />

by the fertile imagination of the numerous eyecandy<br />

developers, and not restricted to any policy<br />

imposed by the base window system. And<br />

vital to rapid deployment, most applications<br />

can be completely oblivious to this background<br />

legerdemain.<br />

In this design, such sophisticated effects need<br />

only be applied at frame update rates on only<br />

modified sections of the screen rather than at<br />

the rate applications perform graphics; this<br />

constant behavior is highly desirable in systems.<br />

<strong>The</strong>re is very strong “pull” from both commercial<br />

and non-commercial users of X for this<br />

work and the current early version will likely<br />

be shipped as part of the next X.org Foundation<br />

X Window System release, sometime<br />

this summer. Since there has not been sufficient<br />

exposure through widespread use, further<br />

changes will certainly be required further experience<br />

with the facilities are gained in a much<br />

larger audience; as these can be made without<br />

affecting existing applications, immediate deployment<br />

is both possible and extremely desirable.<br />

<strong>The</strong> mechanisms described above realize a fundamentally<br />

more interesting architecture than<br />

either Windows or Mac OSX, where the compositing<br />

policy is hardwired into the window<br />

system. We expect a fertile explosion of experimentation,<br />

experience (both good and bad),<br />

and a winnowing of ideas as these facilities<br />

gain wider exposure.<br />

5 Input Transformation<br />

In the “naïve,” eye-candy use of the new compositing<br />

functions, no transformation of input<br />

events are required, as input to windows remains<br />

at the same geometric position on the<br />

screen, even though the windows are first rendered<br />

off screen. More sophisticated use, for<br />

example, screen readers or immersive environments<br />

such as Croquet [SRRK02], or Sun’s<br />

Looking Glass [KJ04] requires transformation<br />

of input events from where they first occur<br />

on the visible screen to the actual position in<br />

the windows (being rendered from off screen),<br />

since the window’s contents may have been arbitrarily<br />

transformed or even texture mapped<br />

onto shapes on the screen.<br />

As part of Sun Microsystem’s award winning<br />

work on accessibility in open source for screen<br />

readers, Sun has developed the XEvIE extension<br />

[Kre], which allows external clients to<br />

transform input events. This looks like a good<br />

starting point for the somewhat more general<br />

problem that 3D systems pose, and with some


232 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

modification can serve both the accessibility<br />

needs and those of more sophisticated applications.<br />

6 Synchronization<br />

Synchronization is probably the largest remaining<br />

challenge posed by compositing.<br />

While composite has eliminated much flashing<br />

of the screen since window exposure is eliminated,<br />

this does not solve the challenge of the<br />

compositing manager happening to copy an application’s<br />

window to the frame buffer in the<br />

middle of an application painting a sequence<br />

of updates. No “tearing” of single graphics operations<br />

take place since the X server is single<br />

threaded, and all graphics operations are run to<br />

completion.<br />

<strong>The</strong> X Synchronization extension<br />

(XSync) [GCGW92], widely available<br />

but to date seldom used, provides a general set<br />

of mechanisms for applications to synchronize<br />

with each other, with real time, and potentially<br />

with other system provided counters. XSync’s<br />

original design intent intended system provided<br />

counters for vertical retrace interrupts,<br />

audio sample clocks, and similar system<br />

facilities, enabling very tight synchronization<br />

of graphics operations with these time bases.<br />

Work has begun on <strong>Linux</strong> to provide these<br />

counters at long last, when available, to flesh<br />

out the design originally put in place and tested<br />

in the early 1990’s.<br />

A possible design for solving the application<br />

synchronization problem at low overhead may<br />

be to mark sections of requests with increments<br />

of XSync counters: if the count is odd<br />

(or even) the window would be unstable/stable.<br />

<strong>The</strong> compositing manager might then copy the<br />

window only if the window is in a stable state.<br />

Some details and possibly extensions to XSync<br />

will need to be worked out, if this approach is<br />

pursued.<br />

7 Next Steps<br />

We believe we are slightly more than half way<br />

through the process of rearchitecting and reimplementing<br />

the X Window System. <strong>The</strong> existing<br />

prototype needs to become a production<br />

system requiring significant infrastructure<br />

work as described in this section.<br />

7.1 OpenGL based X<br />

Current X-based systems which support<br />

OpenGL do so by encapsulating the OpenGL<br />

environment within X windows. As such,<br />

an OpenGL application cannot manipulate X<br />

objects with OpenGL drawing commands.<br />

Using OpenGL as the basis for the X server itself<br />

will place X objects such as pixmaps and<br />

off-screen window contents inside OpenGL<br />

objects allowing applications to use the full<br />

OpenGL command set to manipulate them.<br />

A “proof of concept” of implementation of the<br />

X Render extension is being done as part of<br />

the Glitz back-end for Cairo, which is showing<br />

very good performance for render based applications.<br />

Whether the “core” X graphics will require<br />

any OpenGL extensions is still somewhat<br />

an open question.<br />

In concert with the new compositing extensions,<br />

conventional X applications can then be<br />

integrated into 3D environments such as Croquet,<br />

or Sun’s Looking Glass. X application<br />

contents can be used as textures and mapped<br />

onto any surface desired in those environments.<br />

This work is underway, but not demonstrable<br />

at this date.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 233<br />

7.2 <strong>Kernel</strong> support for graphics cards<br />

In current open source systems, graphics cards<br />

are supported in a manner totally unlike that<br />

of any other operating system, and unlike previous<br />

device drivers for the X Window System<br />

on commercial UNIX systems. <strong>The</strong>re is no single<br />

central kernel driver responsible for managing<br />

access to the hardware. Instead, a large set<br />

of cooperating user and kernel mode systems<br />

are involved in mutual support of the hardware,<br />

including the X server (for 2D graphic), the<br />

direct-rendering infrastructure (DRI) (for accelerated<br />

3D graphics), the kernel frame buffer<br />

driver (for text console emulation), the General<br />

ATI TV and Overlay Software (GATOS)<br />

(for video input and output) and alternate 2D<br />

graphics systems like DirectFB.<br />

Two of these systems, the kernel frame buffer<br />

driver and the X server both include code to<br />

configure the graphics card “video mode”—<br />

the settings needed to send the correct video<br />

signals to monitors connected to the card.<br />

Three of these systems, DRI, the X server<br />

and GATOS, all include code for managing<br />

the memory space within the graphics card.<br />

All of these systems directly manipulate hardware<br />

registers without any coordination among<br />

them.<br />

<strong>The</strong> X server has no kernel component for<br />

2D graphics. Long-latency operations cannot<br />

use interrupts, instead the X server spins while<br />

polling status registers. DMA is difficult or impossible<br />

to configure in this environment. Perhaps<br />

the most egregious problem is that the<br />

X server reconfigures the PCI bus to correct<br />

BIOS mapping errors without informing the<br />

operating system kernel. <strong>Kernel</strong> access to devices<br />

while this remapping is going on may<br />

find the related devices mismapped.<br />

To rationalize this situation, various groups and<br />

vendors are coordinating efforts to create a single<br />

kernel-level entity responsible for basic device<br />

management, but this effort has just begun.<br />

7.3 Housecleaning and Latency Elimination<br />

and Latency Hiding<br />

Serious attempts were made in the early 1990’s<br />

to multi-thread the X server itself, with the discovery<br />

that the threading overhead in the X<br />

server is a net performance loss [Smi92].<br />

Applications, however, often need to be multithreaded.<br />

<strong>The</strong> primary C binding to the X protocol<br />

is called Xlib, and its current implementation<br />

by one of us dates from 1987. While it<br />

was partially developed on a Firefly multiprocessor<br />

workstation of that era, something almost<br />

unheard of at that date, and some consideration<br />

of multi-threaded applications were<br />

taken in its implementation, its internal transport<br />

facilities were never expected/intended to<br />

be preserved when serious multi-threaded operating<br />

systems became available. Unfortunately,<br />

rather than a full rewrite as one of us expected,<br />

multi-threaded support was debugged<br />

into existence using the original code base and<br />

the resulting code is very bug-prone and hard to<br />

maintain. Additionally, over the years, Xlib became<br />

a “kitchen sink” library, including functionality<br />

well beyond its primary use as a binding<br />

to the X protocol. We have both seriously<br />

regretted the precedents both of us set<br />

introducing extraneous functionality into Xlib,<br />

causing it to be one of the largest libraries on<br />

UNIX/<strong>Linux</strong> systems. Due to better facilities<br />

in modern toolkits and system libraries, more<br />

than half of Xlib’s current footprint is obsolete<br />

code or data.<br />

While serious work was done in X11’s design<br />

to mitigate latency, X’s performance, particularly<br />

over low speed networks, is often limited<br />

by round trip latency, and with retrospect<br />

much more can be done [PG03]. As this


234 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

work shows, client side fonts have made a significant<br />

improvement in startup latency, and<br />

work has already been completed in toolkits<br />

to mitigate some of the other hot spots. Much<br />

of the latency can be retrieved by some simple<br />

techniques already underway, but some require<br />

more sophisticated techniques that the<br />

current Xlib implementation is not capable of.<br />

Potentially 90the latency as of 2003 can be<br />

recovered by various techniques. <strong>The</strong> XCB<br />

library [MS01] by Bart Massey and Jamey<br />

Sharp is both carefully engineered to be multithreaded<br />

and to expose interfaces that will allow<br />

for latency hiding.<br />

Since libraries linked against different basic<br />

X transport systems would cause havoc in the<br />

same address space, a Xlib compatibility layer<br />

(XCL) has been developed that provides the<br />

“traditional” X library API, using the original<br />

Xlib stubs, but replacing the internal transport<br />

and locking system, which will allow for much<br />

more useful latency hiding interfaces. <strong>The</strong><br />

XCB/XCL version of Xlib is now able to run<br />

essentially all applications, and after a shakedown<br />

period, should be able to replace the existing<br />

Xlib transport soon. Other bindings than<br />

the traditional Xlib bindings then become possible<br />

in the same address space, and we may<br />

see toolkits adopt those bindings at substantial<br />

savings in space.<br />

7.4 Mobility, Collaboration, and Other Topics<br />

X’s original intended environment included<br />

highly mobile students, and a hope, never generally<br />

realized for X, was the migration of applications<br />

between X servers.<br />

<strong>The</strong> user should be able to travel between systems<br />

running X and retrieve your running applications<br />

(with suitable authentication and authorization).<br />

<strong>The</strong> user should be able to log out<br />

and “park” applications somewhere for later<br />

retrieval, either on the same display, or elsewhere.<br />

Users should be able to replicate an<br />

application’s display on a wall projector for<br />

presentation. Applications should be able to<br />

easily survive the loss of the X server (most<br />

commonly caused by the loss of the underlying<br />

TCP connection, when running remotely).<br />

Toolkit implementers typically did not understand<br />

and share this poorly enunciated vision<br />

and were primarily driven by pressing immediate<br />

needs, and X’s design and implementation<br />

made migration or replication difficult<br />

to implement as an afterthought. As a result,<br />

migration (and replication) was seldom<br />

implemented, and early toolkits such as Xt<br />

made it even more difficult. Emacs is the only<br />

widespread application capable of both migration<br />

and replication, and it avoided using any<br />

toolkit. A more detailed description of this vision<br />

is available in [Get02].<br />

Recent work in some of the modern toolkits<br />

(e.g. GTK+) and evolution of X itself make<br />

much of this vision demonstrable in current applications.<br />

Some work in the X infrastructure<br />

(Xlib) is underway to enable the prototype in<br />

GTK+ to be finished.<br />

Similarly, input devices need to become fullfledged<br />

network data sources, to enable much<br />

looser coupling of keyboards, mice, game consoles<br />

and projectors and displays; the challenge<br />

here will be the authentication, authorization<br />

and security issues this will raise. <strong>The</strong> HAL<br />

and DBUS projects hosted at freedesktop.org<br />

are working on at least part of the solutions for<br />

the user interface challenges posed by hotplug<br />

of input devices.<br />

7.5 Color Management<br />

<strong>The</strong> existing color management facilities in<br />

X are over 10 years old, have never seen<br />

widespread use, and do not meet current needs.<br />

This area is ripe for revisiting. Marti Maria Sa-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 235<br />

guer’s LittleCMS [Mar] may be of use here.<br />

For the first time, we have the opportunity to<br />

“get it right” from one end to the other if we<br />

choose to make the investment.<br />

7.6 Security and Authentication<br />

Transport security has become an burning issue;<br />

X is network transparent (applications can<br />

run on any system in a network, using remote<br />

displays), yet we dare no longer use X over the<br />

network directly due to password grabbing kits<br />

in the hands of script kiddies. SSH [BS01] provides<br />

such facilities via port forwarding and<br />

is being used as a temporary stopgap. Urgent<br />

work on something better is vital to enable<br />

scaling and avoid the performance and latency<br />

issues introduced by transit of extra processes,<br />

particularly on (<strong>Linux</strong> Terminal Server<br />

Project (LTSP [McQ02]) servers, which are beginning<br />

break out of their initial use in schools<br />

and other non security sensitive environments<br />

into very sensitive commercial environments.<br />

Another aspect of security arises between applications<br />

sharing a display. In the early and<br />

mid 1990’s efforts were made as a result of the<br />

compartmented mode workstation projects to<br />

make it much more difficult for applications to<br />

share or steal data from each other on a X display.<br />

<strong>The</strong>se facilities are very inflexible, and<br />

have gone almost unused.<br />

As projectors and other shared displays become<br />

common over the next five years, applications<br />

from multiple users sharing a display<br />

will become commonplace. In such environments,<br />

different people may be using the same<br />

display at the same time and would like some<br />

level of assurance that their application’s data<br />

is not being grabbed by the other user’s application.<br />

Eamon Walsh has, as part of the SE<strong>Linux</strong><br />

project [Wal04], been working to replace the<br />

existing X Security extension with an extension<br />

that, as in SE<strong>Linux</strong>, will allow multiple<br />

different security policies to be developed external<br />

to the X server. This should allow multiple<br />

different policies to be available to suit the<br />

varied uses: normal workstations, secure workstations,<br />

shared displays in conference rooms,<br />

etc.<br />

7.7 Compression and Image Transport<br />

Many/most modern applications and desktops,<br />

including the most commonly used application<br />

(a web browser) are now intensive users of synthetic<br />

and natural images. <strong>The</strong> previous attempt<br />

(XIE [SSF + 96]) to provide compressed<br />

image transport failed due to excessive complexity<br />

and over ambition of the designers, has<br />

never been significantly used, and is now in<br />

fact not even shipped as part of current X distributions.<br />

Today, many images are being read from disk<br />

or the network in compressed form, uncompressed<br />

into memory in the X client, moved<br />

to the X server (where they often occupy another<br />

copy of the uncompressed data). If we<br />

add general data compression to X (or run X<br />

over ssh with compression enabled) the data<br />

would be both compressed and uncompressed<br />

on its way to the X server. A simple replacement<br />

for XIE (if the complexity slippery slope<br />

can be avoided in a second attempt) would be<br />

worthwhile, along with other general compression<br />

of the X protocol.<br />

Results in our 2003 Usenix X Network Performance<br />

paper show that, in real application<br />

workloads (the startup of a Gnome desktop),<br />

using even simple GZIP [Gai93] style<br />

compression can make a tremendous difference<br />

in a network environment, with a factor<br />

of 300(!) savings in bandwidth. Apparently<br />

the synthetic images used in many current<br />

UI’s are extremely good candidates for


236 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

compression. A simple X extension that could<br />

encapsulate one or more X requests into the<br />

extension request would avoid multiple compression/uncompression<br />

of the same data in<br />

the system where an image transport extension<br />

was also present. <strong>The</strong> basic X protocol framework<br />

is actually very byte efficient relative to<br />

most conventional RPC systems, with a basic<br />

X request only occupying 4 bytes (contrast this<br />

with HTTP or CORBA, in which a simple request<br />

is more than 100 bytes).<br />

With the great recent interest in LTSP in commercial<br />

environments, work here would be extremely<br />

well spent, saving both memory and<br />

CPU, and network bandwidth.<br />

We are more than happy to hear from anyone<br />

interested in helping in this effort to bring X<br />

into the new millennium.<br />

References<br />

[BS01]<br />

[Gai93]<br />

Daniel J. Barrett and Richard<br />

Silverman. SSH, <strong>The</strong> Secure<br />

Shell: <strong>The</strong> Definitive Guide.<br />

O’Reilly & Associates, Inc.,<br />

2001.<br />

Jean-Loup Gailly. Gzip: <strong>The</strong><br />

Data Compression Program.<br />

iUniverse.com, 1.2.4 edition,<br />

1993.<br />

[GCGW92] Tim Glauert, Dave Carver, James<br />

Gettys, and David Wiggins. X<br />

Synchronization Extension<br />

Protocol, Version 3.0. X<br />

consortium standard, 1992.<br />

[Get02]<br />

James Gettys. <strong>The</strong> Future is<br />

Coming, Where the X Window<br />

System Should Go. In FREENIX<br />

Track, 2002 Usenix Annual<br />

Technical Conference, Monterey,<br />

CA, June 2002. USENIX.<br />

[GRS83]<br />

[Hob85]<br />

[Hun04]<br />

[KJ04]<br />

[Kre]<br />

[Mar]<br />

[McQ02]<br />

[MS01]<br />

Leo Guibas, Lyle Ramshaw, and<br />

Jorge Stolfi. A kinetic framework<br />

for computational geometry. In<br />

Proceedings of the IEEE 1983<br />

24th Annual Symposium on the<br />

Foundations of Computer<br />

Science, pages 100–111. IEEE<br />

Computer Society Press, 1983.<br />

John D. Hobby. Digitized Brush<br />

Trajectories. PhD thesis,<br />

Stanford University, 1985. Also<br />

Stanford Report<br />

STAN-CS-85-1070.<br />

A. Hundt. DirectFB Overview<br />

(v0.2 for DirectFB 0.9.21),<br />

February 2004.<br />

http://www.directfb.<br />

org/documentation.<br />

H. Kawahara and D. Johnson.<br />

Project Looking Glass: 3D<br />

Desktop Exploration. In X<br />

Developers Conference,<br />

Cambridge, MA, April 2004.<br />

S. Kreitman. XEvIE - X Event<br />

Interception Extension. http:<br />

//freedesktop.org/<br />

~stukreit/xevie.html.<br />

M. Maria. Little CMS Engine<br />

1.12 API Definition. Technical<br />

report.<br />

http://www.littlecms.<br />

com/lcmsapi.txt.<br />

Jim McQuillan. LTSP - <strong>Linux</strong><br />

Terminal Server Project, Version<br />

3.0. Technical report, March<br />

2002. http://www.ltsp.<br />

org/documentation/<br />

ltsp-3.0-4-en.html.<br />

Bart Massey and Jamey Sharp.<br />

XCB: An X protocol c binding.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 237<br />

[NR04]<br />

[Pac01a]<br />

[Pac01b]<br />

[Pac02]<br />

[PD84]<br />

[PG03]<br />

In XFree86 Technical<br />

Conference, Oakland, CA,<br />

November 2001. USENIX.<br />

Peter Nilsson and David<br />

Reveman. Glitz: Hardware<br />

Accelerated Image Compositing<br />

using OpenGL. In FREENIX<br />

Track, 2004 Usenix Annual<br />

Technical Conference, Boston,<br />

MA, July 2004. USENIX.<br />

Keith Packard. Design and<br />

Implementation of the X<br />

Rendering Extension. In<br />

FREENIX Track, 2001 Usenix<br />

Annual Technical Conference,<br />

Boston, MA, June 2001.<br />

USENIX.<br />

Keith Packard. <strong>The</strong> Xft Font<br />

Library: Architecture and Users<br />

Guide. In XFree86 Technical<br />

Conference, Oakland, CA,<br />

November 2001. USENIX.<br />

Keith Packard. Font<br />

Configuration and Customization<br />

for Open Source Systems. In<br />

2002 Gnome User’s and<br />

Developers European<br />

Conference, Seville, Spain, April<br />

2002. Gnome.<br />

Thomas Porter and Tom Duff.<br />

Compositing Digital Images.<br />

Computer Graphics,<br />

18(3):253–259, July 1984.<br />

Keith Packard and James Gettys.<br />

X Window System Network<br />

Performance. In FREENIX<br />

Track, 2003 Usenix Annual<br />

Technical Conference, San<br />

Antonio, TX, June 2003.<br />

USENIX.<br />

[Pla00]<br />

[SG92]<br />

[Smi92]<br />

[SRRK02]<br />

J. Platt. Optimal filtering for<br />

patterned displays. IEEE Signal<br />

Processing Letters,<br />

7(7):179–180, 2000.<br />

Robert W. Scheifler and James<br />

Gettys. X Window System.<br />

Digital Press, third edition, 1992.<br />

John Smith. <strong>The</strong> Multi-Threaded<br />

X Server. <strong>The</strong> X Resource,<br />

1:73–89, Winter 1992.<br />

D. Smith, A. Raab, D. Reed, and<br />

A. Kay. Croquet: <strong>The</strong> Users<br />

Manual, October 2002.<br />

http://glab.cs.<br />

uni-magdeburg.de/<br />

~croquet/downloads/<br />

Croquet0.1.pdf.<br />

[SSF + 96] Robert N.C. Shelley, Robert W.<br />

Scheifler, Ben Fahy, Jim Fulton,<br />

Keith Packard, Joe Mauro,<br />

Richard Hennessy, and Tom<br />

Vaughn. X Image Extension<br />

Protocol Version 5.02. X<br />

consortium standard, 1996.<br />

[Wal04]<br />

[WP03]<br />

Eamon Walsh. Integrating<br />

XFree86 With<br />

Security-Enhanced <strong>Linux</strong>. In X<br />

Developers Conference,<br />

Cambridge, MA, April 2004.<br />

http://freedesktop.<br />

org/Software/XDevConf/<br />

x-security-walsh.pdf.<br />

Carl Worth and Keith Packard.<br />

Xr: Cross-device Rendering for<br />

Vector Graphics. In Proceedings<br />

of the Ottawa <strong>Linux</strong> Symposium,<br />

Ottawa, ON, July 2003. OLS.


238 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


IA64-<strong>Linux</strong> perf tools for IO dorks<br />

Examples of IA-64 PMU usage<br />

Grant Grundler<br />

Hewlett-Packard<br />

iod00d@hp.com<br />

grundler@parisc-linux.org<br />

Abstract<br />

Itanium processors have very sophisticated<br />

performance monitoring tools integrated into<br />

the CPU. McKinley and Madison Itanium<br />

CPUs have over three hundred different types<br />

of events they can filter, trigger on, and count.<br />

<strong>The</strong> restrictions on which combinations of triggers<br />

are allowed is daunting and varies across<br />

CPU implementations. Fortunately, the tools<br />

hide this complicated mess. While the tools<br />

prevent us from shooting ourselves in the foot,<br />

it’s not obvious how to use those tools for measuring<br />

kernel device driver behaviors.<br />

IO driver writers can use pfmon to measure two<br />

key areas generally not obvious from the code:<br />

MMIO read and write frequency and precise<br />

addresses of instructions regularly causing L3<br />

data cache misses. Measuring MMIO reads has<br />

some nuances related to instruction execution<br />

which are relevant to understanding ia64 and<br />

likely ia32 platforms. Similarly, the ability to<br />

pinpoint exactly which data is being accessed<br />

by drivers enables driver writers to either modify<br />

the algorithms or add prefetching directives<br />

where feasible. I include some examples on<br />

how I used pfmon to measure NIC drivers and<br />

give some guidelines on use.<br />

q-syscollect is a “gprof without the pain” kind<br />

of tool. While q-syscollect uses the same kernel<br />

perfmon subsystem as pfmon, the former<br />

works at a higher level. With some knowledge<br />

about how the kernel operates, q-syscollect can<br />

collect call-graphs, function call counts, and<br />

percentage of time spent in particular routines.<br />

In other words, pfmon can tell us how much<br />

time the CPU spends stalled on d-cache misses<br />

and q-syscollect can give us the call-graph for<br />

the worst offenders.<br />

Updated versions of this paper will be available<br />

from http://iou.parisc-linux.<br />

org/ols2004/<br />

1 Introduction<br />

Improving the performance of IO drivers is really<br />

not that easy. It usually goes something<br />

like:<br />

1. Determine which workload is relevant<br />

2. Set up the test environment<br />

3. Collect metrics<br />

4. Analyze the metrics<br />

5. Change the code based on theories about<br />

the metrics<br />

6. Iterate on Collect metrics<br />

This paper attempts to make the collectanalyze-change<br />

loop more efficient for three


240 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

obvious things: MMIO reads, MMIO writes,<br />

and cache line misses.<br />

MMIO reads and writes are easier to locate in<br />

<strong>Linux</strong> code than for other OSs which support<br />

memory-mapped IO—just search for readl()<br />

and writel() calls. But pfmon [1] can provide<br />

statistics of actual behavior and not just where<br />

in the code MMIO space is touched.<br />

Cache line misses are hard to detect. None<br />

of the regular performance tools I’ve used<br />

can precisely tell where CPU stalls are taking<br />

place. We can guess some of them based on<br />

data usage—like spin locks ping-ponging between<br />

CPUs. This requires a level of understanding<br />

that most of us mere mortals don’t<br />

possess. Again, pfmon can help out here.<br />

Lastly, getting an overview of system performance<br />

and getting run-time call graph usually<br />

requires compiler support that gcc doesn’t provide.<br />

q-tools[4] can provide that information.<br />

Driver writers can then manually adjust the<br />

code knowing where the “hot spots” are.<br />

1.1 pfmon<br />

<strong>The</strong> author of pfmon, Stephane Eranian [2],<br />

describes pfmon as “the performance tool<br />

for IA64-<strong>Linux</strong> which exploits all the features<br />

of the IA-64 Performance Monitoring Unit<br />

(PMU).” pfmon uses a command line interface<br />

and does not require any special privilege<br />

to run. pfmon can monitor a single process, a<br />

multi-threaded process, multi-processes workloads<br />

and the entire system.<br />

pfmon is the user command line interface to<br />

the kernel perfmon subsystem. perfmon does<br />

the ugly work of programming the PMU. Perfmon<br />

is versioned separately from pfmon command.<br />

When in doubt, use the perfmon in the<br />

latest 2.6 kernel.<br />

<strong>The</strong>re are two major types of measurements:<br />

counting and sampling. For counting, pfmon<br />

simply reports the number of occurrences of<br />

the desired events during the monitoring period.<br />

pfmon can also be configured to sample<br />

at certain intervals information about the execution<br />

of a command or for the entire system.<br />

It is possible to sample any events provided by<br />

the underlying PMU.<br />

<strong>The</strong> information recorded by the PMU depends<br />

on what the user wants. pfmon contains a few<br />

preset measurements but for the most part the<br />

user is free to set up custom measurements.<br />

On Itanium2, pfmon provides access to all the<br />

PMU advanced features such as opcode matching,<br />

range restrictions, the Event Address Registers<br />

(EAR) and the Branch Trace Buffer.<br />

1.2 pfmon command line options<br />

Here is a summary of command line options<br />

used in the examples later in this paper:<br />

–us-c use the US-style comma separator for<br />

large numbers.<br />

–cpu-list=0 bind pfmon to CPU 0 and only<br />

count on CPU 0<br />

–pin-command bind the command at the end<br />

of the command line to the same CPU as<br />

pfmon.<br />

–resolve-addr look up addresses and print the<br />

symbols<br />

–long-smpl-periods=2000 take a sample of<br />

every 2000th event.<br />

–smpl-periods-random=0xfff:10 randomize<br />

the sampling period. This is necessary<br />

to avoid bias when sampling repetitive<br />

behaviors. <strong>The</strong> first value is the mask<br />

of bits to randomize (e.g., 0xfff) and the<br />

second value is initial seed (e.g., 10).<br />

-k kernel only.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 241<br />

–system-wide measure the entire system (all<br />

processes and kernel)<br />

Parameters only available on a to-be-released<br />

pfmon v3.1:<br />

–smpl-module=dear-hist-itanium2 This particular<br />

module is to be used ONLY in<br />

conjunction with the Data EAR (Event<br />

Address Registers) and presents recorded<br />

samples as histograms about the cache<br />

misses. By default, the information is presented<br />

in the instruction view but it is possible<br />

to get the data view of the misses<br />

also.<br />

-e data_ear_cache_lat64 pseudo event for<br />

memory loads with latency ≥ 64 cycles.<br />

<strong>The</strong> real event is DATA_EAR_EVENT<br />

(counts the number of times Data EAR<br />

has recorded something) and the pseudo<br />

event expresses the latency filter for the<br />

event. Use “pfmon -ldata_ear_<br />

cache*” to list all valid values. Valid<br />

values with McKinley CPU are powers of<br />

two (4 – 4096).<br />

1.3 q-tools<br />

<strong>The</strong> author of q-tools, David Mosberger [5],<br />

has described q-tools as “gprof without the<br />

pain.”<br />

q-tools package contains q-syscollect,<br />

q-view, qprof, and q-dot.<br />

q-syscollect collects profile information<br />

using kernel perfmon subsystem to<br />

sample the PMU. q-view will present the<br />

data collected in both flat-profile and call<br />

graph form. q-dot displays the call-graph<br />

in graphical form. Please see the qprof [6]<br />

website for details on qprof.<br />

q-syscollect depends on the kernel perfmon<br />

subsystem which is included in all 2.6<br />

<strong>Linux</strong> kernels. Because q-syscollect uses<br />

the PMU, it has the following advantages over<br />

other tools:<br />

• no special kernel support needed (besides<br />

perfmon subsystem).<br />

• provides call-graph of kernel functions<br />

• can collect call-graphs of the kernel while<br />

interrupts are blocked.<br />

• measures multi-threaded applications<br />

• data is collected per-CPU and can be<br />

merged<br />

• instruction level granularity (not bundles)<br />

2 Measuring MMIO Reads<br />

Nearly every driver uses MMIO reads to either<br />

flush MMIO writes, flush in-flight DMA,<br />

or (most obviously) collect status data from the<br />

IO device directly. While use of MMIO read is<br />

necessary in most cases, it should be avoided<br />

where possible.<br />

2.1 Why worry about MMIO Reads?<br />

MMIO reads are expensive—how expensive<br />

depends on speed of the IO bus, the number<br />

bridges the read (and its corresponding read return)<br />

has to cross, how “busy” each bus is, and<br />

finally how quickly the device responds to the<br />

read request. On most architectures, one can<br />

precisely measure the cost by measuring a loop<br />

of MMIO reads and calling get_cycles()<br />

before/after the loop.<br />

I’ve measured anywhere from 1µs to 2µs per<br />

read. In practical terms:<br />

• ∼ 500–600 cycles on an otherwise-idle<br />

400 MHz PA-RISC machine.


242 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

• ∼ 1000 cycles on a 450 MHz Pentium machine<br />

which included crossing a PCI-PCI<br />

bridge.<br />

• ∼ 900–1000 cycles on a 800 MHz IA64<br />

HP ZX1 machine.<br />

And for those who still don’t believe me, try<br />

watching a DVD movie after turning DMA off<br />

for an IDE DVD player:<br />

hdparm -d 0 /dev/cdrom<br />

By switching the IDE controller to use PIO<br />

(Programmed I/O) mode, all data will be transferred<br />

to/from host memory under CPU control,<br />

byte (or word) at a time. pfmon can measure<br />

this. And pfmon looks broken when it<br />

displays three and four digit “Average Cycles<br />

Per Instruction” (CPI) output.<br />

2.2 Eh? Memory Reads don’t stall?<br />

<strong>The</strong>y do. But the CPU and PMU don’t “realize”<br />

the stall until the next memory reference.<br />

<strong>The</strong> CPU continues execution until memory order<br />

is enforced by the acquire semantics in the<br />

MMIO read. This means the Data Event Address<br />

Registers record the next stalled memory<br />

reference due to memory ordering constraints,<br />

not the MMIO read. <strong>One</strong> has to look<br />

at the instruction stream carefully to determine<br />

which instruction actually caused the stall.<br />

This also means the following sequence<br />

doesn’t work exactly like we expect:<br />

writel(CMD,addr);<br />

readl(addr);<br />

udelay(1);<br />

y = buf->member;<br />

<strong>The</strong> problem is the value returned by<br />

read(x) is never consumed. Memory<br />

ordering imposes no constraint on nonload/store<br />

instructions. Hence udelay(1)<br />

begins before the CPU stalls. <strong>The</strong> CPU will<br />

stall on buf->member because of memory<br />

ordering restrictions if the udelay(1) completes<br />

before readl(x) is retired. Drop the<br />

udelay(1) call and pfmon will always see<br />

the stall caused by MMIO reads on the next<br />

memory reference.<br />

Unfortunately, the IA32 Software Developer’s<br />

Manual[3] Volume 3, Chapter 7.2 “MEMORY<br />

ORDERING” is silent on the issue of how<br />

MMIO (uncached accesses) will (or will not)<br />

stall the instruction stream. This document<br />

is very clear on how “IO Operations” (e.g.,<br />

IN/OUT) will stall the instruction pipeline until<br />

the read return arrives at the CPU. A direct response<br />

from Intel(R) indicated readl() does<br />

not stall like IN or OUT do and IA32 has the<br />

same problem. <strong>The</strong> Intel® architect who responded<br />

did hedge the above statement claiming<br />

a “udelay(10) will be as close as expected”<br />

for an example similar to mine. Anyone who<br />

has access to a frontside bus analyzer can verify<br />

the above statement by measuring timing<br />

loops between uncached accesses. I’m not that<br />

privileged and have to trust Intel® in this case.<br />

For IA64, we considered putting an extra burden<br />

on udelay to stall the instruction stream<br />

until previous memory references were retired.<br />

We could use dummy loads/stores before and<br />

after the actual delay loop so memory ordering<br />

could be used to stall the instruction pipeline.<br />

That seemed excessive for something that we<br />

didn’t have a bug report for.<br />

Consensus was adding mf.a (memory fence)<br />

instruction to readl() should be sufficient.<br />

<strong>The</strong> architecture only requires mf.a serve as<br />

an ordering token and need not cause any delays<br />

of its own. In other words, the implementation<br />

is platform specific. mf.a has not<br />

been added to readl() yet because every-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 243<br />

thing was working without so far.<br />

2.3 pfmon -e uc_loads_retired<br />

IO accesses are generally the only uncached<br />

references made on IA64-linux and normally<br />

will represent MMIO reads. <strong>The</strong> basic measurement<br />

will tell us roughly how many cycles<br />

the CPU stalls for MMIO reads. Get the number<br />

of MMIO reads per sample period and then<br />

multiply by the actual cycle counts a MMIO<br />

read takes for the given device. <strong>One</strong> needs to<br />

measure MMIO read cost by using a CPU internal<br />

cycle counter and hacking the kernel to<br />

read a harmless address from the target device<br />

a few thousand times.<br />

In order to make statements about per transaction<br />

or per interrupt, we need to know<br />

the cumulative number of transactions or<br />

interrupts processed for the sample period.<br />

pktgen is straightforward in this regard since<br />

pktgen will print transaction statistics when<br />

a run is terminated. And one can record<br />

/proc/interrupts contents before and<br />

after each pfmon run to collect interrupt<br />

events as well.<br />

Drawbacks to the above are one assumes a homogeneous<br />

driver environment; i.e., only one<br />

type of driver is under load during the test. I<br />

think that’s a fair assumption for development<br />

in most cases. Bridges (e.g., routing traffic<br />

across different interconnects) are probably the<br />

one case it’s not true. <strong>One</strong> has to work a bit<br />

harder to figure out what the counts mean in<br />

that case.<br />

For other benchmarks, like SpecWeb, we want<br />

to grab /proc/interrupt and networking<br />

stats before/after pfmon runs.<br />

2.4 tg3 Memory Reads<br />

In summary, Figure 1 shows tg3 is doing<br />

2749675/(1834959 − 918505) ≈ 3<br />

MMIO reads per interrupt and averaging about<br />

5000000/(1834959 − 918505) ≈ 5 packets<br />

per interrupt. This is with the BCM5701 chip<br />

running in PCI mode at 66MHz:64-bit.<br />

Based on code inspection, here is a break down<br />

of where the MMIO reads occur in temporal<br />

order:<br />

1. tg3_interrupt() flushes MMIO<br />

write to MAILBOX_INTERRUPT_0<br />

2. tg3_poll() → tg3_enable_<br />

ints() → tw32(TG3PCI_MISC_<br />

HOST_CTRL)<br />

3. tg3_enable_ints() flushes MMIO<br />

write to MAILBOX_INTERRUPT_0<br />

It’s obvious when inspecting tw32(), the<br />

BCM5701 chip has a serious bug. Every call<br />

to tw32() on BCM5701 requires a MMIO<br />

read to follow the MMIO write. Only writes to<br />

mailbox registers don’t require this and a different<br />

routine is used for mailbox writes.<br />

Given the NIC was designed for zero MMIO<br />

reads, this is pretty poor performance. Using<br />

a BCM5703 or BCM5704 would avoid the<br />

MMIO read in tw32().<br />

I’ve exchanged email with David Miller and<br />

Jeff Garzik (tg3 driver maintainers). <strong>The</strong>y have<br />

valid concerns with portability. We agree tg3<br />

could be reduced to one MMIO read after the<br />

last MMIO write (to guarantee interrupts get<br />

re-enabled).<br />

<strong>One</strong> would need to use the “tag” field in the<br />

status block when writing the mail box register<br />

to indicate which “tag” the CPU most recently


244 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

gsyprf3:~# pfmon -e uc_loads_retired -k --system-wide \<br />

-- /usr/src/pktgen-testing/pktgen-single-tg3<br />

Adding devices to run.<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

57: 918505 0 IO-SAPIC-level eth1<br />

Result: OK: 7613693(c7613006+d687) usec, 5000000 (64byte) 656771pps 320Mb/sec<br />

(336266752bps) errors: 0<br />

57: 1834959 0 IO-SAPIC-level eth1<br />

CPU0<br />

2749675 UC_LOADS_RETIRED<br />

CPU1<br />

1175 UC_LOADS_RETIRED<br />

}<br />

Figure 1: tg3 v3.6 MMIO reads with pktgen/IRQ on same CPU<br />

saw. Using Message Signaled Interrupts (MSI)<br />

instead of Line based IRQs would guarantee<br />

the most recent status block update (transferred<br />

via DMA writes) would be visible to the CPU<br />

before tg3_interrupt() gets called.<br />

<strong>The</strong> protocol would allow correct operation<br />

without using MSI, too.<br />

2.5 Benchmarking, pfmon, and CPU bindings<br />

<strong>The</strong> purpose of binding pktgen to CPU1 is<br />

to verify the transmit code path is NOT doing<br />

any MMIO reads. We split the transmit code<br />

path and interrupt handler across CPUs to narrow<br />

down which code path is performing the<br />

MMIO reads. This change is not obvious from<br />

Figure 2 output since tg3 only performs MMIO<br />

reads from CPU 0 (tg3_interrupt()).<br />

But in Figure 2, performance goes up 30%!<br />

Offhand, I don’t know if this is due to CPU<br />

utilization (pktgen and tg3_interrupt()<br />

contending for CPU cycles) or if DMA is more<br />

efficient because of cache-line flows. When I<br />

don’t have any deadlines looming, I’d like to<br />

determine the difference.<br />

2.6 e1000 Memory Reads<br />

e1000 version 5.2.52-k4 has a more efficient<br />

implementation than tg3 driver. In a nut shell,<br />

MMIO reads are pretty much irrelevant to the<br />

pktgen workload with e1000 driver using default<br />

values.<br />

Figure 3 shows e1000 performs<br />

173315/(703829 − 622143) ≈ 2 MMIO<br />

reads per interrupt and 5000000/(703829 −<br />

622143) ≈ 61 packets per interrupt.<br />

Being the curious soul I am, I tracked down<br />

the two MMIO reads anyway. <strong>One</strong> is in the interrupt<br />

handler and the second when interrupts<br />

are re-enabled. It looks like e1000 will always<br />

need at least 2 MMIO reads per interrupt.<br />

3 Measuring MMIO Writes<br />

3.1 Why worry about MMIO Writes?<br />

MMIO writes are clearly not as significant as<br />

MMIO reads. Nonetheless, every time a driver<br />

writes to MMIO space, some subtle things happen.<br />

<strong>The</strong>re are four minor issues to think about:<br />

memory ordering, PCI bus utilization, filling<br />

outbound write queues, and stalling MMIO<br />

reads longer than necessary.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 245<br />

gsyprf3:~# pfmon -e uc_loads_retired -k --system-wide \<br />

-- /usr/src/pktgen-testing/pktgen-single-tg3<br />

Adding devices to run.<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

57: 5809687 0 IO-SAPIC-level eth1<br />

Result: OK: 5914889(c5843865+d71024) usec, 5000000 (64byte) 845451pps 412Mb/se<br />

c (432870912bps) errors: 0<br />

57: 6427969 0 IO-SAPIC-level eth1<br />

CPU0<br />

1855253 UC_LOADS_RETIRED<br />

CPU1<br />

950 UC_LOADS_RETIRED<br />

Figure 2: tg3 v3.6 MMIO reads with pktgen/IRQ on diff CPU<br />

gsyprf3:~# pfmon -e uc_loads_retired -k --system-wide \<br />

-- /usr/src/pktgen-testing/pktgen-single-e1000<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

59: 622143 0 IO-SAPIC-level eth3<br />

Result: OK: 10228738(c9990105+d238633) usec, 5000000 (64byte) 488854pps 238Mb/<br />

sec (250293248bps) errors: 81669<br />

59: 703829 0 IO-SAPIC-level eth3<br />

CPU0<br />

173315 UC_LOADS_RETIRED<br />

CPU1<br />

1422 UC_LOADS_RETIRED<br />

Figure 3: MMIO reads for e1000 v5.2.52-k4


246 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

First, memory ordering is enforced since PCI<br />

requires strong ordering of MMIO writes. This<br />

means the MMIO write will push all previous<br />

regular memory writes ahead. This is not a serious<br />

issue but it can make a MMIO write take<br />

longer.<br />

MMIO writes are short transactions (i.e., much<br />

less than a cache-line). <strong>The</strong> PCI bus setup time<br />

to select the device, send the target address and<br />

data, and disconnect measurably reduces PCI<br />

bus utilization. It typically results in six or<br />

more PCI bus cycles to send four (or eight)<br />

bytes of data. On systems which strongly order<br />

DMA Read Returns and MMIO Writes, the<br />

latter will also interfere with DMA flows by interrupting<br />

in-flight, outbound DMA.<br />

If the IO bridge (e.g., PCI Bus controller) nearest<br />

the CPU has a full write queue, the CPU<br />

will stall. <strong>The</strong> bridge would normally queue<br />

the MMIO write and then tell the CPU it’s<br />

done. <strong>The</strong> chip designers normally make the<br />

write queue deep enough so the CPU never<br />

needs to stall. But drivers that perform many<br />

MMIO writes (e.g., use door bells) and burst<br />

many of MMIO writes at a time, could run into<br />

a worst case.<br />

<strong>The</strong> last concern, stalling MMIO reads longer<br />

than normal, exists because of PCI ordering<br />

rules. MMIO reads and MMIO writes are<br />

strongly ordered. E.g., if four MMIO writes<br />

are queued before a MMIO read, the read will<br />

wait until all four MMIO write transactions<br />

have completed. So instead of say 1000 CPU<br />

cycles, the MMIO read might take more than<br />

2000 CPU cycles on current platforms.<br />

3.2 pfmon -e uc_stores_retired<br />

pfmon counts MMIO Writes with no surprises.<br />

3.3 tg3 Memory Writes<br />

Figure 4 shows tg3 does about 10M MMIO<br />

writes to send 5M packets. However, we<br />

can break the MMIO writes down into base<br />

level (feed packets onto transmit queue) and<br />

tg3_interrupt which handles TX (and<br />

RX) completions. Knowing which code path<br />

the MMIO writes are in helps track down usage<br />

in the source code.<br />

Output in Figure 5 is after hacking the<br />

pktgen-single-tg3 script to bind<br />

pktgen kernel thread to CPU 1 when<br />

eth1 is directing interrupts to CPU 0.<br />

<strong>The</strong> distribution between TX queue setup<br />

and interrupt handling is obvious now.<br />

CPU 0 is handling interrupts and performs<br />

3013580/(5803789 − 5201193) ≈ 5 MMIO<br />

writes per interrupt. CPU 1 is handling TX<br />

setup and performs 5000376/5000000 ≈ 1<br />

MMIO write per packet.<br />

Again, as noted in section 2.5, binding pktgen<br />

thread to one CPU and interrupts to another,<br />

changes the performance dramatically.<br />

3.4 e1000 Memory Writes<br />

Figure 6 shows 248891/(991082 −<br />

908366) ≈ 3 MMIO writes per interrupt<br />

and 5001303/5000000 ≈ 1 MMIO write<br />

per packet. In other words, slightly better than<br />

tg3 driver. Nonetheless, the hardware can’t<br />

push as many packets. <strong>One</strong> difference is the<br />

e1000 driver is pushing data to a NIC behind a<br />

PCI-PCI Bridge.<br />

Figure 7 shows a ≈40% improvement in<br />

throughput 1 for pktgen without a PCI-PCI<br />

Bridge in the way. Note the ratios of MMIO<br />

writes per interrupt and MMIO writes per<br />

1 This demonstrates how the distance between the IO<br />

device and CPU (and memory) directly translates into<br />

latency and performance.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 247<br />

gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/pktgen-test<br />

ing/pktgen-single-tg3<br />

Adding devices to run.<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

57: 4284466 0 IO-SAPIC-level eth1<br />

Result: OK: 7611689(c7610900+d789) usec, 5000000 (64byte) 656943pps 320Mb/sec<br />

(336354816bps) errors: 0<br />

57: 5198436 0 IO-SAPIC-level eth1<br />

CPU0<br />

9570269 UC_STORES_RETIRED<br />

CPU1<br />

445 UC_STORES_RETIRED<br />

Figure 4: tg3 v3.6 MMIO writes<br />

gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/<br />

pktgen-testing/pktgen-single-tg3<br />

Adding devices to run.<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

57: 5201193 0 IO-SAPIC-level eth1<br />

Result: OK: 5880249(c5811180+d69069) usec, 5000000 (64byte) 850340pps 415Mb<br />

/sec (435374080bps) errors: 0<br />

57: 5803789 0 IO-SAPIC-level eth1<br />

CPU0<br />

3013580 UC_STORES_RETIRED<br />

CPU1<br />

5000376 UC_STORES_RETIRED<br />

Figure 5: tg3 v3.6 MMIO writes with pktgen/IRQ split across CPUs<br />

gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/<br />

pktgen-testing/pktgen-single-e1000<br />

Running... ctrl^C to stop<br />

59: 908366 0 IO-SAPIC-level eth3<br />

Result: OK: 10340222(c10104719+d235503) usec, 5000000 (64byte) 483558pps 236Mb<br />

/sec (247581696bps) errors: 82675<br />

59: 991082 0 IO-SAPIC-level eth3<br />

CPU0<br />

248891 UC_STORES_RETIRED<br />

CPU1<br />

5001303 UC_STORES_RETIRED<br />

Figure 6: MMIO writes for e1000 v5.2.52-k4<br />

gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/pktgen-test<br />

ing/pktgen-single-e1000<br />

Running... ctrl^C to stop<br />

71: 3 0 IO-SAPIC-level eth7<br />

Result: OK: 7491358(c7342756+d148602) usec, 5000000 (64byte) 667467pps 325Mb/s<br />

ec (341743104bps) errors: 59870<br />

71: 59907 0 IO-SAPIC-level eth7<br />

CPU0<br />

180406 UC_STORES_RETIRED<br />

CPU1<br />

5000939 UC_STORES_RETIRED<br />

Figure 7: e1000 v5.2.52-k4 MMIO writes without PCI-PCI Bridge


248 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

packet are the same. I doubt the MMIO<br />

reads and MMIO writes are the limiting factors.<br />

More likely DMA access to memory<br />

(and thus TX/RX descriptor rings) limits NIC<br />

packet processing.<br />

4 Measuring Cache-line Misses<br />

<strong>The</strong> Event Address Registers 2 (EAR) can only<br />

record one event at a time. What is so interesting<br />

about them is that they record precise information<br />

about data cache misses. For instance<br />

for a data cache miss, you get the:<br />

• address of the instruction, likely a load<br />

• address of the target data<br />

• latency in cycles to resolve the miss<br />

<strong>The</strong> information pinpoints the source of the<br />

miss, not the consequence (i.e., the stall).<br />

<strong>The</strong> Data EAR (DEAR) can also tell us about<br />

MMIO reads via sampling. <strong>The</strong> DEAR can<br />

only record loads that miss, not stores. Of<br />

course, MMIO reads always miss because they<br />

are uncached. This is interesting if we want to<br />

track down which MMIO addresses are “hot.”<br />

It’s usually easier to track down usage in source<br />

code knowing which MMIO address is referenced.<br />

Collecting with DEAR sampling requires two<br />

parameters be tweaked to statistically improve<br />

the samples. <strong>One</strong> is the frequency at which<br />

Data Addresses are recorded and the other is<br />

the threshold (how many CPU cycles latency).<br />

Because we know the latency to L3 is about<br />

21 cycles, setting the EAR threshold to a value<br />

higher (e.g., 64 cycles) ensures only the load<br />

2 pfmon v3.1 is the first version to support EAR<br />

and is expected to be available in August, 2004.<br />

misses accessing main memory will be captured.<br />

This is how to select which level of<br />

cacheline misses one samples.<br />

While high threshholds (e.g., 64 cycles) will<br />

show us where the longest delays occur, it will<br />

not show us the worst offenders. Doing a second<br />

run with a lower threshold (e.g., 4 cycles)<br />

shows all L1, L2, and L3 cache misses and provides<br />

a much broader picture of cache utilization.<br />

When sampling events with low threshholds,<br />

we will get saturated with events and need to<br />

reduce the number of events actually sampled<br />

to every 5000th. <strong>The</strong> appropriate value will<br />

depend on the workload and how patient one<br />

is. <strong>The</strong> workload needs to be run long enough<br />

to be statistically significant and the sampling<br />

period needs to be high enough to not significantly<br />

perturb the workload.<br />

4.1 tg3 Data Cache misses > 64 cycles<br />

For the output in Figure 8, I’ve iteratively decreased<br />

the smpl-periods until I noticed the total<br />

pktgen throughput starting to drop. Figure<br />

8 output only shows the tg3 interrupt code<br />

path since pfmon is bound to CPU 0. Normally,<br />

it would be useful to run this again with<br />

cpu-list=1. We could then see what the<br />

TX code path and pktgen are doing.<br />

Also, the pin-command option in<br />

this example doesn’t do anything since<br />

pktgen-single-tg3 directs a pktgen<br />

kernel thread bound CPU 1 to do the real<br />

work. I’ve included the option only to make<br />

people aware of it.<br />

4.2 tg3 Data Cache misses > 4 cycles<br />

Figure 9 puts the lat64 output in Figure 8<br />

into better perspective. It shows tg3 is spending<br />

more time for L1 and L2 misses than L3 misses


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 249<br />

gsyprf3:~# pfmon31 --us-c --cpu-list=0 --pin-command --resolve-addr \<br />

--smpl-module=dear-hist-itanium2 \<br />

-e data_ear_cache_lat64 --long-smpl-periods=500 \<br />

--smpl-periods-random=0xfff:10 --system-wide \<br />

-k -- /usr/src/pktgen-testing/pktgen-single-tg3<br />

added event set 0<br />

only kernel symbols are resolved in system-wide mode<br />

Adding devices to run.<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

57: 7209769 0 IO-SAPIC-level eth1<br />

Result: OK: 5915877(c5845032+d70845) usec, 5000000 (64byte) 845308pps 412Mb/sec<br />

(432797696bps) errors: 0<br />

57: 7827812 0 IO-SAPIC-level eth1<br />

# total_samples 672<br />

# instruction addr view<br />

# sorted by count<br />

# showing per per distinct value<br />

# %L2 : percentage of L1 misses that hit L2<br />

# %L3 : percentage of L1 misses that hit L3<br />

# %RAM : percentage of L1 misses that hit memory<br />

# L2 : 5 cycles load latency<br />

# L3 : 12 cycles load latency<br />

# sampling period: 500<br />

#count %self %cum %L2 %L3 %RAM instruction addr<br />

38 5.65% 5.65% 0.00% 0.00% 100.00% 0xa000000100009141 ia64_spinlock_contention<br />

+0x21<br />

36 5.36% 11.01% 0.00% 0.00% 100.00% 0xa00000020003e580 tg3_interrupt[tg3]+0xe0<br />

32 4.76% 15.77% 0.00% 0.00% 100.00% 0xa000000200034770 tg3_write_indirect_reg32[tg3]<br />

+0x90<br />

32 4.76% 20.54% 0.00% 0.00% 100.00% 0xa00000020003e640 tg3_interrupt[tg3]+0x1a0<br />

30 4.46% 25.00% 0.00% 0.00% 100.00% 0xa000000200034e91 tg3_enable_ints[tg3]+0x91<br />

29 4.32% 29.32% 0.00% 0.00% 100.00% 0xa00000020003e510 tg3_interrupt[tg3]+0x70<br />

28 4.17% 33.48% 0.00% 0.00% 100.00% 0xa00000020003d1a0 tg3_tx[tg3]+0x2e0<br />

27 4.02% 37.50% 0.00% 0.00% 100.00% 0xa00000020003cfa0 tg3_tx[tg3]+0xe0<br />

24 3.57% 41.07% 0.00% 0.00% 100.00% 0xa00000020003cfd1 tg3_tx[tg3]+0x111<br />

21 3.12% 44.20% 0.00% 0.00% 100.00% 0xa000000200034e60 tg3_enable_ints[tg3]+0x60<br />

.<br />

.<br />

.<br />

# level 0 : counts=0 avg_cycles=0.0ms 0.00%<br />

# level 1 : counts=0 avg_cycles=0.0ms 0.00%<br />

# level 2 : counts=672 avg_cycles=0.0ms 100.00%<br />

approx cost: 0.0s<br />

Figure 8: tg3 v3.6 lat64 output


250 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

gsyprf3:~# pfmon31 --us-c --cpu-list=0 --resolve-addr --smpl-module=dear-hist-itanium2 \<br />

-e data_ear_cache_lat4 --long-smpl-periods=5000 --smpl-periods-random=0xfff:10 \<br />

--system-wide -k -- /usr/src/pktgen-testing/pktgen-single-tg3<br />

added event set 0<br />

only kernel symbols are resolved in system-wide mode<br />

Adding devices to run.<br />

Configuring devices<br />

Running... ctrl^C to stop<br />

57: 8484552 0 IO-SAPIC-level eth1<br />

Result: OK: 5938001(c5866437+d71564) usec, 5000000 (64byte) 842034pps 411Mb/sec<br />

(431121408bps) errors: 0<br />

57: 9093642 0 IO-SAPIC-level eth1<br />

# total_samples 795<br />

# instruction addr view<br />

# sorted by count<br />

# showing per per distinct value<br />

# %L2 : percentage of L1 misses that hit L2<br />

# %L3 : percentage of L1 misses that hit L3<br />

# %RAM : percentage of L1 misses that hit memory<br />

# L2 : 5 cycles load latency<br />

# L3 : 12 cycles load latency<br />

# sampling period: 5000<br />

# #count %self %cum %L2 %L3 %RAM instruction addr<br />

95 11.95% 11.95% 0.00% 98.95% 1.05% 0xa00000020003d150 tg3_tx[tg3]+0x290<br />

83 10.44% 22.39% 93.98% 4.82% 1.20% 0xa00000020003d030 tg3_tx[tg3]+0x170<br />

21 2.64% 25.03% 0.00% 95.24% 4.76% 0xa0000001000180f0 ia64_handle_irq+0x170<br />

20 2.52% 27.55% 5.00% 80.00% 15.00% 0xa00000020003d040 tg3_tx[tg3]+0x180<br />

18 2.26% 29.81% 50.00% 11.11% 38.89% 0xa00000020003cfa0 tg3_tx[tg3]+0xe0<br />

17 2.14% 31.95% 0.00% 0.00% 100.00% 0xa00000020003e671 tg3_interrupt[tg3]<br />

+0x1d1<br />

17 2.14% 34.09% 0.00% 100.00% 0.00% 0xa00000020003e700 tg3_interrupt[tg3]<br />

+0x260<br />

16 2.01% 36.10% 56.25% 43.75% 0.00% 0xa000000100012160 ia64_leave_kernel<br />

+0x180<br />

16 2.01% 38.11% 62.50% 0.00% 37.50% 0xa00000020003cf60 tg3_tx[tg3]+0xa0<br />

15 1.89% 40.00% 86.67% 6.67% 6.67% 0xa00000020003cfd0 tg3_tx[tg3]+0x110<br />

15 1.89% 41.89% 0.00% 0.00% 100.00% 0xa000000100016041 do_IRQ+0x1a1<br />

15 1.89% 43.77% 0.00% 53.33% 46.67% 0xa00000020003e370 tg3_poll[tg3]+0x350<br />

.<br />

.<br />

.<br />

# level 0 : counts=226 avg_cycles=0.0ms 28.43%<br />

# level 1 : counts=264 avg_cycles=0.0ms 33.21%<br />

# level 2 : counts=305 avg_cycles=0.0ms 38.36%<br />

approx cost: 0.0s<br />

Figure 9: tg3 v3.6 lat4 output


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 251<br />

and in only two locations. Adding one prefetch<br />

to pull data from L3 into L2 would help for the<br />

top offender. <strong>One</strong> needs to figure out which bit<br />

of data each recorded access refers to and determine<br />

how early one can prefetch that data.<br />

We can also rule out MMIO accesses as the top<br />

culprit. tg3_interrupt+0x1d1 could be<br />

an MMIO read but it doesn’t show up in Figure<br />

8 like tg3_write_indirect_reg32<br />

does.<br />

Note smpl-periods is 10x higher in Figure<br />

9 than in Figure 8. Collecting 10x more<br />

samples with lat4 definitely disturbs the<br />

workload.<br />

5 q-tools<br />

q-syscollect and q-view are trivial to<br />

use. An example and brief explanation for kernel<br />

usage follow.<br />

Please remember most applications spend most<br />

of the time in user space and not in the kernel.<br />

q-tools is especially good in user space.<br />

5.1 q-syscollect<br />

q-syscollect -c 5000 -C 5000 -t<br />

20 -k<br />

This will collect system wide kernel data during<br />

the 20 second period. Twenty to thrity seconds<br />

is usually long enough to get sufficient accuracy<br />

3 . However, if the workload generates<br />

a very wide call graph with even distribution,<br />

one will likely need to sample for longer periods<br />

to get accuracy in the ±1% range. When<br />

in doubt, try sampling for longer periods to see<br />

if the call-counts change significantly.<br />

3 See Page 7 of the David Mosberger’s Gelato talk<br />

[4] for a nice graph on accuracy which only applies to<br />

his example.<br />

<strong>The</strong> -c and -C set the call sample rate and<br />

code sample rate respectively. <strong>The</strong> call sample<br />

rate is used to collect function call counts.<br />

This is one of the key differences compared to<br />

traditional profiling tools: q-syscollect obtains<br />

call-counts in a statistical fashion, just as has<br />

been done traditionally for the execution-time<br />

profile. <strong>The</strong> code sample rate is used to collect<br />

a flat profile (CPU_CYCLES by default).<br />

<strong>The</strong> -e option allows one to change the event<br />

used to sample for the flat profile. <strong>The</strong> default<br />

is to sample CPU_CYCLES event. This provides<br />

traditional execution time in the flat profile.<br />

<strong>The</strong> data is stored in the current directory under<br />

.q/ directory. <strong>The</strong> next section demonstrates<br />

how q-view displays the data.<br />

5.2 q-view<br />

I was running the netperf [7] TCP_RR test in<br />

the background to another server when I collected<br />

the following data. As Figure 10 shows,<br />

this particular TCP_RR test isn’t costing many<br />

cycles in tg3 driver. Or, at least not ones I can<br />

measure.<br />

tg3_interrupt() shows up in the flat profile<br />

with 0.314 seconds time associated with<br />

it. <strong>The</strong> time measurement is only possible<br />

because handle_IRQ_event() re-enables<br />

interrupts if the IRQ handler is not registered<br />

with SA_INTERRUPT (to indicate latency<br />

sensitive IRQ handler). do_IRQ() and<br />

other functions in that same call graph do NOT<br />

have any time measurements because interrupts<br />

are disabled. As noted before, the callgraph<br />

is sampled using a different part of the<br />

PMU than the part which samples the flat profile.<br />

Lastly, I’ve omitted the trailing output of<br />

q-view which explains the fields and<br />

columns more completely. Read that first be-


252 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

gsyprf3:~# q-view .q/kernel-cpu0.info | more<br />

Flat profile of CPU_CYCLES in kernel-cpu0.hist#0:<br />

Each histogram sample counts as 200.510u seconds<br />

% time self cumul calls self/call tot/call name<br />

68.88 13.41 13.41 215k 62.5u 62.5u default_idle<br />

2.90 0.56 13.97 431k 1.31u 1.31u finish_task_switch<br />

2.50 0.49 14.46 233k 2.09u 4.89u tg3_poll<br />

1.77 0.35 14.80 1.38M 251n 268n ipt_do_table<br />

1.61 0.31 15.12 240k 1.31u 1.31u tg3_interrupt<br />

1.51 0.29 15.41 240k 1.22u 5.95u net_rx_action<br />

.<br />

.<br />

.<br />

Call-graph table:<br />

index %time self children called name<br />

<br />

[176] 69.4 30.5m 13.4 - cpu_idle<br />

29.5m 0.285 231k/457k schedule [164]<br />

10.0m 0.00 244k/244k check_pgt_cache [178]<br />

13.4 0.00 215k/215k default_idle [177]<br />

----------------------------------------------------<br />

.<br />

.<br />

.<br />

----------------------------------------------------<br />

0.293 1.14 240k __do_softirq [40]<br />

[56] 7.4 0.293 1.14 240k net_rx_action<br />

0.487 0.649 233k/233k tg3_poll [57]<br />

----------------------------------------------------<br />

0.487 0.649 233k net_rx_action [56]<br />

[57] 5.9 0.487 0.649 233k tg3_poll<br />

- 0.00 229k/229k tg3_enable_ints [133]<br />

97.7m 0.552 225k/225k tg3_rx [61]<br />

- 0.00 227k/227k tg3_tx [58]<br />

----------------------------------------------------<br />

.<br />

.<br />

.<br />

----------------------------------------------------<br />

- 1.88 348k ia64_leave_kernel [10]<br />

[11] 9.7 - 1.88 348k ia64_handle_irq<br />

- 1.52 239k/240k do_softirq [39]<br />

- 0.367 356k/356k do_IRQ [12]<br />

----------------------------------------------------<br />

.<br />

.<br />

.<br />

Figure 10: q-view output for TCP_RR over tg3 v3.6


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 253<br />

fore going through the rest of the output.<br />

6 Conclusion<br />

6.1 More pfmon examples<br />

CPU L2 cache misses in one kernel function<br />

pfmon --verb -k \<br />

--irange=sba_alloc_range \<br />

-el2_misses --system-wide \<br />

--session-timeout=10<br />

Show all L2 cache misses in<br />

sba_alloc_range. This is interesting<br />

since sba_alloc_range() walks<br />

a bitmap to look for “free” resources.<br />

<strong>One</strong> can instead specify -el3_misses<br />

since L3 cache misses are much more<br />

expensive.<br />

CPU 1 memory loads<br />

pfmon --us-c \<br />

--cpu-list=1 \<br />

-e loads_retired \<br />

-k --system-wide \<br />

-- /tmp/pktgen-single Only<br />

count memory loads on CPU 1. This is<br />

useful for when we can bind the interrupt<br />

to CPU 1 and the workload to a different<br />

CPU. This lets us separate interrupt path<br />

from base level code, i.e., when is the<br />

load happening (before or after DMA<br />

occurred) and which code path should<br />

one be looking more closely at.<br />

List EAR events supported pfmon -lear<br />

List all EAR types supported by pfmon 4 .<br />

More info on Event pfmon -i DATA_EAR_<br />

TLB_ALL pfmon can provide more info<br />

on particular events it supports.<br />

4 EAR isn’t supported until pfmon v3.1<br />

6.2 And thanks to. . .<br />

Special thanks to Stephane Eranian [2] for dedicating<br />

so much time to the perfmon kernel<br />

driver and associated tools. People might think<br />

the PMU does it all—but only with a lot of SW<br />

driving it. His review of this paper caught some<br />

good bloopers. This talk only happened because<br />

I sit across the aisle from him and could<br />

pester him regularly.<br />

Thanks to David Mosberger[5] for putting together<br />

q-tools and making it so trivial to use.<br />

In addition, in no particular order:<br />

Christophe de Dinechin, Bjorn Helgaas,<br />

Matthew Wilcox, Andrew Patterson, Al Stone,<br />

Asit Mallick, and James Bottomley for reviewing<br />

this document or providing technical guidance.<br />

Thanks also to the OLS staff for making this<br />

event happen every year.<br />

My apologies if I omitted other contributors.<br />

References<br />

[1] perfmon homepage,<br />

http://www.hpl.hp.com/<br />

research/linux/<br />

perfmon/<br />

[2] Stephane Eranian,<br />

http://www.gelato.org/<br />

community/gelato_<br />

meeting.php?id=CU2004#<br />

talk22<br />

[3] <strong>The</strong> IA-32 Intel(R) Architecture<br />

Software Developer’s Manuals,<br />

http://www.intel.com/<br />

design/pentium4/<br />

manuals/253668.htm<br />

[4] q-tools homepage,<br />

http://www.hpl.hp.com/


254 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

research/linux/<br />

q-tools/<br />

[5] David Mosberger,<br />

http://www.gelato.org/<br />

community/gelato_<br />

meeting.php?id=CU2004#<br />

talk19<br />

[6] qprof homepage,<br />

http://www.hpl.hp.com/<br />

research/linux/qprof/<br />

[7] netperf homepage, http:<br />

//www.netperf.org/


Carrier Grade Server Features in the <strong>Linux</strong> <strong>Kernel</strong><br />

Towards <strong>Linux</strong>-based Telecom Plarforms<br />

Ibrahim Haddad<br />

Ericsson Research<br />

ibrahim.haddad@ericsson.com<br />

Abstract<br />

Traditionally, communications and data service<br />

networks were built on proprietary platforms<br />

that had to meet very specific availability,<br />

reliability, performance, and service response<br />

time requirements. Today, communication<br />

service providers are challenged to meet<br />

their needs cost-effectively for new architectures,<br />

new services, and increased bandwidth,<br />

with highly available, scalable, secure, and<br />

reliable systems that have predictable performance<br />

and that are easy to maintain and upgrade.<br />

This paper presents the technological<br />

trend of migrating from proprietary to open<br />

platforms based on software and hardware<br />

building blocks. It also focuses on the ongoing<br />

work by the Carrier Grade <strong>Linux</strong> working<br />

group at the Open Source Development Labs,<br />

examines the CGL architecture, the requirements<br />

from the latest specification release, and<br />

presents some of the needed kernel features<br />

that are not currently supported by <strong>Linux</strong> such<br />

as a <strong>Linux</strong> cluster communication mechanism,<br />

a low-level kernel mechanism for improved reliability<br />

and soft-realtime performance, support<br />

for multi-FIB, and support for additional<br />

security mechanisms.<br />

1 Open platforms<br />

<strong>The</strong> demand for rich media and enhanced<br />

communication services is rapidly leading to<br />

significant changes in the communication industry,<br />

such as the convergence of data and<br />

voice technologies. <strong>The</strong> transition to packetbased,<br />

converged, multi-service IP networks<br />

require a carrier grade infrastructure based on<br />

interoperable hardware and software building<br />

blocks, management middleware, and applications,<br />

implemented with standard interfaces.<br />

<strong>The</strong> communication industry is witnessing a<br />

technology trend moving away from proprietary<br />

systems toward open and standardized<br />

systems that are built using modular and flexible<br />

hardware and software (operating system<br />

and middleware) common off the shelf components.<br />

<strong>The</strong> trend is to proceed forward delivering<br />

next generation and multimedia communication<br />

services, using open standard carrier<br />

grade platforms. This trend is motivated<br />

by the expectations that open platforms are going<br />

to reduce the cost and risk of developing<br />

and delivering rich communications services.<br />

Also, they will enable faster time to market and<br />

ensure portability and interoperability between<br />

various components from different providers.<br />

<strong>One</strong> frequently asked question is: ’How can we<br />

meet tomorrow’s requirements using existing<br />

infrastructures and technologies?’. Proprietary<br />

platforms are closed systems, expensive to develop,<br />

and often lack support of the current<br />

and upcoming standards. Using such closed<br />

platforms to meet tomorrow’s requirements for<br />

new architectures and services is almost impossible.<br />

A uniform open software environment<br />

with the characteristics demanded by telecom


256 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

applications, combined with commercial offthe-shelf<br />

software and hardware components<br />

is a necessary part of these new architectures.<br />

<strong>The</strong> following key industry consortia are defining<br />

hardware and software high availability<br />

specifications that are directly related to telecom<br />

platforms:<br />

1. <strong>The</strong> PCI Industrial Computer Manufacturers<br />

Group [1] (PICMG) defines standards<br />

for high availability (HA) hardware.<br />

2. <strong>The</strong> Open Source Development Labs [2]<br />

(OSDL) Carrier Grade <strong>Linux</strong> [3] (CGL)<br />

working group was established in January<br />

2002 with the goal of enhancing the<br />

<strong>Linux</strong> operating system, to achieve an<br />

Open Source platform that is highly available,<br />

secure, scalable and easily maintained,<br />

suitable for carrier grade systems.<br />

3. <strong>The</strong> Service Availability Forum [4] (SA<br />

Forum) defines the interfaces of HA middleware<br />

and focusing on APIs for hardware<br />

platform management and for application<br />

failover in the application API. SA<br />

compliant middleware will provide services<br />

to an application that needs to be HA<br />

in a portable way.<br />

2 <strong>The</strong> term Carrier Grade<br />

In this paper, we refer to the term Carrier Grade<br />

on many occasions. Carrier grade is a term<br />

for public network telecommunications products<br />

that require a reliability percentage up to 5<br />

or 6 nines of uptime.<br />

• 5 nines refers to 99.999% of uptime per<br />

year (i.e., 5 minutes of downtime per<br />

year). This level of availability is usually<br />

associated with Carrier Grade servers.<br />

• 6 nines refers to 99.9999% of uptime per<br />

year (i.e., 30 seconds of downtime per<br />

year). This level of availability is usually<br />

associated with Carrier Grade switches.<br />

3 <strong>Linux</strong> versus proprietary operating<br />

systems<br />

This section describes briefly the motivating<br />

reasons in favor of using <strong>Linux</strong> on Carrier<br />

Grade systems, versus continuing with proprietary<br />

operating systems. <strong>The</strong>se motivations include:<br />

• Cost: <strong>Linux</strong> is available free of charge in<br />

the form of a downloadable package from<br />

the Internet.<br />

• Source code availability: With <strong>Linux</strong>, you<br />

gain full access to the source code allowing<br />

you to tailor the kernel to your needs.<br />

Figure 1: From Proprietary to Open Solutions<br />

<strong>The</strong> operating system is a core component in<br />

such architectures. In the remaining of this paper,<br />

we will be focusing on CGL, its architecture<br />

and specifications.<br />

• Open development process (Figure 2):<br />

<strong>The</strong> development process of the kernel is<br />

open to anyone to participate and contribute.<br />

<strong>The</strong> process is based on the concept<br />

of "release early, release often."<br />

• Peer review and testing resources: With<br />

access to the source code, people using a


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 257<br />

wide variety of platform, operating systems,<br />

and compiler combinations; can<br />

compile, link, and run the code on their<br />

systems to test for portability, compatibility<br />

and bugs.<br />

• Vendor independent: With <strong>Linux</strong>, you no<br />

longer have to be locked into a specific<br />

vendor. <strong>Linux</strong> is supported on multiple<br />

platforms.<br />

• High innovation rate: New features are<br />

usually implemented on <strong>Linux</strong> before they<br />

are available on commercial or proprietary<br />

systems.<br />

Figure 2: Open development process of the<br />

<strong>Linux</strong> kernel<br />

Other contributing factors include <strong>Linux</strong>’ support<br />

for a broad range of processors and<br />

peripherals, commercial support availability,<br />

high performance networking, and the proven<br />

record of being a stable, and reliable server<br />

platform.<br />

4 Carrier Grade <strong>Linux</strong><br />

<strong>The</strong> <strong>Linux</strong> kernel is missing several features<br />

that are needed in a telecom environment. It<br />

is not adapted to meet telecom requirements<br />

in various areas such as reliability, security,<br />

and scalability. To help the advancement of<br />

<strong>Linux</strong> in the telecom space, OSDL established<br />

the CGL working group. <strong>The</strong> group specifies<br />

and helps implement an Open Source platform<br />

targeted for the communication industry that<br />

is highly available, secure, scalable and easily<br />

maintained. <strong>The</strong> CGL working group is composed<br />

of several members from network equipment<br />

providers, system integrators, platform<br />

providers, and <strong>Linux</strong> distributors. <strong>The</strong>y all<br />

contribute to the requirement definition of Carrier<br />

Grade <strong>Linux</strong>, help Open Source projects<br />

to meet these requirements, and in some cases<br />

start new Open Source projects. Many of<br />

the CGL members companies have contributed<br />

pieces of technologies to Open Source in order<br />

to make the <strong>Linux</strong> <strong>Kernel</strong> a more viable option<br />

for telecom platforms. For instance, the Open<br />

Systems Lab [5] from Ericsson Research has<br />

contributed three key technologies: the Transparent<br />

IPC [6], the Asynchronous Event Mechanism<br />

[7], and the Distributed Security Infrastructure<br />

[8]. <strong>The</strong>re are already <strong>Linux</strong> distributions,<br />

MontaVista [9] for instance, that are<br />

providing CGL distribution based on the CGL<br />

requirement definition. Many companies are<br />

also either deploying CGL, or at least evaluating<br />

and experimenting with it.<br />

Consequently, CGL activities are giving much<br />

momentum for <strong>Linux</strong> in the telecom space<br />

allowing it to be a viable option to proprietary<br />

operating system. Member companies of<br />

CGL are releasing code to Open Source and<br />

are making some of their proprietary technologies<br />

open, which leads to going forward from<br />

closed platforms to open platforms that use<br />

CGL <strong>Linux</strong>.<br />

5 Target CGL applications<br />

<strong>The</strong> CGL Working Group has identified three<br />

main categories of application areas into which<br />

they expect the majority of applications implemented<br />

on CGL platforms to fall. <strong>The</strong>se appli-


258 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

cation areas are gateways, signaling, and management<br />

servers.<br />

• Gateways are bridges between two different<br />

technologies or administration domains.<br />

For example, a media gateway performs<br />

the critical function of converting<br />

voice messages from a native telecommunications<br />

time-division-multiplexed network,<br />

to an Internet protocol packetswitched<br />

network. A gateway processes a<br />

large number of small messages received<br />

and transmitted over a large number of<br />

physical interfaces. Gateways perform<br />

in a timely manner very close to hard<br />

real-time. <strong>The</strong>y are implemented on dedicated<br />

platforms with replicated (rather<br />

than clustered) systems used for redundancy.<br />

• Signaling servers handle call control, session<br />

control, and radio recourse control.<br />

A signaling server handles the routing and<br />

maintains the status of calls over the network.<br />

It takes the request of user agents<br />

who want to connect to other user agents<br />

and routes it to the appropriate signaling.<br />

Signaling servers require soft real time response<br />

capabilities less than 80 milliseconds,<br />

and may manage tens of thousands<br />

of simultaneous connections. A signaling<br />

server application is context switch and<br />

memory intensive due to requirements for<br />

quick switching and a capacity to manage<br />

large numbers of connections.<br />

• Management servers handle traditional<br />

network management operations, as well<br />

as service and customer management.<br />

<strong>The</strong>se servers provide services such as: a<br />

Home Location Register and Visitor Location<br />

Register (for wireless networks)<br />

or customer information (such as personal<br />

preferences including features the<br />

customer is authorized to use). Typically,<br />

management applications are data<br />

and communication intensive. <strong>The</strong>ir response<br />

time requirements are less stringent<br />

by several orders of magnitude, compared<br />

to those of signaling and gateway<br />

applications.<br />

6 Overview of the CGL working<br />

group<br />

<strong>The</strong> CGL working group has the vision that<br />

next-generation and multimedia communication<br />

services can be delivered using <strong>Linux</strong><br />

based open standards platforms for carrier<br />

grade infrastructure equipment. To achieve this<br />

vision, the working group has setup a strategy<br />

to define the requirements and architecture<br />

for the Carrier Grade <strong>Linux</strong> platform, develop<br />

a roadmap for the platform, and promote the<br />

development of a stable platform upon which<br />

commercial components and services can be<br />

deployed.<br />

In the course of achieving this strategy, the<br />

OSDL CGL working group, is creating the requirement<br />

definitions, and identifying existing<br />

Open Source projects that support the roadmap<br />

to implement the required components and interfaces<br />

of the platform. When an Open Source<br />

project does not exist to support a certain requirement,<br />

OSDL CGL is launching (or support<br />

the launch of) new Open Source projects<br />

to implement missing components and interfaces<br />

of the platform.<br />

<strong>The</strong> CGL working group consists of three distinct<br />

sub-groups that work together. <strong>The</strong>se subgroups<br />

are: specification, proof-of-concept,<br />

and validation. Responsibilities of each subgroup<br />

are as follows:<br />

1. Specifications: <strong>The</strong> specifications subgroup<br />

is responsible for defining a set of


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 259<br />

requirements that lead to enhancements in<br />

the <strong>Linux</strong> kernel, that are useful for carrier<br />

grade implementations and applications.<br />

<strong>The</strong> group collects, categorizes, and<br />

prioritizes the requirements from participants<br />

to allow reasonable work to proceed<br />

on implementations. <strong>The</strong> group also interacts<br />

with other standard defining bodies,<br />

open source communities, developers<br />

and distributions to ensure that the requirements<br />

identify useful enhancements<br />

in such a way, that they can be adopted<br />

into the base <strong>Linux</strong> kernel.<br />

2. Proof-of-Concept: This sub-group generates<br />

documents covering the design, features,<br />

and technology relevant to CGL. It<br />

drives the implementation and integration<br />

of core Carrier Grade enhancements to<br />

<strong>Linux</strong> as identified and prioritized by the<br />

requirement document. <strong>The</strong> group is also<br />

responsible for ensuring the integrated enhancements<br />

pass, the CGL validation test<br />

suite and for establishing and leading an<br />

open source umbrella project to coordinate<br />

implementation and integration activities<br />

for CGL enhancements.<br />

3. Validation: This sub-group defines standard<br />

test environments for developing validation<br />

suites. It is responsible for coordinating<br />

the development of validation<br />

suites, to ensure that all of the CGL requirements<br />

are covered. This group is<br />

also responsible for the development of<br />

an Open Source project CGL validation<br />

suite.<br />

7 CGL architecture<br />

Figure 3 presents the scope of the CGL Working<br />

Group, which covers two areas:<br />

• Carrier Grade <strong>Linux</strong>: Various requirements<br />

such as availability and scalability<br />

Figure 3: CGL architecture and scope<br />

are related to the CGL enhancements to<br />

the operating system. Enhancements may<br />

also be made to hardware interfaces, interfaces<br />

to the user level or application code<br />

and interfaces to development and debugging<br />

tools. In some cases, to access the<br />

kernel services, user level library changes<br />

will be needed.<br />

• Software Development Tools: <strong>The</strong>se tools<br />

will include debuggers and analyzers.<br />

On October 9, 2003, OSDL announced<br />

the availability of the OSDL Carrier<br />

Grade <strong>Linux</strong> Requirements Definition,<br />

Version 2.0 (CGL 2.0). This latest requirement<br />

definition for next-generation<br />

carrier grade <strong>Linux</strong> offers major advances<br />

in security, high availability, and clustering.<br />

8 CGL requirements<br />

<strong>The</strong> requirement definition document of CGL<br />

version 2.0 introduced new and enhanced features<br />

to support <strong>Linux</strong> as a carrier grade platform.<br />

<strong>The</strong> CGL requirement definition divides<br />

the requirements in main categories described<br />

briefly below:


260 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

8.1 Clustering<br />

<strong>The</strong>se requirements support the use of multiple<br />

carrier server systems to provide higher levels<br />

of service availability through redundant resources<br />

and recovery capabilities, and to provide<br />

a horizontally scaled environment supporting<br />

increased throughput.<br />

8.2 Security<br />

<strong>The</strong> security requirements are aimed at maintaining<br />

a certain level of security while not endangering<br />

the goals of high availability, performance,<br />

and scalability. <strong>The</strong> requirements support<br />

the use of additional security mechanisms<br />

to protect the systems against attacks from both<br />

the Internet and intranets, and provide special<br />

mechanisms at kernel level to be used by telecom<br />

applications.<br />

8.3 Standards<br />

CGL specifies standards that are required for<br />

compliance for carrier grade server systems.<br />

Examples of these standards include:<br />

• <strong>Linux</strong> Standard Base<br />

• POSIX Timer Interface<br />

• POSIX Signal Interface<br />

• POSIX Message Queue Interface<br />

• POSIX Semaphore Interface<br />

• IPv6 RFCs compliance<br />

• IPsecv6 RFCs compliance<br />

• MIPv6 RFCs compliance<br />

• SNMP support<br />

• POSIX threads<br />

8.4 Platform<br />

OSDL CGL specifies requirements that support<br />

interactions with the hardware platforms<br />

making up carrier server systems. Platform capabilities<br />

are not tied to a particular vendor’s<br />

implementation. Examples of the platform requirements<br />

include:<br />

• Hot insert: supports hot-swap insertion of<br />

hardware components<br />

• Hot remove: supports hot-swap removal<br />

of hardware components<br />

• Remote boot support:<br />

booting functionality<br />

supports remote<br />

• Boot cycle detection: supports detecting<br />

reboot cycles due to recurring failures.<br />

If the system experiences a problem that<br />

causes it to reboot repeatedly, the system<br />

will go offline. This is to prevent additional<br />

difficulties from occurring as a result<br />

of the repeated reboots<br />

• Diskless systems: Provide support for<br />

diskless systems loading their kernel/application<br />

over the network<br />

• Support remote booting across common<br />

LAN and WAN communication media<br />

8.5 Availability<br />

<strong>The</strong> availability requirements support heightened<br />

availability of carrier server systems, such<br />

as improving the robustness of software components<br />

or by supporting recovery from failure<br />

of hardware or software. Examples of these requirements<br />

include:<br />

• RAID 1: support for RAID 1 offers mirroring<br />

to provide duplicate sets of all data<br />

on separate hard disks


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 261<br />

• Watchdog timer interface: support for<br />

watchdog timers to perform certain specified<br />

operations when timeouts occur<br />

• Support for Disk and volume management:<br />

to allow grouping of disks into volumes<br />

• Ethernet link aggregation and link<br />

failover: support bonding of multiple NIC<br />

for bandwidth aggregation and provide<br />

automatic failover of IP addresses from<br />

one interface to another<br />

• Support for application heartbeat monitor:<br />

monitor applications availability and<br />

functionality.<br />

8.6 Serviceability<br />

<strong>The</strong> serviceability requirements support servicing<br />

and managing hardware and software on<br />

carrier server systems. <strong>The</strong>se are wide-ranging<br />

set requirements, put together, help support the<br />

availability of applications and the operating<br />

system. Examples of these requirements include:<br />

• Support for producing and storing kernel<br />

dumps<br />

• Support for dynamic debug to allow dynamically<br />

the insertion of software instrumentation<br />

into a running system in the<br />

kernel or applications<br />

• Support for platform signal handler enabling<br />

infrastructures to allow interrupts<br />

generated by hardware errors to be logged<br />

using the event logging mechanism<br />

• Support for remote access to event log information<br />

8.7 Performance<br />

OSDL CGL specifies the requirements that<br />

support performance levels necessary for the<br />

environments expected to be encountered by<br />

carrier server systems. Examples of these requirements<br />

include:<br />

• Support for application (pre) loading.<br />

• Support for soft real time performance<br />

through configuring the scheduler to provide<br />

soft real time support with latency of<br />

10 ms.<br />

• Support <strong>Kernel</strong> preemption.<br />

• Raid 0 support: RAID Level 0 provides<br />

"disk striping" support to enhance<br />

performance for request-rate-intensive or<br />

transfer-rate-intensive environments<br />

8.8 Scalability<br />

<strong>The</strong>se requirements support vertical and horizontal<br />

scaling of carrier server systems such as<br />

the addition of hardware resources to result in<br />

acceptable increases in capacity.<br />

8.9 Tools<br />

<strong>The</strong> tools requirements provide capabilities to<br />

facilitate diagnosis. Examples of these requirements<br />

include:<br />

• Support the usage of a kernel debugger.<br />

• Support for <strong>Kernel</strong> dump analysis.<br />

• Support for debugging multi-threaded<br />

programs


262 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

9 CGL 3.0<br />

<strong>The</strong> work on the next version of the OSDL<br />

CGL requirements, version 3.0, started in January<br />

2004 with focus on advanced requirement<br />

areas such as manageability, serviceability,<br />

tools, security, standards, performance,<br />

hardware, clustering and availability. With the<br />

success of CGL’s first two requirement documents,<br />

OSDL CGL working group anticipates<br />

that their third version will be quite beneficial<br />

to the Carrier Grade ecosystem. Official release<br />

of the CGL requirement document Version<br />

3.0 is expected in October 2004.<br />

10 CGL implementations<br />

<strong>The</strong>re are several enhancements to the <strong>Linux</strong><br />

<strong>Kernel</strong> that are required by the communication<br />

industry, to help adopt <strong>Linux</strong> on their carrier<br />

grade platforms, and support telecom applications.<br />

<strong>The</strong>se enhancements (Figure 4) fall into<br />

the following categories availability, security,<br />

serviceability, performance, scalability, reliability,<br />

standards, and clustering.<br />

some cases, bringing some projects into maturity<br />

levels takes a considerable amount of time<br />

before being able to request its integration into<br />

the <strong>Linux</strong> kernel. Nevertheless, some of the enhancements<br />

are targeted for inclusion in kernel<br />

version 2.7. Other enhancements will follow in<br />

later kernel releases. Meanwhile, all enhancements,<br />

in the form of packages, kernel modules<br />

and patches, are available from their respective<br />

project web sites. <strong>The</strong> CGL 2.0 requirements<br />

are in-line with the <strong>Linux</strong> development community.<br />

<strong>The</strong> purpose of this project is to form a<br />

catalyst to capture common requirements from<br />

end-users for a CGL distribution. With a common<br />

set of requirements from the major Network<br />

Equipment Providers, developers can be<br />

much more productive and efficient within development<br />

projects. Many individuals within<br />

the CGL initiative are also active participants<br />

and contributors in the Open Source development<br />

community.<br />

11 Examples of needed features in<br />

the <strong>Linux</strong> <strong>Kernel</strong><br />

In this section, we provide some examples<br />

of missing features and mechanisms from the<br />

<strong>Linux</strong> kernel that are necessary in a telecom<br />

environment.<br />

11.1 Transparent Inter-Process and Inter-<br />

Processor Communication Protocol for<br />

<strong>Linux</strong> Clusters<br />

Figure 4: CGL enhancements areas<br />

<strong>The</strong> implementations providing theses enhancements<br />

are Open Source projects and<br />

planned for integration with the <strong>Linux</strong> kernel<br />

when the implementations are mature, and<br />

ready for merging with the kernel code. In<br />

Today’s telecommunication environments are<br />

increasingly adopting clustered servers to gain<br />

benefits in performance, availability, and scalability.<br />

<strong>The</strong> resulting benefits of a cluster<br />

are greater or more cost-efficient than what a<br />

single server can provide. Furthermore, the<br />

telecommunications industry interest in clustering<br />

originates from the fact that clusters<br />

address carrier grade characteristics such as<br />

guaranteed service availability, reliability and


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 263<br />

scaled performance, using cost-effective hardware<br />

and software. Without being absolute<br />

about these requirements, they can be divided<br />

in these three categories: short failure detection<br />

and failure recovery, guaranteed availability of<br />

service, and short response times. <strong>The</strong> most<br />

widely adopted clustering technique is use of<br />

multiple interconnected loosely coupled nodes<br />

to create a single highly available system.<br />

<strong>One</strong> missing feature from the <strong>Linux</strong> kernel in<br />

this area is a reliable, efficient, and transparent<br />

inter-process and inter-processor communication<br />

protocol. Transparent Inter Process<br />

Communication (TIPC) [6] is a suitable Open<br />

Source implementation that fills this gap and<br />

provides an efficient cluster communication<br />

protocol. This leverages the particular conditions<br />

present within loosely coupled clusters.<br />

It runs on <strong>Linux</strong> and is provided as a portable<br />

source code package implementing a loadable<br />

kernel module.<br />

TIPC is unique because there seems to be no<br />

other protocol providing a comparable combination<br />

of versatility and performance. It<br />

includes some original innovations such as<br />

the functional addressing, the topology subscription<br />

services, and the reactive connection<br />

concept. Other important TIPC features<br />

include full location transparency, support<br />

for lightweight connections, reliable multicast,<br />

signaling link protocol, topology subscription<br />

services and more.<br />

TIPC should be regarded as a useful toolbox<br />

for anyone wanting to develop or use Carrier<br />

Grade or Highly Available <strong>Linux</strong> clusters. It<br />

provides the necessary infrastructure for cluster,<br />

network and software management functionality,<br />

as well as a good support for designing<br />

site-independent, scalable, distributed,<br />

high-availability and high-performance applications.<br />

It is also worthwhile to mention that the<br />

ForCES (Forwarding and Control Element<br />

WG) [11] working group within IETF has<br />

agreed that their router internal protocol (the<br />

ForCES protocol) must be possible to carry<br />

over different types of transport protocols.<br />

<strong>The</strong>re is consensus on that TCP is the protocol<br />

to be used when ForCES messages are<br />

transported over the Internet, while TIPC is<br />

the protocol to be used in closed environments<br />

(LANs), where special characteristics such as<br />

high performance and multicast support is desirable.<br />

Other protocols may also be added as<br />

options.<br />

TIPC is a contribution from Ericsson [5] to<br />

the Open Source community. TIPC was announced<br />

on LKML on June 28, 2004; it is licensed<br />

under a dual GPL and BSD license.<br />

11.2 IPv4, IPv6, MIPv6 forwarding tables fast<br />

access and compact memory with multiple<br />

FIB support<br />

Routers are core elements of modern telecom<br />

networks. <strong>The</strong>y propagate and direct billion<br />

of data packets from their source to their destination<br />

using air transport devices or through<br />

high-speed links. <strong>The</strong>y must operate as fast as<br />

the medium in order to deliver the best quality<br />

of service and have a negligible effect on<br />

communications. To give some figures, it is<br />

common for routers to manage between 10.000<br />

to 500.000 routes. In these situations, good<br />

performance is achievable by handling around<br />

2000 routes/sec. <strong>The</strong> actual implementation of<br />

the IP stack in <strong>Linux</strong> works fine for home or<br />

small business routers. However, with the high<br />

expectation of telecom operators and the new<br />

capabilities of telecom hardware, it appears as<br />

barely possible to use <strong>Linux</strong> as an efficient<br />

forwarding and routing element of a high-end<br />

router for large network (core/border/access<br />

router) or a high-end server with routing capabilities.


264 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

<strong>One</strong> problem with the networking stack in<br />

<strong>Linux</strong> is the lack of support for multiple<br />

forward-ing information bases (multi-FIB) wit<br />

h overlapping interface’s IP address, and the<br />

lack of appropriate interfaces for addressing<br />

FIB. Another problem with the curren t implementation<br />

is the limited scalability of the routing<br />

table.<br />

<strong>The</strong> solution to these problems is to provide<br />

support for multi-FIB with overlapping IP address.<br />

As such, we can have on differe nt<br />

VLAN or different physical interfaces, independent<br />

network in the same <strong>Linux</strong> box. For<br />

example, we can have two HTTP servers serving<br />

two different networks with potentially the<br />

same IP address. <strong>One</strong> HTTP server will serve<br />

the network/FIB 10, and the othe r HTTP<br />

server will serves the network/FIB 20. <strong>The</strong> advantage<br />

gained is to have one <strong>Linux</strong> box serving<br />

two different customers usi ng the same IP<br />

address. ISPs adopt this approach by providing<br />

services for multiple customers sharing the<br />

same server (server pa rtitioning), instead of<br />

using a server per customer.<br />

<strong>The</strong> way to achieve this is to have an ID (an<br />

identifier that identifies the customer or user of<br />

the service) to completely separ ate the routing<br />

table in memory. Two approaches exist:<br />

the first is to have a separate routing tables,<br />

each routing table is looked up by their ID and<br />

within tha t table the lookup is done one the<br />

prefix. <strong>The</strong> second approach is to have one table,<br />

and the lookup is done on the combined<br />

key = prefix + ID.<br />

A different kind of problem arises when we are<br />

not able to predict access time, with the chaining<br />

in the hash table of the routi ng cache (and<br />

FIB). This problem is of particular inter-est in<br />

an environment that requires predictable performance.<br />

Another aspect of the problem is that the route<br />

cache and the routing table are not kept synchronized<br />

most of the time (path MTU, just<br />

to name one). <strong>The</strong> route cache flush is executed<br />

regularly; therefore, any updates on the<br />

cache are lost. For example, if you have a routing<br />

cache flush, you have to rebuild every route<br />

that you are currently talking to, by going for<br />

every route in the hash/try table and rebuilding<br />

the information. First, you have to lookup in<br />

the routing cache, and if you have a miss, then<br />

you need to go in the hash/try table. This process<br />

is very slow and not predictable since the<br />

hash/try table is implemented wi th linked list<br />

and there is high potential for collisions when a<br />

large number of routes are present. This design<br />

is suitable fo r a home PC with a few routes, but<br />

it is not scalable for a large server.<br />

To support the various routing requirements<br />

of server nodes operating in high performance<br />

and mission critical envrionments,<br />

<strong>Linux</strong> should support the following:<br />

• Implementation of multi-FIB using tree<br />

(radix, patricia, etc.): It is very important<br />

to have predictable performance in insert/delete/lookup<br />

from 10.000 to 500.000<br />

routes. In addition, it is favourable to have<br />

the same data structure for both IPv4 and<br />

IPv6.<br />

• Socket and ioctl interfaces for addressing<br />

multi-FIB.<br />

• Multi-FIB support for neighbors (arp).<br />

Providing these implementations in <strong>Linux</strong> will<br />

affect a large part of net/core, net/ipv4 and<br />

net/ipv6; these subsystems (mostly network<br />

layer) will need to be re-written. Other areas<br />

will have minimal impact at the source code<br />

level, mostly at the transport layer (socket,<br />

TCP, UDP, RAW, NAT, IPIP, IGMP, etc.).<br />

As for the availability of an Open Source<br />

project that can provide these functionalities,


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 265<br />

there exists a project called "<strong>Linux</strong> Virtual<br />

Routing and Forwarding" [12]. This project<br />

aims to implement a flexible and scalable<br />

mechanism for providing multiple routing instances<br />

within the <strong>Linux</strong> kernel. <strong>The</strong> project<br />

has some potential in providing the needed<br />

functionalities, however no progress has been<br />

made since 2002 and the project seems to be<br />

inactive.<br />

11.3 Run-time Authenticity Verification for Binaries<br />

<strong>Linux</strong> has generally been considered immune<br />

to the spread of viruses, backdoors and Trojan<br />

programs on the Internet. However, with<br />

the increasing popularity of <strong>Linux</strong> as a desktop<br />

platform, the risk of seeing viruses or Trojans<br />

developed for this platform are rapidly<br />

growing. To alleviate this problem, the system<br />

should prevent on run time the execution<br />

of un-trusted software. <strong>One</strong> solution is<br />

to digitally sign the trusted binaries and have<br />

the system check the digital signature of binaries<br />

before running them. <strong>The</strong>refore, untrusted<br />

(not signed) binaries are denied the execution.<br />

This can improve the security of the system<br />

by avoiding a wide range of malicious binaries<br />

like viruses, worms, Trojan programs and<br />

backdoors from running on the system.<br />

DigSig [13] is a <strong>Linux</strong> kernel module that<br />

checks the signature of a binary before running<br />

it. It inserts digital signatures inside the ELF<br />

binary and verifies this signature before loading<br />

the binary. It is based on the <strong>Linux</strong> Security<br />

Module hooks (LSM has been integrated with<br />

the <strong>Linux</strong> kernel since 2.5.X and higher).<br />

Typically, in this approach, vendors do not sign<br />

binaries; the control of the system remains with<br />

the local administrator. <strong>The</strong> responsible administrator<br />

is to sign all binaries they trust with<br />

their private key. <strong>The</strong>refore, DigSig guarantees<br />

two things: (1) if you signed a binary, nobody<br />

else other than yourself can modify that binary<br />

without being detected. (2) Nobody can run a<br />

binary which is not signed or badly signed.<br />

<strong>The</strong>re has already been several initiatives in<br />

this domain, such as Tripwire [14], BSign [15],<br />

Cryptomark [16], but we believe the DigSig<br />

project is the first to be both easily accessible to<br />

all (available on SourceForge, under the GPL<br />

license) and to operate at kernel level on run<br />

time. <strong>The</strong> run time is very important for Carrier<br />

Grade <strong>Linux</strong> as this takes into account the<br />

high availability aspects of the system.<br />

<strong>The</strong> DigSig approach has been using existing<br />

solutions like GnuPG [17] and BSign (a<br />

Debian package) rather than reinventing the<br />

wheel. However, in order to reduce the overhead<br />

in the kernel, the DigSig project only took<br />

the minimum code necessary from GnuPG.<br />

This helped much to reduce the amount of code<br />

imported to the kernel in source code of the<br />

original (only 1/10 of the original GnuPG 1.2.2<br />

source code has been imported to the kernel<br />

module).<br />

DigSig is a contribution from Ericsson [5] to<br />

the Open Source community. It was released<br />

under the GPL license and it is available from<br />

[8].<br />

DigSig has been announced on LKML [18] but<br />

it not yet integrated in the <strong>Linux</strong> <strong>Kernel</strong>.<br />

11.4 Efficient Low-Level Asynchronous Event<br />

Mechanism<br />

Carrier grade systems must provide a 5-nines<br />

availability, a maximum of five minutes per<br />

year of downtime, which includes hardware,<br />

operating system, software upgrade and maintenance.<br />

Operating systems for such systems<br />

must ensure that they can deliver a high response<br />

rate with minimum downtime. In addition,<br />

carrier-grade systems must take into<br />

account such characteristics such as scalabil-


266 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

ity, high availability and performance. In carrier<br />

grade systems, thousands of requests must<br />

be handled concurrently without affecting the<br />

overall system’s performance, even under extremely<br />

high loads. Subscribers can expect<br />

some latency time when issuing a request, but<br />

they are not willing to accept an unbounded<br />

response time. Such transactions are not handled<br />

instantaneously for many reasons, and it<br />

can take some milliseconds or seconds to reply.<br />

Waiting for an answer reduces applications<br />

abilities to handle other transactions.<br />

Many different solutions have been envisaged<br />

to improve <strong>Linux</strong>’s capabilities in this area using<br />

different types of software organization,<br />

such as multithreaded architectures, implementing<br />

efficient POSIX interfaces, or improving<br />

the scalability of existing kernel routines.<br />

<strong>One</strong> possible solution that is adequate for carrier<br />

grade servers is the Asynchronous Event<br />

Mechanism (AEM), which provides asynchronous<br />

execution of processes in the <strong>Linux</strong><br />

kernel. AEM implements a native support<br />

for asynchronous events in the <strong>Linux</strong> kernel<br />

and aims to bring carrier-grade characteristics<br />

to <strong>Linux</strong> in areas of scalability and soft realtime<br />

responsiveness. In addition, AEM offers<br />

event-based development framework, scalability,<br />

flexibility, and extensibility.<br />

Ericsson [5] released AEM to Open Source in<br />

February 2003 under the GPL license. AEM<br />

was announced on the <strong>Linux</strong> <strong>Kernel</strong> Mailing<br />

List (LKML) [20], and received feedback that<br />

resulted in some changes to the design and implementation.<br />

AEM is not yet integrated with<br />

the <strong>Linux</strong> kernel.<br />

12 Conclusion<br />

<strong>The</strong>re are many challenges accompanying the<br />

migration from proprietary to open platforms.<br />

<strong>The</strong> main challenge remains to be the availability<br />

of the various kernel features and mechanisms<br />

needed for telecom platforms and integrating<br />

these features in the <strong>Linux</strong> kernel.<br />

References<br />

[1] PCI Industrial Computer Manufacturers<br />

Group,<br />

http://www.picmg.org<br />

[2] Open Source Development Labs,<br />

http://www.osdl.org<br />

[3] Carrier Grade <strong>Linux</strong>,<br />

http://osdl.org/lab_activities<br />

[4] Service Availability Forum,<br />

http://www.saforum.org<br />

[5] Open System Lab,<br />

http://www.linux.ericsson.ca<br />

[6] Transparent IPC,<br />

http://tipc.sf.net<br />

[7] Asynchronous Event Mechanism,<br />

http://aem.sf.net<br />

[8] Distributed Security Infrastructure,<br />

http://disec.sf.net<br />

[9] MontaVista Carrier Grade Edition,<br />

http://www.mvista.com/cge<br />

[10] Make Clustering Easy with TIPC,<br />

<strong>Linux</strong>World Magazine, April 2004<br />

[11] IETF ForCES working group,<br />

http://www.sstanamera.com/~forces<br />

[12] <strong>Linux</strong> Virtual Routing and Forwarding<br />

project,<br />

http://linux-vrf.sf.net<br />

[13] Stop Malicious Code Execution at<br />

<strong>Kernel</strong> Level, <strong>Linux</strong>World Magazine,<br />

January 2004


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 267<br />

[14] Tripwire,<br />

http://www.tripwire.com<br />

[15] Bsign,<br />

http://packages.debian.org/bsign<br />

[16] Cryptomark,<br />

http://immunix.org/cryptomark.html<br />

[17] GnuPG,<br />

http://www.gnupg.org<br />

[18] DigSig announcement on LKML,<br />

http://lwn.net/Articles/51007<br />

[19] An Event Mechanism for <strong>Linux</strong>, <strong>Linux</strong><br />

Journal, July 2003<br />

[20] AEM announcement on LKML,<br />

http://lwn.net/Articles/45633<br />

Acknowledgments<br />

Thank you to Ludovic Beliveau, Mathieu<br />

Giguere, Magnus Karlson, Jon Maloy, Mats<br />

Naslund, Makan Pourzandi, and Frederic<br />

Rossi, for their valuable contributions and reviews.


268 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>


Demands, Solutions, and Improvements for <strong>Linux</strong><br />

Filesystem Security<br />

Michael Austin Halcrow<br />

International Business Machines, Inc.<br />

mike@halcrow.us<br />

Abstract<br />

Securing file resources under <strong>Linux</strong> is a team<br />

effort. No one library, application, or kernel<br />

feature can stand alone in providing robust security.<br />

Current <strong>Linux</strong> access control mechanisms<br />

work in concert to provide a certain level<br />

of security, but they depend upon the integrity<br />

of the machine itself to protect that data. Once<br />

the data leaves that machine, or if the machine<br />

itself is physically compromised, those access<br />

control mechanisms can no longer protect the<br />

data in the filesystem. At that point, data privacy<br />

must be enforced via encryption.<br />

As <strong>Linux</strong> makes inroads in the desktop market,<br />

the need for transparent and effective data encryption<br />

increases. To be practically deployable,<br />

the encryption/decryption process must<br />

be secure, unobtrusive, consistent, flexible, reliable,<br />

and efficient. Most encryption mechanisms<br />

that run under <strong>Linux</strong> today fail in one<br />

or more of these categories. In this paper, we<br />

discuss solutions to many of these issues via<br />

the integration of encryption into the <strong>Linux</strong><br />

filesystem. This will provide access control enforcement<br />

on data that is not necessarily under<br />

the control of the operating environment.<br />

We also explore how stackable filesystems, Extended<br />

Attributes, PAM, GnuPG web-of-trust,<br />

supporting libraries, and applications (such as<br />

GNOME/KDE) can all be orchestrated to provide<br />

robust encryption-based access control<br />

over filesystem content.<br />

1 Development Efforts<br />

This paper is motivated by an effort on the part<br />

of the IBM <strong>Linux</strong> Technology Center to enhance<br />

<strong>Linux</strong> filesystem security through better<br />

integration of encryption technology. <strong>The</strong><br />

author of this paper is working together with<br />

the external community and several members<br />

of the LTC in the design and development of<br />

a transparent cryptographic filesystem layer in<br />

the <strong>Linux</strong> kernel. <strong>The</strong> “we” in this paper refers<br />

to immediate members of the author’s development<br />

team who are working together on this<br />

project, although many others outside that development<br />

team have thus far had a significant<br />

part in this development effort.<br />

2 <strong>The</strong> Filesystem Security<br />

2.1 Threat Model<br />

Computer users tend to be overly concerned<br />

about protecting their credit card numbers from<br />

being sniffed as they are transmitted over the<br />

Internet. At the same time, many do not think<br />

twice when sending equally sensitive information<br />

in the clear via an email message. A<br />

thief who steals a removable device, laptop, or<br />

server can also read the confidential files on<br />

those devices if they are left unprotected. Nevertheless,<br />

far too many users neglect to take the<br />

necessary steps to protect their files from such<br />

an event. Your liability limit for unauthorized


270 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

charges to your credit card is $50 (and most<br />

credit card companies waive that liability for<br />

victims of fraud); on the other hand, confidentiality<br />

cannot be restored once lost.<br />

Today, we see countless examples of neglect<br />

to use encryption to protect the integrity and<br />

the confidentiality of sensitive data. Those<br />

who are trusted with sensitive information routinely<br />

send that information as unencrypted<br />

email attachments. <strong>The</strong>y also store that information<br />

in clear text on disks, USB keychain<br />

drives, backup tapes, and other removable media.<br />

GnuPG[7] and OpenSSL[8] provide all the<br />

encryption tools necessary to protect this information,<br />

but these tools are not used nearly as<br />

often as they ought to be.<br />

If required to go through tedious encryption or<br />

decryption steps every time they need to work<br />

with a file or share it, people will select insecure<br />

passwords, transmit passwords in an insecure<br />

manner, fail to consider or use public key<br />

encryption options, or simply stop encrypting<br />

their files altogether. If security is overly obstructive,<br />

people will remove it, work around<br />

it, or misuse it (thus rendering it less effective).<br />

As <strong>Linux</strong> gains adoption in the desktop market,<br />

we need integrated file integrity and confidentiality<br />

that is seamless, transparent, easy to use,<br />

and effective.<br />

2.2 Integration of File Encryption into the<br />

Filesystem<br />

Several solutions exist that solve separate<br />

pieces of the problem. In one example highlighting<br />

transparency, employees within an organization<br />

that uses IBM Lotus Notes [9]<br />

for its email will not even notice the complex<br />

PKI or the encryption process that is integrated<br />

into the product. Encryption and decryption<br />

of sensitive email messages is seamless to the<br />

end user; it involves checking an “Encrypt”<br />

box, specifying a recipient, and sending the<br />

message. This effectively addresses a significant<br />

file in-transit confidentiality problem. If<br />

the local replicated mailbox database is also<br />

encrypted, then it also addresses confidentiality<br />

on the local storage device, but the protection<br />

is lost once the data leaves the domain of<br />

Notes (for example, if an attached file is saved<br />

to disk). <strong>The</strong> process must be seamlessly integrated<br />

into all relevant aspects of the user’s<br />

operating environment.<br />

In Section 4, we discuss filesystem security<br />

in general under <strong>Linux</strong>, with an emphasis<br />

on confidentiality and integrity enforcement<br />

via cryptographic technologies. In Section<br />

6, we propose a mechanism to integrate encryption<br />

of files at the filesystem level, including<br />

integration of GnuPG[7] web-of-trust,<br />

PAM[10], a stackable filesystem model[2], Extended<br />

Attributes[6], and libraries and applications,<br />

in order to make the entire process as<br />

transparent as possible to the end user.<br />

3 A Team Effort<br />

Filesystem security encompasses more than<br />

just the filesystem itself. It is a team effort,<br />

involving the kernel, the shells, the login processes,<br />

the filesystems, the applications, the administrators,<br />

and the users. When we speak of<br />

“filesystem security,” we refer to the security<br />

of the files in a filesystem, no matter what ends<br />

up providing that security.<br />

For any filesystem security problem that exists,<br />

there are usually several different ways of<br />

solving it. Solutions that involve modifications<br />

in the kernel tend to introduce less overhead.<br />

This is due to the fact that context switches and<br />

copying of data between kernel and user memory<br />

is reduced. However, changes in the kernel<br />

may reduce the efficiency of the kernel’s<br />

VFS while making it both harder to maintain<br />

and more bug-prone. As notable exceptions,


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 271<br />

Erez Zadok’s stackable filesystem framework,<br />

FiST[3], and Loop-aes, require no change to<br />

the current <strong>Linux</strong> kernel VFS. Solutions that<br />

exist entirely in userspace do not complicate<br />

the kernel, but they tend to have more overhead<br />

and may be limited in the functionality they are<br />

able to provide, as they are limited by the interface<br />

to the kernel from userspace. Since they<br />

are in userspace, they are also more prone to<br />

attack.<br />

4 Aspects of Filesystem Security<br />

Computer security can be decomposed into<br />

several areas:<br />

• Identifying who you are and having the<br />

machine recognize that identification (authentication).<br />

• Determining whether or not you should be<br />

granted access to a resource such as a sensitive<br />

file (authorization). This is often<br />

based on the permissions associated with<br />

the resource by its owner or an administrator<br />

(access control).<br />

• Transforming your data into an encrypted<br />

format in order to make it prohibitively<br />

costly for unauthorized users to decrypt<br />

and view (confidentiality).<br />

• Performing checksums, keyed hashes,<br />

and/or signing of your data to make unauthorized<br />

modifications of your data detectable<br />

(integrity).<br />

4.1 Filesystem Integrity<br />

When people consider filesystem security, they<br />

traditionally think about access control (file<br />

permissions) and confidentiality (encryption).<br />

File integrity, however, can be just as important<br />

as confidentiality, if not more so. If a script<br />

that performs an administrative task is altered<br />

in an unauthorized fashion, the script may perform<br />

actions that violate the system’s security<br />

policies. For example, many rootkits modify<br />

system startup and shutdown scripts to facilitate<br />

the attacker’s attempts to record the user’s<br />

keystrokes, sniff network traffic, or otherwise<br />

infiltrate the system.<br />

More often than not, the value of the data<br />

stored in files is greater than that of the machine<br />

that hosts the files. For example, if an<br />

attacker manages to insert false data into a financial<br />

report, the alteration to the report may<br />

go unnoticed until substantial damage has been<br />

done; jobs could be at stake and in more extreme<br />

cases even criminal charges against the<br />

user could result . If trojan code sneaks into the<br />

source repository for a major project, the public<br />

release of that project may contain a backdoor.<br />

1<br />

Many security professionals foresee a nightmare<br />

scenario wherein a widely propagated Internet<br />

worm quietly alters the contents of word<br />

processing and spreadsheet documents. Without<br />

any sort of integrity mechanism in place<br />

in the vast majority of the desktop machines<br />

in the world, nobody would know if any data<br />

that traversed vulnerable machines could be<br />

trusted. This threat could be very effectively<br />

addressed with a combination of a kernel-level<br />

mandatory access control (MAC)[11] protection<br />

profile and a filesystem that provides integrity<br />

and auditing capabilities. Such a combination<br />

would be resistant to damage done by<br />

a root compromise, especially if aided by a<br />

Trusted Platform Module (TPM)[13] using attestation.<br />

1 A high-profile example of an attempt to do this occurred<br />

with the <strong>Linux</strong> kernel last year. Fortunately, the<br />

source code management process used by the kernel developers<br />

allowed them to catch the attempted insertion<br />

of the trojan code before it made it into the actual kernel.


272 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

<strong>One</strong> can approach filesystem integrity from<br />

two angles. <strong>The</strong> first is to have strong authentication<br />

and authorization mechanisms in<br />

place that employ sufficiently flexible policy<br />

languages. <strong>The</strong> second is to have an auditing<br />

mechanism, to detect unauthorized attempts at<br />

modifying the contents of a filesystem.<br />

4.1.1 Authentication and Authorization<br />

<strong>The</strong> filesystem must contain support for the<br />

kernel’s security structure, which requires<br />

stateful security attributes on each file. Most<br />

GNU/<strong>Linux</strong> applications today use PAM[10]<br />

(see Section 4.1.2 below) for authentication<br />

and process credentials to represent their authorization;<br />

policy language is limited to<br />

what can be expressed using the file owner<br />

and group, along with the owner/group/world<br />

read/write/execute attributes of the file. <strong>The</strong><br />

administrator and the current owner have the<br />

authority to set the owner of the file or the<br />

read/write/execute policies for that file. In<br />

many filesystems, files may also contain additional<br />

security flags, such as an immutable or<br />

append-only flag.<br />

Posix Access Control Lists (ACL’s)[6] provide<br />

for more stringent delegations of access authority<br />

on a per-file basis. In an ACL, individual<br />

read/write/execute permissions can be assigned<br />

to the owner, the owning group, individual<br />

users, or groups. Masks can also be applied<br />

that indicate the maximum effective permissions<br />

for a class.<br />

For those who require even more flexible access<br />

control, SE <strong>Linux</strong>[15] uses a powerful<br />

policy language that can express a wide variety<br />

of access control policies for files and<br />

filesystem operations. In fact, <strong>Linux</strong> Security<br />

Module (LSM)[14] hooks (see Section 4.1.3<br />

below) exist for most of the security-relevant<br />

filesystem operations, which makes it easier to<br />

implement custom filesystem-agnostic security<br />

models. Authentication and authorization are<br />

pretty well covered with a combination of existing<br />

filesystem, kernel, and user-space solutions<br />

that are part of most GNU/<strong>Linux</strong> distributions.<br />

Many distributions could, however, do a<br />

better job of aiding both the administrator and<br />

the user in understanding and using all the tools<br />

that they have available to them.<br />

Policies that safeguard sensitive data should include<br />

timeouts, whereby the user must periodically<br />

re-authenticate in order to continue to<br />

access the data. In the event that the authorized<br />

users neglect to lock down the machine<br />

before leaving work for the day, timeouts help<br />

to keep the custodial staff from accessing the<br />

data when they come in at night to clean the<br />

office. As usual, this must be implemented in<br />

such a way as to be unobtrusive to the user. If a<br />

user finds a security mechanism overly imposing<br />

or inconvenient, he will usually disable or<br />

circumvent it.<br />

4.1.2 PAM<br />

Pluggable Authentication Modules (PAM)[10]<br />

implement authentication-related security policies.<br />

PAM offers discretionary access control<br />

(DAC)[12]; applications must defer to PAM in<br />

order to authenticate a user. If the authenticating<br />

PAM function that is called returns an affirmative<br />

answer, then the application can use<br />

that response to authorize the action, and vice<br />

versa. <strong>The</strong> exact mechanism that the PAM<br />

function uses to evaluate the authentication is<br />

dependent on the module called. 2<br />

In the case of filesystem security and encryption,<br />

PAM can be employed to obtain and forward<br />

keys to a filesystem encryption layer in<br />

kernel space. This would allow seamless inte-<br />

2 This is parameterizable in the configuration files<br />

found under /etc/pam.d/


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 273<br />

gration with any key retrieval mechanism that<br />

can be coded as a Pluggable Authentication<br />

Module.<br />

4.1.3 LSM<br />

<strong>Linux</strong> Security Modules (LSM) can provide<br />

customized security models. <strong>One</strong> possible use<br />

of LSM is to allow decryption of certain files<br />

only when a physical device is connected to the<br />

machine. This could be, for example, a USB<br />

keychain device, a Smartcard, or an RFID device.<br />

Some devices of these classes can also be<br />

used to house the encryption keys (retrievable<br />

via PAM, as previously discussed).<br />

4.1.4 Auditing<br />

<strong>The</strong> second angle to filesystem integrity is auditing.<br />

Auditing should only fill in where authentication<br />

and authorization mechanisms fall<br />

short. In a utopian world, where security systems<br />

are perfect and trusted people always act<br />

trustworthily, auditing does not have much of<br />

a use. In reality, code that implements security<br />

has defects and vulnerabilities. Passwords can<br />

be compromised, and authorized people can<br />

act in an untrustworthy manner. Auditing can<br />

involve keeping a log of all changes made to<br />

the attributes of the file or to the file data itself.<br />

It can also involve taking snapshots of the attributes<br />

and/or contents of the file and comparing<br />

the current state of the file with what was<br />

recorded in a prior snapshot.<br />

Intrusion detection systems (IDS), such as<br />

Tripwire[16], AIDE[17], or Samhain[18], perform<br />

auditing functions. As an example, Tripwire<br />

periodically scans the contents of the<br />

filesystem, checking file attributes, such as the<br />

size, the modification time, and the cryptographic<br />

hash of each file. If any attributes for<br />

the files being checked are found to be altered,<br />

Tripwire will report it. This approach can work<br />

fairly well in cases where the files are not expected<br />

to change very often, as is the case with<br />

most system scripts, shared libraries, executables,<br />

or configuration files. However, care must<br />

be taken to assure that the attacker cannot also<br />

modify Tripwire’s database when he modifies<br />

a system file; the integrity of the IDS system<br />

itself must also be assured.<br />

In cases where a file changes often, such as<br />

a database file or a spreadsheet file in an active<br />

project, we see a need for a more dynamic<br />

auditing solution - one which is perhaps<br />

more closely integrated with the filesystem<br />

itself. In many cases, the simple fact that<br />

the file has changed does not imply a security<br />

violation. We must also know who made<br />

the change. More robust security requirements<br />

also demand that we know what parts<br />

of the file were changed and when the changes<br />

were made. <strong>One</strong> could even imagine scenarios<br />

where the context of the change must also be<br />

taken into consideration (i.e., who was logged<br />

in, which processes were running, or what network<br />

activity was taking place at the time the<br />

change was made).<br />

File integrity, particularly in the area of auditing,<br />

is perhaps the security aspect of <strong>Linux</strong><br />

filesystems that could use the most improvement.<br />

Most efforts in secure filesystem development<br />

have focused on confidentiality more<br />

so than integrity, and integrity has been regulated<br />

to the domain of userland utilities that<br />

must periodically scan the entire filesystem.<br />

Sometimes, just knowing that a file has been<br />

changed is insufficient. Administrators would<br />

like to know exactly how the attacker made<br />

the changes and under what circumstances they<br />

were made.<br />

Cryptographic hashes are often used. <strong>The</strong>se<br />

can detect unauthorized circumvention of the<br />

filesystem itself, as long as the attacker forgets


274 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

(or is unable) to update the hashes when making<br />

unauthorized changes to the files. Some<br />

auditing solutions, such as the <strong>Linux</strong> Auditing<br />

System (LAuS) 3 that is part of SuSE <strong>Linux</strong><br />

Enterprise Server, can track system calls that<br />

affect the filesystem. Another recent addition<br />

to the 2.6 <strong>Linux</strong> kernel is the Light-weight<br />

Auditing Framework written by Rik Faith[28].<br />

<strong>The</strong>se are implemented independently of the<br />

filesystem itself, and the level of detail in the<br />

records is largely limited to the system call parameters<br />

and return codes. It is advisable that<br />

you keep your log files on a separate machine<br />

than the one being audited, since the attacker<br />

could modify the audit logs themselves once<br />

he has compromised the machine’s security.<br />

4.1.5 Improvements on Integrity<br />

Extended Attributes provide for a convenient<br />

way to attach metadata relating to a file to the<br />

file itself. On the premise that possession of<br />

a secret equates to authentication, every time<br />

an authenticated subject makes an authorized<br />

write to a file, a hash over the concatenation of<br />

that secret to the file contents (keyed hashing;<br />

HMAC is one popular standard) can be written<br />

as an Extended Attribute on that file. Since<br />

this action would be performed on the filesystem<br />

level, the user would not have to conscientiously<br />

re-run userspace tools to perform such<br />

an operation every time he wants to generate<br />

an integrity verifier on the file.<br />

This is an expensive operation to perform over<br />

large files, and so it would be a good idea to<br />

define extent sizes over which keyed hashes are<br />

formed, with the Extended Attributes including<br />

extent descriptors along with the keyed hashes.<br />

That way, a small change in the middle of a<br />

3 Note that LAuS is being covered in more detail in<br />

the 2004 Ottawa <strong>Linux</strong> Symposium by Doc Shankar,<br />

Emily Ratliff, and Olaf Kirch as part of their presentation<br />

regarding CAPP/EAL3+ Certification.<br />

large file would only require the keyed hash<br />

to be re-generated over the extent in which the<br />

change occurs. A keyed hash over the sequential<br />

set of the extent hashes would also keep an<br />

attacker from swapping around extents undetected.<br />

4.2 File Confidentiality<br />

Confidentiality means that only authorized<br />

users can read the contents of a file. Sometimes<br />

the names of the files themselves or a directory<br />

structure can be sensitive. In other cases, the<br />

sizes of the files or the modification times can<br />

betray more information than one might want<br />

to be known. Even the security policies protecting<br />

the files can reveal sensitive information.<br />

For example, “Only employees of Novell<br />

and SuSE can read this file” would imply that<br />

Novell and SuSE are collaborating on something,<br />

and neither of them may want this fact<br />

to be public knowledge as of yet. Many interesting<br />

protocols have been developed that can<br />

address these sorts of issues; some of them are<br />

easier to implement than others.<br />

When approaching the question of confidentiality,<br />

we assume that the block device that<br />

contains the file is vulnerable to physical compromise.<br />

For example, a laptop that contains<br />

sensitive material might be lost, or a database<br />

server might be stolen in a burglary. In either<br />

event, the data on the hard drive must not be<br />

readable by an unauthorized individual. If any<br />

individual must be authenticated before he is<br />

able to access to the data, then the data is protected<br />

against unauthorized access.<br />

Surprisingly, many users surrender their own<br />

data’s confidentiality (and more often than not<br />

they do so unwittingly). It has been my personal<br />

observation that most people do not fully<br />

understand the lack of confidentiality afforded<br />

their data when they send it over the Internet.<br />

To compound this problem, comprehend-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 275<br />

ing and even using most encryption tools takes<br />

considerable time and effort on the part of most<br />

users. If sensitive files could be encrypted by<br />

default, only to be decrypted by those authorized<br />

at the time of access, then the user would<br />

not have to expend so much effort toward protecting<br />

the data’s confidentiality.<br />

By putting the encryption at the filesystem<br />

layer, this model becomes possible without any<br />

modifications to the applications or libraries.<br />

A policy at that layer can dictate that certain<br />

processes, such as the mail client, are to receive<br />

the encrypted version any files that are<br />

read from disk.<br />

4.2.1 Encryption<br />

File confidentiality is most commonly accomplished<br />

through encryption. For performance<br />

reasons, secure filesystems use symmetric key<br />

cryptography, like AES or Triple-DES, although<br />

an asymmetric public/private keypair<br />

may be used to encrypt the symmetric key in<br />

some key management schemes. This hybrid<br />

approach is in common use through SSL and<br />

PGP encryption protocols.<br />

<strong>One</strong> of our proposals to extend Cryptfs is to<br />

mirror the techniques used in GnuPG encryption.<br />

If the symmetric key that protects the contents<br />

of a file is encrypted with the public key<br />

of the intended recipient of the file and stored<br />

as an Extended Attribute of the file, then that<br />

file can be transmitted in multiple ways (e.g.,<br />

physical device such as removable storage); as<br />

long as the Extended Attributes of the file are<br />

preserved across filesystem transfers, then the<br />

recipient with the corresponding private key<br />

has all the information that his Cryptfs layer<br />

needs to transparently decrypt the contents of<br />

the file.<br />

4.2.2 Key Management<br />

Key management will make or break a cryptographic<br />

filesystem.[5] If the key can be easily<br />

compromised, then even the strongest cipher<br />

will provide weak protection. If your<br />

key is accessible in an unencrypted file or in<br />

an unprotected region of memory, or if it is<br />

ever transmitted over the network in the clear,<br />

a rogue user can capture that key and use<br />

it later. Most passwords have poor entropy,<br />

which means that an attacker can have pretty<br />

good success with a brute force attack against<br />

the password. Thus the weakest link in the<br />

chain for password-based encryption is usually<br />

the password itself. <strong>The</strong> Cryptographic<br />

Filesystem (CFS)[22] mandates that the user<br />

choose a password with a length of at least 16<br />

characters. 4<br />

Ideally, the key would be kept in passwordencrypted<br />

form on a removable device (like a<br />

USB keychain drive) that is stored separately<br />

from the files that the key is used to encrypt.<br />

That way, an attacker would have to both compromise<br />

the password and gain physical access<br />

to the removable device before he could decrypt<br />

your files.<br />

Filesystem encryption is one of the most exciting<br />

applications for the Trusted Computing<br />

Platform. Given that the attacker has physical<br />

access to a machine with a Trusted Platform<br />

Module, it is significantly more difficult<br />

to compromise the key. By using secret sharing<br />

(otherwise known as key splitting)[4], the actual<br />

key used to decrypt a file on the filesystem<br />

can be contained as both the user’s key and the<br />

machine’s key (as contained in the TPM). In<br />

order to decrypt the files, an attacker must not<br />

4 <strong>The</strong> subject of secure password selection, although<br />

an important one, is beyond the scope of this<br />

article. Recommended reading on this subject is at<br />

http://www.alw.nih.gov/Security/Docs/<br />

passwd.html.


276 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

only compromise the user key, but he must also<br />

have access to the machine on which the TPM<br />

chip is installed. This “binds” the encrypted<br />

files to the machine. This is especially useful<br />

for protecting files on removable backup media.<br />

4.2.3 Cryptanalysis<br />

All block ciphers and most stream ciphers are,<br />

to various degrees, vulnerable to successful<br />

cryptanalysis. If a cipher is used improperly,<br />

then it may become even easier to discover the<br />

plaintext and/or the key. For example, with<br />

certain ciphers operating in certain modes, an<br />

attacker could discover information that aids<br />

in cryptanalysis by getting the filesystem to<br />

re-encrypt an already encrypted block of data.<br />

Other times, a cryptanalyst can deduce information<br />

about the type of data in the encrypted<br />

file when that data has predictable segments of<br />

data, like a common header or footer (thus allowing<br />

for a known-plaintext attack).<br />

4.2.4 Cipher Modes<br />

A block encryption mode that is resistant to<br />

cryptanalysis can involve dependencies among<br />

chains of bytes or blocks of data. Cipherblock-chaining<br />

(CBC) mode, for example, provides<br />

adequate encryption in many circumstances.<br />

In CBC mode, a change to one block<br />

of data will require that all subsequent blocks<br />

of data be re-encrypted. <strong>One</strong> can see how this<br />

would impact performance for large files, as a<br />

modification to data near the beginning of the<br />

file would require that all subsequent blocks be<br />

read, decrypted, re-encrypted, and written out<br />

again.<br />

This particular inefficiency can be effectively<br />

addressed by defining chaining extents. By<br />

limiting regions of the file that encompass<br />

chained blocks, it is feasible to decrypt and reencrypt<br />

the smaller segments. For example, if<br />

the block size for a cipher is 64 bits (8 bytes)<br />

and the block size, which is (we assume) the<br />

minimum unit of data that the block device<br />

driver can transfer at a time (512 bytes) then<br />

one could limit the number of blocks in any extent<br />

to 64 blocks. Depending on the plaintext<br />

(and other factors), this may be too few to effectively<br />

counter cryptanalysis, and so the extent<br />

size could be set to a small multiple of the<br />

page size without severely impacting overall<br />

performance. <strong>The</strong> optimal extent size largely<br />

depends on the access patterns and data patterns<br />

for the file in question; we plan on benchmarking<br />

against varying extent lengths under<br />

varying access patterns.<br />

4.2.5 Key Escrow<br />

<strong>The</strong> proverbial question, “What if the sysadmin<br />

gets hit by a bus?” is one that no organization<br />

should ever stop asking. In fact, sometimes<br />

no one person should alone have independent<br />

access to the sensitive data; multiple<br />

passwords may be required before the data is<br />

decrypted. Shareholders should demand that<br />

no single person in the company have full access<br />

to certain valuable data, in order to mitigate<br />

the damage to the company that could be<br />

done by a single corrupt administrator or executive.<br />

Methods for secret sharing can be employed<br />

to assure that multiple keys be required<br />

for file access, and (m,n)-threshold schemes [4]<br />

can ensure that the data is retrievable, even if a<br />

certain number of the keys are lost. Secret sharing<br />

would be easily implementable as part of<br />

any of the existing cryptographic filesystems.<br />

4.3 File Resilience<br />

<strong>The</strong> loss of a file can be just as devastating<br />

as the compromise of a file. <strong>The</strong>re are many


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 277<br />

well-established solutions to performing backups<br />

of your filesystem, but some cryptographic<br />

filesystems preclude the ability to efficiently<br />

and/or securely use them. Backup tapes tend<br />

to be easier to steal than secure computer systems<br />

are, and if unencrypted versions of secure<br />

files exist on the tapes, that constitutes an<br />

often-overlooked vulnerability.<br />

<strong>The</strong> <strong>Linux</strong> 2.6 kernel cryptoloop device 5<br />

filesystem is an all-or-nothing approach. Most<br />

backup utilities must be given free reign on<br />

the unencrypted directory listings in order to<br />

perform incremental backups. Most other<br />

encrypted filesystems keep sets of encrypted<br />

files in directories in the underlying filesystem,<br />

which makes incremental backups possible<br />

without giving the backup tools access to<br />

the unencrypted content of the files.<br />

<strong>The</strong> backup utilities must, however, maintain<br />

backups of the metadata in the directories containing<br />

the encrypted files in addition to the<br />

files themselves. On the other hand, when the<br />

filesystem takes the approach of storing the<br />

cryptographic metadata as Extended Attributes<br />

for each file, then backup utilities need only<br />

worry about copying just the file in question to<br />

the backup medium (preserving the Extended<br />

Attributes, of course).<br />

4.4 Advantages of FS-Level, EA-Guided Encryption<br />

Most encrypted filesystem solutions either operate<br />

on the entire block device or operate on<br />

entire directories. <strong>The</strong>re are several advantages<br />

to implementing filesystem encryption at the<br />

filesystem level and storing encryption metadata<br />

in the Extended Attributes of each file:<br />

• Granularity: Keys can be mapped to individual<br />

files, rather than entire block de-<br />

5 Note that this is deprecated and is in the process of<br />

being replaced with the Device Mapper crypto target.<br />

vices or entire directories.<br />

• Backup Utilities: Incremental backup<br />

tools can correctly operate without having<br />

to have access to the decrypted content of<br />

the files it is backing up.<br />

• Performance: In most cases, only certain<br />

files need to be encrypted. System<br />

libraries and executables, in general, do<br />

not need to be encrypted. By limiting the<br />

actual encryption and decryption to only<br />

those files that really need it, system resources<br />

will not be taxed as much.<br />

• Transparent Operation: Individual encrypted<br />

files can be easily transfered off of<br />

the block device without any extra transformation,<br />

and others with authorization<br />

will be able to decrypt those files. <strong>The</strong><br />

userspace applications and libraries do not<br />

need to be modified and recompiled to<br />

support this transparency.<br />

Since all the information necessary to decrypt<br />

a file is contained in the Extended Attributes<br />

of the file, it is possible for a user on a machine<br />

that is not running Cryptfs to use userland<br />

utilities to access the contents of the file.<br />

This also applies to other security-related operations,<br />

like verifying keyed hashes. This addresses<br />

compatibility issues with machines that<br />

are not running the encrypted filesystem layer.<br />

5 Survey of <strong>Linux</strong> Encrypted<br />

Filesystems<br />

5.1 Encrypted Loopback Filesystems<br />

5.1.1 Loop-aes<br />

<strong>The</strong> most well-known method of encrypting<br />

a filesystem is to use a loopback en-


278 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

crypted filesystem. 6 Loop-aes[20] is part<br />

of the 2.6 <strong>Linux</strong> kernel (CONFIG_BLK_DEV_<br />

CRYPTOLOOP). It performs encryption at the<br />

block device level. With Loop-aes, the administrator<br />

can choose whatever cipher he wishes<br />

to use with the filesystem. <strong>The</strong> mount package<br />

on most popular GNU/<strong>Linux</strong> distributions<br />

contains the losetup utility, which can be used<br />

to set up the encrypted loopback mount (you<br />

can choose whatever cipher that the kernel supports;<br />

we use blowfish in this example):<br />

root# modprobe cryptoloop<br />

root# modprobe blowfish<br />

root# dd if=/dev/urandom of=encrypted.img \<br />

bs=4k count=1000<br />

root# losetup -e blowfish /dev/loop0 \<br />

encrypted.img<br />

root# mkfs.ext3 /dev/loop0<br />

root# mkdir /mnt/unencrypted-view<br />

root# mount /dev/loop0 /mnt/unencrypted-view<br />

<strong>The</strong> loopback encrypted filesystem falls short<br />

in the fact that it is an all-or-nothing solution.<br />

It is impossible for most standard backup utilities<br />

to perform incremental backups on sets<br />

of encrypted files without being given access<br />

to the unencrypted files. In addition, remote<br />

users will need to use IPSec or some other network<br />

encryption layer when accessing the files,<br />

which must be exported from the unencrypted<br />

mount point on the server. Loop-aes is, however,<br />

the best performing encrypted filesystem<br />

that is freely available and integrated with most<br />

GNU/<strong>Linux</strong> distributions. It is an adequate solution<br />

for many who require little more than<br />

basic encryption of their entire filesystems.<br />

5.1.2 BestCrypt<br />

BestCrypt[23] is a non-free product that uses a<br />

loopback approach, similar to Loop-aes.<br />

6 Note that Loop-aes is being deprecated, in favor of<br />

Device Mapping (DM) Crypt, which also does encryption<br />

at the block device layer.<br />

5.1.3 PPDD<br />

PPDD[21] is a block device driver that encrypts<br />

and decrypts data as it goes to and comes<br />

from another block device. It works very much<br />

like Loop-aes; in fact, in the 2.4 kernel, it uses<br />

the loopback device, as Loop-aes does. PPDD<br />

has not been ported to the 2.6 kernel. Loop-aes<br />

takes the same approach, and Loop-aes ships<br />

with the 2.6 kernel itself.<br />

5.2 CFS<br />

<strong>The</strong> Cryptographic Filesystem (CFS)[22] by<br />

Matt Blaze is a well established transparent encrypted<br />

filesystem, originally written for BSD<br />

platforms. CFS is implemented entirely in<br />

userspace and operates similarly to NFS. A<br />

userspace daemon, cfsd, acts as a pseudo-NFS<br />

server, and the kernel makes RPC calls to the<br />

daemon. <strong>The</strong> CFS daemon performs transparent<br />

encryption and decryption when writing<br />

and reading data. Just as NFS can export a<br />

directory from any exportable filesystem, CFS<br />

can do the same, while managing the encryption<br />

on top of that filesystem.<br />

In the background, CFS stores the metadata<br />

necessary to encrypt and decrypt files with<br />

the files being encrypted or decrypted on the<br />

filesystem. If you were to look at those directories<br />

directly, you would see a set of files<br />

with encrypted values for filenames, and there<br />

would be a handful of metadata files mixed in.<br />

When accessed through CFS, those metadata<br />

files are hidden, and the files are transparently<br />

encrypted and decrypted for the user applications<br />

(with the proper credentials) to freely<br />

work with the data.<br />

While CFS is capable of acting as a remote<br />

NFS server, this is not recommended for many<br />

reasons, some of which include performance<br />

and security issues with plaintext passwords<br />

and unencrypted data being transmitted over


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 279<br />

the network. You would be better off, from a<br />

security perspective (and perhaps also performance,<br />

depending on the number of clients),<br />

to use a regular NFS server to handle remote<br />

mounts of the encrypted directories, with local<br />

CFS mounts off of the NFS mounts.<br />

Perhaps the most attractive attribute of CFS<br />

is the fact that it does not require any modifications<br />

to the standard <strong>Linux</strong> kernel. <strong>The</strong><br />

source code for CFS is freely obtainable. It is<br />

packaged in the Debian repositories and is also<br />

available in RPM form. Using apt, CFS is perhaps<br />

the easiest encrypted filesystem for a user<br />

to set up and start using:<br />

root# apt-get install cfs<br />

user# cmkdir encrypted-data<br />

user# cattach encrypted-data unencrypted-view<br />

<strong>The</strong> user will be prompted for his password<br />

at the requisite stages. At this point,<br />

anything the user writes to or reads from<br />

/crypt/unencrypted-view will be transparently<br />

encrypted to and decrypted from files in<br />

encrypted-data. Note that any user on the system<br />

can make a new encrypted directory and<br />

attach it. It is not necessary to initialize and<br />

mount an entire block device, as is the case<br />

with Loop-aes.<br />

5.3 TCFS<br />

TCFS[24] is a variation on CFS that includes<br />

secure integrated remote access and file integrity<br />

features. TCFS assumes the client’s<br />

workstation is trusted, and the server cannot<br />

necessarily be trusted. Everything sent to and<br />

from the server is encrypted. Encryption and<br />

decryption take place on the client side.<br />

Note that this behavior can be mimicked with<br />

a CFS mount on top of an NFS mount. However,<br />

because TCFS works within the kernel<br />

(thus requiring a patch) and does not necessitate<br />

two levels of mounting, it is faster than an<br />

NFS+CFS combination.<br />

TCFS is no longer an actively maintained<br />

project. <strong>The</strong> last release was made three years<br />

ago for the 2.0 kernel.<br />

5.4 Cryptfs<br />

As a proof-of-concept for the FiST stackable<br />

filesystem framework, Erez Zadok, et. al. developed<br />

Cryptfs[1]. Under Cryptfs, symmetric<br />

keys are associated with groups of files<br />

within a single directory. <strong>The</strong> key is generated<br />

with a password that is entered at the time that<br />

the filesystem is mounted. <strong>The</strong> Cryptfs mount<br />

point provides an unencrypted view of the directory<br />

that contains the encrypted files.<br />

<strong>The</strong> authors of this paper are currently working<br />

on extending Cryptfs to provide seamless<br />

integration into the user’s desktop environment<br />

(see Section 6).<br />

5.5 Userspace Encrypted Filesystems<br />

EncFS[25] utilizes the Filesystem in Userspace<br />

(FUSE) library and kernel module to implement<br />

an encrypted filesystem in userspace.<br />

Like CFS, EncFS encrypts on a per-file basis.<br />

CryptoFS[26] is similar to EncFS, except it<br />

uses the <strong>Linux</strong> Userland Filesystem (LUFS) library<br />

instead of FUSE.<br />

SSHFS[27], like CryptoFS, uses the LUFS kernel<br />

module and userspace daemon. It limits itself<br />

to encrypting the files via SFTP as they are<br />

transfered over a network; the files stored on<br />

disk are unencrypted. From the user perspective,<br />

all file accesses take place as though they<br />

were being performed on any regular filesystem<br />

(opens, read, writes, etc.). SSHFS transfers<br />

the files back and forth via SFTP with the<br />

file server as these operations occur.


280 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

5.6 Reiser4<br />

ReiserFS version 4 (Reiser4)[29], while still in<br />

the development stage, features pluggable security<br />

modules. <strong>The</strong>re are currently proposed<br />

modules for Reiser4 that will perform encryption<br />

and auditing.<br />

5.7 Network Filesystem Security<br />

Much research has taken place in the domain of<br />

networking filesystem security. CIFS, NFSv4,<br />

and other networking filesystems face special<br />

challenges in relation to user identification, access<br />

control, and data secrecy. <strong>The</strong> NFSv4 protocol<br />

definition in RFC 3010 contains descriptions<br />

of security mechanisms in section 3[30].<br />

6 Proposed Extensions to Cryptfs<br />

Our proposal is to place file encryption metadata<br />

into the Extended Attributes (EA’s) of the<br />

file itself. Extended Attributes are a generic<br />

interface for attaching metadata to files. <strong>The</strong><br />

Cryptfs layer will be extended to extract that<br />

information and to use the information to direct<br />

the encrypting and decrypting of the contents<br />

of the file. In the event that the filesystem<br />

does not support Extended Attributes, another<br />

filesystem layer can provide that functionality.<br />

<strong>The</strong> stackable framework effectively<br />

allows Cryptfs to operate on top of any filesystem.<br />

<strong>The</strong> encryption process is very similar to that of<br />

GnuPG and other public key cryptography programs<br />

that use a hybrid approach to encrypting<br />

data. By integrating the process into the<br />

filesystem, we can achieve a greater degree of<br />

transparency, without requiring any changes to<br />

userspace applications or libraries.<br />

Under our proposed design, when a new file is<br />

created as an encrypted file, the Cryptfs layer<br />

generates a new symmetric key K s for the encryption<br />

of the data that will be written. File<br />

creation policy enacted by Cryptfs can be dictated<br />

by directory attributes or globally defined<br />

behavior. <strong>The</strong> owner of the file is automatically<br />

authorized to access the file, and so the<br />

symmetric key is encrypted with the public key<br />

of the owner of the file K u , which was passed<br />

into the Cryptfs layer at the time that the user<br />

logged in by a Pluggable Authentication Module<br />

linked against libcryptfs. <strong>The</strong> encrypted<br />

symmetric key is then added to the Extended<br />

Attribute set of the file:<br />

{K s }K u<br />

Suppose that the user at this point wants to<br />

grant Alice access to the file. Alice’s public<br />

key, K a , is in the user’s GnuPG keyring. He<br />

can run a utility that selects Alice’s key, extracts<br />

it from the GnuPG keyring, and passes<br />

it to the Cryptfs layer, with instructions to<br />

add Alice as an authorized user for the file.<br />

<strong>The</strong> new key list in the Extended Attribute set<br />

for the file then contains two copies of the<br />

symmetric key, encrypted with different public<br />

keys:<br />

{K s }K u<br />

{K s }K a<br />

Note that this is not an access control directive;<br />

it is rather a confidentiality enforcement mechanism<br />

that extends beyond the local machine’s<br />

access control. Without either the user’s or Alice’s<br />

private key, no entity will be able to access<br />

the decrypted contents of the file. <strong>The</strong> machine<br />

that harbors such keys will enact its own access<br />

control over the decrypted file, based on<br />

standard UNIX file permissions and/or ACL’s.<br />

When that file is copied to a removable media<br />

or attached to an email, as long as the Extended<br />

Attributes are preserved, Alice will have all<br />

the information that she needs in order to retrieve<br />

the symmetric key for the file and de-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 281<br />

<strong>Kernel</strong><br />

User<br />

<strong>Kernel</strong><br />

crypto API<br />

cryptfs<br />

library<br />

Change file<br />

encryption attributes<br />

VFS syscall<br />

Login/GNOME/KDE/...<br />

Cryptfs layer<br />

Keystore<br />

Key retrieval<br />

PAM Module<br />

Prompt for<br />

user authentication<br />

action (conversation)<br />

File Structure<br />

Security Attributes<br />

Additional layers<br />

(optional)<br />

USB Keychain device,<br />

Smartcard, TPM,<br />

GnuPG Keyring, etc...<br />

Authentication<br />

PAM Module<br />

Filesystem<br />

PAM<br />

Figure 1: Overview of proposed extended Cryptfs architecture<br />

crypt it. If Alice is also running Cryptfs, when<br />

she launches an application that accesses the<br />

file, the decryption process is entirely transparent<br />

to her, since her Cryptfs layer received<br />

her private key from PAM at the time that she<br />

logged in.<br />

If the user requires the ability to encrypt a file<br />

for access by a group of users, then the user<br />

can associate sets of public keys with groups<br />

and refer to the groups when granting access.<br />

<strong>The</strong> userspace application that links against<br />

libcryptfs can then pass in the public keys to<br />

Cryptfs for each member of the group and instruct<br />

Cryptfs to add the associated key record<br />

to the Extended Attributes. Thus no special<br />

support for groups is needed within the Cryptfs<br />

layer itself.<br />

6.1 <strong>Kernel</strong>-level Changes<br />

No modifications to the 2.6 kernel itself are<br />

necessary to support the stackable Cryptfs<br />

layer. <strong>The</strong> Cryptfs module’s logical divisions<br />

include a sysfs interface, a keystore, and<br />

the VFS operation routines that perform the<br />

encryption and the decryption on reads and<br />

writes.<br />

By working with a userspace daemon, it would<br />

be possible for Cryptfs to export public key<br />

cryptographic operations to userspace. In order<br />

to avoid the need for such a daemon while<br />

using public key cryptography, the kernel cryptographic<br />

API must be extended to support it.<br />

6.2 PAM<br />

At login, the user’s public and private keys<br />

need to find their way into the kernel<br />

Cryptfs layer. This can be accomplished by<br />

writing a Pluggable Authentication Module,<br />

pam cryptfs.so. This module will link against<br />

libcryptfs and will extract keys from the user’s<br />

GnuPG keystore. <strong>The</strong> libcryptfs library will<br />

use the sysfs interface to pass the user’s keys<br />

into the Cryptfs layer.<br />

6.3 libcryptfs<br />

<strong>The</strong> libcryptfs library works with the Cryptfs’s<br />

sysfs interface. Userspace utilities, such as<br />

pam cryptfs.so, GNOME/KDE, or stand-alone<br />

utilities, will link against this library and use it


282 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

VFS Syscalls<br />

<strong>Kernel</strong> Crypto API<br />

Cryptfs Layer<br />

File Security<br />

Attributes<br />

Crypto calls<br />

parameterized<br />

by file security<br />

attributes<br />

Calls to kernel<br />

Crypto API<br />

Keys retrieved<br />

from the keystore<br />

Keystore<br />

Symmetric keys used<br />

for encryption of file<br />

data may need<br />

decrypting with the<br />

authorized user’s<br />

private key.<br />

File<br />

EA’s are parsed<br />

into cryptfs layer’s<br />

file attribute<br />

structure<br />

sysfs<br />

Set<br />

private/public<br />

keys<br />

Extended<br />

Attributes<br />

Data<br />

Userspace<br />

libcryptfs<br />

Figure 2: Structure of Cryptfs layer in kernel<br />

to communicate with the kernel Cryptfs layer.<br />

6.4 User Interface<br />

Desktop environments such as GNOME or<br />

KDE can link against libcryptfs to provide<br />

users with a convenient interface through<br />

which to work with the files. For example,<br />

by right-clicking on an icon representing the<br />

file and selecting “Security”, the user will be<br />

presented with a window that can be used to<br />

control the encryption status of the file. Such<br />

options will include whether or not the file is<br />

encrypted, which users should be able to encrypt<br />

and decrypt the file (identified by their<br />

public keys from the user’s GnuPG keyring),<br />

what cipher is used, what keylength is used,<br />

an optional password that encrypts the symmetric<br />

key, whether or not to use keyed hashing<br />

over extents of the file for integrity, the<br />

hash algorithm to use, whether accesses to the<br />

file when no key is available should result in<br />

an error or in the encrypted blocks being returned<br />

(perhaps associated with UID’s - good<br />

for backup utilities), and other properties that<br />

are controlled by the Cryptfs layer.<br />

6.5 Example Walkthrough<br />

When a file’s encryption attribute is set, the<br />

first thing that the Cryptfs layer will do will be<br />

to generate a new symmetric key, which will be<br />

used for all encryption and decryption of the<br />

file in question. Any data in that file is then<br />

immediately encrypted with that key. When<br />

using public key-enforced access control, that


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 283<br />

key will be encrypted with the process owner’s<br />

private key and stored as an EA of the file.<br />

When the process owner wishes to allow others<br />

to access the file, he encrypts the symmetric<br />

key with the their public keys. From the<br />

user’s perspective, this can be done by rightclicking<br />

on an icon representing the file, selecting<br />

“Security→Add Authorized User Key”,<br />

and having the user specify the authorized user<br />

while using PAM to retrieve the public key for<br />

that user.<br />

When using password-enforced access control,<br />

the symmetric key is instead encrypted using a<br />

key generated from a password. <strong>The</strong> user can<br />

then share that password with everyone who<br />

he authorized to access the file. In either case<br />

(public key-enforced or password-enforced access<br />

control), revocation of access to future<br />

versions of the file will necessitate regeneration<br />

and re-encryption of the symmetric key.<br />

Suppose the encrypted file is then copied to a<br />

removable device and delivered to an authorized<br />

user. When that user logged into his machine,<br />

his private key was retrieved by the key<br />

retrieval Pluggable Authentication Module and<br />

sent to the Cryptfs keystore. When that user<br />

launches any arbitrary application and attempts<br />

to access the encrypted file from the removable<br />

media, Cryptfs retrieves the encrypted symmetric<br />

key correlating with that user’s public<br />

key, uses the authenticated user’s private key<br />

to decrypt the symmetric key, associates that<br />

symmetric key with the file, and then proceeds<br />

to use that symmetric key for reading and writing<br />

the file. This is done in an entirely transparent<br />

manner from the perspective of the user,<br />

and the file maintains its encrypted status on<br />

the removable media throughout the entire process.<br />

No modification to the application or applications<br />

accessing the file are necessary to<br />

implement such functionality.<br />

In the case where a file’s symmetric key is encrypted<br />

with a password, it will be necessary<br />

for the user to launch a daemon that listens for<br />

password queries from the kernel cryptfs layer.<br />

Without such a daemon, the user’s initial attempt<br />

to access the file will be denied, and the<br />

user will have to use a password set utility to<br />

send the password to the cryptfs layer in the<br />

kernel.<br />

6.6 Other Considerations<br />

Sparse files present a challenge to encrypted<br />

filesystems. Under traditional UNIX semantics,<br />

when a user seeks more than a block beyond<br />

the end of a file to write, then that space<br />

is not stored on the block device at all. <strong>The</strong>se<br />

missing blocks are known as “holes.”<br />

When holes are later read, the kernel simply<br />

fills in zeros into the memory without actually<br />

reading the zeros from disk (recall that they<br />

do not exist on the disk at all; the filesystem<br />

“fakes it”). From the point of view of whatever<br />

is asking for the data from the filesystem,<br />

the section of the file being read appears to be<br />

all zeros. This presents a problem when the<br />

file is supposed to be encrypted. Without taking<br />

sparse files into consideration, the encryption<br />

layer will naïvely assume that the zeros being<br />

passed to it from the underlying filesystem<br />

are actually encrypted data, and it will attempt<br />

to decrypt the zeros. Obviously, this will result<br />

in something other that zeros being presented<br />

above the encryption layer, thus violating<br />

UNIX sparse file semantics.<br />

<strong>One</strong> solution to this problem is to abandon the<br />

concept of “holes” altogether at the Cryptfs<br />

layer. Whenever we seek past the end of the<br />

file and write, we can actually encrypt blocks<br />

of zeros and write them out to the underlying<br />

filesystem. While this allows Cryptfs to adhere<br />

to UNIX semantics, it is much less efficient.<br />

<strong>One</strong> possible solution might be to store a<br />

“hole bitmap” as an Extended Attribute of the


284 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

file. Each bit would correspond with a block of<br />

the file; a “1” might indicate that the block is a<br />

“hole” and should be zero’d out rather than decrypted,<br />

and a “0” might indicate that the block<br />

should be normally decrypted.<br />

Our proposed extensions to Cryptfs in the near<br />

future do not currently address the issues of directory<br />

structure and file size secrecy. We recognize<br />

that this type of confidentiality is important<br />

to many, and we plan to explore ways<br />

to integrate such features into Cryptfs, possibly<br />

by employing extra filesystem layers to aid in<br />

the process.<br />

Extended Attribute content can also be sensitive.<br />

Technically, only enough information to<br />

retrieve the symmetric decryption key need be<br />

accessible by authorized individuals; all other<br />

attributes can be encrypted with that key, just<br />

as the contents of the file are encrypted.<br />

Processes that are not authorized to access the<br />

decrypted content will either be denied access<br />

to the file or will receive the encrypted content,<br />

depending on how the Cryptfs layer is parameterized.<br />

This behavior permits incremental<br />

backup utilities to function properly, without<br />

requiring access to the unencrypted content<br />

of the files they are backing up.<br />

At some point, we would like to include file integrity<br />

information in the Extended Attributes.<br />

As previously mentioned, this can be accomplished<br />

via sets of keyed hashes over extents<br />

within the file:<br />

H 0 = H{O 0 , D 0 , K s }<br />

H 1 = H{O 1 , D 1 , K s }<br />

. . .<br />

H n = H{O n , D n , K s }<br />

H f = H{H 0 , H 1 , . . . , H n , n, s, K s }<br />

where n is the number of extents in the file,<br />

s is the extent size (also contained as another<br />

EA), O i is the offset number i within the file,<br />

D i is the data from offset O i to O i + s, K s is<br />

the key that one must possess in order to make<br />

authorized changes to the file, and H f is the<br />

hash of the hashes, the number of extents, the<br />

extent size, and the secret key, to help detect<br />

when an attacker swaps around extents or alters<br />

the extent size.<br />

Keyed hashes prove that whoever modified the<br />

data had access to the shared secret, which is,<br />

in this case, the symmetric key. Digital signatures<br />

can also be incorporated into Cryptfs.<br />

Executables downloaded over the Internet can<br />

often be of questionable origin or integrity. If<br />

you trust the person who signed the executable,<br />

then you can have a higher degree of certainty<br />

that the executable is safe to run if the digital<br />

signature is verifiable. <strong>The</strong> verification of the<br />

digital signature can be dynamically performed<br />

at the time of execution.<br />

As previously mentioned, in addition to the extensions<br />

to the Cryptfs stackable layer, this effort<br />

is requiring the development of a cryptfs<br />

library, a set of PAM modules, hooks into<br />

GNOME and KDE, and some utilities for managing<br />

file encryption. Applications that copy<br />

files with Extended Attributes must take steps<br />

to make sure that they preserve the Extended<br />

Attributes. 7<br />

7 Conclusion<br />

<strong>Linux</strong> currently has a comprehensive framework<br />

for managing filesystem security. Standard<br />

file security attributes, process credentials,<br />

ACL, PAM, LSM, Device Mapping (DM)<br />

Crypt, and other features together provide good<br />

security in a contained environment. To extend<br />

access control enforcement over individual<br />

files beyond the local environment, you<br />

must use encryption in a way that can be easily<br />

7 See http://www.suse.de/~agruen/<br />

ea-acl-copy/


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 285<br />

applied to individual files. <strong>The</strong> currently employed<br />

processes of encrypting and decrypting<br />

files, however, is inconvenient and often obstructive.<br />

By integrating the encryption and the decryption<br />

of the individual files into the filesystem<br />

itself, associating encryption metadata with the<br />

individual files, we can extend <strong>Linux</strong> security<br />

to provide seamless encryption-enforced access<br />

control and integrity auditing.<br />

8 Recognitions<br />

We would like to express our appreciation for<br />

the contributions and input on the part of all<br />

those who have laid the groundwork for an effort<br />

toward transparent filesystem encryption.<br />

This includes contributors to FiST and Cryptfs,<br />

GnuPG, PAM, and many others from which<br />

we are basing our development efforts, as well<br />

as several members of the kernel development<br />

community.<br />

9 Legal Statement<br />

This work represents the view of the author and<br />

does not necessarily represent the view of IBM.<br />

IBM and Lotus Notes are registered trademarks<br />

of International Business Machines Corporation<br />

in the United States, other countries,<br />

or both.<br />

Other company, product, and service names<br />

may be trademarks or service marks of others.<br />

References<br />

[1] E. Zadok, L. Badulescu, and A. Shender.<br />

Cryptfs: A stackable vnode level<br />

encryption file system. Technical Report<br />

CUCS-021-98, Computer Science<br />

Department, Columbia University, 1998.<br />

[2] J.S. Heidemann and G.J. Popek. File<br />

system development with stackable layers.<br />

ACM Transactions on Computer Systems,<br />

12(1):58–89, February 1994.<br />

[3] E. Zadok and J. Nieh. FiST: A Language<br />

for Stackable File Systems. Proceedings<br />

of the Annual USENIX Technical<br />

Conference, pp. 55–70, San Diego, June<br />

2000.<br />

[4] S.C. Kothari, Generalized Linear<br />

Threshold Scheme, Advances in<br />

Cryptology: Proceedings of CRYPTO 84,<br />

Springer-Verlag, 1985, pp. 231–241.<br />

[5] Matt Blaze. “Key Management in an<br />

Encrypting File System,” Proc. Summer<br />

’94 USENIX Tech. Conference, Boston,<br />

MA, June 1994.<br />

[6] For more information on Extended<br />

Attributes (EA’s) and Access Control Lists<br />

(ACL’s), see<br />

http://acl.bestbits.at/ or<br />

http://www.suse.de/~agruen/<br />

acl/chapter/fs_acl-en.pdf<br />

[7] For more information on GnuPG, see<br />

http://www.gnupg.org/<br />

[8] For more information on OpenSSL, see<br />

http://www.openssl.org/<br />

[9] For more information on IBM Lotus<br />

Notes, see http://www-306.ibm.<br />

com/software/lotus/. Information<br />

on Notes security can be obtained from<br />

http://www-10.lotus.com/ldd/<br />

today.nsf/f01245ebfc115aaf<br />

8525661a006b86b9/<br />

232e604b847d2cad8<br />

8256ab90074e298?OpenDocument<br />

[10] For more information on Pluggable<br />

Authentication Modules (PAM), see<br />

http://www.kernel.org/pub/<br />

linux/libs/pam/


286 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

[11] For more information on Mandatory<br />

Access Control (MAC), see http://<br />

csrc.nist.gov/publications/<br />

nistpubs/800-7/node35.html<br />

[12] For more information on Discretionary<br />

Access Control (DAC), see http://<br />

csrc.nist.gov/publications/<br />

nistpubs/800-7/node25.html<br />

[13] For more information on the Trusted<br />

Computing Platform Alliance (TCPA), see<br />

http://www.trustedcomputing.<br />

org/home<br />

[14] For more information on <strong>Linux</strong> Security<br />

Modules (LSM’s), see<br />

http://lsm.immunix.org/<br />

[15] For more information on<br />

Security-Enhanced <strong>Linux</strong> (SE <strong>Linux</strong>), see<br />

http://www.nsa.gov/selinux/<br />

index.cfm<br />

[16] For more information on Tripwire, see<br />

http://www.tripwire.org/<br />

[17] For more information on AIDE, see<br />

http://www.cs.tut.fi/<br />

~rammer/aide.html<br />

[18] For more information on Samhain, see<br />

http://la-samhna.de/samhain/<br />

[23] For more information on BestCrypt, see<br />

http://www.jetico.com/index.<br />

htm#/products.htm<br />

[24] For more information on TCFS, see<br />

http://www.tcfs.it/<br />

[25] For more information on EncFS, see<br />

http://arg0.net/users/<br />

vgough/encfs.html<br />

[26] For more information on CryptoFS, see<br />

http://reboot.animeirc.de/<br />

cryptofs/<br />

[27] For more information on SSHFS, see<br />

http://lufs.sourceforge.net/<br />

lufs/fs.html<br />

[28] For more information on the<br />

Light-weight Auditing Framework, see<br />

http:<br />

//lwn.net/Articles/79326/<br />

[29] For more information on Reiser4, see<br />

http:<br />

//www.namesys.com/v4/v4.html<br />

[30] NFSv4 RFC 3010 can be obtained from<br />

http://www.ietf.org/rfc/<br />

rfc3010.txt<br />

[19] For more information on Logcrypt, see<br />

http:<br />

//www.lunkwill.org/logcrypt/<br />

[20] For more information on Loop-aes, see<br />

http://sourceforge.net/<br />

projects/loop-aes/<br />

[21] For more information on PPDD, see<br />

http://linux01.gwdg.de/<br />

~alatham/ppdd.html<br />

[22] For more information on CFS, see<br />

http://sourceforge.net/<br />

projects/cfsnfs/


Hotplug Memory and the <strong>Linux</strong> VM<br />

Dave Hansen, Mike Kravetz, with Brad Christiansen<br />

IBM <strong>Linux</strong> Technology Center<br />

haveblue@us.ibm.com, kravetz@us.ibm.com, bradc1@us.ibm.com<br />

Matt Tolentino<br />

Intel<br />

matthew.e.tolentino@intel.com<br />

Abstract<br />

This paper will describe the changes needed to<br />

the <strong>Linux</strong> memory management system to cope<br />

with adding or removing RAM from a running<br />

system. In addition to support for physically<br />

adding or removing DIMMs, there is an everincreasing<br />

number of virtualized environments<br />

such as UML or the IBM pSeries Hypervisor<br />

which can transition RAM between virtual<br />

system images, based on need. This paper will<br />

describe techniques common to all supported<br />

platforms, as well as challenges for specific architectures.<br />

1 Introduction<br />

As Free Software Operating Systems continue<br />

to expand their scope of use, so do the demands<br />

placed upon them. <strong>One</strong> area of continuing<br />

growth for <strong>Linux</strong> is the adaptation to<br />

incessantly changing hardware configurations<br />

at runtime. While initially confined to commonly<br />

removed devices such as keyboards,<br />

digital cameras or hard disks, <strong>Linux</strong> has recently<br />

begun to grow to include the capability<br />

to hot-plug integral system components. This<br />

paper describes the changes necessary to enable<br />

<strong>Linux</strong> to adapt to dynamic changes in one<br />

of the most critical system resource—system<br />

RAM.<br />

2 Motivation<br />

<strong>The</strong> underlying reason for wanting to change<br />

the amount of RAM is very simple: availability.<br />

<strong>The</strong> systems that support memory hot-plug<br />

operations are designed to fulfill mission critical<br />

roles; significant enough that the cost of<br />

a reboot cycle for the sole purpose of adding<br />

or replacing system RAM is simply too expensive.<br />

For example, some large ppc64 machines<br />

have been reported to take well over thirty minutes<br />

for a simple reboot. <strong>The</strong>refore, the downtime<br />

necessary for an upgrade may compromise<br />

the five nine uptime requirement critical<br />

to high-end system customers [1].<br />

However, memory hotplug is not just important<br />

for big-iron. <strong>The</strong> availability of high<br />

speed, commodity hardware has prompted a<br />

resurgence of research into virtual machine<br />

monitors—layers of software such as Xen<br />

[2], VMWare [3], and conceptually even User<br />

Mode <strong>Linux</strong> that allow for multiple operating<br />

system instances to be run in isolated, virtual<br />

domains. As computing hardware density has<br />

increased, so has the possibility of splitting up<br />

that computing power into more manageable<br />

pieces. <strong>The</strong> capability for an operating system<br />

to expand or contract the range of physical


288 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

memory resources available presents the possibility<br />

for virtual machine implementations to<br />

balance memory requirements and improve the<br />

management of memory availability between<br />

domains 1 . This author currently leases a small<br />

User Mode <strong>Linux</strong> partition for small Internet<br />

tasks such as DNS and low-traffic web serving.<br />

Similar configurations with an approximately<br />

100 MHz processor and 64 MB of RAM are<br />

not uncommon. Imagine, in the case of an accidental<br />

Slashdotting, how useful radically growing<br />

such a machine could be.<br />

3 <strong>Linux</strong>’s Hotplug Shortcomings<br />

Before being able to handle the full wrath of<br />

Slashdot. we have to consider <strong>Linux</strong>’s current<br />

design. <strong>Linux</strong> only has two data structures<br />

that absolutely limit the amount of RAM that<br />

<strong>Linux</strong> can handle: the page allocator bitmaps,<br />

and mem_map[] (on contiguous memory systems).<br />

<strong>The</strong> page allocator bitmaps are very<br />

simple in concept, have a bit set one way when<br />

a page is available, and the opposite when it<br />

has been allocated. Since there needs to be one<br />

bit available for each page, it obviously has to<br />

scale with the size of the system’s total RAM.<br />

<strong>The</strong> bitmap memory consumption is approximately<br />

1 bit of memory for each page of system<br />

RAM.<br />

4 Resizing mem_map[]<br />

<strong>The</strong> mem_map[] structure is a bit more complicated.<br />

Conceptually, it is an array, with one<br />

struct page for each physical page which<br />

the system contains. <strong>The</strong>se structures contain<br />

bookkeeping information such as flags indicating<br />

page usage and locking structures. <strong>The</strong><br />

complexity with the struct pages is associated<br />

when their size. <strong>The</strong>y have a size of<br />

further<br />

1 err, I could write a lot about this, so I won’t go any<br />

40 bytes each on i386 (in the 2.6.5 kernel).<br />

On a system with 4096 byte hardware pages,<br />

this implies that about 1% of the total system<br />

memory will be consumed by struct<br />

pages alone. This use of 1% of the system<br />

memory is not a problem in and of itself. But,<br />

it does other problems.<br />

<strong>The</strong> <strong>Linux</strong> page allocator has a limitation on<br />

the maximum amounts of memory that it can<br />

allocate to a single request. On i386, this<br />

is 4MB, while on ppc64, it is 16MB. It is<br />

easy to calculate that anything larger than a<br />

4GB i386 system will be unable to allocate<br />

its mem_map[] with the normal page allocator.<br />

Normally, this problem with mem_map is<br />

avoided by using a boot-time allocator which<br />

does not have the same restrictions as the allocator<br />

used at runtime. However, memory hotplug<br />

requires the ability to grow the amount of<br />

mem_map[] used at runtime. It is not feasible<br />

to use the same approach as the page allocator<br />

bitmaps because, in contrast, they are kept to<br />

small-enough sizes to not impinge on the maximum<br />

size allocation limits.<br />

4.1 mem_map[] preallocation<br />

A very simple way around the runtime allocator<br />

limitations might be to allocate sufficient<br />

memory form mem_map[] at boot-time to account<br />

for any amount of RAM that could possibly<br />

be added to the system. But, this approach<br />

quickly breaks down in at least one important<br />

case. <strong>The</strong> mem_map[] must be allocated<br />

in low memory, an area on i386 which<br />

is approximately 896MB in total size. This<br />

is very important memory which is commonly<br />

exhausted [4],[5],[6]. Consider an 8GB system<br />

which could be expanded to 64GB in the future.<br />

Its normal mem_map[] use would be<br />

around 84MB, an acceptable 10% use of low<br />

memory. However, had mem_map[] been<br />

preallocated to handle a total capacity of 64GB<br />

of system memory, it would use an astound-


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 289<br />

ing 71% of low memory, giving any 8GB system<br />

all of the low memory problems associated<br />

with much larger systems.<br />

Preallocation also has the disadvantage of imposing<br />

limitations possibly making the user<br />

decide how large they expect the system to<br />

be, either when the kernel is compiled, or<br />

when it is booted. Perhaps the administrator<br />

of the above 8GB machine knows that it<br />

will never get any larger than 16GB. Does that<br />

make the low memory usage more acceptable?<br />

It would likely solve the immediate problem,<br />

however, such limitations and user intervention<br />

are becoming increasingly unacceptable<br />

to <strong>Linux</strong> vendors, as they drastically increase<br />

possible user configurations, and support costs<br />

along with it.<br />

4.2 Breaking mem_map[] up<br />

Instead of preallocation, another solution is<br />

to break up mem_map[]. Instead of needing<br />

massive amounts of memory, smaller ones<br />

could be used to piece together mem_map[]<br />

from more manageable allocations Interestingly,<br />

there is already precedent in the <strong>Linux</strong><br />

kernel for such an approach. <strong>The</strong> discontiguous<br />

memory support code tries to solve a different<br />

problem (large holes in the physical address<br />

space), but a similar solution was needed.<br />

In fact, there has been code released to use the<br />

current discontigmem support in <strong>Linux</strong> to implement<br />

memory hotplug. But, this has several<br />

disadvantages. Most importantly, it requires<br />

hijacking the NUMA code for use with<br />

memory hotplug. This would exclude the use<br />

of NUMA and memory hotplug on the same<br />

system, which is likely an unacceptable compromise<br />

due to the vast performance benefits<br />

demonstrated from using the <strong>Linux</strong> NUMA<br />

code for its intended use [6].<br />

Using the NUMA code for memory hotplug is<br />

a very tempting proposition because in addition<br />

to splitting up mem_map[] the NUMA<br />

support also handles discontiguous memory.<br />

Discontiguous memory simply means that the<br />

system does not lay out all of its physical memory<br />

in a single block, rather there are holes.<br />

Handling these holes with memory hotplug is<br />

very important, otherwise the only memory<br />

that could be added or removed would be on<br />

the end.<br />

Although an approch similar to this “node hotplug”<br />

approach will be needed when adding or<br />

removing entire NUMA nodes, using it on a<br />

regular SMP hotplug system could be disastrous.<br />

Each discontiguous area is represented<br />

by several data structures but each has at least<br />

one structzone. This structure is the basic<br />

unit which <strong>Linux</strong> uses to pool memory. When<br />

the amounts of memory reach certain low levels,<br />

<strong>Linux</strong> will respond by trying to free or<br />

swap memory. Artificially creating too many<br />

zones causes these events to be triggered much<br />

too early, degrading system performance and<br />

under-utilizing available RAM.<br />

5 CONFIG_NONLINEAR<br />

<strong>The</strong> solution to both the mem_map[] and discontiguous<br />

memory problems comes in a single<br />

package: nonlinear memory. First implemented<br />

by Daniel Phillips in April of 2002 as<br />

an alternative to discontiguous memory, nonlinear<br />

solves a similar set of problems.<br />

Laying out mem_map[] as an array has several<br />

advantages. <strong>One</strong> of the most important<br />

is the ability to quickly determine the physical<br />

address of any arbitrary struct page.<br />

Since mem_map[N] represents the Nth page<br />

of physical memory, the physical address of the<br />

memory represented by that struct page<br />

can be determined by simple pointer arithmetic:<br />

Once mem_map[] is broken up these simple


290 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

physical_address = (&mem_map[N] - &mem_map[0]) * sizeof(struct page)<br />

struct page N = mem_map[(physical_address / sizeof(struct page)]<br />

Figure 1: Physical Address Calculations<br />

calculations are no longer possible, thus another<br />

approach is required. <strong>The</strong> nonlinear approach<br />

is to use a set of two lookup tables, each<br />

one complementing the above operations: one<br />

for converting struct page to physical addresses,<br />

the other for doing the opposite. While<br />

it would be possible to have a table with an entry<br />

for every single page, that approach wastes<br />

far too much memory. As a result, nonlinear<br />

handles pages in uniformly sized sections, each<br />

of which has its own mem_map[] and an associated<br />

physical address range. <strong>Linux</strong> has some<br />

interesting conventions about how addresses<br />

are represented, and this has serious implications<br />

for how the nonlinear code functions.<br />

5.1 Physical Address Representations<br />

<strong>The</strong>re are, in fact, at least three different ways<br />

to represent a physical address in <strong>Linux</strong>: a<br />

physical address, a struct page, and a<br />

page frame number (pfn). A pfn is traditionally<br />

just the physical address divided by the size<br />

of a physical page (the N in the above in Figure<br />

1). Many parts of the kernel prefer to use<br />

a pfn as opposed to a struct page pointer<br />

to keep track of pages because pfn’s are easier<br />

to work with, being conceptually just array<br />

indexes. <strong>The</strong> page allocator bitmaps discussed<br />

above are just such a part of the kernel. To allocate<br />

or free a page, the page allocator toggles<br />

a bit at an index in one of the bitmaps. That<br />

index is based on a pfn, not a struct page<br />

or a physical address.<br />

Being so easily transposed, that decision does<br />

not seem horribly important. But it does cause<br />

a serious problem for memory hotplug. Consider<br />

a system with 100 1GB DIMM slots<br />

that support hotplug. When the system is first<br />

booted, only one of these DIMM slots is populated.<br />

Later on, the owner decides to hotplug<br />

another DIMM, but puts it in slot 100 instead<br />

of slot 2. Now, nonlinear has a bit of a problem:<br />

the new DIMM happens to appear at a physical<br />

address 100 times higher address than the first<br />

DIMM. <strong>The</strong> mem_map[] for the new DIMM<br />

is split up properly, but the allocator bitmap’s<br />

length is directly tied to the pfn, and thus the<br />

physical address of the memory.<br />

Having already stated that the allocator bitmap<br />

stays at manageable sizes, this still does not<br />

seem like much of an issue. However, the<br />

physical address of that new memory could<br />

have an even greater range than 100 GB; it has<br />

the capability to have many, many terabytes of<br />

range, based on the hardware. Keeping allocator<br />

bitmaps for terabytes of memory could<br />

conceivably consume all system memory on a<br />

small machine, which is quite unacceptable.<br />

Nonlinear offers a solution to this by introducing<br />

a new way to represent a physical address:<br />

a fourth addressing scheme. With three<br />

addressing schemes already existing, a fourth<br />

seems almost comical, until its small scope is<br />

considered. <strong>The</strong> new scheme is isolated to use<br />

inside of a small set of core allocator functions<br />

a single place in the memory hotplug code itself.<br />

A simple lookup table converts these new<br />

“linear” pfns into the more familiar physical<br />

pfns.


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 291<br />

5.2 Issues with CONFIG_NONLINEAR<br />

Although it greatly simplifies several issues,<br />

nonlinear is not without its problems. Firstly,<br />

it does require the consultation of a small number<br />

of lookup tables during critical sections of<br />

code. Random access of these tables is likely to<br />

cause cache overhead. <strong>The</strong> more finely grained<br />

the units of hotplug, the larger these tables will<br />

grow, and the worse the cache effects.<br />

Another concern arises with the size of the<br />

nonlinear tables themselves. While they allow<br />

pfns and mem_map[] to have nonlinear relationships,<br />

the nonlinear structures themselves<br />

remain normal, everyday, linear arrays. If<br />

hardware is encountered with sufficiently small<br />

hotplug units, and sufficiently large ranges of<br />

physical addresses, an alternate scheme to the<br />

arrays may be required. However, it is the authors’<br />

desire to keep the implementation simple,<br />

until such a need is actually demonstrated.<br />

6 Memory Removal<br />

While memory addition is a relatively blackand-white<br />

problem, memory removal has many<br />

more shades of gray. <strong>The</strong>re are many different<br />

ways to use memory, and each of them has<br />

specific challenges for unusing it. We will first<br />

discuss the kinds of memory that <strong>Linux</strong> has<br />

which are relevant to memory removal, along<br />

with strategies to go about unusing them.<br />

6.1 “Easy” User Memory<br />

Unusing memory is a matter of either moving<br />

data or simply throwing it away. <strong>The</strong> easiest,<br />

most straightforward kind of memory to<br />

remove is that whose contents can just be discarded.<br />

<strong>The</strong> two most common manifestations<br />

of this are clean page cache pages and swapped<br />

pages. Page cache pages are either dirty (containing<br />

information which has not been written<br />

to disk) or clean pages, which are simply a<br />

copy of something that is present on the disk.<br />

Memory removal logic that encounters a clean<br />

page cache page is free to have it discarded,<br />

just as the low memory reclaim code does today.<br />

<strong>The</strong> same is true of swapped pages; a page<br />

of RAM which has been written to disk is safe<br />

to discard. (Note: there is usually a brief period<br />

between when a page is written to disk,<br />

and when it is actually removed from memory.)<br />

Any page that can be swapped is also an easy<br />

candidate for memory removal, because it can<br />

easily be turned into a swapped page with existing<br />

code.<br />

6.2 Swappable User Memory<br />

Another type of memory which is very similar<br />

to the two types above is something which<br />

is only used by user programs, but is for<br />

some reason not a candidate for swapping.<br />

This at least includes pages which have been<br />

mlock()’d (which is a system call to prevent<br />

swapping). Instead of discarding these pages<br />

out of RAM, they must instead be moved. <strong>The</strong><br />

algorithm to accomplish this should be very<br />

similar to the algorithm for a complete page<br />

swapping: freeze writes to the page, move the<br />

page’s contents to another place in memory,<br />

change all references to the page, and re-enable<br />

writing. Notice that this is the same process as<br />

a complete swap cycle except that the writes to<br />

the disk are removed.<br />

6.3 <strong>Kernel</strong> Memory<br />

Now comes the hard part. Up until now, we<br />

have discussed memory which is being used<br />

by user programs. <strong>The</strong>re is also memory that<br />

<strong>Linux</strong> sets aside for its own use and this comes<br />

in many more varieties than that used by user<br />

programs. <strong>The</strong> techniques for dealing with this<br />

memory are largely still theoretical, and do not<br />

have existing implementations.


292 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

Remember how the <strong>Linux</strong> page allocator can<br />

only keep track of pages in powers of two? <strong>The</strong><br />

<strong>Linux</strong> slab cache was designed to make up for<br />

that [6], [7]. It has the ability to take those powers<br />

of two pages, and chop them up into smaller<br />

pieces. <strong>The</strong>re are some fixed-size groups for<br />

common allocations like 1024, 1532, or 8192<br />

bytes, but there are also caches for certain<br />

kinds of data structures. Some of these caches<br />

have the ability to attempt to shrink themselves<br />

when the system needs some memory back, but<br />

even that is relatively worthless for memory<br />

hotplug.<br />

6.4 Removing Slab Cache Pages<br />

<strong>The</strong> problem is that the slab cache’s shrinking<br />

mechanism does not concentrate on shrinking<br />

any particular memory, it just concentrates on<br />

shrinking, period. Plus, there’s currently no<br />

mechanism to tell which slab a particular page<br />

belongs to. It could just as easily be a simply<br />

discarded dcache entry as it could be a completely<br />

immovable entry like a pte_chain.<br />

<strong>Linux</strong> will need mechanisms to allow the slab<br />

cache shrinking to be much more surgical.<br />

However, there will always be slab cache memory<br />

which is not covered by any of the shrinking<br />

code, like for generic kmalloc() allocations.<br />

<strong>The</strong> slab cache could also make efforts<br />

to keep these “mystery” allocations away from<br />

those for which it knows how to handle.<br />

While the record-keeping for some slab-cache<br />

pages is sparse, there is memory with even<br />

more mysterious origins. Some is allocated<br />

early in the boot process, while other uses pull<br />

pages directly out of the allocator never to be<br />

seen again. If hot-removal of these areas is required,<br />

then a different approach must be employed:<br />

direct replacement. Instead of simply<br />

reducing the usage of an area of memory until<br />

it is unused, a one-to-one replacement of this<br />

memory is required. With the judicious use of<br />

page tables, the best that can be done is to preserve<br />

the virtual address of these areas. While<br />

this is acceptable for most use, it is not without<br />

its pitfalls.<br />

6.5 Removing DMA Memory<br />

<strong>One</strong> unacceptable place to change the physical<br />

address of some data is for a device’s<br />

DMA buffer. Modern disk controllers and network<br />

devices can transfer their data directly<br />

into the system’s memory without the CPU’s<br />

direct involvement. However, since the CPU<br />

is not involved, the devices lack access to the<br />

CPU’s virtual memory architecture. For this<br />

reason, all DMA-capable devices’ transfers are<br />

based on the physical address of the memory<br />

to which they are transferring. Every user of<br />

DMA in <strong>Linux</strong> will either need to be guaranteed<br />

to not be affected by memory replacement,<br />

or to be notified of such a replacement<br />

so that it can take corrective action. It should<br />

be noted, however, that the virtualization layer<br />

on ppc64 can properly handle this remapping<br />

in its IOMMU. Other architectures with IOM-<br />

MUs should be able to employ similar techniques.<br />

6.6 Removal and the Page Allocator<br />

<strong>The</strong> <strong>Linux</strong> page allocator works by keeping<br />

lists of groups of pages in sizes that are powers<br />

of two times the size of a page. It keeps a<br />

list of groups that are available for each power<br />

of two. However, when a request for a page<br />

is made, the only real information provided is<br />

for the size required, there is no component for<br />

specifically specifying which particular memory<br />

is required.<br />

<strong>The</strong> first thing to consider before removing<br />

memory is to make sure that no other part<br />

of the system is using that piece of memory.<br />

Thankfully, that’s exactly what a normal allocation<br />

does: make sure that it is alone in


<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 293<br />

its use of the page. So, making the page allocator<br />

support memory removal will simply<br />

involve walking the same lists that store the<br />

page groups. But, instead of simply taking the<br />

first available pages, it will be more finicky,<br />

only “allocating” pages that are among those<br />

about to be removed. In addition, the allocator<br />

should have checks in the free_pages()<br />

path to look for pages which were selected for<br />

removal.<br />

1. Inform allocator to catch any pages in the<br />

area being removed.<br />

2. Go into allocator, and remove any pages<br />

in that area.<br />

3. Trigger page reclaim mechanisms to trigger<br />

free()s, and hopefully unuse all target<br />

pages.<br />

4. If not complete, goto 3.<br />

6.7 Page Groupings<br />

As described above, the page allocator is the<br />

basis for all memory allocations. However,<br />

when it comes time to remove memory a fixed<br />

size block of memory is what is removed.<br />

<strong>The</strong>se blocks correspond to the sections defined<br />

in the the implementation of nonlinear<br />

memory. When removing a section of memory,<br />

the code performing the remove operation<br />

will first try to essentially allocate all the<br />

pages in the section. To remove the section,<br />

all pages within the section must be made free<br />

of use by some mechanism as described above.<br />

However, it should be noted that some pages<br />

will not be able to be made available for removal.<br />

For example, pages in use for kernel<br />

allocations, DMA or via the slab-cache. Since<br />

the page allocator makes no attempt to group<br />

pages based on usage, it is possible in a worst<br />

case situation that every section contains one<br />

in-use page that can not be removed. Ideally,<br />

we would like to group pages based on their usage<br />

to allow the maximum number of sections<br />

to be removed.<br />

Currently, the definition of zones provides<br />

some level of grouping on specific architectures.<br />

For example, on i386, three zones are<br />

defined: DMA, NORMAL and HIGHMEM.<br />

With such definitions, one would expect most<br />

non-removable pages to be allocated out of the<br />

DMA and NORMAL zones. In addition, one<br />

would expect most HIGHMEM allocations to<br />

be associated with userspace pages and thus<br />

removable. Of course, when the page allocator<br />

is under memory pressure it is possible<br />

that zone preferences will be ignored and allocations<br />

may come from an alternate zone. It<br />

should also be noted that on some architectures,<br />

such as ppc64, only one zone (DMA) is<br />

defined. Hence, zones can not provide grouping<br />

of pages on every architecture. It appears<br />

that zones do provide some level of page<br />

grouping, but possibly not sufficient for memory<br />

hotplug.<br />

Ideally, we would like to experiment with<br />

teaching the page allocator about the use of<br />

pages it is handing out. A simple thought<br />

would be to introduce the concept of sections<br />

to the allocator. Allocations of a specific type<br />

are directed to a section that is primarily used<br />

for allocations of that same type. For example,<br />

when allocations for use within the kernel are<br />

needed the allocator will attempt to allocate the<br />

page from a section that contains other internal<br />

kernel allocations. If no such pages can be<br />

found, then a new section is marked for internal<br />

kernel allocations. In this way pages which can<br />

not be easily freed are grouped together rather<br />

than spread throughout the system. In this way<br />

the page allocator’s use of sections would be<br />

analogous to the slab caches use of pages.


294 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

7 Conclusion<br />

<strong>The</strong> prevalence of hotplug-capable <strong>Linux</strong> systems<br />

is only expanding. Support for these systems<br />

will make <strong>Linux</strong> more flexible and will<br />

make additional capabilities available to other<br />

parts of the system.<br />

Legal Statement<br />

This work represents the view of the authors and<br />

does not necessarily represent the view of IBM or<br />

Intel.<br />

IBM is a trademarks or registered trademarks of International<br />

Business Machines Corporation in the<br />

United States and/or other countries.<br />

Intel and i386 are trademarks or registered trademarks<br />

of Intel Corporation in the United States,<br />

other countries, or both.<br />

Ottawa <strong>Linux</strong> Symposium. July 2003. pp<br />

181–196.<br />

[5] Gorman, Mel Understanding the <strong>Linux</strong><br />

Virtual Memory Manager Prentice Hall,<br />

NJ. 2004.<br />

[6] Martin Bligh and Dave Hansen <strong>Linux</strong><br />

Memory Management on Larger<br />

Machines Proceeedings of the Ottawa<br />

<strong>Linux</strong> Symposium 2003. pp 53–88.<br />

[7] Bonwick, Jeff <strong>The</strong> Slab Allocator: An<br />

Object-Caching <strong>Kernel</strong> Memory<br />

Allocator Proceedings of USENIX<br />

Summer 1994 Technical Conference<br />

http://www.usenix.org/<br />

publications/library/<br />

proceedings/bos94/bonwick.html<br />

<strong>Linux</strong> is a registered trademark of Linus Torvalds.<br />

VMware is a trademark of VMware, Inc.<br />

References<br />

[1] Five Nine at the IP Edge<br />

http://www.iec.org/online/<br />

tutorials/five-nines<br />

[2] Barham, Paul, et al. Xen and the Art of<br />

Virtualization Proceedings of the ACM<br />

Symposium on Operating System<br />

Principles (SOSP), October 2003.<br />

[3] Waldspurger, Carl Memory Resource<br />

Management in VMware ESX Server<br />

Proceedings of the USENIX Association<br />

Symposium on Operating System Design<br />

and Implementation, 2002. pp 181–194.<br />

[4] Dobson, Matthew and Gaughen, Patricia<br />

and Hohnbaum, Michael. <strong>Linux</strong> Support<br />

for NUMA Hardware Proceedings of the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!