You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Proceedings of the<br />
<strong>Linux</strong> Symposium<br />
Volume <strong>One</strong><br />
July 21st–24th, 2004<br />
Ottawa, Ontario<br />
Canada
Contents<br />
TCP Connection Passing 9<br />
Werner Almesberger<br />
Cooperative <strong>Linux</strong> 23<br />
Dan Aloni<br />
Build your own Wireless Access Point 33<br />
Erik Andersen<br />
Run-time testing of LSB Applications 41<br />
Stuart Anderson<br />
<strong>Linux</strong> Block IO—present and future 51<br />
Jens Axboe<br />
<strong>Linux</strong> AIO Performance and Robustness for Enterprise Workloads 63<br />
Suparna Bhattacharya<br />
Methods to Improve Bootup Time in <strong>Linux</strong> 79<br />
Tim R. Bird<br />
<strong>Linux</strong> on NUMA Systems 89<br />
Martin J. Bligh<br />
Improving <strong>Kernel</strong> Performance by Unmapping the Page Cache 103<br />
James Bottomley<br />
<strong>Linux</strong> Virtualization on IBM Power5 Systems 113<br />
Dave Boutcher
<strong>The</strong> State of ACPI in the <strong>Linux</strong> <strong>Kernel</strong> 121<br />
Len Brown<br />
Scaling <strong>Linux</strong> to the Extreme 133<br />
Ray Bryant<br />
Get More Device Drivers out of the <strong>Kernel</strong>! 149<br />
Peter Chubb<br />
Big Servers—2.6 compared to 2.4 163<br />
Wim A. Coekaerts<br />
Multi-processor and Frequency Scaling 167<br />
Paul Devriendt<br />
Dynamic <strong>Kernel</strong> Module Support: From <strong>The</strong>ory to Practice 187<br />
Matt Domsch<br />
e100 weight reduction program 203<br />
Scott Feldman<br />
NFSv4 and rpcsec_gss for linux 207<br />
J. Bruce Fields<br />
Comparing and Evaluating epoll, select, and poll Event Mechanisms 215<br />
Louay Gammo<br />
<strong>The</strong> (Re)Architecture of the X Window System 227<br />
James Gettys<br />
IA64-<strong>Linux</strong> perf tools for IO dorks 239<br />
Grant Grundler
Carrier Grade Server Features in the <strong>Linux</strong> <strong>Kernel</strong> 255<br />
Ibrahim Haddad<br />
Demands, Solutions, and Improvements for <strong>Linux</strong> Filesystem Security 269<br />
Michael Austin Halcrow<br />
Hotplug Memory and the <strong>Linux</strong> VM 287<br />
Dave Hansen
Conference Organizers<br />
Andrew J. Hutton, Steamballoon, Inc.<br />
Stephanie Donovan, <strong>Linux</strong> Symposium<br />
C. Craig Ross, <strong>Linux</strong> Symposium<br />
Review Committee<br />
Jes Sorensen, Wild Open Source, Inc.<br />
Matt Domsch, Dell<br />
Gerrit Huizenga, IBM<br />
Matthew Wilcox, Hewlett-Packard<br />
Dirk Hohndel, Intel<br />
Val Henson, Sun Microsystems<br />
Jamal Hadi Salimi, Znyx<br />
Andrew Hutton, Steamballoon, Inc.<br />
Proceedings Formatting Team<br />
John W. Lockhart, Red Hat, Inc.<br />
Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights<br />
to all as a condition of submission.
8 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
TCP Connection Passing<br />
Werner Almesberger<br />
werner@almesberger.net<br />
Abstract<br />
tcpcp is an experimental mechanism that allows<br />
cooperating applications to pass ownership<br />
of TCP connection endpoints from one<br />
<strong>Linux</strong> host to another one. tcpcp can be used<br />
between hosts using different architectures and<br />
does not need the other endpoint of the connection<br />
to cooperate (or even to know what’s<br />
going on).<br />
1 Introduction<br />
When designing systems for load-balancing,<br />
process migration, or fail-over, there is eventually<br />
the point where one would like to be<br />
able to “move” a socket from one machine to<br />
another one, without losing the connection on<br />
that socket, similar to file descriptor passing on<br />
a single host. Such a move operation usually<br />
involves at least three elements:<br />
1. Moving any application space state related<br />
to the connection to the new owner.<br />
E.g. in the case of a Web server serving<br />
large static files, the application state<br />
could simply be the file name and the current<br />
position in the file.<br />
2. Making sure that packets belonging to the<br />
connection are sent to the new owner of<br />
the socket. Normally this also means that<br />
the previous owner should no longer receive<br />
them.<br />
3. Last but not least, creating compatible<br />
network state in the kernel of the new connection<br />
owner, such that it can resume the<br />
communication where the previous owner<br />
left off.<br />
Origin (server)<br />
App<br />
Destination (server)<br />
<strong>Kernel</strong> state<br />
App<br />
Application state<br />
Packet routing<br />
User space<br />
<strong>Kernel</strong><br />
Peer<br />
(client)<br />
Figure 1: Passing one end of a TCP connection<br />
from one host to another.<br />
Figure 1 illustrates this for the case of a clientserver<br />
application, where one server passes<br />
ownership of a connection to another server.<br />
We shall call the host from which ownership of<br />
the connection endpoint is taken the origin, the<br />
host to which it is transferred the destination,<br />
and the host on the other end of the connection<br />
(which does not change) the peer.<br />
Details of moving the application state are beyond<br />
the scope of this paper, and we will only<br />
sketch relatively simple examples. Similarly,<br />
we will mention a few ways for how the redirection<br />
in the network can be accomplished,<br />
but without going into too much detail.
10 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
<strong>The</strong> complexity of the kernel state of a network<br />
connection, and the difficulty of moving this<br />
state from one host to another, varies greatly<br />
with the transport protocol being used. Among<br />
the two major transport protocols of the Internet,<br />
UDP [1] and TCP [2], the latter clearly<br />
presents more of a challenge in this regard.<br />
Nevertheless, some issues also apply to UDP.<br />
tcpcp (TCP Connection Passing) is a proof of<br />
concept implementation of a mechanism that<br />
allows applications to transport the kernel state<br />
of a TCP endpoint from one host to another,<br />
while the connection is established, and without<br />
requiring the peer to cooperate in any way.<br />
tcpcp is not a complete process migration or<br />
load-balancing solution, but rather a building<br />
block that can be integrated into such systems.<br />
tcpcp consists of a kernel patch (at the time<br />
of writing for version 2.6.4 of the <strong>Linux</strong> kernel)<br />
that implements the operations for dumping<br />
and restoring the TCP connection endpoint,<br />
a library with wrapper functions (see<br />
Section 3), and a few applications for debugging<br />
and demonstration.<br />
<strong>The</strong> project’s home page is at http://<br />
tcpcp.sourceforge.net/<br />
<strong>The</strong> remainder of this paper is organized as follows:<br />
this section continues with a description<br />
of the context in which connection passing exists.<br />
Section 2 explains the connection passing<br />
operation in detail. Section 3 introduces<br />
the APIs tcpcp provides. <strong>The</strong> information that<br />
defines a TCP connection and its state is described<br />
in Section 4. Sections 5 and 6 discuss<br />
congestion control and the limitations TCP imposes<br />
on checkpointing. Security implications<br />
of the availability and use of tcpcp are examined<br />
in Section 7. We conclude with an outlook<br />
on future direction the work on tcpcp will take<br />
in Section 8, and the conclusions in Section 9.<br />
<strong>The</strong> excellent “TCP/IP Illustrated” [3] is recommended<br />
for readers who wish to refresh<br />
their memory of TCP/IP concepts and terminology.<br />
1.1 <strong>The</strong>re is more than one way to do it<br />
tcpcp is only one of several possible methods<br />
for passing TCP connections among hosts.<br />
Here are some alternatives:<br />
In some cases, the solution is to avoid passing<br />
the “live” TCP connection, but to terminate<br />
the connection between the origin and the<br />
peer, and rely on higher protocol layers to reestablish<br />
a new connection between the destination<br />
and the peer. Drawbacks of this approach<br />
include that those higher layers need to<br />
know that they have to re-establish the connection,<br />
and that they need to do this within an<br />
acceptable amount of time. Furthermore, they<br />
may only be able to do this at a few specific<br />
points during a communication.<br />
<strong>The</strong> use of HTTP redirection [4] is a simple<br />
example of connection passing above the transport<br />
layer.<br />
Another approach is to introduce an intermediate<br />
layer between the application and the kernel,<br />
for the purpose of handling such redirection.<br />
This approach is fairly common in process<br />
migration solutions, such as Mosix [5],<br />
MIGSOCK [6], etc. It requires that the peer<br />
be equipped with the same intermediate layer.<br />
1.2 Transparency<br />
<strong>The</strong> key feature of tcpcp is that the peer can be<br />
left completely unaware that the connection is<br />
passed from one host to another. In detail, this<br />
means:<br />
• <strong>The</strong> peer’s networking stack can be used<br />
“as is,” without modification and without<br />
requiring non-standard functionality<br />
• <strong>The</strong> connection is not interrupted
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 11<br />
• <strong>The</strong> peer does not have to stop sending<br />
2 Passing the connection<br />
• No contradictory information is sent to the<br />
peer<br />
Figure 2 illustrates the connection passing procedure<br />
in detail.<br />
• <strong>The</strong>se properties apply to all protocol layers<br />
visible to the peer<br />
Furthermore, tcpcp allows the connection to be<br />
passed at any time, without needing to synchronize<br />
the data stream with the peer.<br />
<strong>The</strong> kernels of the hosts between which the<br />
connection is passed both need to support<br />
tcpcp, and the application(s) on these hosts will<br />
typically have to be modified to perform the<br />
connection passing.<br />
1.3 Various uses<br />
Application scenarios in which the functionality<br />
provided by tcpcp could be useful include<br />
load balancing, process migration, and failover.<br />
In the case of load balancing, an application<br />
can send connections (and whatever processing<br />
is associated with them) to another host if the<br />
local one gets overloaded. Or, one could have a<br />
host acting as a dispatcher that may perform an<br />
initial dialog and then assigns the connection<br />
to a machine in a farm.<br />
For process migration, tcpcp would be invoked<br />
when moving a file descriptor linked to a<br />
socket. If process migration is implemented in<br />
the kernel, an interface would have to be added<br />
to tcpcp to allow calling it in this way.<br />
Fail-over is tricker, because there is normally<br />
no prior indication when the origin will become<br />
unavailable. We discuss the issues arising<br />
from this in more detail in Section 6.<br />
1. <strong>The</strong> application at the origin initiates the<br />
procedure by requesting retrieval of what<br />
we call the Internal Connection Information<br />
(ICI) of a socket. <strong>The</strong> ICI contains<br />
all the information the kernel needs to recreate<br />
a TCP connection endpoint<br />
2. As a side-effect of retrieving the ICI,<br />
tcpcp isolates the connection: all incoming<br />
packets are silently discarded, and no<br />
packets are sent. This is accomplished<br />
by setting up a per-socket filter, and by<br />
changing the output function. Isolating<br />
the socket ensures that the state of the connection<br />
being passed remains stable at either<br />
end.<br />
3. <strong>The</strong> kernel copies all relevant variables,<br />
plus the contents of the out-of-order and<br />
send/retransmit buffers to the ICI. <strong>The</strong><br />
out-of-order buffer contains TCP segments<br />
that have not been acknowledged<br />
yet, because an earlier segment is still<br />
missing.<br />
4. After retrieving the ICI, the application<br />
empties the receive buffer. It can either<br />
process this data directly, or send it along<br />
with the other information, for the destination<br />
to process.<br />
5. <strong>The</strong> origin sends the ICI and any relevant<br />
application state to the destination. <strong>The</strong><br />
application at the origin keeps the socket<br />
open, to ensure that it stays isolated.<br />
6. <strong>The</strong> destination opens a new socket. It<br />
may then bind it to a new port (there are<br />
other choices, described below).
12 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Application<br />
Empty receive buffer (4)<br />
Copy kernel state to ICI (3)<br />
Isolate connection (2)<br />
Destination Origin<br />
Receive OutOfOrder<br />
TCP<br />
Send/Retransmit<br />
Get ICI (1)<br />
Switch network traffic (8)<br />
Vars Send/Retr OutOfOrder Internal Connection Information<br />
Bind port (6)<br />
Set ICI (7)<br />
Receive OutOfOrder<br />
TCP<br />
Send/Retransmit<br />
ACK<br />
Router, switch, ...<br />
Network path to peer<br />
Application<br />
(Re)transmit, or send ACK (10)<br />
Activate connection (9)<br />
Send application state and ICI to new host (5)<br />
Data flow in networking stack Data transfer Command<br />
Figure 2: Passing a TCP connection endpoint in ten easy steps.<br />
7. <strong>The</strong> application at the destination now sets<br />
the ICI on the socket. <strong>The</strong> kernel creates<br />
and populates the necessary data structures,<br />
but does not send any data yet. <strong>The</strong><br />
current implementation makes no use of<br />
the out-of-order data.<br />
8. Network traffic belonging to the connection<br />
is redirected from the origin to the<br />
destination host. Scenarios for this are described<br />
in more detail below. <strong>The</strong> application<br />
at the origin can now close the socket.<br />
9. <strong>The</strong> application at the destination makes a<br />
call to activate the connection.<br />
10. If there is data to transmit, the kernel<br />
will do so. If there is no data, an otherwise<br />
empty ACK segment (like a window<br />
probe) is sent to wake up the peer.<br />
Note that, at the end of this procedure, the<br />
socket at the destination is a perfectly normal<br />
TCP endpoint. In particular, this endpoint can<br />
be passed to another host (or back to the original<br />
one) with tcpcp.<br />
2.1 Local port selection<br />
<strong>The</strong> local port at the destination can be selected<br />
in three ways:<br />
• <strong>The</strong> destination can simply try to use the<br />
same port as the origin. This is necessary<br />
if no address translation is performed on<br />
the connection.<br />
• <strong>The</strong> application can bind the socket before<br />
setting the ICI. In this case, the port in the<br />
ICI is ignored.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 13<br />
• <strong>The</strong> application can also clear the port<br />
information in the ICI, which will cause<br />
the socket to be bound to any available<br />
port. Compared to binding the socket before<br />
setting the ICI, this approach has the<br />
advantage of using the local port number<br />
space much more efficiently.<br />
<strong>The</strong> choice of the port selection method depends<br />
on how the environment in which tcpcp<br />
operates is structured. Normally, either the first<br />
or the last method would be used.<br />
2.2 Switching network traffic<br />
<strong>The</strong>re are countless ways for redirecting IP<br />
packets from one host to another, without help<br />
from the transport layer protocol. <strong>The</strong>y include<br />
redirecting part of the link layer, ingenious<br />
modifications of how link and network<br />
layer interact [7], all kinds of tunnels, network<br />
address translation (NAT), etc.<br />
Since many of the techniques are similar to<br />
network-based load balancing, the <strong>Linux</strong> Virtual<br />
Server Project [8] is a good starting point<br />
for exploring these issues.<br />
While a comprehensive study of this topic if<br />
beyond the scope of this paper, we will briefly<br />
sketch an approach using a static route, because<br />
this is conceptually straightforward and<br />
relatively easy to implement.<br />
Server A<br />
Server B<br />
ipA, ipX<br />
ipB, ipX<br />
ipX gw ipA<br />
GW<br />
ipX gw ipB<br />
ipX<br />
Client<br />
Figure 3: Redirecting network traffic using a<br />
static route.<br />
<strong>The</strong> scenario shown in Figure 3 consists of two<br />
servers A and B, with interfaces with the IP addresses<br />
ipA and ipB, respectively. Each server<br />
also has a virtual interface with the address<br />
ipX. ipA, ipB, and ipX are on the same subnet,<br />
and also the gateway machine has an interface<br />
on this subnet.<br />
At the gateway, we create a static route as follows:<br />
route add ipX gw ipA<br />
When the client connects to the address ipX, it<br />
reaches host A. We can now pass the connection<br />
to host B, as outlined in Section 2. In Step<br />
8, we change the static route on the gateway as<br />
follows:<br />
route del ipX<br />
route add ipX gw ipB<br />
<strong>One</strong> major limitation of this approach is of<br />
course that this routing change affects all connections<br />
to ipX, which is usually undesirable.<br />
Nevertheless, this simple setup can be used to<br />
demonstrate the operation of tcpcp.<br />
3 APIs<br />
<strong>The</strong> API for tcpcp consists of a low-level part<br />
that is based on getting and setting socket options,<br />
and a high-level library that provides<br />
convenient wrappers for the low-level API.<br />
We mention only the most important aspects of<br />
both APIs here. <strong>The</strong>y are described in more detail<br />
in the documentation that is included with<br />
tcpcp.<br />
3.1 Low-level API<br />
<strong>The</strong> ICI is retrieved by getting the TCP_ICI<br />
socket option. As a side-effect, the connection<br />
is isolated, as described in Section 2. <strong>The</strong> application<br />
can determine the maximum ICI size
14 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
for the connection in question by getting the<br />
TCP_MAXICISIZE socket option.<br />
Example:<br />
void *buf;<br />
int ici_size;<br />
size_t size = sizeof(int);<br />
getsockopt(s,SOL_TCP,TCP_MAXICISIZE,<br />
&ici_size,&size);<br />
buf = malloc(ici_size);<br />
size = ici_size;<br />
getsockopt(s,SOL_TCP,TCP_ICI,<br />
buf,&size);<br />
<strong>The</strong> connection endpoint at the destination is<br />
created by setting the TCP_ICI socket option,<br />
and the connection is activated by “setting”<br />
the TCP_CP_FN socket option to the value<br />
TCPCP_ACTIVATE. 1<br />
Example:<br />
int sub_function = TCPCP_ACTIVATE;<br />
setsockopt(s,SOL_TCP,TCP_ICI,<br />
buf,size);<br />
/* ... */<br />
setsockopt(s,SOL_TCP,TCP_CP_FN,<br />
&sub_function,<br />
sizeof(sub_function));<br />
3.2 High-level API<br />
<strong>The</strong>se are the most important functions provided<br />
by the high-level API:<br />
void *tcpcp_get(int s);<br />
int tcpcp_size(const void *ici);<br />
int tcpcp_create(const void *ici);<br />
int tcpcp_activate(int s);<br />
1 <strong>The</strong> use of a multiplexed socket option is admittedly<br />
ugly, although convenient during development.<br />
tcpcp_get allocates a buffer for the ICI, and<br />
retrieves that ICI (isolating the connection as a<br />
side-effect). <strong>The</strong> amount of data in the ICI can<br />
be queried by calling tcpcp_size on it.<br />
tcpcp_create sets an ICI on a socket, and<br />
tcpcp_activate activates the connection.<br />
4 Describing a TCP endpoint<br />
In this section, we describe the parameters that<br />
define a TCP connection and its state. tcpcp<br />
collects all the information it needs to re-create<br />
a TCP connection endpoint in a data structure<br />
we call Internal Connection Information (ICI).<br />
<strong>The</strong> ICI is portable among systems supporting<br />
tcpcp, irrespective of their CPU architecture.<br />
Besides this data, the kernel maintains a large<br />
number of additional variables that can either<br />
be reset to default values at the destination<br />
(such as congestion control state), or that are<br />
only rarely used and not essential for correct<br />
operation of TCP (such as statistics).<br />
4.1 Connection identifier<br />
Each TCP connection in the global Internet or<br />
any private internet [9] is uniquely identified by<br />
the IP addresses of the source and destination<br />
host, and the port numbers used at both ends.<br />
tcpcp currently only supports IPv4, but can<br />
be extended to support IPv6, should the need<br />
arise.<br />
4.2 Fixed data<br />
A few parameters of a TCP connection are negotiated<br />
during the initial handshake, and remain<br />
unchanged during the life time of the<br />
connection. <strong>The</strong>se parameters include whether<br />
window scaling, timestamps, or selective acknowledgments<br />
are used, the number of bits by
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 15<br />
Connection identifier<br />
ip.v4.ip_src IPv4 address of the host on which the ICI was recorded (source)<br />
ip.v4.ip_dst IPv4 address of the peer (destination)<br />
tcp_sport Port at the source host<br />
tcp_dport Port at the destination host<br />
Fixed at connection setup<br />
tcp_flags TCP flags (window scale, SACK, ECN, etc.)<br />
snd_wscale Send window scale<br />
rcv_wscale Receive window scale<br />
snd_mss Maximum Segment Size at the source host<br />
rcv_mss MSS at the destination host<br />
Connection state<br />
state<br />
TCP connection state (e.g. ESTABLISHED)<br />
Sequence numbers<br />
snd_nxt Sequence number of next new byte to send<br />
rcv_nxt Sequence number of next new byte expected to receive<br />
Windows (flow-control)<br />
snd_wnd Window received from peer<br />
rcv_wnd Window advertised to peer<br />
Timestamps<br />
ts_gen Current value of the timestamp generator<br />
ts_recent Most recently received timestamp<br />
Table 1: TCP variables recorded in tcpcp’s Internal Connection Information (ICI) structure.<br />
which the window is shifted, and the maximum<br />
segment sizes (MSS).<br />
<strong>The</strong>se parameters are used mainly for sanity<br />
checks, and to determine whether the destination<br />
host is able to handle the connection. <strong>The</strong><br />
received MSS continues of course to limit the<br />
segment size.<br />
4.3 Sequence numbers<br />
<strong>The</strong> sequence numbers are used to synchronize<br />
all aspects of a TCP connection.<br />
Only the sequence numbers we expect to see<br />
in the network, in either direction, are needed<br />
when re-creating the endpoint. <strong>The</strong> kernel uses<br />
several variables that are derived from these sequence<br />
numbers. <strong>The</strong> values of these variables<br />
either coincide with snd_nxt and rcv_nxt<br />
in the state we set up, or they can be calculated<br />
by examining the send buffer.<br />
4.4 Windows (flow-control)<br />
<strong>The</strong> (flow-control) window determines how<br />
much more data can be sent or received without<br />
overrunning the receiver’s buffer.<br />
<strong>The</strong> window the origin received from the peer<br />
is also the window we can use after re-creating<br />
the endpoint.<br />
<strong>The</strong> window the origin advertised to the peer<br />
defines the minimum receive buffer size at the<br />
destination.
16 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
4.5 Timestamps<br />
TCP can use timestamps to detect old segments<br />
with wrapped sequence numbers [10]. This<br />
mechanism is called Protect Against Wrapped<br />
Sequence numbers (PAWS).<br />
<strong>Linux</strong> uses a global counter (tcp_time_<br />
stamp) to generate local timestamps. If a<br />
moved connection were to use the counter at<br />
the new host, local round-trip-time calculation<br />
may be confused when receiving timestamp<br />
replies from the previous connection, and the<br />
peer’s PAWS algorithm will discard segments<br />
if timestamps appear to have jumped back in<br />
time.<br />
Just turning off timestamps when moving the<br />
connection is not an acceptable solution, even<br />
though [10] seems to allow TCP to just stop<br />
sending timestamps, because doing so would<br />
bring back the problem PAWS tries to solve<br />
in the first place, and it would also reduce the<br />
accuracy of round-trip-time estimates, possibly<br />
degrading the throughput of the connection.<br />
A more satisfying solution is to synchronization<br />
the local timestamp generator. This is<br />
accomplished by introducing a per-connection<br />
timestamp offset that is added to the value<br />
of tcp_time_stamp. This calculation is<br />
hidden in the macro tp_time_stamp(tp),<br />
which just becomes tcp_time_stamp if the<br />
kernel is configured without tcpcp.<br />
<strong>The</strong> addition of the timestamp offset is the only<br />
major change tcpcp requires in the existing<br />
TCP/IP stack.<br />
4.6 Receive buffers<br />
<strong>The</strong>re are two buffers at the receiving side:<br />
the buffer containing segments received out-oforder<br />
(see Section 2), and the buffer with data<br />
that is ready for retrieval by the application.<br />
tcpcp currently ignores both buffers: the outof-order<br />
buffer is copied into the ICI, but not<br />
used when setting up the new socket. Any data<br />
in the receive buffer is left for the application<br />
to read and process.<br />
4.7 Send buffer<br />
<strong>The</strong> send and retransmit buffer contains data<br />
that is no longer accessible through the socket<br />
API, and that cannot be discarded. It is therefore<br />
placed in the ICI, and used to populate the<br />
send buffer at the destination.<br />
4.8 Selective acknowledgments<br />
In Section 5 of [11], the use of inbound SACK<br />
information is left optional. tcpcp takes advantage<br />
of this, and neither preserves SACK information<br />
collected from inbound segments, nor<br />
the history of SACK information sent to the<br />
peer.<br />
Outbound SACKs convey information about<br />
the receiver’s out-of-order queue. Fortunately,<br />
[11] declares this information as purely advisory.<br />
In particular, if reception of data has been<br />
acknowledged with a SACK, this does not imply<br />
that the receiver has to remember having<br />
done so. First, it can request retransmission of<br />
this data, and second, when constructing new<br />
SACKs, the receiver is encouraged to include<br />
information from previous SACKs, but is under<br />
no obligation to do so.<br />
<strong>The</strong>refore, while [11] discourages losing<br />
SACK information, doing so does not violate<br />
its requirements.<br />
Losing SACK information may temporarily<br />
degrade the throughput of the TCP connection.<br />
This is currently of little concern, because<br />
tcpcp forces the connection into slow<br />
start, which has even more drastic performance<br />
implications.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 17<br />
SACK recovery may need to be reconsidered<br />
once tcpcp implements more sophisticated<br />
congestion control.<br />
High−speed LAN<br />
4.9 Other data<br />
<strong>The</strong> TCP connection state is currently always<br />
ESTABLISHED. It may be useful to also allow<br />
passing connections in earlier states, e.g.<br />
SYN_RCVD. This is for further study.<br />
Congestion control data and statistics are currently<br />
omitted. <strong>The</strong> new connection starts with<br />
slow-start, to allow TCP to discover the characteristics<br />
of the new path to the peer.<br />
Origin<br />
Destination<br />
?<br />
Peer<br />
WAN<br />
Characteristics are identical<br />
Reuse congestion control state<br />
Characteristics may differ<br />
Go to slow−start<br />
5 Congestion control<br />
Most of the complexity of TCP is in its congestion<br />
control. tcpcp currently avoids touching<br />
congestion control almost entirely, by setting<br />
the destination to slow start.<br />
This is a highly conservative approach that is<br />
appropriate if knowing the characteristics of<br />
the path between the origin and the peer does<br />
not give us any information on the characteristics<br />
of the path between the destination and the<br />
peer, as shown in the lower part of Figure 4.<br />
However, if the characteristics of the two paths<br />
can be expected to be very similar, e.g. if the<br />
hosts passing the connection are on the same<br />
LAN, better performance could be achieved by<br />
allowing tcpcp to resume the connection at or<br />
nearly at full speed.<br />
Re-establishing congestion control state is for<br />
further study. To avoid abuse, such an operation<br />
can be made available only to sufficiently<br />
trusted applications.<br />
Figure 4: Depending on the structure of the<br />
network, the congestion control state of the<br />
original connection may or may not be reused.<br />
6 Checkpointing<br />
tcpcp is primarily designed for scenarios,<br />
where the old and the new connection owner<br />
are both functional during the process of connection<br />
passing.<br />
A similar usage scenario would if the node<br />
owning the connection occasionally retrieves<br />
(“checkpoints”) the momentary state of the<br />
connection, and after failure of the connection<br />
owner, another node would then use the checkpoint<br />
data to resurrect the connection.<br />
While apparently similar to connection passing,<br />
checkpointing presents several problems<br />
which we discuss in this section. Note that this<br />
is speculative and that the current implementation<br />
of tcpcp does not support any of the exten-
18 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
sions discussed here.<br />
We consider the send and receive flow of the<br />
TCP connection separately, and we assume that<br />
sequence numbers can be directly translated to<br />
application state (e.g. when transferring a file,<br />
application state consists only of the actual file<br />
position, which can be trivially mapped to and<br />
from TCP sequence numbers). Furthermore,<br />
we assume the connection to be in ESTAB-<br />
LISHED state at both ends.<br />
6.1 Outbound data<br />
<strong>One</strong> or more of the following events may occur<br />
between the last checkpoint and the moment<br />
the connection is resurrected:<br />
• the sender may have enqueued more data<br />
• the receiver may have acknowledged<br />
more data<br />
• the receiver may have retrieved more data,<br />
thereby growing its window<br />
Assuming that no additional data has been received<br />
from the peer, the new sender can simply<br />
re-transmit the last segment. (Alternatively,<br />
tcp_xmit_probe_skb might be useful for<br />
the same purpose.) In this case, the following<br />
protocol violations can occur:<br />
• <strong>The</strong> sequence number may have wrapped.<br />
This can be avoided by making sure<br />
that a checkpoint is never older than the<br />
Maximum Segment Lifetime (MSL) 2 , and<br />
that less than 2 31 bytes are sent between<br />
checkpoints.<br />
• If using PAWS, the timestamp may be below<br />
the last timestamp sent by the old<br />
sender. <strong>The</strong> best solution for avoiding this<br />
2 [2] specifies a MSL of two minutes.<br />
is probably to tightly synchronize clock<br />
on the old and the new connection owner,<br />
and to make a conservative estimate of the<br />
number of ticks of the local timestamp<br />
clock that have passed since taking the<br />
checkpoint. This assumes that the timestamp<br />
clock ticks roughly in real time.<br />
Since new data in the segment sent after resurrecting<br />
the connection cannot exceed the receiver’s<br />
window, the only possible outcomes<br />
are that the segment contains either new data,<br />
or only old data. In either case, the receiver<br />
will acknowledge the segment.<br />
Upon reception of an acknowledgment, either<br />
in response to the retransmitted segment, or<br />
from a packet in flight at the time when the connection<br />
was resurrected, the sender knows how<br />
far the connection state has advanced since the<br />
checkpoint was taken.<br />
If the sequence number from the acknowledgment<br />
is below snd_nxt, no special action<br />
is necessary. If the sequence number is<br />
above snd_nxt, the sender would exceptionally<br />
treat this as a valid acknowledgment. 3<br />
As a possible performance improvement, the<br />
sender may notify the application once a new<br />
sequence number has been received, and the<br />
application could then skip over unnecessary<br />
data.<br />
6.2 Inbound data<br />
<strong>The</strong> main problem with checkpointing of incoming<br />
data is that TCP will acknowledge data<br />
that has not yet been retrieved by the application.<br />
<strong>The</strong>refore, checkpointing would have to<br />
delay outbound acknowledgments until the application<br />
has actually retrieved them, and has<br />
3 Note that this exceptional condition does not necessarily<br />
have to occur with the first acknowledgment received.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 19<br />
checkpointed the resulting state change.<br />
To intercept all types of ACKs, tcp_<br />
transmit_skb would have to be changed<br />
to send tp->copied_seq instead of tp-><br />
rcv_nxt. Furthermore, a new API function<br />
would be needed to trigger an explicit acknowledgment<br />
after the data has been stored or processed.<br />
Putting acknowledges under application control<br />
would change their timing. This may upset<br />
the round-trip time estimation of the peer, and<br />
it may also cause it to falsely assume changes<br />
in the congestion level along the path.<br />
7 Security<br />
tcpcp bypasses various sets of access and consistency<br />
checks normally performed when setting<br />
up TCP connections. This section analyzes<br />
the overall security impact of tcpcp.<br />
7.1 Two lines of defense<br />
When setting TCP_ICI, the kernel has no<br />
means of verifying that the connection information<br />
actually originates from a compatible<br />
system. Users may therefore manipulate connection<br />
state, copy connection state from arbitrary<br />
other systems, or even synthesize connection<br />
state according to their wishes. tcpcp provides<br />
two mechanisms to protect against intentional<br />
or accidental mis-uses:<br />
1. tcpcp only takes as little information as<br />
possible from the user, and re-generates<br />
as much of the state related to the TCP<br />
connection (such as neighbour and destination<br />
data) as possible from local information.<br />
Furthermore, it performs a number<br />
of sanity checks on the ICI, to ensure<br />
its integrity, and compatibility with constraints<br />
of the local system (such as buffer<br />
size limits and kernel capabilities).<br />
2. Many manipulations possible through<br />
tcpcp can be shown to be available<br />
through other means if the application has<br />
the CAP_NET_RAW capability. <strong>The</strong>refore,<br />
establishing a new TCP connection<br />
with tcpcp also requires this capability.<br />
This can be relaxed on a host-wide basis.<br />
7.2 Retrieval of sensitive kernel data<br />
Getting TCP_ICI may retrieve information<br />
from the kernel that one would like to hide<br />
from unprivileged applications, e.g. details<br />
about the state of the TCP ISN generator. Since<br />
the equally unprivileged TCP_INFO already<br />
gives access to most TCP connection metadata,<br />
tcpcp does not create any new vulnerabilities.<br />
7.3 Local denial of service<br />
Setting TCP_ICI could be used to introduce<br />
inconsistent data in the TCP stack, or the kernel<br />
in general. Preventing this relies on the correctness<br />
and completeness of the sanity checks<br />
mentioned before.<br />
tcpcp can be used to accumulate stale data in<br />
the kernel. However, this is not very different<br />
from e.g. creating a large number of unused<br />
sockets, or letting buffers fill up in TCP connections,<br />
and therefore poses no new security<br />
threat.<br />
tcpcp can be used to shutdown connections belonging<br />
to third party applications, provided<br />
that the usual access restrictions grant access to<br />
copies of their socket descriptors. This is similar<br />
to executing shutdown on such sockets,<br />
and is therefore believed to pose no new threat.
20 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
7.4 Restricted state transitions<br />
tcpcp could be used to advance TCP connection<br />
state past boundaries imposed by internal<br />
or external control mechanisms. In particular,<br />
conspiring applications may create TCP connections<br />
without ever exchanging SYN packets,<br />
bypassing SYN-filtering firewalls. Since<br />
SYN-filtering firewalls can already be avoided<br />
by privileged applications, sites depending on<br />
SYN-filtering firewalls should therefore use<br />
the default setting of tcpcp, which makes its<br />
use also a privileged operation.<br />
7.5 Attacks on remote hosts<br />
<strong>The</strong> ability to set TCP_ICI makes it easy<br />
to commit all kinds of of protocol violations.<br />
While tcpcp may simplify implementing such<br />
attacks, this type of abuses has always been<br />
possible for privileged users, and therefore,<br />
tcpcp poses no new security threat to systems<br />
properly resistant against network attacks.<br />
However, if a site allows systems where only<br />
trusted users may be able to communicate with<br />
otherwise shielded systems with known remote<br />
TCP vulnerabilities, tcpcp could be used for attacks.<br />
Such sites should use the default setting,<br />
which makes setting TCP_ICI a privileged<br />
operation.<br />
7.6 Security summary<br />
To summarize, the author believes that the design<br />
of tcpcp does not open any new exploits if<br />
tcpcp is used in its default configuration.<br />
Obviously, some subtleties have probably been<br />
overlooked, and there may be bugs inadvertently<br />
leading to vulnerabilities. <strong>The</strong>refore,<br />
tcpcp should receive public scrutiny before being<br />
considered fit for regular use.<br />
8 Future work<br />
To allow faster connection passing among<br />
hosts that share the same, or a very similar path<br />
to the peer, tcpcp should try to avoid going to<br />
slow start. To do so, it will have to pass more<br />
congestion control information, and integrate it<br />
properly at the destination.<br />
Although not strictly part of tcpcp, the redirection<br />
apparatus for the network should be further<br />
extended, in particular to allow individual<br />
connections to be redirected at that point too,<br />
and to include some middleware that coordinates<br />
the redirecting with the changes at the<br />
hosts passing the connection.<br />
It would be very interesting if connection passing<br />
could also be used for checkpointing. <strong>The</strong><br />
analysis in Section 6 suggests that at least limited<br />
checkpointing capabilities should be feasible<br />
without interfering with regular TCP operation.<br />
<strong>The</strong> inner workings of TCP are complex and<br />
easily disturbed. It is therefore important to<br />
subject tcpcp to thorough testing, in particular<br />
in transient states, such as during recovery<br />
from lost segments. <strong>The</strong> umlsim simulator [12]<br />
allows to generate such conditions in a deterministic<br />
way, and will be used for these tests.<br />
9 Conclusion<br />
tcpcp is a proof of concept implementation that<br />
successfully demonstrates that an endpoint of<br />
a TCP connection can be passed from one host<br />
to another without involving the host at the opposite<br />
end of the TCP connection. tcpcp also<br />
shows that this can be accomplished with a relatively<br />
small amount of kernel changes.<br />
tcpcp in its present form is suitable for experimental<br />
use as a building block for load balancing<br />
and process migration solutions. Future
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 21<br />
work will focus on improving the performance<br />
of tcpcp, on validating its correctness, and on<br />
exploring checkpointing capabilities.<br />
References<br />
[1] RFC768; Postel, Jon. User Datagram<br />
Protocol, IETF, August 1980.<br />
[2] RFC793; Postel, Jon. Transmission<br />
Control Protocol, IETF, September 1981.<br />
[3] Stevens, W. Richard. TCP/IP Illustrated,<br />
Volume 1 – <strong>The</strong> Protocols,<br />
Addison-Wesley, 1994.<br />
[4] RFC2616; Fielding, Roy T.; Gettys,<br />
James; Mogul, Jeffrey C.; Frystyk<br />
Nielsen, Henrik; Masinter, Larry; Leach,<br />
Paul J.; Berners-Lee, Tim. Hypertext<br />
Transfer Protocol – HTTP/1.1, IETF,<br />
June 1999.<br />
[5] Bar, Moshe. OpenMosix, Proceedings of<br />
the 10th International <strong>Linux</strong> System<br />
Technology Conference<br />
(<strong>Linux</strong>-Kongress 2003), pp. 94–102,<br />
October 2003.<br />
[9] RFC1918; Rekhter, Yakov; Moskowitz,<br />
Robert G.; Karrenberg, Daniel; de Groot,<br />
Geert Jan; Lear, Eliot. Address<br />
Allocation for Private Internets, IETF,<br />
February 1996.<br />
[10] RFC1323; Jacobson, Van; Braden, Bob;<br />
Borman, Dave. TCP Extensions for High<br />
Performance, IETF, May 1992.<br />
[11] RFC2018; Mathis, Matt; Mahdavi,<br />
Jamshid; Floyd, Sally; Romanow, Allyn.<br />
TCP Selective Acknowledgement<br />
Options, IETF, October 1996.<br />
[12] Almesberger, Werner. UML Simulator,<br />
Proceedings of the Ottawa <strong>Linux</strong><br />
Symposium 2003, July 2003.<br />
http://archive.linuxsymposium.<br />
org/ols2003/Proceedings/<br />
All-Reprints/<br />
Reprint-Almesberger-OLS2003.<br />
pdf<br />
[6] Kuntz, Bryan; Rajan, Karthik.<br />
MIGSOCK – Migratable TCP Socket in<br />
<strong>Linux</strong>, CMU, M.Sc. <strong>The</strong>sis, February<br />
2002. http://www-2.cs.cmu.edu/<br />
~softagents/migsock/MIGSOCK.<br />
pdf<br />
[7] Leite, Fábio Olivé. Load-Balancing HA<br />
Clusters with No Single Point of Failure,<br />
Proceedings of the 9th International<br />
<strong>Linux</strong> System Technology Conference<br />
(<strong>Linux</strong>-Kongress 2002), pp. 122–131,<br />
September 2002. http://www.<br />
linux-kongress.org/2002/<br />
papers/lk2002-leite.html<br />
[8] <strong>Linux</strong> Virtual Server Project, http://<br />
www.linuxvirtualserver.org/
22 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Cooperative <strong>Linux</strong><br />
Dan Aloni<br />
da-x@colinux.org<br />
Abstract<br />
In this paper I’ll describe Cooperative <strong>Linux</strong>, a<br />
port of the <strong>Linux</strong> kernel that allows it to run as<br />
an unprivileged lightweight virtual machine in<br />
kernel mode, on top of another OS kernel. It allows<br />
<strong>Linux</strong> to run under any operating system<br />
that supports loading drivers, such as Windows<br />
or <strong>Linux</strong>, after minimal porting efforts. <strong>The</strong> paper<br />
includes the present and future implementation<br />
details, its applications, and its comparison<br />
with other <strong>Linux</strong> virtualization methods.<br />
Among the technical details I’ll present the<br />
CPU-complete context switch code, hardware<br />
interrupt forwarding, the interface between the<br />
host OS and <strong>Linux</strong>, and the management of the<br />
VM’s pseudo physical RAM.<br />
1 Introduction<br />
Cooperative <strong>Linux</strong> utilizes the rather underused<br />
concept of a Cooperative Virtual Machine<br />
(CVM), in contrast to traditional VMs that<br />
are unprivileged and being under the complete<br />
control of the host machine.<br />
<strong>The</strong> term Cooperative is used to describe two<br />
entities working in parallel, e.g. coroutines [2].<br />
In that sense the most plain description of Cooperative<br />
<strong>Linux</strong> is turning two operating system<br />
kernels into two big coroutines. In that<br />
mode, each kernel has its own complete CPU<br />
context and address space, and each kernel decides<br />
when to give control back to its partner.<br />
However, only one of the two kernels has control<br />
on the physical hardware, where the other<br />
is provided only with virtual hardware abstraction.<br />
From this point on in the paper I’ll refer<br />
to these two kernels as the host operating system,<br />
and the guest <strong>Linux</strong> VM respectively. <strong>The</strong><br />
host can be every OS kernel that exports basic<br />
primitives that provide the Cooperative <strong>Linux</strong><br />
portable driver to run in CPL0 mode (ring 0)<br />
and allocate memory.<br />
<strong>The</strong> special CPL0 approach in Cooperative<br />
<strong>Linux</strong> makes it significantly different than<br />
traditional virtualization solutions such as<br />
VMware, plex86, Virtual PC, and other methods<br />
such as Xen. All of these approaches work<br />
by running the guest OS in a less privileged<br />
mode than of the host kernel. This approach<br />
allowed for the extensive simplification of Cooperative<br />
<strong>Linux</strong>’s design and its short earlybeta<br />
development cycle which lasted only one<br />
month, starting from scratch by modifying the<br />
vanilla <strong>Linux</strong> 2.4.23-pre9 release until reaching<br />
to the point where KDE could run.<br />
<strong>The</strong> only downsides to the CPL0 approach is<br />
stability and security. If it’s unstable, it has the<br />
potential to crash the system. However, measures<br />
can be taken, such as cleanly shutting it<br />
down on the first internal Oops or panic. Another<br />
disadvantage is security. Acquiring root<br />
user access on a Cooperative <strong>Linux</strong> machine<br />
can potentially lead to root on the host machine<br />
if the attacker loads specially crafted kernel<br />
module or uses some very elaborated exploit<br />
in case which the Cooperative <strong>Linux</strong> kernel<br />
was compiled without module support.
24 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Most of the changes in the Cooperative <strong>Linux</strong><br />
patch are on the i386 tree—the only supported<br />
architecture for Cooperative at the time of this<br />
writing. <strong>The</strong> other changes are mostly additions<br />
of virtual drivers: cobd (block device),<br />
conet (network), and cocon (console). Most of<br />
the changes in the i386 tree involve the initialization<br />
and setup code. It is a goal of the Cooperative<br />
<strong>Linux</strong> kernel design to remain as close<br />
as possible to the standalone i386 kernel, so all<br />
changes are localized and minimized as much<br />
as possible.<br />
2 Uses<br />
Cooperative <strong>Linux</strong> in its current early state<br />
can already provide some of the uses that<br />
User Mode <strong>Linux</strong>[1] provides, such as virtual<br />
hosting, kernel development environment,<br />
research, and testing of new distributions or<br />
buggy software. It also enabled new uses:<br />
• Relatively effortless migration path<br />
from Windows. In the process of switching<br />
to another OS, there is the choice between<br />
installing another computer, dualbooting,<br />
or using a virtualization software.<br />
<strong>The</strong> first option costs money, the<br />
second is tiresome in terms of operation,<br />
but the third can be the most quick and<br />
easy method—especially if it’s free. This<br />
is where Cooperative <strong>Linux</strong> comes in. It<br />
is already used in workplaces to convert<br />
Windows users to <strong>Linux</strong>.<br />
• Adding Windows machines to <strong>Linux</strong><br />
clusters. <strong>The</strong> Cooperative <strong>Linux</strong> patch<br />
is minimal and can be easily combined<br />
with others such as the MOSIX or Open-<br />
MOSIX patches that add clustering capabilities<br />
to the kernel. This work in<br />
progress allows to add Windows machines<br />
to super-computer clusters, where one<br />
illustration could tell about a secretary<br />
workstation computer that runs Cooperative<br />
<strong>Linux</strong> as a screen saver—when the<br />
secretary goes home at the end of the day<br />
and leaves the computer unattended, the<br />
office’s cluster gets more CPU cycles for<br />
free.<br />
• Running an otherwise-dual-booted<br />
<strong>Linux</strong> system from the other OS. <strong>The</strong><br />
Windows port of Cooperative <strong>Linux</strong><br />
allows it to mount real disk partitions<br />
as block devices. Numerous people are<br />
using this in order to access, rescue, or<br />
just run their <strong>Linux</strong> system from their<br />
ext3 or reiserfs file systems.<br />
• Using <strong>Linux</strong> as a Windows firewall on<br />
the same machine. As a likely competitor<br />
to other out-of-the-box Windows firewalls,<br />
iptables along with a stripped-down<br />
Cooperative <strong>Linux</strong> system can potentially<br />
serve as a network firewall.<br />
• <strong>Linux</strong> kernel development / debugging<br />
/ research and study on another operating<br />
systems.<br />
Digging inside a running Cooperative<br />
<strong>Linux</strong> kernel, you can hardly tell the<br />
difference between it and a standalone<br />
<strong>Linux</strong>. All virtual addresses are the<br />
same—Oops reports look familiar and the<br />
architecture dependent code works in the<br />
same manner, excepts some transparent<br />
conversions, which are described in the<br />
next section in this paper.<br />
• Development environment for porting<br />
to and from <strong>Linux</strong>.<br />
3 Design Overview<br />
In this section I’ll describe the basic methods<br />
behind Cooperative <strong>Linux</strong>, which include
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 25<br />
complete context switches, handling of hardware<br />
interrupts by forwarding, physical address<br />
translation and the pseudo physical memory<br />
RAM.<br />
3.1 Minimum Changes<br />
To illustrate the minimal effect of the Cooperative<br />
<strong>Linux</strong> patch on the source tree, here is a<br />
diffstat listing of the patch on <strong>Linux</strong> 2.4.26 as<br />
of May 10, 2004:<br />
CREDITS | 6<br />
Documentation/devices.txt | 7<br />
Makefile | 8<br />
arch/i386/config.in | 30<br />
arch/i386/kernel/Makefile | 2<br />
arch/i386/kernel/cooperative.c | 181 +++++<br />
arch/i386/kernel/head.S | 4<br />
arch/i386/kernel/i387.c | 8<br />
arch/i386/kernel/i8259.c | 153 ++++<br />
arch/i386/kernel/ioport.c | 10<br />
arch/i386/kernel/process.c | 28<br />
arch/i386/kernel/setup.c | 61 +<br />
arch/i386/kernel/time.c | 104 +++<br />
arch/i386/kernel/traps.c | 9<br />
arch/i386/mm/fault.c | 4<br />
arch/i386/mm/init.c | 37 +<br />
arch/i386/vmlinux.lds | 82 +-<br />
drivers/block/Config.in | 4<br />
drivers/block/Makefile | 1<br />
drivers/block/cobd.c | 334 ++++++++++<br />
drivers/block/ll_rw_blk.c | 2<br />
drivers/char/Makefile | 4<br />
drivers/char/colx_keyb.c | 1221 +++++++++++++*<br />
drivers/char/mem.c | 8<br />
drivers/char/vt.c | 8<br />
drivers/net/Config.in | 4<br />
drivers/net/Makefile | 1<br />
drivers/net/conet.c | 205 ++++++<br />
drivers/video/Makefile | 4<br />
drivers/video/cocon.c | 484 +++++++++++++++<br />
include/asm-i386/cooperative.h | 175 +++++<br />
include/asm-i386/dma.h | 4<br />
include/asm-i386/io.h | 27<br />
include/asm-i386/irq.h | 6<br />
include/asm-i386/mc146818rtc.h | 7<br />
include/asm-i386/page.h | 30<br />
include/asm-i386/pgalloc.h | 7<br />
include/asm-i386/pgtable-2level.h | 8<br />
include/asm-i386/pgtable.h | 7<br />
include/asm-i386/processor.h | 12<br />
include/asm-i386/system.h | 8<br />
include/linux/console.h | 1<br />
include/linux/cooperative.h | 317 +++++++++<br />
include/linux/major.h | 1<br />
init/do_mounts.c | 3<br />
init/main.c | 9<br />
kernel/Makefile | 2<br />
kernel/cooperative.c | 254 +++++++<br />
kernel/panic.c | 4<br />
kernel/printk.c | 6<br />
50 files changed, 3828 insertions(+), 74 deletions(-)<br />
3.2 Device Driver<br />
<strong>The</strong> device driver port of Cooperative <strong>Linux</strong><br />
is used for accessing kernel mode and using<br />
the kernel primitives that are exported by the<br />
host OS kernel. Most of the driver is OSindependent<br />
code that interfaces with the OS<br />
dependent primitives that include page allocations,<br />
debug printing, and interfacing with user<br />
space.<br />
When a Cooperative <strong>Linux</strong> VM is created, the<br />
driver loads a kernel image from a vmlinux<br />
file that was compiled from the patched kernel<br />
with CONFIG_COOPERATIVE. <strong>The</strong> vmlinux<br />
file doesn’t need any cross platform tools in order<br />
to be generated, and the same vmlinux file<br />
can be used to run a Cooperative <strong>Linux</strong> VM on<br />
several OSes of the same architecture.<br />
<strong>The</strong> VM is associated with a per-process<br />
resource—a file descriptor in <strong>Linux</strong>, or a device<br />
handle in Windows. <strong>The</strong> purpose of this<br />
association makes sense: if the process running<br />
the VM ends abnormally in any way, all<br />
resources are cleaned up automatically from a<br />
callback when the system frees the per-process<br />
resource.<br />
3.3 Pseudo Physical RAM<br />
In Cooperative <strong>Linux</strong>, we had to work around<br />
the <strong>Linux</strong> MM design assumption that the entire<br />
physical RAM is bestowed upon the kernel<br />
on startup, and instead, only give Cooperative<br />
<strong>Linux</strong> a fixed set of physical pages, and<br />
then only do the translations needed for it to<br />
work transparently in that set. All the memory<br />
which Cooperative <strong>Linux</strong> considers as physical<br />
is in that allocated set, which we call the<br />
Pseudo Physical RAM.<br />
<strong>The</strong> memory is allocated in the host OS<br />
using the appropriate kernel function—<br />
alloc_pages() in <strong>Linux</strong> and<br />
MmAllocatePagesForMdl() in<br />
Windows—so it is not mapped in any address<br />
space on the host for not wasting PTEs.<br />
<strong>The</strong> allocated pages are always resident and<br />
not freed until the VM is downed. Page tables
26 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
--- linux/include/asm-i386/pgtable-2level.h 2004-04-20 08:04:01.000000000 +0300<br />
+++ linux/include/asm-i386/pgtable-2level.h 2004-05-09 16:54:09.000000000 +0300<br />
@@ -58,8 +58,14 @@<br />
}<br />
#define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0))<br />
#define pte_same(a, b)<br />
((a).pte_low == (b).pte_low)<br />
-#define pte_page(x)<br />
(mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))<br />
#define pte_none(x)<br />
(!(x).pte_low)<br />
+<br />
+#ifndef CONFIG_COOPERATIVE<br />
+#define pte_page(x)<br />
(mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))<br />
#define __mk_pte(page_nr,pgprot) __pte(((page_nr)
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 27<br />
dress space differently, such that one virtual address<br />
can contain a kernel mapped page in one<br />
OS and a user mapped page in another.<br />
Guest <strong>Linux</strong><br />
Intermediate<br />
0xFFFFFFFF<br />
Host OS<br />
In Cooperative <strong>Linux</strong> the problem was solved<br />
by using an intermediate address space during<br />
the switch (referred to as the ‘passage page,’<br />
see Figure 1). <strong>The</strong> intermediate address space<br />
is defined by a specially created page tables in<br />
both the guest and host contexts and maps the<br />
same code that is used for the switch (passage<br />
code) at both of the virtual addresses that are<br />
involved. When a switch occurs, first CR3 is<br />
changed to point to the intermediate address<br />
space. <strong>The</strong>n, EIP is relocated to the other mapping<br />
of the passage code using a jump. Finally,<br />
CR3 is changed to point to the top page table<br />
directory of the other OS.<br />
<strong>The</strong> single MMU page that contains the passage<br />
page code, also contains the saved state of<br />
one OS while the other is executing. Upon the<br />
beginning of a switch, interrupts are turned off,<br />
and a current state is saved to the passage page<br />
by the passage page code. <strong>The</strong> state includes<br />
all the general purpose registers, the segment<br />
registers, the interrupt descriptor table register<br />
(IDTR), the global descriptor table (GDTR),<br />
the local descriptor register (LTR), the task register<br />
(TR), and the state of the FPU / MMX<br />
/ SSE registers. In the middle of the passage<br />
page code, it restores the state of the other OS<br />
and interrupts are turned back on. This process<br />
is akin to a “normal” process to process context<br />
switch.<br />
Since control is returned to the host OS on every<br />
hardware interrupt (described in the following<br />
section), it is the responsibility of the host<br />
OS scheduler to give time slices to the Cooperative<br />
<strong>Linux</strong> VM just as if it was a regular process.<br />
0x80000000<br />
Figure 1: Address space transition during an<br />
OS cooperative kernel switch, using an intermapped<br />
page<br />
3.5 Interrupt Handling and Forwarding<br />
Since a complete MMU context switch also involves<br />
the IDTR, Cooperative <strong>Linux</strong> must set<br />
an interrupt vector table in order to handle the<br />
hardware interrupts that occur in the system<br />
during its running state. However, Cooperative<br />
<strong>Linux</strong> only forwards the invocations of interrupts<br />
to the host OS, because the latter needs<br />
to know about these interrupts in order to keep<br />
functioning and support the colinux-daemon<br />
process itself, regardless to the fact that external<br />
hardware interrupts are meaningless to the<br />
Cooperative <strong>Linux</strong> virtual machine.<br />
<strong>The</strong> interrupt vectors for the internal processor<br />
exceptions (0x0–0x1f) and the system call vector<br />
(0x80) are kept like they are so that Cooperative<br />
<strong>Linux</strong> handles its own page faults and<br />
other exceptions, but the other interrupt vectors<br />
point to special proxy ISRs (interrupt service<br />
routines). When such an ISR is invoked during<br />
the Cooperative <strong>Linux</strong> context by an external<br />
hardware interrupt, a context switch is made to<br />
the host OS using the passage code. On the
28 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
other side, the address of the relevant ISR of<br />
the host OS is determined by looking at its IDT.<br />
An interrupt call stack is forged and a jump occurs<br />
to that address. Between the invocation of<br />
the ISR in the <strong>Linux</strong> side and the handling of<br />
the interrupt in the host side, the interrupt flag<br />
is disabled.<br />
<strong>The</strong> operation adds a tiny latency to interrupt<br />
handling in the host OS, but it is quite neglectable.<br />
Considering that this interrupt forwarding<br />
technique also involves the hardware<br />
timer interrupt, the host OS cannot detect that<br />
its CR3 was hijacked for a moment and therefore<br />
no exceptions in the host side would occur<br />
as a result of the context switch.<br />
To provide interrupts for the virtual device<br />
drivers of the guest <strong>Linux</strong>, the changes in the<br />
arch code include a virtual interrupt controller<br />
which receives messages from the host OS<br />
on the occasion of a switch and invokes do_<br />
IRQ() with a forged struct pt_args.<br />
<strong>The</strong> interrupt numbers are virtual and allocated<br />
on a per-device basis.<br />
4 Benchmarks And Performance<br />
4.1 Dbench results<br />
This section shows a comparison between User<br />
Mode <strong>Linux</strong> and Cooperative <strong>Linux</strong>. <strong>The</strong> machine<br />
which the following results were generated<br />
on is a 2.8GHz Pentium 4 with HT enabled,<br />
512GB RAM, and a 120GB SATA Maxtor<br />
hard-drive that hosts ext3 partitions. <strong>The</strong><br />
comparison was performed using the dbench<br />
1.3-2 package of Debian on all setups.<br />
<strong>The</strong> host machine runs the <strong>Linux</strong> 2.6.6 kernel<br />
patched with SKAS support. <strong>The</strong> UML kernel<br />
is <strong>Linux</strong> 2.6.4 that runs with 32MB of RAM,<br />
and is configured to use SKAS mode. <strong>The</strong> Cooperative<br />
<strong>Linux</strong> kernel is a <strong>Linux</strong> 2.4.26 kernel<br />
and it is configured to run with 32MB of RAM,<br />
same as the UML system. <strong>The</strong> root file-system<br />
of both UML and Cooperative <strong>Linux</strong> machines<br />
is the same host <strong>Linux</strong> file that contains an ext3<br />
image of a 0.5GB minimized Debian system.<br />
<strong>The</strong> commands ‘dbench 1’, ‘dbench 3’, and<br />
‘dbench 10’ were run in 3 consecutive runs for<br />
each command, on the host <strong>Linux</strong>, on UML,<br />
and on Cooperative <strong>Linux</strong> setups. <strong>The</strong> results<br />
are shown in Table 2, Table 3, and Table 4.<br />
Table 2:<br />
MB/sec)<br />
Table 3:<br />
MB/sec)<br />
System Throughput Netbench<br />
43.813 54.766<br />
Host 50.117 62.647<br />
44.128 55.160<br />
10.418 13.022<br />
UML 9.408 11.760<br />
9.309 11.636<br />
10.418 13.023<br />
co<strong>Linux</strong> 12.574 15.718<br />
12.075 15.094<br />
output of dbench 10 (units are in<br />
System Throughput Netbench<br />
43.287 54.109<br />
Host 41.383 51.729<br />
59.965 74.956<br />
11.857 14.821<br />
UML 15.143 18.929<br />
14.602 18.252<br />
24.095 30.119<br />
co<strong>Linux</strong> 32.527 40.659<br />
36.423 45.528<br />
output of dbench 3 (units are in<br />
4.2 Understanding the results<br />
From the results in these runs, ‘dbench 10’,<br />
‘dbench 3’, and ‘dbench 1’ show 20%, 123%,<br />
and 303% increase respectively, compared to<br />
UML. <strong>The</strong>se numbers relate to the number
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 29<br />
Table 4:<br />
MB/sec)<br />
System Throughput Netbench<br />
158.205 197.756<br />
Host 182.191 227.739<br />
179.047 223.809<br />
15.351 19.189<br />
UML 16.691 20.864<br />
16.180 20.226<br />
45.592 56.990<br />
co<strong>Linux</strong> 72.452 90.565<br />
106.952 133.691<br />
output of dbench 1 (units are in<br />
of dbench threads, which is a result of the<br />
synchronous implementation of cobd 1 . Yet,<br />
neglecting the versions of the kernels compared,<br />
Cooperative <strong>Linux</strong> achieves much better<br />
probably because of low overhead with regard<br />
to context switching and page faulting in the<br />
guest <strong>Linux</strong> VM.<br />
<strong>The</strong> current implementation of the cobd driver<br />
is synchronous file reading and writing directly<br />
from the kernel of the host <strong>Linux</strong>—No user<br />
space of the host <strong>Linux</strong> is involved, therefore<br />
less context switching and copying. About<br />
copying, the specific implementation of cobd<br />
in the host <strong>Linux</strong> side benefits from the fact<br />
that filp->f_op->read() is called directly<br />
on the cobd driver’s request buffer after<br />
mapping it using kmap(). Reimplementing<br />
this driver as asynchronous on both the host<br />
and guest—can improve performance.<br />
Unlike UML, Cooperative <strong>Linux</strong> can benefit<br />
in the terms of performance from the implementation<br />
of kernel-to-kernel driver bridges<br />
such as cobd. For example, currently virtual<br />
Ethernet in Cooperative <strong>Linux</strong> is done similar<br />
to UML—i.e., using user space daemons<br />
with tuntap on the host. If instead we create<br />
a kernel-to-kernel implementation with no<br />
user space daemons in between, Cooperative<br />
1 ubd UML equivalent<br />
<strong>Linux</strong> has the potential to achieve much better<br />
in benchmarking.<br />
5 Planned Features<br />
Since Cooperative <strong>Linux</strong> is a new project<br />
(2004–), most of its features are still waiting<br />
to be implemented.<br />
5.1 Suspension<br />
Software-suspending <strong>Linux</strong> is a challenge on<br />
standalone <strong>Linux</strong> systems, considering the entire<br />
state of the hardware needs to be saved and<br />
restored, along with the space that needs to be<br />
found for storing the suspended image. On<br />
User Mode <strong>Linux</strong> suspending [3] is easier—<br />
only the state of a few processes needs saving,<br />
and no hardware is involved.<br />
However, in Cooperative <strong>Linux</strong>, it will be even<br />
easier to implement suspension, because it will<br />
involve its internal state almost entirely. <strong>The</strong><br />
procedure will involve serializing the pseudo<br />
physical RAM by enumerating all the page table<br />
entries that are used in Cooperative <strong>Linux</strong>,<br />
either by itself (for user space and vmalloc<br />
page tables) or for itself (the page tables of<br />
the pseudo physical RAM), and change them<br />
to contain the pseudo value instead of the real<br />
value.<br />
<strong>The</strong> purpose of this suspension procedure is to<br />
allow no notion of the real physical memory<br />
to be contained in any of the pages allocated<br />
for the Cooperative <strong>Linux</strong> VM, since Cooperative<br />
<strong>Linux</strong> will be given a different set of pages<br />
when it will resume at a later time. At the suspended<br />
state, the pages can be saved to a file<br />
and the VM could be resumed later. Resuming<br />
will involve loading that file, allocating the<br />
memory, and fix-enumerate all the page tables<br />
again so that the values in the page table entries<br />
point to the newly allocated memory.
30 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Another implementation strategy will be to just<br />
dump everything on suspension as it is, but<br />
on resume—enumerate all the page table entries<br />
and adjust between the values of the old<br />
RPPFNs 2 and new RPPFNs.<br />
Note that a suspended image could be created<br />
under one host OS and be resumed in another<br />
host OS of the same architecture. <strong>One</strong> could<br />
carry a suspended <strong>Linux</strong> on a USB memory device<br />
and resume/suspend it on almost any computer.<br />
5.2 User Mode <strong>Linux</strong>[1] inside Cooperative<br />
<strong>Linux</strong><br />
<strong>The</strong> possibility of running UML inside Cooperative<br />
<strong>Linux</strong> is not far from being immediately<br />
possible. It will allow to bring UML with all its<br />
glory to operating systems that cannot support<br />
it otherwise because of their user space APIs.<br />
Combining UML and Cooperative <strong>Linux</strong> cancels<br />
the security downside that running Cooperative<br />
<strong>Linux</strong> could incur.<br />
5.3 Live Cooperative Distributions<br />
Live-CD distributions like KNOPPIX can be<br />
used to boot on top of another operating system<br />
and not only as standalone, reaching a larger<br />
sector of computer users considering the host<br />
operating system to be Windows NT/2000/XP.<br />
5.4 Integration with ReactOS<br />
ReactOS, the free Windows NT clone, will be<br />
incorporating Cooperative <strong>Linux</strong> as a POSIX<br />
subsystem.<br />
5.5 Miscellaneous<br />
• Virtual frame buffer support.<br />
2 real physical page frame numbers<br />
• Incorporating features from User Mode<br />
<strong>Linux</strong>, e.g. humfs 3 .<br />
• Support for more host operating systems<br />
such as FreeBSD.<br />
6 Conclusions<br />
We have discussed how Cooperative <strong>Linux</strong><br />
works and its benefits—apart from being a<br />
BSKH 4 , Cooperative <strong>Linux</strong> has the potential<br />
to become an alternative to User Mode <strong>Linux</strong><br />
that enhances on portability and performance,<br />
rather than on security.<br />
Moreover, the implications that Cooperative<br />
<strong>Linux</strong> has on what is the media defines as<br />
‘<strong>Linux</strong> on the Desktop’—are massive, as the<br />
world’s most dominant albeit proprietary desktop<br />
OS supports running <strong>Linux</strong> distributions<br />
for free, as another software, with the aimedfor<br />
possibility that the <strong>Linux</strong> newbie would<br />
switch to the standalone <strong>Linux</strong>. As userfriendliness<br />
of the Windows port will improve,<br />
the exposure that <strong>Linux</strong> gets by the average<br />
computer user can increase tremendously.<br />
7 Thanks<br />
Muli Ben Yehuda, IBM<br />
Jun Okajima, Digital Infra<br />
Kuniyasu Suzaki, AIST<br />
References<br />
[1] Jeff Dike. User Mode <strong>Linux</strong>. http:<br />
//user-mode-linux.sf.net.<br />
3 A recent addition to UML that provides an host FS<br />
implementation that uses files in order to store its VFS<br />
metadata<br />
4 Big Scary <strong>Kernel</strong> Hack
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 31<br />
[2] Donald E. Knuth. <strong>The</strong> Art of Computer<br />
Programming, volume 1.<br />
Addison-Wesley, Reading, Massachusetts,<br />
1997. Describes coroutines in their pure<br />
sense.<br />
[3] Richard Potter. Scrapbook for User Mode<br />
<strong>Linux</strong>. http:<br />
//sbuml.sourceforge.net/.
32 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Build your own Wireless Access Point<br />
Erik Andersen<br />
Codepoet Consulting<br />
andersen@codepoet.org<br />
Abstract<br />
This presentation will cover the software, tools,<br />
libraries, and configuration files needed to<br />
construct an embedded <strong>Linux</strong> wireless access<br />
point. Some of the software available for constructing<br />
embedded <strong>Linux</strong> systems will be discussed,<br />
and selection criteria for which tools to<br />
use for differing embedded applications will be<br />
presented. During the presentation, an embedded<br />
<strong>Linux</strong> wireless access point will be constructed<br />
using the <strong>Linux</strong> kernel, the uClibc C<br />
library, BusyBox, the syslinux bootloader, iptables,<br />
etc. Emphasis will be placed on the<br />
more generic aspects of building an embedded<br />
<strong>Linux</strong> system using BusyBox and uClibc.<br />
At the conclusion of the presentation, the presenter<br />
will (with luck) boot up the newly constructed<br />
wireless access point and demonstrate<br />
that it is working perfectly. Source code, build<br />
system, cross compilers, and detailed instructions<br />
will be made available.<br />
1 Introduction<br />
When I began working on embedded <strong>Linux</strong>,<br />
the question of whether or not <strong>Linux</strong> was small<br />
enough to fit inside a particular device was a<br />
difficult problem. <strong>Linux</strong> distributions 1 have<br />
1 <strong>The</strong> term “distribution” is used by the <strong>Linux</strong> community<br />
to refer to a collection of software, including<br />
the <strong>Linux</strong> kernel, application programs, and needed library<br />
code, which makes up a complete running system.<br />
Sometimes, the term “<strong>Linux</strong>” or “GNU/<strong>Linux</strong>” is also<br />
used to refer to this collection of software.<br />
historically been designed for server and desktop<br />
systems. As such, they deliver a fullfeatured,<br />
comprehensive set of tools for just<br />
about every purpose imaginable. Most <strong>Linux</strong><br />
distributions, such as Red Hat, Debian, or<br />
SuSE, provide hundreds of separate software<br />
packages adding up to several gigabytes of<br />
software. <strong>The</strong> goal of server or desktop <strong>Linux</strong><br />
distributions has been to provide as much value<br />
as possible to the user; therefore, the large<br />
size is quite understandable. However, this<br />
has caused the <strong>Linux</strong> operating system to be<br />
much larger then is desirable for building an<br />
embedded <strong>Linux</strong> system such as a wireless access<br />
point. Since embedded devices represent<br />
a fundamentally different target for <strong>Linux</strong>,<br />
it became apparent to me that embedded devices<br />
would need different software than what<br />
is commonly used on desktop systems. I knew<br />
that <strong>Linux</strong> has a number of strengths which<br />
make it extremely attractive for the next generation<br />
of embedded devices, yet I could see<br />
that developers would need new tools to take<br />
advantage of <strong>Linux</strong> within small, embedded<br />
spaces.<br />
I began working on embedded <strong>Linux</strong> in the<br />
middle of 1999. At the time, building an ‘embedded<br />
<strong>Linux</strong>’ system basically involved copying<br />
binaries from an existing <strong>Linux</strong> distribution<br />
to a target device. If the needed software did<br />
not fit into the required amount of flash memory,<br />
there was really nothing to be done about<br />
it except to add more flash or give up on the<br />
project. Very little effort had been made to<br />
develop smaller application programs and li-
34 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
braries designed for use in embedded <strong>Linux</strong>.<br />
As I began to analyze how I could save space,<br />
I decided that there were three main areas that<br />
could be attacked to shrink the footprint of an<br />
embedded <strong>Linux</strong> system: the kernel, the set of<br />
common application programs included in the<br />
system, and the shared libraries. Many people<br />
doing <strong>Linux</strong> kernel development were at least<br />
talking about shrinking the footprint of the kernel.<br />
For the past five years, I have focused on<br />
the latter two areas: shrinking the footprint of<br />
the application programs and libraries required<br />
to produce a working embedded <strong>Linux</strong> system.<br />
This paper will describe some of the software<br />
tools I’ve worked on and maintained, which are<br />
now available for building very small embedded<br />
<strong>Linux</strong> systems.<br />
2 <strong>The</strong> C Library<br />
Let’s take a look at an embedded <strong>Linux</strong> system,<br />
the <strong>Linux</strong> Router Project, which was available<br />
in 1999. http://www.linuxrouter.org/<br />
<strong>The</strong> <strong>Linux</strong> Router Project, begun by Dave<br />
Cinege, was and continues to be a very commonly<br />
used embedded <strong>Linux</strong> system. Its selfdescribed<br />
tagline reads “A networking-centric<br />
micro-distribution of <strong>Linux</strong>” which is “small<br />
enough to fit on a single 1.44MB floppy disk,<br />
and makes building and maintaining routers,<br />
access servers, thin servers, thin clients,<br />
network appliances, and typically embedded<br />
systems next to trivial.” First, let’s download<br />
a copy of one of the <strong>Linux</strong> Router Project’s<br />
“idiot images.” I grabbed my copy from<br />
the mirror site at ftp://sunsite.unc.edu/<br />
pub/<strong>Linux</strong>/distributions/linux-router/<br />
dists/current/idiot-image_1440KB_FAT_<br />
2.9.8_<strong>Linux</strong>_2.2.gz.<br />
Opening up the idiot-image there are several<br />
very interesting things to be seen.<br />
# gunzip \<br />
idiot-image_1440KB_FAT_2.9.8_<strong>Linux</strong>_2.2.gz<br />
# mount \<br />
idiot-image_1440KB_FAT_2.9.8_<strong>Linux</strong>_2.2 \<br />
/mnt -o loop<br />
# du -ch /mnt/*<br />
34K /mnt/etc.lrp<br />
6.0K /mnt/ldlinux.sys<br />
512K /mnt/linux<br />
512 /mnt/local.lrp<br />
1.0K /mnt/log.lrp<br />
17K /mnt/modules.lrp<br />
809K /mnt/root.lrp<br />
512 /mnt/syslinux.cfg<br />
1.0K /mnt/syslinux.dpy<br />
1.4M total<br />
# mkdir test<br />
# cd test<br />
# tar -xzf /mnt/root.lrp<br />
# du -hs<br />
2.2M .<br />
2.2M total<br />
# du -ch bin root sbin usr var<br />
460K bin<br />
8.0K root<br />
264K sbin<br />
12K usr/bin<br />
304K usr/sbin<br />
36K usr/lib/ipmasqadm<br />
40K usr/lib<br />
360K usr<br />
56K var/lib/lrpkg<br />
60K var/lib<br />
4.0K var/spool/cron/crontabs<br />
8.0K var/spool/cron<br />
12K var/spool<br />
76K var<br />
1.2M total<br />
# du -ch lib<br />
24K lib/POSIXness<br />
1.1M lib<br />
1.1M total<br />
# du -h lib/libc-2.0.7.so<br />
644K lib/libc-2.0.7.so<br />
Taking a look at the software contained in<br />
this embedded <strong>Linux</strong> system, we quickly notice<br />
that in a software image totaling 2.2<br />
Megabytes, the libraries take up over half the<br />
space. If we look even closer at the set of<br />
libraries, we quickly find that the largest single<br />
component in the entire system is the GNU<br />
C library, in this case occupying nearly 650k.<br />
What is more, this is a very old version of<br />
the C library; newer versions of GNU glibc,
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 35<br />
such as version 2.3.2, are over 1.2 Megabytes<br />
all by themselves! <strong>The</strong>re are tools available<br />
from <strong>Linux</strong> vendors and in the Open Source<br />
community which can reduce the footprint of<br />
the GNU C library considerably by stripping<br />
unwanted symbols; however, using such tools<br />
precludes adding additional software at a later<br />
date. Even when these tools are appropriate,<br />
there are limits to the amount of size which can<br />
be reclaimed from the GNU C library in this<br />
way.<br />
<strong>The</strong> prospect of shrinking a single library that<br />
takes up so much space certainly looked like<br />
low hanging fruit. In practice, however, replacing<br />
the GNU C library for embedded <strong>Linux</strong><br />
systems was not easy task.<br />
3 <strong>The</strong> origins of uClibc<br />
As I despaired over the large size of the GNU<br />
C library, I decided that the best thing to do<br />
would be to find another C library for <strong>Linux</strong><br />
that would be better suited for embedded systems.<br />
I spent quite a bit of time looking around,<br />
and after carefully evaluating the various Open<br />
Source C libraries that I knew of 2 , I sadly<br />
found that none of them were suitable replacements<br />
for glibc. Of all the Open Source C libraries,<br />
the library closest to what I imagined<br />
an embedded C library should be was called<br />
uC-libc and was being used for uClinux systems.<br />
However, it also had many problems at<br />
the time—not the least of which was that uClibc<br />
had no central maintainer. <strong>The</strong> only mechanism<br />
being used to support multiple architec-<br />
2 <strong>The</strong> Open Source C libraries I evaluated at<br />
the time included Al’s Free C RunTime library<br />
(no longer on the Internet); dietlibc available from<br />
http://www.fefe.de/dietlibc/; the minix C<br />
library available from http://www.cs.vu.nl/<br />
cgi-bin/raw/pub/minix/; the newlib library<br />
available from http://sources.redhat.com/<br />
newlib/; and the eCos C library available from ftp:<br />
//ecos.sourceware.org/pub/ecos/.<br />
tures was a complete source tree fork, and there<br />
had already been a few such forks with plenty<br />
of divergant code. In short, uC-libc was a mess<br />
of twisty versions, all different. After spending<br />
some time with the code, I decided to fix it, and<br />
in the process changed the name to uClibc<br />
(no hyphen).<br />
With the help of D. Jeff Dionne, one of the creators<br />
of uClinux 3 , I ported uClibc to run on<br />
Intel compatible x86 CPUs. I then grafted in<br />
the header files from glibc 2.1.3 to simplify<br />
software ports, and I cleaned up the resulting<br />
breakage. <strong>The</strong> header files were later updated<br />
again to generally match glibc 2.3.2. This effort<br />
has made porting software from glibc to<br />
uClibc extremely easy. <strong>The</strong>re were, however,<br />
many functions in uClibc that were either broken<br />
or missing and which had to be re-written<br />
or created from scratch. When appropriate, I<br />
sometimes grafted in bits of code from the current<br />
GNU C library and libc5. Once the core<br />
of the library was reasonably solid, I began<br />
adding a platform abstraction layer to allow<br />
uClibc to compile and run on different types of<br />
CPUs. Once I had both the ARM and x86 platforms<br />
basically running, I made a few small<br />
announcements to the <strong>Linux</strong> community. At<br />
that point, several people began to make regular<br />
contributions. Most notably was Manuel<br />
Novoa III, who began contributing at that time.<br />
He has continued working on uClibc and is<br />
responsible for significant portions of uClibc<br />
such as the stdio and internationalization code.<br />
After a great deal of effort, we were able to<br />
build the first shared library version of uClibc<br />
in January 2001. And earlier this year we were<br />
able to compile a Debian Woody system using<br />
uClibc 4 , demonstrating the library is now able<br />
3 uClinux is a port of <strong>Linux</strong> designed to run on microcontrollers<br />
which lack Memory Management Units<br />
(MMUs) such as the Motorolla DragonBall or the<br />
ARM7TDMI. <strong>The</strong> uClinux web site is found at http:<br />
//www.uclinux.org/.<br />
4 http://www.uclibc.org/dists/
36 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
to support a complete <strong>Linux</strong> distribution. People<br />
now use uClibc to build versions of Gentoo,<br />
Slackware, <strong>Linux</strong> from Scratch, rescue disks,<br />
and even live <strong>Linux</strong> CDs 5 . A number of commercial<br />
products have also been released using<br />
uClibc, such as wireless routers, network attached<br />
storage devices, DVD players, etc.<br />
4 Compiling uClibc<br />
Before we can compile uClibc, we must first<br />
grab a copy of the source code and unpack it<br />
so it is ready to use. For this paper, we will just<br />
grab a copy of the daily uClibc snapshot.<br />
# SITE=http://www.uclibc.org/downloads<br />
# wget -q $SITE/uClibc-snapshot.tar.bz2<br />
# tar -xjf uClibc-snapshot.tar.bz2<br />
# cd uClibc<br />
uClibc requires a configuration file, .config,<br />
that can be edited to change the way the library<br />
is compiled, such as to enable or disable<br />
features (i.e. whether debugging support<br />
is enabled or not), to select a cross-compiler,<br />
etc. <strong>The</strong> preferred method when starting from<br />
scratch is to run make defconfig followed<br />
by make menuconfig. Since we are going<br />
to be targeting a standard Intel compatible x86<br />
system, no changes to the default configuration<br />
file are necessary.<br />
5 <strong>The</strong> Origins of BusyBox<br />
As I mentioned earlier, the two components<br />
of an embedded <strong>Linux</strong> that I chose to work<br />
towards reducing in size were the shared libraries<br />
and the set common application programs.<br />
A typical <strong>Linux</strong> system contains a variety<br />
of command-line utilities from numerous<br />
5 Puppy <strong>Linux</strong> available from http://www.<br />
goosee.com/puppy/ is a live linux CD system built<br />
with uClibc that includes such favorites as XFree86 and<br />
Mozilla.<br />
different organizations and independent programmers.<br />
Among the most prominent of these<br />
utilities were GNU shellutils, fileutils, textutils<br />
(now combined to form GNU coreutils), and<br />
similar programs that can be run within a shell<br />
(commands such as sed, grep, ls, etc.).<br />
<strong>The</strong> GNU utilities are generally very highquality<br />
programs, and are almost without exception<br />
very, very feature-rich. <strong>The</strong> large feature<br />
set comes at the cost of being quite large—<br />
prohibitively large for an embedded <strong>Linux</strong> system.<br />
After some investigation, I determined<br />
that it would be more efficient to replace them<br />
rather than try to strip them down, so I began<br />
looking at alternatives.<br />
Just as with alternative C libraries, there were<br />
several choices for small shell utilities: BSD<br />
has a number of utilities which could be used.<br />
<strong>The</strong> Minix operating system, which had recently<br />
released under a free software license,<br />
also had many useful utilities. Sash, the stand<br />
alone shell, was also a possibility. After quite<br />
a lot of research, the one that seemed to be<br />
the best fit was BusyBox. It also appealed to<br />
me because I was already familiar with Busy-<br />
Box from its use on the Debian boot floppies,<br />
and because I was acquainted with Bruce<br />
Perens, who was the maintainer. Starting approximately<br />
in October 1999, I began enhancing<br />
BusyBox and fixing the most obvious problems.<br />
Since Bruce was otherwise occupied and<br />
was no longer actively maintaining BusyBox,<br />
Bruce eventually consented to let me take over<br />
maintainership.<br />
Since that time, BusyBox has gained a large<br />
following and attracted development talent<br />
from literally the whole world. It has been<br />
used in commercial products such as the IBM<br />
<strong>Linux</strong> wristwatch, the Sharp Zaurus PDA, and<br />
Linksys wireless routers such as the WRT54G,<br />
with many more products being released all the<br />
time. So many new features and applets have<br />
been added to BusyBox, that the biggest chal-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 37<br />
lenge I now face is simply keeping up with all<br />
of the patches that get submitted!<br />
6 So, How Does It Work?<br />
BusyBox is a multi-call binary that combines<br />
many common Unix utilities into a single executable.<br />
When it is run, BusyBox checks if it<br />
was invoked via a symbolic link (a symlink),<br />
and if the name of the symlink matches the<br />
name of an applet that was compiled into Busy-<br />
Box, it runs that applet. If BusyBox is invoked<br />
as busybox, then it will read the command<br />
line and try to execute the applet name passed<br />
as the first argument. For example:<br />
# ./busybox date<br />
Wed Jun 2 15:01:03 MDT 2004<br />
# ./busybox echo "hello there"<br />
hello there<br />
# ln -s ./busybox uname<br />
# ./uname<br />
<strong>Linux</strong><br />
BusyBox is designed such that the developer<br />
compiling it for an embedded system can select<br />
exactly which applets to include in the final binary.<br />
Thus, it is possible to strip out support for<br />
unneeded and unwanted functionality, resulting<br />
in a smaller binary with a carefully selected<br />
set of commands. <strong>The</strong> customization granularity<br />
for BusyBox even goes one step further:<br />
each applet may contain multiple features that<br />
can be turned on or off. Thus, for example, if<br />
you do not wish to include large file support,<br />
or you do not need to mount NFS filesystems,<br />
you can simply turn these features off, further<br />
reducing the size of the final BusyBox binary.<br />
7 Compiling Busybox<br />
Let’s walk through a normal compile of Busy-<br />
Box. First, we must grab a copy of the Busy-<br />
Box source code and unpack it so it is ready to<br />
use. For this paper, we will just grab a copy of<br />
the daily BusyBox snapshot.<br />
# SITE=http://www.busybox.net/downloads<br />
# wget -q $SITE/busybox-snapshot.tar.bz2<br />
# tar -xjf busybox-snapshot.tar.bz2<br />
# cd busybox<br />
Now that we are in the BusyBox source directory<br />
we can configure BusyBox so that it<br />
meets the needs of our embedded <strong>Linux</strong> system.<br />
This is done by editing the file .config<br />
to change the set of applets that are compiled<br />
into BusyBox, to enable or disable features<br />
(i.e. whether debugging support is enabled or<br />
not), and to select a cross-compiler. <strong>The</strong> preferred<br />
method when starting from scratch is<br />
to run make defconfig followed by make<br />
menuconfig. Once BusyBox has been configured<br />
to taste, you just need to run make to<br />
compile it.<br />
8 Installing Busybox to a Target<br />
If you then want to install BusyBox onto a<br />
target device, this is most easily done by typing:<br />
make install. <strong>The</strong> installation script<br />
automatically creates all the required directories<br />
(such as /bin, /sbin, and the like) and<br />
creates appropriate symlinks in those directories<br />
for each applet that was compiled into the<br />
BusyBox binary.<br />
If we wanted to install BusyBox to the directory<br />
/mnt, we would simply run:<br />
# make PREFIX=/mnt install<br />
[--installation text omitted--]
38 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
9 Let’s build something that<br />
works!<br />
Now that I have certainly bored you to death,<br />
we finally get to the fun part, building our own<br />
embedded <strong>Linux</strong> system. For hardware, I will<br />
be using a Soekris 4521 system 6 with an 133<br />
Mhz AMD Elan CPU, 64 MB main memory,<br />
and a generic Intersil Prism based 802.11b card<br />
that can be driven using the hostap 7 driver.<br />
<strong>The</strong> root filesystem will be installed on a compact<br />
flash card.<br />
To begin with, we need to create toolchain with<br />
which to compile the software for our wireless<br />
access point. This requires we first compile<br />
GNU binutils 8 , then compile the GNU<br />
compiler collection—gcc 9 , and then compile<br />
uClibc using the newly created gcc compiler.<br />
With all those steps completed, we must finally<br />
recompile gcc using using the newly<br />
built uClibc library so that libgcc_s and<br />
libstdc++ can be linked with uClibc.<br />
Fortunately, the process of creating a uClibc<br />
toolchain can be automated. First we will go<br />
to the uClibc website and obtain a copy of the<br />
uClibc buildroot by going here:<br />
http://www.uclibc.org/cgi-bin/<br />
cvsweb/buildroot/<br />
and clicking on the “Download tarball” link 10 .<br />
This is a simple GNU make based build system<br />
which first builds a uClibc toolchain, and then<br />
builds a root filesystem using the newly built<br />
uClibc toolchain.<br />
For the root filesystem of our wireless access<br />
6 http://www.soekris.com/net4521.htm<br />
7 http://hostap.epitest.fi/<br />
8 http://sources.redhat.com/<br />
binutils/<br />
9 http://gcc.gnu.org/<br />
10 http://www.uclibc.org/cgi-bin/<br />
cvsweb/buildroot.tar.gz?view=tar<br />
point, we will need a <strong>Linux</strong> kernel, uClibc,<br />
BusyBox, pcmcia-cs, iptables, hostap, wtools,<br />
bridgeutils, and the dropbear ssh server. To<br />
compile these programs, we will first edit the<br />
buildroot Makefile to enable each of these<br />
items. Figure 1 shows the changes I made to<br />
the buildroot Makefile:<br />
Running make at this point will download the<br />
needed software packages, build a toolchain,<br />
and create a minimal root filesystem with the<br />
specified software installed.<br />
On my system, with all the software packages<br />
previously downloaded and cached locally, a<br />
complete build took 17 minutes, 19 seconds.<br />
Depending on the speed of your network connection<br />
and the speed of your build system,<br />
now might be an excellent time to take a lunch<br />
break, take a walk, or watch a movie.<br />
10 Checking out the new Root<br />
Filesystem<br />
We now have our root filesystem finished and<br />
ready to go. But we still need to do a little<br />
more work before we can boot up our newly<br />
built embedded <strong>Linux</strong> system. First, we need<br />
to compress our root filesystem so it can be<br />
loaded as an initrd.<br />
# gzip -9 root_fs_i386<br />
# ls -sh root_fs_i386.gz<br />
1.1M root_fs_i386.gz<br />
Now that our root filesystem has been compressed,<br />
it is ready to install on the boot media.<br />
To make things simple, I will install the Compact<br />
Flash boot media into a USB card reader<br />
device, and copy files using the card reader.<br />
# ms-sys -s /dev/sda<br />
Public domain master boot record<br />
successfully written to /dev/sda
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 39<br />
--- Makefile<br />
+++ Makefile<br />
@@ -140,6 +140,6 @@<br />
# Unless you want to build a kernel, I recommend just using<br />
# that...<br />
-TARGETS+=kernel-headers<br />
-#TARGETS+=linux<br />
+#TARGETS+=kernel-headers<br />
+TARGETS+=linux<br />
#TARGETS+=system-linux<br />
@@ -150,5 +150,5 @@<br />
#TARGETS+=zlib openssl openssh<br />
# Dropbear sshd is much smaller than openssl + openssh<br />
-#TARGETS+=dropbear_sshd<br />
+TARGETS+=dropbear_sshd<br />
# Everything needed to build a full uClibc development system!<br />
@@ -175,5 +175,5 @@<br />
# Some stuff for access points and firewalls<br />
-#TARGETS+=iptables hostap wtools dhcp_relay bridge<br />
+TARGETS+=iptables hostap wtools dhcp_relay bridge<br />
#TARGETS+=iproute2 netsnmp<br />
Figure 1: Changes to the buildroot Makefile<br />
# mkdosfs /dev/sda1<br />
mkdosfs 2.10 (22 Sep 2003)<br />
# syslinux /dev/sda1<br />
APPEND initrd=root_fs.gz \<br />
console=ttyS0,57600 \<br />
root=/dev/ram0 boot=/dev/hda1,msdos rw<br />
# cp syslinux.cfg /mnt<br />
# cp root_fs_i386.gz /mnt/root_fs.gz<br />
# cp build_i386/buildroot-kernel /mnt/linux<br />
So we now have a copy of our root filesystem<br />
and <strong>Linux</strong> kernel on the compact flash disk. Finally,<br />
we need to configure the bootloader. In<br />
case you missed it a few steps ago, we are using<br />
the syslinux bootloader for this example.<br />
I happen to have a ready to use syslinux configuration<br />
file, so I will now install that to the<br />
compact flash disk as well:<br />
# cat syslinux.cfg<br />
TIMEOUT 0<br />
PROMPT 0<br />
DEFAULT linux<br />
LABEL linux<br />
KERNEL linux<br />
And now, finally, we are done. Our embedded<br />
<strong>Linux</strong> system is complete and ready to boot.<br />
And you know what? It is very, very small.<br />
Take a look at Table 1.<br />
With a carefully optimized <strong>Linux</strong> kernel<br />
(which this kernel unfortunately isn’t) we<br />
could expect to have even more free space.<br />
And remember, every bit of space we save is<br />
money that embedded <strong>Linux</strong> developers don’t<br />
have to spend on expensive flash memory. So<br />
now comes the final test; it is now time to boot<br />
from our compact flash disk. Here is what you<br />
should see.<br />
[----kernel boot messages snipped--]
40 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
# ll /mnt<br />
total 1.9M<br />
drwxr-r- 2 root root 16K Jun 2 16:39 ./<br />
drwxr-xr-x 22 root root 4.0K Feb 6 07:40 ../<br />
-r-xr-r- 1 root root 7.7K Jun 2 16:36 ldlinux.sys*<br />
-rwxr-r- 1 root root 795K Jun 2 16:36 linux*<br />
-rwxr-r- 1 root root 1.1M Jun 2 16:36 root_fs.gz*<br />
-rwxr-r- 1 root root 170 Jun 2 16:39 syslinux.cfg*<br />
Table 1: Output of ls -lh /mnt.<br />
Freeing unused kernel memory: 64k freed<br />
Welcome to the Erik’s wireless access point.<br />
useful example. <strong>The</strong>re are thousands of other<br />
potential applications that are only waiting for<br />
you to create them.<br />
uclibc login: root<br />
BusyBox v1.00-pre10 (2004.06.02-21:54+0000)<br />
Built-in shell (ash)<br />
Enter ’help’ for a list of built-in commands.<br />
# du -h / | tail -n 1<br />
2.6M<br />
#<br />
And there you have it—your very own wireless<br />
access point. Some additional configuration<br />
will be necessary to start up the wireless<br />
interface, which will be demonstrated during<br />
my presentation.<br />
11 Conclusion<br />
<strong>The</strong> two largest components of a standard<br />
<strong>Linux</strong> system are the utilities and the libraries.<br />
By replacing these with smaller equivalents a<br />
much more compact system can be built. Using<br />
BusyBox and uClibc allows you to customize<br />
your embedded distribution by stripping<br />
out unneeded applets and features, thus<br />
further reducing the final image size. This<br />
space savings translates directly into decreased<br />
cost per unit as less flash memory will be required.<br />
Combine this with the cost savings of<br />
using <strong>Linux</strong>, rather than a more expensive proprietary<br />
OS, and the reasons for using <strong>Linux</strong><br />
become very compelling. <strong>The</strong> example Wireless<br />
Access point we created is a simple but
Run-time testing of LSB Applications<br />
Abstract<br />
Stuart Anderson<br />
Free Standards Group<br />
anderson@freestandards.org<br />
<strong>The</strong> dynamic application test tool is capable<br />
of checking API usage at run-time. <strong>The</strong> LSB<br />
defines only a subset of all possible parameter<br />
values to be valid. This tool is capable of<br />
checking these value while the application is<br />
running.<br />
This paper will explain how this tool works,<br />
and highlight some of the more interesting implementation<br />
details such as how we managed<br />
to generate most of the code automatically,<br />
based on the interface descriptions contained<br />
in the LSB database.<br />
Results to date will be presented, along with<br />
future plans and possible uses for this tool.<br />
1 Introduction<br />
<strong>The</strong> <strong>Linux</strong> Standard Base (LSB) Project began<br />
in 1998, when the <strong>Linux</strong> community came<br />
together and decided to take action to prevent<br />
GNU/<strong>Linux</strong> based operating systems from<br />
fragmenting in the same way UNIX operating<br />
systems did in the 1980s and 1990s. <strong>The</strong> LSB<br />
defines the Application Binary Interface (ABI)<br />
for the core part of a GNU/<strong>Linux</strong> system. As<br />
an ABI, the LSB defines the interface between<br />
the operating system and the applications. A<br />
complete set of tests for an ABI must be capable<br />
of measuring the interface from both sides.<br />
Almost from the beginning, testing has been<br />
Matt Elder<br />
University of South Caroilina<br />
happymutant@sc.rr.com<br />
a cornerstone of the project. <strong>The</strong> LSB was<br />
originally organized around 3 components: the<br />
written specification, a sample implementation,<br />
and the test suites. <strong>The</strong> written specification<br />
is the ultimate definition of the LSB. Both<br />
the sample implementation, and the test suites<br />
yield to the authority of the written specification.<br />
<strong>The</strong> sample implementation (SI) is a minimal<br />
subset of a GNU/<strong>Linux</strong> system that provides a<br />
runtime that implements the LSB, and as little<br />
else as possible. <strong>The</strong> SI is neither intended to<br />
be a minimal distribution, nor the basis for a<br />
distribution. Instead, it is used as both a proof<br />
of concept and a testing tool. Applications<br />
which are seeking certification are required to<br />
prove they execute correctly using the SI and<br />
two other distributions. <strong>The</strong> SI is also used to<br />
validate the runtime test suites.<br />
<strong>The</strong> third component is testing. <strong>One</strong> of the<br />
things that strengthens the LSB is its ability to<br />
measure, and thus prove, conformance to the<br />
standard. Testing is achieved with an array of<br />
different test suites, each of which measures a<br />
different aspect of the specification.<br />
LSB Runtime<br />
• cmdchk<br />
This test suite is a simple existence test<br />
that ensures the required LSB commands<br />
and utilities are found on an LSB conforming<br />
system.
42 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
• libchk<br />
This test suite checks the libraries required<br />
by the LSB to ensure they contain<br />
the interfaces and symbol versions as<br />
specified by the LSB.<br />
• runtimetests<br />
This test suite measures the behavior of<br />
the interfaces provided by the GNU/<strong>Linux</strong><br />
system. This is the largest of the test<br />
suites, and is actually broken down into<br />
several components, which are referred to<br />
collectively as the runtime tests. <strong>The</strong>se<br />
tests are derived from the test suites used<br />
by the Open Group for UNIX branding.<br />
LSB Packaging<br />
• pkgchk<br />
This test examines an RPM format package<br />
to ensure it conforms to the LSB.<br />
• pkginstchk<br />
This test suite is used to ensure that the<br />
package management tool provided by a<br />
GNU/<strong>Linux</strong> system will correctly install<br />
LSB conforming packages. This suite is<br />
still in early stages of development.<br />
LSB Application<br />
• appchk<br />
This test performs a static analysis of an<br />
application to ensure that it only uses<br />
libraries and interfaces specified by the<br />
LSB.<br />
• dynchk<br />
This test is used to measure an applications<br />
use of the LSB interfaces during its<br />
execution, and is the subject of this paper.<br />
2 <strong>The</strong> database<br />
<strong>The</strong> LSB Specification contains over 6600 interfaces,<br />
each of which is associated with a library<br />
and a header file, and may have parameters.<br />
Because of the size and complexity of the<br />
data describing these interfaces, a database is<br />
used to maintain this information.<br />
It is impractical to try and keep the specification,<br />
test suites and development libraries and<br />
headers synchronized for this much data. Instead,<br />
portions of the specification and tests,<br />
and all of the development headers and libraries<br />
are generated from the database. This<br />
ensures that as changes are made to the<br />
database, the changes are propagated to the<br />
other parts of the project as well.<br />
Some of the relevant data components in this<br />
DB are Libraries, Headers, Interfaces, and<br />
Types. <strong>The</strong>re are also secondary components<br />
and relations between all of the components. A<br />
short description of some of these is needed before<br />
moving on to how the dynchk test is constructed.<br />
2.1 Library<br />
<strong>The</strong> LSB specifies 17 shared libraries, which<br />
contains the 6600 interfaces. <strong>The</strong> interfaces<br />
in each library are grouped into logical units<br />
called a LibGroup. <strong>The</strong> LibGroups help to organize<br />
the interfaces, which is very useful in<br />
the written specification, but isn’t used much<br />
elsewhere.<br />
2.2 Interface<br />
An Interface represents a globally visible symbol,<br />
such as a function, or piece of data. Interfaces<br />
have a Type, which is either the type of<br />
the global data or the return type of the function.<br />
If the Interface is a function, then it will<br />
have zero or more Parameters, which form a
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 43<br />
Library<br />
Tid Ttype Tname Tbasetype<br />
1 Intrinsic int 0<br />
2 Pointer int * 1<br />
LibGroup<br />
Interface<br />
LibGroup<br />
Interface<br />
LibGroup<br />
Interface<br />
Table 1: Example of recursion in Type table for<br />
int *<br />
struct foo {<br />
int a;<br />
int *b;<br />
}<br />
Figure 1: Relationship between Library, Lib-<br />
Group and Interface<br />
Interface<br />
Type Parameter Parameter Parameter<br />
Type Type Type<br />
Figure 2: Relationship between Interface, Type<br />
and Parameter<br />
set of Types ordered by their position in the parameter<br />
list.<br />
2.3 Type<br />
As mentioned above, the database contains<br />
enough information to be able to generate<br />
header files which are a part of the LSB development<br />
tools. This means that the database<br />
must be able to represent Clanguage types. <strong>The</strong><br />
Type and TypeMember tables provide these.<br />
<strong>The</strong>se tables are used recursively. If a Type is<br />
defined in terms of another type, then it will<br />
have a base type that points to that other type.<br />
For structs and unions, the TypeMemeber table<br />
Figure 3: Sample struct<br />
is used to hold the ordered list of members. Entries<br />
in the TypeMember table point back to the<br />
Type table to describe the type of each member.<br />
For enums, the TypeMember table is also used<br />
to hold the ordered list of values.<br />
Tid Ttype Tname Tbasetype<br />
1 Intrinsic int 0<br />
2 Pointer int * 1<br />
3 Struct foo 0<br />
Table 2: Contents of Type table<br />
<strong>The</strong> structure shown in Figure 3 is represented<br />
by the entries in the Type table in Table 2 and<br />
the TypeMember table in Table 3.<br />
2.4 Header<br />
Headers, like Libraries, have their contents arranged<br />
into logical groupings known a Header-<br />
Groups. Unlike Libraries, these HeaderGroups<br />
are ordered so that the proper sequence of<br />
definitions within a header file can be maintained.<br />
HeaderGroups contain Constant definitions<br />
(i.e. #define statements) and Type definitions.<br />
If you examine a few well designed<br />
header files, you will notice a pattern of a comment<br />
followed by related constant definitions<br />
and type definitions. <strong>The</strong> entire header file can<br />
be viewed as a repeating sequence of this pat-
44 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Tmid TMname TMtypeid TMposition TMmemberof<br />
10 a 1 0 3<br />
11 b 2 1 3<br />
Table 3: Contents of TypeMember<br />
HeaderGroup 1<br />
Constants<br />
Types<br />
HeaderGroup 2<br />
Constants<br />
Types<br />
HeaderGroup 3<br />
Constants<br />
Types<br />
Function<br />
Declarations<br />
Figure 4: Organization of Headers<br />
tern. This pattern is the basis for the Header-<br />
Group concept.<br />
2.5 TypeType<br />
<strong>One</strong> last construct in our database should be<br />
mentioned. While we are able to represent<br />
a syntactic description of interfaces and<br />
types in the database, this is not enough to<br />
automatically generate meaningful test cases.<br />
We need to add some semantic information<br />
that better describes how the types in structures<br />
and parameters are used. As an example,<br />
struct sockaddr contains a member,<br />
sa_family, of type unsigned short. <strong>The</strong><br />
compiler will of course ensure that only values<br />
between 0 and 2 16 − 1 will be used, but<br />
only a few of those values have any meaning<br />
in this context. By adding the semantic information<br />
that this member holds a socket family<br />
value, the test generator can cause the value<br />
found in sa_family to be tested against the<br />
legal socket families values (AF_INET, AF_<br />
INET6, etc), instead of just ensuring the value<br />
falls between 0 and 2 16 − 1, which is really just<br />
a noop test.<br />
Example TypeType entries<br />
• RWaddress<br />
An address from the process space that<br />
must be both readable and writable.<br />
• Rdaddress<br />
An address from the process space that<br />
must be at least readable.<br />
• filedescriptor<br />
A small integer value greater than or equal<br />
to 0, and less than the maximum file descriptor<br />
for the process.<br />
• pathname<br />
<strong>The</strong> name of a file or directory that should<br />
be compared against the Filesystem Hierarchy<br />
Standard.<br />
2.6 Using this data<br />
As mentioned above, the data in the database is<br />
used to generate different portions of the LSB<br />
project. This strategy was adopted to ensure
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 45<br />
these different parts would always be in sync,<br />
without having to depend on human intervention.<br />
<strong>The</strong> written specification contains tables of interfaces,<br />
and data definitions (constants and<br />
types). <strong>The</strong>se are all generated from the<br />
database.<br />
<strong>The</strong> LSB development environment 1 consists<br />
of stub libraries and header files that contain<br />
only the interfaces defined by the LSB. This<br />
development environment helps catch the use<br />
of non-LSB interfaces during the development<br />
or porting of an application instead of being<br />
surprised by later test results. Both the stub<br />
libraries and headers are produced by scripts<br />
pulling data from the database.<br />
Some of the test suites described previously<br />
have components which are generated from the<br />
database. Cmdchk and libchk have lists of<br />
commands and interfaces respectively which<br />
are extracted from the database. <strong>The</strong> static application<br />
test tool, appchk, also has a list of<br />
interfaces that comes from the database. <strong>The</strong><br />
dynamic application test tool, dynchk, has the<br />
majority of its code generated from information<br />
in the database.<br />
3 <strong>The</strong> Dynamic Checker<br />
<strong>The</strong> static application checker simply examines<br />
an executable file to determine if it is using<br />
interfaces beyond those allowed by the LSB.<br />
This is very useful to determine if an application<br />
has been built correctly. However, is<br />
unable to determine if the interfaces are used<br />
correctly when the application is executed. A<br />
different kind of test is required to be able to<br />
perform this level of checking. This new test<br />
must interact with the application while it is<br />
1 See the May Issue of <strong>Linux</strong> Journal for more information<br />
on the LSB Development Environment.<br />
running, without interfering with the execution<br />
of the application.<br />
This new test has two major components: a<br />
mechanism for hooking itself into an application,<br />
and a collection of functions to perform<br />
the tests for all of the interfaces. <strong>The</strong>se components<br />
can mostly be developed independently<br />
of each other.<br />
3.1 <strong>The</strong> Mechanism<br />
<strong>The</strong> mechanism for interacting with the application<br />
must be transparent and noninterfering<br />
to the application. We considered the approach<br />
used by 3 different tools: abc, ltrace, and fakeroot.<br />
• abc—This tool was the inspiration for<br />
our new dynamic checker. abc was developed<br />
as part of the SVR4 ABI test<br />
tools. abc works by modifying the target<br />
application. <strong>The</strong> application’s executable<br />
is modified to load a different version<br />
of the shared libraries and to call a<br />
different version of each interface. This<br />
is accomplished by changing the strings<br />
in the symbol table and DT_NEEDED<br />
records. For example, libc.so.1 is<br />
changed to LiBc.So.1, and fread()<br />
is changed to FrEaD(). <strong>The</strong> test set<br />
is then located in /usr/lib/LiBc.<br />
So.1, which in turns loads the original<br />
/usr/lib/libc.so.1. This mechanism<br />
works, but the requirement to modify<br />
the executable file is undesirable.<br />
• ltrace—This tool is similar to<br />
strace, except that it traces calls<br />
into shared libraries instead of calls into<br />
the kernel. ltrace uses the ptrace<br />
interface to control the application’s<br />
process. With this approach, the test sets<br />
are located in a separate program and are<br />
invoked by stopping the application upon
46 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
entry to the interface being tested. This<br />
approach has two drawbacks: first, the<br />
code required to decode the process stack<br />
and extract the parameters is unique to<br />
each architecture, and second, the tests<br />
themselves are more complicated to write<br />
since the parameters have to be fetched<br />
from the application’s process.<br />
• fakeroot—This tool is used to create<br />
an environment where an unprivileged<br />
process appears to have root privileges.<br />
fakeroot uses LD_PRELOAD to load<br />
an additional shared library before any of<br />
the shared libraries specified by the DT_<br />
NEEDED records in the executable. This<br />
extra library contains a replacement function<br />
for each file manipulation function.<br />
<strong>The</strong> functions in this library will be selected<br />
by the dynamic linker instead of the<br />
normal functions found in the regular libraries.<br />
<strong>The</strong> test sets themselves will perform<br />
tests of the parameters, and then call<br />
the original version of the functions.<br />
We chose to use the LD_PRELOAD mechanism<br />
because we felt it was the simplest to use.<br />
Based on this mechanism, a sample test case<br />
looks like Figure 5.<br />
<strong>One</strong> problem that must be avoided when using<br />
this mechanism is recursion. If the above<br />
function just called read() at the end, it<br />
would end up calling itself again. Instead, the<br />
RTLD_NEXT flag passed to dlsym() tells the<br />
dynamic linker to look up the symbol on one<br />
of the libraries loaded after the current library.<br />
This will get the original version of the function.<br />
3.2 Test set organization<br />
<strong>The</strong> test set functions are organized into 3 layers.<br />
<strong>The</strong> top layer contains the functions that<br />
are test stubs for the LSB interfaces. <strong>The</strong>se<br />
functions are implemented by calling the functions<br />
in layers 2 and 3. An example of a function<br />
in the first layer was given in Figure 5.<br />
<strong>The</strong> second layer contains the functions that<br />
test data structures and types which are passed<br />
in as parameters. <strong>The</strong>se functions are also implemented<br />
by calling the functions in layer 3<br />
and other functions in layer 2. A function in<br />
the second layer looks like Figure 6.<br />
<strong>The</strong> third layer contains functions that test the<br />
types which have been annotated with additional<br />
semantic information. <strong>The</strong>se functions<br />
often have to perform nontrivial operations to<br />
test the assertion required for these supplemental<br />
types. Figure 7 is an example of a layer 3<br />
function.<br />
Presently, there are 3056 functions in layer 1<br />
(tests for libstdc++ are not yet being generated),<br />
106 functions in layer 2, and just a few<br />
in layer 3. We estimate that the total number of<br />
functions in layer 3 upon completion of the test<br />
tool will be on the order of several dozen. <strong>The</strong><br />
functions in the first two layers are automatically<br />
generated based on the information in the<br />
database. Functions in layer 3 are hand coded.<br />
3.3 Automatic generation of the tests<br />
In Table 4, is a summary of the size of the test<br />
tool so far. As work progresses, these numbers<br />
will only get larger. Most of the code in<br />
the test is very repetitive, and prone to errors<br />
when edited manually. <strong>The</strong> ability to automate<br />
the process of creating this code is highly desirable.<br />
Let’s take another look at the sample function<br />
from layer 1. This time, however, lets replace<br />
some of the code with a description of the information<br />
it represents. See Figure 8 for this<br />
parameterized version.<br />
All of the occurrences of the string read are
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 47<br />
ssize_t read (int arg0, void *arg1, size_t arg2) {<br />
if (!funcptr)<br />
funcptr = dlsym(RTLD_NEXT, "read");<br />
validate_filedescriptor(arg0, "read");<br />
validate_RWaddress(arg1, "read");<br />
validate_size_t(arg2, "read");<br />
return funcptr(arg0, arg1, arg2);<br />
}<br />
Figure 5: Test case for read() function<br />
void validate_struct_sockaddr_in(struct sockaddr_in *input,<br />
char *name) {<br />
validate_socketfamily(input->sin_family,name);<br />
validate_socketport(input->sin_port,name);<br />
validate_IPv4Address((input->sin_addr), name);<br />
}<br />
Figure 6: Test case for validating struct sockaddr_in<br />
Module Files Lines of Code<br />
libc 752 19305<br />
libdl 5 125<br />
libgcc_s 13 262<br />
libGL 450 11046<br />
libICE 49 1135<br />
libm 281 6568<br />
libncurses 266 6609<br />
libpam 13 335<br />
libpthread 82 2060<br />
libSM 37 865<br />
libX11 668 16112<br />
libXext 113 2673<br />
libXt 288 7213<br />
libz 39 973<br />
structs 106 1581<br />
Table 4: Summary of generated code<br />
<strong>The</strong>se two examples, now represent templates<br />
that can be used to create the functions for layers<br />
1 and 2. From the previous description of<br />
the database, you can see that there is enough<br />
information available to be able to instantiate<br />
these templates for each interfaces, and structure<br />
used by the LSB.<br />
<strong>The</strong> automation is implemented by 2 perl<br />
scripts: gen_lib.pl and gen_tests.pl.<br />
<strong>The</strong>se scripts generate the code for layers 1 and<br />
2 respectively.<br />
Overall, these scripts work well, but we have<br />
run into a few interesting situations along the<br />
way.<br />
3.4 Handling the exceptions<br />
actually just the function name, and could have<br />
been replaced also.<br />
<strong>The</strong> same thing can be done for the sample<br />
function from layer 2 as is seen in Figure 9.<br />
So far, we have come up with an overall architecture<br />
for the test tool, selected a mechanism<br />
that allows us to hook the tests into the running<br />
application, discovered the pattern in the test<br />
functions so that we could create a template for
48 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
void validate_filedescriptor(const int fd, const char *name) {<br />
if (fd >= lsb_sysconf(_SC_OPEN_MAX))<br />
ERROR("fd too big");<br />
else if (fd < 0)<br />
ERROR("fd negative");<br />
}<br />
Figure 7: Test case for validating a filedescriptor<br />
return-type read (list of parameters) {<br />
if (!funcptr)<br />
funcptr = dlsym(RTLD_NEXT, "read");<br />
validate_parameter1 type(arg0, "read");<br />
validate_parameter2 type(arg1, "read");<br />
validate_parameter3 type(arg2, "read");<br />
return funcptr(arg0, arg1, arg2);<br />
}<br />
Figure 8: Parameterized test case for a function<br />
automatically generating the code, and implemented<br />
the scripts to generate all of the tests<br />
cases. <strong>The</strong> only problem is that now we run<br />
into the real world, where things don’t always<br />
follow the rules.<br />
Here are a few of the interesting situations we<br />
have encountered<br />
• Variadic Functions<br />
Of the 725 functions in libc, 25 of them<br />
take a variable number of parameters.<br />
This causes problems in the generation of<br />
the code for the test case, but most importantly<br />
it affects our ability to know how<br />
to process the arguments. <strong>The</strong>se function<br />
have to be written by hand to handle<br />
the special needs of these functions.<br />
For the functions in the exec, printf<br />
and scanf families, the test cases can be<br />
implemented by calling the varargs form<br />
of the function (execl() can be implemented<br />
using execv()).<br />
• open()<br />
In addition to the problems of being a<br />
variadic function, the third parameter to<br />
open() and open64() is only valid<br />
if the O_CREAT flag is set in the second<br />
parameter to these functions. This<br />
simple exception requires a small amount<br />
of manual intervention, so these function<br />
have to be maintained by hand.<br />
• memory allocation<br />
<strong>One</strong> of the recursion problems we ran into<br />
is that memory will be allocated within<br />
the dlsym() function call, so the implementation<br />
of one test case ends up invoking<br />
the test case for one of the memory<br />
allocation routines, which by default<br />
would call dlsym(), creating the recursion.<br />
This cycle had to be broken by having<br />
the test cases for these routines call<br />
libc private interfaces to memory allocation.<br />
• changing memory map
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 49<br />
void validate_struct_structure name(struct structure name<br />
*input, char *name) {<br />
validate_type of member 1(input->name of member 1, name);<br />
validate_type of member 2(input->name of member 2, name);<br />
validate_type of member 3((input->name of member 3), name);<br />
}<br />
Figure 9: Parameterized test case for a struct<br />
Pointers are validated by making sure they<br />
contain an address that is valid for the process.<br />
/proc/self/maps is read to obtain<br />
the memory map of the current process.<br />
<strong>The</strong>se results are cached, for performance<br />
reasons, but usually, the memory<br />
map of the process will change over time.<br />
Both the stack and the heap will grow,<br />
resulting in valid pointers being checked<br />
against a cached copy of the memory map.<br />
In the event a pointer is found to be invalid,<br />
the memory map is re-read, and the<br />
pointer checked again. <strong>The</strong> mmap() and<br />
munmap() test cases are also maintained<br />
by hand so that they can also cause the<br />
memory map to be re-read.<br />
• hidden ioctl()s<br />
By design, the LSB specifies interfaces<br />
at the highest possible level. <strong>One</strong> example<br />
of this, is the use of the termio functions,<br />
instead of specifying the underlying<br />
ioctl() interface. It turns out that<br />
this tool catches the underlying ioctl()<br />
calls anyway, and flags it as an error. <strong>The</strong><br />
solution is for the termio functions the set<br />
a flag indicating that the ioctl() test<br />
case should skip its tests.<br />
• Optionally NULL parameters<br />
Many interfaces have parameters which<br />
may be NULL. This triggerred lots of<br />
warnings for many programs. <strong>The</strong> solution<br />
was to add a flag that indicated that<br />
the Parameter may be NULL, and to not<br />
try to validate the pointer, or the data being<br />
pointed to.<br />
No doubt, there will be more interesting situations<br />
to have to deal with before this tool is<br />
completed.<br />
4 Results<br />
As of the deadline for this paper, results are<br />
preliminary, but encouraging. <strong>The</strong> tool is initially<br />
being tested against simple commands<br />
such as ls and vi, and some X Windows clients<br />
such as xclock and xterm. <strong>The</strong> tool is correctly<br />
inserting itself into the application under test,<br />
and we are getting some interesting results that<br />
will be examined more closely.<br />
<strong>One</strong> example is vi passes a NULL to<br />
__strtol_internal several times during<br />
startup.<br />
<strong>The</strong> tool was designed to work across all architectures.<br />
At present, it has been built and tested<br />
on only the IA32 and IA64 architectures. No<br />
significant problems are anticipate on other architectures.<br />
Additional results and experience will be presented<br />
at the conference.
50 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
5 Future Work<br />
<strong>The</strong>re is still much work to be done. Some of<br />
the outstanding tasks are highlighted here.<br />
• Additional TypeTypes<br />
Semantic information needs to be added<br />
for additional parameters and structures.<br />
<strong>The</strong> additional layer 3 tests that correspond<br />
to this information must also be implemented.<br />
• Architecture-specific interfaces<br />
As we found in the LSB, there are some<br />
interfaces, and types that are unique to one<br />
or more architectures. <strong>The</strong>se need to be<br />
handled properly so they are not part of<br />
the tests when built on an architecture for<br />
which they don’t apply.<br />
• Unions<br />
Although Unions are represented in the<br />
database in the same way as structures,<br />
the database does not contain enough information<br />
to describe how to interpret or<br />
test the contents of a union. Test cases that<br />
involve unions may have to be written by<br />
hand.<br />
• Additional libraries<br />
<strong>The</strong> information in the database for the<br />
graphics libraries and for libstdc++ is<br />
incomplete, therefore, it is not possible to<br />
generate all of the test cases for those libraries.<br />
Once the data is complete, the test<br />
cases will also be complete.
<strong>Linux</strong> Block IO—present and future<br />
Jens Axboe<br />
SuSE<br />
axboe@suse.de<br />
Abstract<br />
<strong>One</strong> of the primary focus points of 2.5 was fixing<br />
up the bit rotting block layer, and as a result<br />
2.6 now sports a brand new implementation of<br />
basically anything that has to do with passing<br />
IO around in the kernel, from producer to disk<br />
driver. <strong>The</strong> talk will feature an in-depth look<br />
at the IO core system in 2.6 comparing to 2.4,<br />
looking at performance, flexibility, and added<br />
functionality. <strong>The</strong> rewrite of the IO scheduler<br />
API and the new IO schedulers will get a fair<br />
treatment as well.<br />
No 2.6 talk would be complete without 2.7<br />
speculations, so I shall try to predict what<br />
changes the future holds for the world of <strong>Linux</strong><br />
block I/O.<br />
1 2.4 Problems<br />
<strong>One</strong> of the most widely criticized pieces of<br />
code in the 2.4 kernels is, without a doubt, the<br />
block layer. It’s bit rotted heavily and lacks<br />
various features or facilities that modern hardware<br />
craves. This has led to many evils, ranging<br />
from code duplication in drivers to massive<br />
patching of block layer internals in vendor<br />
kernels. As a result, vendor trees can easily<br />
be considered forks of the 2.4 kernel with<br />
respect to the block layer code, with all of<br />
the problems that this fact brings with it: 2.4<br />
block layer code base may as well be considered<br />
dead, no one develops against it. Hardware<br />
vendor drivers include many nasty hacks<br />
and #ifdef’s to work in all of the various<br />
2.4 kernels that are out there, which doesn’t exactly<br />
enhance code coverage or peer review.<br />
<strong>The</strong> block layer fork didn’t just happen for the<br />
fun of it of course, it was a direct result of<br />
the various problem observed. Some of these<br />
are added features, others are deeper rewrites<br />
attempting to solve scalability problems with<br />
the block layer core or IO scheduler. In the<br />
next sections I will attempt to highlight specific<br />
problems in these areas.<br />
1.1 IO Scheduler<br />
<strong>The</strong> main 2.4 IO scheduler is called<br />
elevator_linus, named after the benevolent<br />
kernel dictator to credit him for some<br />
of the ideas used. elevator_linus is a<br />
one-way scan elevator that always scans in<br />
the direction of increasing LBA. It manages<br />
latency problems by assigning sequence<br />
numbers to new requests, denoting how many<br />
new requests (either merges or inserts) may<br />
pass this one. <strong>The</strong> latency value is dependent<br />
on data direction, smaller for reads than for<br />
writes. Internally, elevator_linus uses<br />
a double linked list structure (the kernels<br />
struct list_head) to manage the request<br />
structures. When queuing a new IO unit with<br />
the IO scheduler, the list is walked to find a<br />
suitable insertion (or merge) point yielding an<br />
O(N) runtime. That in itself is suboptimal in<br />
presence of large amounts of IO and to make<br />
matters even worse, we repeat this scan if the<br />
request free list was empty when we entered
52 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
the IO scheduler. <strong>The</strong> latter is not an error<br />
condition, it will happen all the time for even<br />
moderate amounts of write back against a<br />
queue.<br />
1.2 struct buffer_head<br />
<strong>The</strong> main IO unit in the 2.4 kernel is the<br />
struct buffer_head. It’s a fairly unwieldy<br />
structure, used at various kernel layers for different<br />
things: caching entity, file system block,<br />
and IO unit. As a result, it’s suboptimal for either<br />
of them.<br />
From the block layer point of view, the two<br />
biggest problems is the size of the structure<br />
and the limitation in how big a data region it<br />
can describe. Being limited by the file system<br />
one block semantics, it can at most describe a<br />
PAGE_CACHE_SIZE amount of data. In <strong>Linux</strong><br />
on x86 hardware that means 4KiB of data. Often<br />
it can be even worse: raw io typically uses<br />
the soft sector size of a queue (default 1KiB)<br />
for submitting io, which means that queuing<br />
eg 32KiB of IO will enter the io scheduler 32<br />
times. To work around this limitation and get<br />
at least to a page at the time, a 2.4 hack was<br />
introduced. This is called vary_io. A driver<br />
advertising this capability acknowledges that it<br />
can manage buffer_head’s of varying sizes<br />
at the same time. File system read-ahead, another<br />
frequent user of submitting larger sized<br />
io, has no option but to submit the read-ahead<br />
window in units of the page size.<br />
1.3 Scalability<br />
With the limit on buffer_head IO size and<br />
elevator_linus runtime, it doesn’t take a<br />
lot of thinking to discover obvious scalability<br />
problems in the <strong>Linux</strong> 2.4 IO path. To add insult<br />
to injury, the entire IO path is guarded by a<br />
single, global lock: io_request_lock. This<br />
lock is held during the entire IO queuing operation,<br />
and typically also from the other end<br />
when a driver subtracts requests for IO submission.<br />
A single global lock is a big enough<br />
problem on its own (bigger SMP systems will<br />
suffer immensely because of cache line bouncing),<br />
but add to that long runtimes and you have<br />
a really huge IO scalability problem.<br />
<strong>Linux</strong> vendors have long shipped lock scalability<br />
patches for quite some time to get around<br />
this problem. <strong>The</strong> adopted solution is typically<br />
to make the queue lock a pointer to a driver local<br />
lock, so the driver has full control of the<br />
granularity and scope of the lock. This solution<br />
was adopted from the 2.5 kernel, as we’ll<br />
see later. But this is another case where driver<br />
writers often need to differentiate between vendor<br />
and vanilla kernels.<br />
1.4 API problems<br />
Looking at the block layer as a whole (including<br />
both ends of the spectrum, the producers<br />
and consumers of the IO units going through<br />
the block layer), it is a typical example of code<br />
that has been hacked into existence without<br />
much thought to design. When things broke<br />
or new features were needed, they had been<br />
grafted into the existing mess. No well defined<br />
interface exists between file system and<br />
block layer, except a few scattered functions.<br />
Controlling IO unit flow from IO scheduler<br />
to driver was impossible: 2.4 exposes the IO<br />
scheduler data structures (the ->queue_head<br />
linked list used for queuing) directly to the<br />
driver. This fact alone makes it virtually impossible<br />
to implement more clever IO scheduling<br />
in 2.4. Even the recently (in the 2.4.20’s)<br />
added lower latency work was horrible to work<br />
with because of this lack of boundaries. Verifying<br />
correctness of the code is extremely difficult;<br />
peer review of the code likewise, since a<br />
reviewer must be intimate with the block layer<br />
structures to follow the code.<br />
Another example on lack of clear direction is
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 53<br />
the partition remapping. In 2.4, it’s the driver’s<br />
responsibility to resolve partition mappings.<br />
A given request contains a device and sector<br />
offset (i.e. /dev/hda4, sector 128) and the<br />
driver must map this to an absolute device offset<br />
before sending it to the hardware. Not only<br />
does this cause duplicate code in the drivers,<br />
it also means the IO scheduler has no knowledge<br />
of the real device mapping of a particular<br />
request. This adversely impacts IO scheduling<br />
whenever partitions aren’t laid out in strict ascending<br />
disk order, since it causes the io scheduler<br />
to make the wrong decisions when ordering<br />
io.<br />
2 2.6 Block layer<br />
<strong>The</strong> above observations were the initial kick off<br />
for the 2.5 block layer patches. To solve some<br />
of these issues the block layer needed to be<br />
turned inside out, breaking basically anythingio<br />
along the way.<br />
2.1 bio<br />
Given that struct buffer_head was one<br />
of the problems, it made sense to start from<br />
scratch with an IO unit that would be agreeable<br />
to the upper layers as well as the drivers.<br />
<strong>The</strong> main criteria for such an IO unit would be<br />
something along the lines of:<br />
1. Must be able to contain an arbitrary<br />
amount of data, as much as the hardware<br />
allows. Or as much that makes sense at<br />
least, with the option of easily pushing<br />
this boundary later.<br />
2. Must work equally well for pages that<br />
have a virtual mapping as well as ones that<br />
do not.<br />
3. When entering the IO scheduler and<br />
driver, IO unit must point to an absolute<br />
location on disk.<br />
4. Must be able to stack easily for IO stacks<br />
such as raid and device mappers. This includes<br />
full redirect stacking like in 2.4, as<br />
well as partial redirections.<br />
Once the primary goals for the IO structure<br />
were laid out, the struct bio was<br />
born. It was decided to base the layout<br />
on a scatter-gather type setup, with the bio<br />
containing a map of pages. If the map<br />
count was made flexible, items 1 and 2 on<br />
the above list were already solved. <strong>The</strong><br />
actual implementation involved splitting the<br />
data container from the bio itself into a<br />
struct bio_vec structure. This was mainly<br />
done to ease allocation of the structures so<br />
that sizeof(struct bio) was always constant.<br />
<strong>The</strong> bio_vec structure is simply a tuple<br />
of {page, length, offset}, and the<br />
bio can be allocated with room for anything<br />
from 1 to BIO_MAX_PAGES. Currently <strong>Linux</strong><br />
defines that as 256 pages, meaning we can support<br />
up to 1MiB of data in a single bio for<br />
a system with 4KiB page size. At the time<br />
of implementation, 1MiB was a good deal beyond<br />
the point where increasing the IO size further<br />
didn’t yield better performance or lower<br />
CPU usage. It also has the added bonus of<br />
making the bio_vec fit inside a single page,<br />
so we avoid higher order memory allocations<br />
(sizeof(struct bio_vec) == 12 on 32-<br />
bit, 16 on 64-bit) in the IO path. This is an<br />
important point, as it eases the pressure on the<br />
memory allocator. For swapping or other low<br />
memory situations, we ideally want to stress<br />
the allocator as little as possible.<br />
Different hardware can support different sizes<br />
of io. Traditional parallel ATA can do a maximum<br />
of 128KiB per request, qlogicfc SCSI<br />
doesn’t like more than 32KiB, and lots of high<br />
end controllers don’t impose a significant limit<br />
on max IO size but may restrict the maximum<br />
number of segments that one IO may be composed<br />
of. Additionally, software raid or de-
54 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
vice mapper stacks may like special alignment<br />
of IO or the guarantee that IO won’t cross<br />
stripe boundaries. All of this knowledge is either<br />
impractical or impossible to statically advertise<br />
to submitters of io, so an easy interface<br />
for populating a bio with pages was essential<br />
if supporting large IO was to become<br />
practical. <strong>The</strong> current solution is int bio_<br />
add_page() which attempts to add a single<br />
page (full or partial) to a bio. It returns the<br />
amount of bytes successfully added. Typical<br />
users of this function continue adding pages<br />
to a bio until it fails—then it is submitted for<br />
IO through submit_bio(), a new bio is allocated<br />
and populated until all data has gone<br />
out. int bio_add_page() uses statically<br />
defined parameters inside the request queue to<br />
determine how many pages can be added, and<br />
attempts to query a registered merge_bvec_<br />
fn for dynamic limits that the block layer cannot<br />
know about.<br />
Drivers hooking into the block layer before the<br />
IO scheduler 1 deal with struct bio directly,<br />
as opposed to the struct request that are<br />
output after the IO scheduler. Even though the<br />
page addition API guarantees that they never<br />
need to be able to deal with a bio that is too<br />
big, they still have to manage local splits at<br />
sub-page granularity. <strong>The</strong> API was defined that<br />
way to make it easier for IO submitters to manage,<br />
so they don’t have to deal with sub-page<br />
splits. 2.6 block layer defines two ways to<br />
deal with this situation—the first is the general<br />
clone interface. bio_clone() returns a clone<br />
of a bio. A clone is defined as a private copy of<br />
the bio itself, but with a shared bio_vec page<br />
map list. Drivers can modify the cloned bio<br />
and submit it to a different device without duplicating<br />
the data. <strong>The</strong> second interface is tailored<br />
specifically to single page splits and was<br />
written by kernel raid maintainer Neil Brown.<br />
<strong>The</strong> main function is bio_split() which re-<br />
1 Also known as at make_request time.<br />
turns a struct bio_pair describing the two<br />
parts of the original bio. <strong>The</strong> two bio’s can<br />
then be submitted separately by the driver.<br />
2.2 Partition remapping<br />
Partition remapping is handled inside the IO<br />
stack before going to the driver, so that both<br />
drivers and IO schedulers have immediate full<br />
knowledge of precisely where data should end<br />
up. <strong>The</strong> device unfolding is done automatically<br />
by the same piece of code that resolves<br />
full bio redirects. <strong>The</strong> worker function is<br />
blk_partition_remap().<br />
2.3 Barriers<br />
Another feature that found its way to some vendor<br />
kernels is IO barriers. A barrier is defined<br />
as a piece of IO that is guaranteed to:<br />
• Be on platter (or safe storage at least)<br />
when completion is signaled.<br />
• Not proceed any previously submitted io.<br />
• Not be proceeded by later submitted io.<br />
<strong>The</strong> feature is handy for journalled file systems,<br />
fsync, and any sort of cache bypassing<br />
IO 2 where you want to provide guarantees on<br />
data order and correctness. <strong>The</strong> 2.6 code isn’t<br />
even complete yet or in the Linus kernels, but it<br />
has made its way to Andrew Morton’s -mm tree<br />
which is generally considered a staging area for<br />
features. This section describes the code so far.<br />
<strong>The</strong> first type of barrier supported is a soft<br />
barrier. It isn’t of much use for data integrity<br />
applications, since it merely implies<br />
ordering inside the IO scheduler. It is signaled<br />
with the REQ_SOFTBARRIER flag inside<br />
struct request. A stronger barrier is the<br />
2 Such types of IO include O_DIRECT or raw.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 55<br />
hard barrier. From the block layer and IO<br />
scheduler point of view, it is identical to the<br />
soft variant. Drivers need to know about it<br />
though, so they can take appropriate measures<br />
to correctly honor the barrier. So far the ide<br />
driver is the only one supporting a full, hard<br />
barrier. <strong>The</strong> issue was deemed most important<br />
for journalled desktop systems, where the<br />
lack of barriers and risk of crashes / power loss<br />
coupled with ide drives generally always defaulting<br />
to write back caching caused significant<br />
problems. Since the ATA command set<br />
isn’t very intelligent in this regard, the ide solution<br />
adopted was to issue pre- and post flushes<br />
when encountering a barrier.<br />
<strong>The</strong> hard and soft barrier share the feature that<br />
they are both tied to a piece of data (a bio,<br />
really) and cannot exist outside of data context.<br />
Certain applications of barriers would really<br />
like to issue a disk flush, where finding out<br />
which piece of data to attach it to is hard or<br />
impossible. To solve this problem, the 2.6 barrier<br />
code added the blkdev_issue_flush()<br />
function. <strong>The</strong> block layer part of the code is basically<br />
tied to a queue hook, so the driver issues<br />
the flush on its own. A helper function is provided<br />
for SCSI type devices, using the generic<br />
SCSI command transport that the block layer<br />
provides in 2.6 (more on this later). Unlike<br />
the queued data barriers, a barrier issued with<br />
blkdev_issue_flush() works on all interesting<br />
drivers in 2.6 (IDE, SCSI, SATA). <strong>The</strong><br />
only missing bits are drivers that don’t belong<br />
to one of these classes—things like CISS and<br />
DAC960.<br />
2.4 IO Schedulers<br />
As mentioned in section 1.1, there are a number<br />
of known problems with the default 2.4 IO<br />
scheduler and IO scheduler interface (or lack<br />
thereof). <strong>The</strong> idea to base latency on a unit of<br />
data (sectors) rather than a time based unit is<br />
hard to tune, or requires auto-tuning at runtime<br />
and this never really worked out. Fixing the<br />
runtime problems with elevator_linus is<br />
next to impossible due to the data structure exposing<br />
problem. So before being able to tackle<br />
any problems in that area, a neat API to the IO<br />
scheduler had to be defined.<br />
2.4.1 Defined API<br />
In the spirit of avoiding over-design 3 , the API<br />
was based on initial adaption of elevator_<br />
linus, but has since grown quite a bit as newer<br />
IO schedulers required more entry points to exploit<br />
their features.<br />
<strong>The</strong> core function of an IO scheduler is, naturally,<br />
insertion of new io units and extraction of<br />
ditto from drivers. So the first 2 API functions<br />
are defined, next_req_fn and add_req_fn.<br />
If you recall from section 1.1, a new IO<br />
unit is first attempted merged into an existing<br />
request in the IO scheduler queue. And<br />
if this fails and the newly allocated request<br />
has raced with someone else adding an adjacent<br />
IO unit to the queue in the mean time,<br />
we also attempt to merge struct requests.<br />
So 2 more functions were added to cater to<br />
these needs, merge_fn and merge_req_fn.<br />
Cleaning up after a successful merge is done<br />
through merge_cleanup_fn. Finally, a defined<br />
IO scheduler can provide init and exit<br />
functions, should it need to perform any duties<br />
during queue init or shutdown.<br />
<strong>The</strong> above described the IO scheduler API<br />
as of 2.5.1, later on more functions were<br />
added to further abstract the IO scheduler<br />
away from the block layer core. More details<br />
may be found in the struct elevator_s in<br />
kernel include file.<br />
3 Some might, rightfully, claim that this is worse than<br />
no design
56 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
2.4.2 deadline<br />
specific hard target that must be met.<br />
In kernel 2.5.39, elevator_linus was finally<br />
replaced by something more appropriate,<br />
the deadline IO scheduler. <strong>The</strong> principles behind<br />
it are pretty straight forward — new requests<br />
are assigned an expiry time in milliseconds,<br />
based on data direction. Internally, requests<br />
are managed on two different data structures.<br />
<strong>The</strong> sort list, used for inserts and front<br />
merge lookups, is based on a red-black tree.<br />
This provides O(log n) runtime for both insertion<br />
and lookups, clearly superior to the doubly<br />
linked list. Two FIFO lists exist for tracking<br />
request expiry times, using a double linked<br />
list. Since strict FIFO behavior is maintained<br />
on these two lists, they run in O(1) time. For<br />
back merges it is important to maintain good<br />
performance as well, as they dominate the total<br />
merge count due to the layout of files on<br />
disk. So deadline added a merge hash for<br />
back merges, ideally providing O(1) runtime<br />
for merges. Additionally, deadline adds a onehit<br />
merge cache that is checked even before going<br />
to the hash. This gets surprisingly good hit<br />
rates, serving as much as 90% of the merges<br />
even for heavily threaded io.<br />
Implementation details aside, deadline continues<br />
to build on the fact that the fastest way to<br />
access a single drive, is by scanning in the direction<br />
of ascending sector. With its superior<br />
runtime performance, deadline is able to support<br />
very large queue depths without suffering<br />
a performance loss or spending large amounts<br />
of time in the kernel. It also doesn’t suffer from<br />
latency problems due to increased queue sizes.<br />
When a request expires in the FIFO, deadline<br />
jumps to that disk location and starts serving<br />
IO from there. To prevent accidental seek<br />
storms (which would further cause us to miss<br />
deadlines), deadline attempts to serve a number<br />
of requests from that location before jumping<br />
to the next expired request. This means that<br />
the assigned request deadlines are soft, not a<br />
2.4.3 Anticipatory IO scheduler<br />
While deadline works very well for most<br />
workloads, it fails to observe the natural dependencies<br />
that often exist between synchronous<br />
reads. Say you want to list the contents of<br />
a directory—that operation isn’t merely a single<br />
sync read, it consists of a number of reads<br />
where only the completion of the final request<br />
will give you the directory listing. With deadline,<br />
you could get decent performance from<br />
such a workload in presence of other IO activities<br />
by assigning very tight read deadlines. But<br />
that isn’t very optimal, since the disk will be<br />
serving other requests in between the dependent<br />
reads causing a potentially disk wide seek<br />
every time. On top of that, the tight deadlines<br />
will decrease performance on other io streams<br />
in the system.<br />
Nick Piggin implemented an anticipatory IO<br />
scheduler [Iyer] during 2.5 to explore some interesting<br />
research in this area. <strong>The</strong> main idea<br />
behind the anticipatory IO scheduler is a concept<br />
called deceptive idleness. When a process<br />
issues a request and it completes, it might be<br />
ready to issue a new request (possibly close<br />
by) immediately. Take the directory listing example<br />
from above—it might require 3–4 IO<br />
operations to complete. When each of them<br />
completes, the process 4 is ready to issue the<br />
next one almost instantly. But the traditional<br />
io scheduler doesn’t pay any attention to this<br />
fact, the new request must go through the IO<br />
scheduler and wait its turn. With deadline, you<br />
would have to typically wait 500 milliseconds<br />
for each read, if the queue is held busy by other<br />
processes. <strong>The</strong> result is poor interactive performance<br />
for each process, even though overall<br />
throughput might be acceptable or even good.<br />
4 Or the kernel, on behalf of the process.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 57<br />
Instead of moving on to the next request from<br />
an unrelated process immediately, the anticipatory<br />
IO scheduler (hence forth known as AS)<br />
opens a small window of opportunity for that<br />
process to submit a new IO request. If that happens,<br />
AS gives it a new chance and so on. Internally<br />
it keeps a decaying histogram of IO think<br />
times to help the anticipation be as accurate as<br />
possible.<br />
Internally, AS is quite like deadline. It uses the<br />
same data structures and algorithms for sorting,<br />
lookups, and FIFO. If the think time is set<br />
to 0, it is very close to deadline in behavior.<br />
<strong>The</strong> only differences are various optimizations<br />
that have been applied to either scheduler allowing<br />
them to diverge a little. If AS is able to<br />
reliably predict when waiting for a new request<br />
is worthwhile, it gets phenomenal performance<br />
with excellent interactiveness. Often the system<br />
throughput is sacrificed a little bit, so depending<br />
on the workload AS might not be the<br />
best choice always. <strong>The</strong> IO storage hardware<br />
used, also plays a role in this—a non-queuing<br />
ATA hard drive is a much better fit than a SCSI<br />
drive with a large queuing depth. <strong>The</strong> SCSI<br />
firmware reorders requests internally, thus often<br />
destroying any accounting that AS is trying<br />
to do.<br />
processes can be placed in. And using regular<br />
hashing technique to find the appropriate<br />
bucket in case of collisions, fatal collisions are<br />
avoided.<br />
CFQ deviates radically from the concepts that<br />
deadline and AS is based on. It doesn’t assign<br />
deadlines to incoming requests to maintain<br />
fairness, instead it attempts to divide<br />
bandwidth equally among classes of processes<br />
based on some correlation between them. <strong>The</strong><br />
default is to hash on thread group id, tgid.<br />
This means that bandwidth is attempted distributed<br />
equally among the processes in the<br />
system. Each class has its own request sort<br />
and hash list, using red-black trees again for<br />
sorting and regular hashing for back merges.<br />
When dealing with writes, there is a little catch.<br />
A process will almost never be performing its<br />
own writes—data is marked dirty in context of<br />
the process, but write back usually takes place<br />
from the pdflush kernel threads. So CFQ is<br />
actually dividing read bandwidth among processes,<br />
while treating each pdflush thread as a<br />
separate process. Usually this has very minor<br />
impact on write back performance. Latency is<br />
much less of an issue with writes, and good<br />
throughput is very easy to achieve due to their<br />
inherent asynchronous nature.<br />
2.4.4 CFQ<br />
<strong>The</strong> third new IO scheduler in 2.6 is called<br />
CFQ. It’s loosely based on the ideas on<br />
stochastic fair queuing (SFQ [McKenney]).<br />
SFQ is fair as long as its hashing doesn’t collide,<br />
and to avoid that, it uses a continually<br />
changing hashing function. Collisions can’t be<br />
completely avoided though, frequency will depend<br />
entirely on workload and timing. CFQ<br />
is an acronym for completely fair queuing, attempting<br />
to get around the collision problem<br />
that SFQ suffers from. To do so, CFQ does<br />
away with the fixed number of buckets that<br />
2.5 Request allocation<br />
Each block driver in the system has at least<br />
one request_queue_t request queue structure<br />
associated with it. <strong>The</strong> recommended<br />
setup is to assign a queue to each logical<br />
spindle. In turn, each request queue has<br />
a struct request_list embedded which<br />
holds free struct request structures used<br />
for queuing io. 2.4 improved on this situation<br />
from 2.2, where a single global free list was<br />
available to add one per queue instead. This<br />
free list was split into two sections of equal<br />
size, for reads and writes, to prevent either
58 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
direction from starving the other 5 . 2.4 statically<br />
allocated a big chunk of requests for each<br />
queue, all residing in the precious low memory<br />
of a machine. <strong>The</strong> combination of O(N) runtime<br />
and statically allocated request structures<br />
firmly prevented any real world experimentation<br />
with large queue depths on 2.4 kernels.<br />
2.6 improves on this situation by dynamically<br />
allocating request structures on the fly instead.<br />
Each queue still maintains its request free list<br />
like in 2.4. However it’s also backed by a memory<br />
pool 6 to provide deadlock free allocations<br />
even during swapping. <strong>The</strong> more advanced<br />
io schedulers in 2.6 usually back each request<br />
by its own private request structure, further<br />
increasing the memory pressure of each request.<br />
Dynamic request allocation lifts some of<br />
this pressure as well by pushing that allocation<br />
inside two hooks in the IO scheduler API—<br />
set_req_fn and put_req_fn. <strong>The</strong> latter<br />
handles the later freeing of that data structure.<br />
2.6 Plugging<br />
For the longest time, the <strong>Linux</strong> block layer has<br />
used a technique dubbed plugging to increase<br />
IO throughput. In its simplicity, plugging<br />
works sort of like the plug in your tub drain—<br />
when IO is queued on an initially empty queue,<br />
the queue is plugged. Only when someone asks<br />
for the completion of some of the queued IO is<br />
the plug yanked out, and io is allowed to drain<br />
from the queue. So instead of submitting the<br />
first immediately to the driver, the block layer<br />
allows a small buildup of requests. <strong>The</strong>re’s<br />
nothing wrong with the principle of plugging,<br />
and it has been shown to work well for a number<br />
of workloads. However, the block layer<br />
maintains a global list of plugged queues inside<br />
the tq_disk task queue. <strong>The</strong>re are three<br />
main problems with this approach:<br />
5 In reality, to prevent writes for consuming all requests.<br />
6<br />
mempool_t interface from Ingo Molnar.<br />
1. It’s impossible to go backwards from the<br />
file system and find the specific queue to<br />
unplug.<br />
2. Unplugging one queue through tq_disk<br />
unplugs all plugged queues.<br />
3. <strong>The</strong> act of plugging and unplugging<br />
touches a global lock.<br />
All of these adversely impact performance.<br />
<strong>The</strong>se problems weren’t really solved until late<br />
in 2.6, when Intel reported a huge scalability<br />
problem related to unplugging [Chen] on a 32<br />
processor system. 93% of system time was<br />
spent due to contention on blk_plug_lock,<br />
which is the 2.6 direct equivalent of the 2.4<br />
tq_disk embedded lock. <strong>The</strong> proposed solution<br />
was to move the plug lists to a per-<br />
CMU structure. While this would solve the<br />
contention problems, it still leaves the other 2<br />
items on the above list unsolved.<br />
So work was started to find a solution that<br />
would fix all problems at once, and just generally<br />
Feel Right. 2.6 contains a link between<br />
the block layer and write out paths<br />
which is embedded inside the queue, a<br />
struct backing_dev_info. This structure<br />
holds information on read-ahead and queue<br />
congestion state. It’s also possible to go from<br />
a struct page to the backing device, which<br />
may or may not be a block device. So it<br />
would seem an obvious idea to move to a backing<br />
device unplugging scheme instead, getting<br />
rid of the global blk_run_queues() unplugging.<br />
That solution would fix all three issues at<br />
once—there would be no global way to unplug<br />
all devices, only target specific unplugs, and<br />
the backing device gives us a mapping from<br />
page to queue. <strong>The</strong> code was rewritten to do<br />
just that, and provide unplug functionality going<br />
from a specific struct block_device,<br />
page, or backing device. Code and interface<br />
was much superior to the existing code base,
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 59<br />
and results were truly amazing. Jeremy Higdon<br />
tested on an 8-way IA64 box [Higdon] and<br />
got 75–80 thousand IOPS on the stock kernel<br />
at 100% CPU utilization, 110 thousand IOPS<br />
with the per-CPU Intel patch also at full CPU<br />
utilization, and finally 200 thousand IOPS at<br />
merely 65% CPU utilization with the backing<br />
device unplugging. So not only did the new<br />
code provide a huge speed increase on this<br />
machine, it also went from being CPU to IO<br />
bound.<br />
2.6 also contains some additional logic to<br />
unplug a given queue once it reaches the<br />
point where waiting longer doesn’t make much<br />
sense. So where 2.4 will always wait for an explicit<br />
unplug, 2.6 can trigger an unplug when<br />
one of two conditions are met:<br />
1. <strong>The</strong> number of queued requests reach a<br />
certain limit, q->unplug_thresh. This<br />
is device tweak able and defaults to 4.<br />
2. When the queue has been idle for q-><br />
unplug_delay. Also device tweak able,<br />
and defaults to 3 milliseconds.<br />
<strong>The</strong> idea is that once a certain number of<br />
requests have accumulated in the queue, it<br />
doesn’t make much sense to continue waiting<br />
for more—there is already an adequate number<br />
available to keep the disk happy. <strong>The</strong> time limit<br />
is really a last resort, and should rarely trigger<br />
in real life. Observations on various work<br />
loads have verified this. More than a handful or<br />
two timer unplugs per minute usually indicates<br />
a kernel bug.<br />
2.7 SCSI command transport<br />
An annoying aspect of CD writing applications<br />
in 2.4 has been the need to use ide-scsi, necessitating<br />
the inclusion of the entire SCSI stack<br />
for only that application. With the clear majority<br />
of the market being ATAPI hardware, this<br />
becomes even more silly. ide-scsi isn’t without<br />
its own class of problems either—it lacks the<br />
ability to use DMA on certain writing types.<br />
CDDA audio ripping is another application that<br />
thrives with ide-scsi, since the native uniform<br />
cdrom layer interface is less than optimal (put<br />
mildly). It doesn’t have DMA capabilities at<br />
all.<br />
2.7.1 Enhancing struct request<br />
<strong>The</strong> problem with 2.4 was the lack of ability<br />
to generically send SCSI “like” commands<br />
to devices that understand them. Historically,<br />
only file system read/write requests could be<br />
submitted to a driver. Some drivers made up<br />
faked requests for other purposes themselves<br />
and put then on the queue for their own consumption,<br />
but no defined way of doing this existed.<br />
2.6 adds a new request type, marked by<br />
the REQ_BLOCK_PC bit. Such a request can be<br />
either backed by a bio like a file system request,<br />
or simply has a data and length field set.<br />
For both types, a SCSI command data block is<br />
filled inside the request. With this infrastructure<br />
in place and appropriate update to drivers<br />
to understand these requests, it’s a cinch to support<br />
a much better direct-to-device interface for<br />
burning.<br />
Most applications use the SCSI sg API for talking<br />
to devices. Some of them talk directly to<br />
the /dev/sg* special files, while (most) others<br />
use the SG_IO ioctl interface. <strong>The</strong> former<br />
requires a yet unfinished driver to transform<br />
them into block layer requests, but the latter<br />
can be readily intercepted in the kernel and<br />
routed directly to the device instead of through<br />
the SCSI layer. Helper functions were added<br />
to make burning and ripping even faster, providing<br />
DMA for all applications and without<br />
copying data between kernel and user space at<br />
all. So the zero-copy DMA burning was possible,<br />
and this even without changing most ap-
60 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
plications.<br />
3 <strong>Linux</strong>-2.7<br />
<strong>The</strong> 2.5 development cycle saw the most massively<br />
changed block layer in the history of<br />
<strong>Linux</strong>. Before 2.5 was opened, Linus had<br />
clearly expressed that one of the most important<br />
things that needed doing, was the block<br />
layer update. And indeed, the very first thing<br />
merged was the complete bio patch into 2.5.1-<br />
pre2. At that time, no more than a handful<br />
drivers compiled (let alone worked). <strong>The</strong> 2.7<br />
changes will be nowhere as severe or drastic.<br />
A few of the possible directions will follow in<br />
the next few sections.<br />
3.1 IO Priorities<br />
Prioritized IO is a very interesting area that<br />
is sure to generate lots of discussion and development.<br />
It’s one of the missing pieces of<br />
the complete resource management puzzle that<br />
several groups of people would very much like<br />
to solve. People running systems with many<br />
users, or machines hosting virtual hosts (or<br />
completed virtualized environments) are dying<br />
to be able to provide some QOS guarantees.<br />
Some work was already done in this<br />
area, so far nothing complete has materialized.<br />
<strong>The</strong> CKRM [CKRM] project spear headed by<br />
IBM is an attempt to define global resource<br />
management, including io. <strong>The</strong>y applied a little<br />
work to the CFQ IO scheduler to provide<br />
equal bandwidth between resource management<br />
classes, but at no specific priorities. Currently<br />
I have a CFQ patch that is 99% complete<br />
that provides full priority support, using the IO<br />
contexts introduced by AS to manage fair sharing<br />
over the full time span that a process exists<br />
7 . This works well enough, but only works<br />
7 CFQ currently tears down class structures as soon<br />
as it is empty, it doesn’t persist over process life time.<br />
for that specific IO scheduler. A nicer solution<br />
would be to create a scheme that works independently<br />
of the io scheduler used. That would<br />
require a rethinking of the IO scheduler API.<br />
3.2 IO Scheduler switching<br />
Currently <strong>Linux</strong> provides no less than 4 IO<br />
schedulers—the 3 mentioned, plus a forth<br />
dubbed noop. <strong>The</strong> latter is a simple IO scheduler<br />
that does no request reordering, no latency<br />
management, and always merges whenever it<br />
can. Its area of application is mainly highly<br />
intelligent hardware with huge queue depths,<br />
where regular request reordering doesn’t make<br />
sense. Selecting a specific IO scheduler can<br />
either be done by modifying the source of a<br />
driver and putting the appropriate calls in there<br />
at queue init time, or globally for any queue by<br />
passing the elevator=xxx boot parameter.<br />
This makes it impossible, or at least very impractical,<br />
to benchmark different IO schedulers<br />
without many reboots or recompiles. Some<br />
way to switch IO schedulers per queue and on<br />
the fly is desperately needed. Freezing a queue<br />
and letting IO drain from it until it’s empty<br />
(pinning new IO along the way), and then shutting<br />
down the old io scheduler and moving to<br />
the new scheduler would not be so hard to do.<br />
<strong>The</strong> queues expose various sysfs variables already,<br />
so the logical approach would simply be<br />
to:<br />
# echo deadline > \<br />
/sys/block/hda/queue/io_scheduler<br />
A simple but effective interface. At least two<br />
patches doing something like this were already<br />
proposed, but nothing was merged at that time.<br />
4 Final comments<br />
<strong>The</strong> block layer code in 2.6 has come a long<br />
way from the rotted 2.4 code. New features
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 61<br />
bring it more up-to-date with modern hardware,<br />
and completely rewritten from scratch<br />
core provides much better scalability, performance,<br />
and memory usage benefiting any machine<br />
from small to really huge. Going back<br />
a few years, I heard constant complaints about<br />
the block layer and how much it sucked and<br />
how outdated it was. <strong>The</strong>se days I rarely<br />
hear anything about the current state of affairs,<br />
which usually means that it’s doing pretty well<br />
indeed. 2.7 work will mainly focus on feature<br />
additions and driver layer abstractions (our<br />
concept of IDE layer, SCSI layer etc will be<br />
severely shook up). Nothing that will wreak<br />
havoc and turn everything inside out like 2.5<br />
did. Most of the 2.7 work mentioned above<br />
is pretty light, and could easily be back ported<br />
to 2.6 once it has been completed and tested.<br />
Which is also a good sign that nothing really<br />
radical or risky is missing. So things are settling<br />
down, a sign of stability.<br />
[Higdon] Jeremy Higdon, Re: [PATCH]<br />
per-backing dev unplugging #2, <strong>Linux</strong><br />
kernel mailing list<br />
http://marc.theaimsgroup.<br />
com/?l=linux-kernel&m=<br />
107941470424309&w=2, 2004<br />
[CKRM] IBM, Class-based <strong>Kernel</strong> Resource<br />
Management (CKRM),<br />
http://ckrm.sf.net, 2004<br />
[Bhattacharya] Suparna Bhattacharya, Notes<br />
on the Generic Block Layer Rewrite in<br />
<strong>Linux</strong> 2.5, General discussion,<br />
Documentation/block/biodoc.<br />
txt, 2002<br />
References<br />
[Iyer] Sitaram Iyer and Peter Druschel,<br />
Anticipatory scheduling: A disk<br />
scheduling framework to overcome<br />
deceptive idleness in synchronous I/O,<br />
18th ACM Symposium on Operating<br />
Systems Principles, http:<br />
//www.cs.rice.edu/~ssiyer/<br />
r/antsched/antsched.ps.gz,<br />
2001<br />
[McKenney] Paul E. McKenney, Stochastic<br />
Fairness Queuing, INFOCOM http:<br />
//rdrop.com/users/paulmck/<br />
paper/sfq.2002.06.04.pdf,<br />
1990<br />
[Chen] Kenneth W. Chen, per-cpu<br />
blk_plug_list, <strong>Linux</strong> kernel mailing list<br />
http://www.ussg.iu.edu/<br />
hypermail/linux/kernel/<br />
0403.0/0179.html, 2004
62 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
<strong>Linux</strong> AIO Performance and Robustness for<br />
Enterprise Workloads<br />
Suparna Bhattacharya, IBM (suparna@in.ibm.com)<br />
John Tran, IBM (jbtran@ca.ibm.com)<br />
Mike Sullivan, IBM (mksully@us.ibm.com)<br />
Chris Mason, SUSE (mason@suse.com)<br />
1 Abstract<br />
In this paper we address some of the issues<br />
identified during the development and stabilization<br />
of Asynchronous I/O (AIO) on <strong>Linux</strong><br />
2.6.<br />
We start by describing improvements made to<br />
optimize the throughput of streaming buffered<br />
filesystem AIO for microbenchmark runs.<br />
Next, we discuss certain tricky issues in ensuring<br />
data integrity between AIO Direct I/O<br />
(DIO) and buffered I/O, and take a deeper look<br />
at synchronized I/O guarantees, concurrent<br />
I/O, write-ordering issues and the improvements<br />
resulting from radix-tree based writeback<br />
changes in the <strong>Linux</strong> VFS.<br />
We then investigate the results of using <strong>Linux</strong><br />
2.6 filesystem AIO on the performance metrics<br />
for certain enterprise database workloads<br />
which are expected to benefit from AIO, and<br />
mention a few tips on optimizing AIO for such<br />
workloads. Finally, we briefly discuss the issues<br />
around workloads that need to combine<br />
asynchronous disk I/O and network I/O.<br />
2 Introduction<br />
AIO enables a single application thread to<br />
overlap processing with I/O operations for better<br />
utilization of CPU and devices. AIO can<br />
improve the performance of certain kinds of<br />
I/O intensive applications like databases, webservers<br />
and streaming-content servers. <strong>The</strong><br />
use of AIO also tends to help such applications<br />
adapt and scale more smoothly to varying<br />
loads.<br />
2.1 Overview of kernel AIO in <strong>Linux</strong> 2.6<br />
<strong>The</strong> <strong>Linux</strong> 2.6 kernel implements in-kernel<br />
support for AIO. A low-level native AIO system<br />
call interface is provided that can be invoked<br />
directly by applications or used by library<br />
implementations to build POSIX/SUS<br />
semantics. All discussion hereafter in this paper<br />
pertains to the native kernel AIO interfaces.<br />
Applications can submit one or more<br />
I/O requests asynchronously using the<br />
io_submit() system call, and obtain<br />
completion notification using the<br />
io_getevents() system call. Each<br />
I/O request specifies the operation (typically<br />
read/write), the file descriptor and the parameters<br />
for the operation (e.g., file offset,<br />
buffer). I/O requests are associated with the<br />
completion queue (ioctx) they were submitted<br />
against. <strong>The</strong> results of I/O are reported as<br />
completion events on this queue, and reaped<br />
using io_getevents().<br />
<strong>The</strong> design of AIO for the <strong>Linux</strong> 2.6 kernel has<br />
been discussed in [1], including the motivation
64 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
behind certain architectural choices, for example:<br />
• Sharing a common code path for AIO and<br />
regular I/O<br />
• A retry-based model for AIO continuations<br />
across blocking points in the case of<br />
buffered filesystem AIO (currently implemented<br />
as a set of patches to the <strong>Linux</strong> 2.6<br />
kernel) where worker threads take on the<br />
caller’s address space for executing retries<br />
involving access to user-space buffers.<br />
2.2 Background on retry-based AIO<br />
<strong>The</strong> retry-based model allows an AIO request<br />
to be executed as a series of non-blocking iterations.<br />
Each iteration retries the remaining<br />
part of the request from where the last iteration<br />
left off, re-issuing the corresponding<br />
AIO filesystem operation with modified arguments<br />
representing the remaining I/O. <strong>The</strong> retries<br />
are “kicked” via a special AIO waitqueue<br />
callback routine, aio_wake_function(),<br />
which replaces the default waitqueue entry<br />
used for blocking waits.<br />
<strong>The</strong> high-level retry infrastructure is responsible<br />
for running the iterations in the address<br />
space context of the caller, and ensures that<br />
only one retry instance is active at a given time.<br />
This relieves the fops themselves from having<br />
to deal with potential races of that sort.<br />
2.3 Overview of the rest of the paper<br />
In subsequent sections of this paper, we describe<br />
our experiences in addressing several issues<br />
identified during the optimization and stabilization<br />
efforts related to the kernel AIO implementation<br />
for <strong>Linux</strong> 2.6, mainly in the area<br />
of disk- or filesystem-based AIO.<br />
We observe, for example, how I/O patterns<br />
generated by the common VFS code paths<br />
used by regular and retry-based AIO could<br />
be non-optimal for streaming AIO requests,<br />
and we describe the modifications that address<br />
this finding. A different set of problems<br />
that has seen some development activity<br />
are the races, exposures and potential<br />
data-integrity concerns between direct and<br />
buffered I/O, which become especially tricky<br />
in the presence of AIO. Some of these issues<br />
motivated Andrew Morton’s modified pagewriteback<br />
design for the VFS using tagged<br />
radix-tree lookups, and we discuss the implications<br />
for the AIO O_SYNC write implementation.<br />
In general, disk-based filesystem AIO requirements<br />
for database workloads have been a<br />
guiding consideration in resolving some of the<br />
trade-offs encountered, and we present some<br />
initial performance results for such workloads.<br />
Lastly, we touch upon potential approaches to<br />
allow processing of disk-based AIO and communications<br />
I/O within a single event loop.<br />
3 Streaming AIO reads<br />
3.1 Basic retry pattern for single AIO read<br />
<strong>The</strong> retry-based design for buffered filesystem<br />
AIO read works by converting each blocking<br />
wait for read completion on a page into a retry<br />
exit. <strong>The</strong> design queues an asynchronous notification<br />
callback and returns the number of<br />
bytes for which the read has completed so far<br />
without blocking. <strong>The</strong>n, when the page becomes<br />
up-to-date, the callback kicks off a retry<br />
continuation in task context. This retry continuation<br />
invokes the same filesystem read operation<br />
again using the caller’s address space, but<br />
this time with arguments modified to reflect the<br />
remaining part of the read request.<br />
For example, given a 16KB read request starting<br />
at offset 0, where the first 4KB is already<br />
in cache, one might see the following sequence<br />
of retries (in the absence of readahead):
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 65<br />
first time:<br />
fop->aio_read(fd, 0, 16384) = 4096<br />
and when read completes for the second page:<br />
fop->aio_read(fd, 4096, 12288) = 4096<br />
and when read completes for the third page:<br />
fop->aio_read(fd, 8192, 8192) = 4096<br />
and when read completes for the fourth page:<br />
fop->aio_read(fd, 12288, 4096) = 4096<br />
3.2 Impact of readahead on single AIO read<br />
Usually, however, the readahead logic attempts<br />
to batch read requests in advance. Hence, more<br />
I/O would be seen to have completed at each<br />
retry. <strong>The</strong> logic attempts to predict the optimal<br />
readahead window based on state it maintains<br />
about the sequentiality of past read requests on<br />
the same file descriptor. Thus, given a maximum<br />
readahead window size of 128KB, the sequence<br />
of retries would appear to be more like<br />
the following example, which results in significantly<br />
improved throughput:<br />
first time:<br />
fop->aio_read(fd, 0, 16384) = 4096,<br />
after issuing readahead<br />
for 128KB/2 = 64KB<br />
and when read completes for the above I/O:<br />
fop->aio_read(fd, 4096, 12288) = 12288<br />
Notice that care is taken to ensure that readaheads<br />
are not repeated during retries.<br />
3.3 Impact of readahead on streaming AIO<br />
reads<br />
In the case of streaming AIO reads, a sequence<br />
of AIO read requests is issued on the same<br />
file descriptor, where subsequent reads are submitted<br />
without waiting for previous requests to<br />
complete (contrast this with a sequence of synchronous<br />
reads).<br />
Interestingly, we encountered a significant<br />
throughput degradation as a result of the interplay<br />
of readahead and streaming AIO reads.<br />
To see why, consider the retry sequence for<br />
streaming random AIO read requests of 16KB,<br />
where o1, o2, o3, ... refer to the random<br />
offsets where these reads are issued:<br />
first time:<br />
fop->aio_read(fd, o1, 16384) = -EIOCBRETRY,<br />
after issuing readahead for 64KB<br />
as the readahead logic sees the first page<br />
of the read<br />
fop->aio_read(fd, o2, 16384) = -EIOCBRETRY,<br />
after issuing readahead for 8KB (notice<br />
the shrinkage of the readahead window<br />
because of non-sequentiality seen by the<br />
readahead logic)<br />
fop->aio_read(fd, o3, 16384) = -EIOCBRETRY,<br />
after maximally shrinking the readahead<br />
window, turning off readahead and issuing<br />
4KB read in the slow path<br />
fop->aio_read(fd, o4, 16384) = -EIOCBRETRY,<br />
after issuing 4KB read in the slow path<br />
.<br />
.<br />
and when read completes for o1<br />
fop->aio_read(fd, o1, 16384) = 16384<br />
and when read completes for o2<br />
fop->aio_read(fd, o2, 16384) = 8192<br />
and when read completes for o3<br />
fop->aio_read(fd, o3, 16384) = 4096<br />
and when read completes for o4<br />
fop->aio_read(fd, o3, 16384) = 4096<br />
.<br />
.<br />
In steady state, this amounts to a maximallyshrunk<br />
readahead window with 4KB reads at<br />
random offsets being issued serially one at a<br />
time on a slow path, causing seek storms and<br />
driving throughputs down severely.<br />
3.4 Upfront readahead for improved streaming<br />
AIO read throughputs<br />
To address this issue, we made the readahead<br />
logic aware of the sequentiality of all pages in a<br />
single read request upfront—before submitting<br />
the next read request. This resulted in a more<br />
desirable outcome as follows:<br />
fop->aio_read(fd, o1, 16384) = -EIOCBRETRY,<br />
after issuing readahead for 64KB<br />
as the readahead logic sees all the 4<br />
pages for the read<br />
fop->aio_read(fd, o2, 16384) = -EIOCBRETRY,<br />
after issuing readahead for 20KB, as the<br />
readahead logic sees all 4 pages of the<br />
read (the readahead window shrinks to<br />
4+1=5 pages)
66 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
fop->aio_read(fd, o3, 16384) = -EIOCBRETRY,<br />
after issuing readahead for 20KB, as the<br />
readahead logic sees all 4 pages of the<br />
read (the readahead window is maintained<br />
at 4+1=5 pages)<br />
.<br />
.<br />
and when read completes for o1<br />
fop->aio_read(fd, o1, 16384) = 16384<br />
and when read completes for o2<br />
fop->aio_read(fd, o2, 16384) = 16384<br />
and when read completes for o3<br />
fop->aio_read(fd, o3, 16384) = 16384<br />
.<br />
.<br />
3.5 Upfront readahead and sendfile regressions<br />
At first sight it appears that upfront readahead<br />
is a reasonable change for all situations, since<br />
it immediately passes to the readahead logic<br />
the entire size of the request. However, it has<br />
the unintended, potential side-effect of losing<br />
pipelining benefits for really large reads, or operations<br />
like sendfile which involve post processing<br />
I/O on the contents just read. <strong>One</strong> way<br />
to address this is to clip the maximum size<br />
of upfront readahead to the maximum readahead<br />
setting for the device. To see why even<br />
that may not suffice for certain situations, let<br />
us take a look at the following sequence for<br />
a webserver that uses non-blocking sendfile to<br />
serve a large (2GB) file.<br />
sendfile(fd, 0, 2GB, fd2) = 8192,<br />
tells readahead about up to 128KB<br />
of the read<br />
sendfile(fd, 8192, 2GB - 8192, fd2) = 8192,<br />
tells readahead about 8KB - 132KB<br />
of the read<br />
sendfile(fd, 16384, 2GB - 16384, fd2) = 8192,<br />
tells readahead about 16KB-140KB<br />
of the read<br />
...<br />
This confuses the readahead logic about the<br />
I/O pattern which appears to be 0–128K, 8K–<br />
132K, 16K–140K instead of clear sequentiality<br />
from 0–2GB that is really appropriate.<br />
To avoid such unanticipated issues, upfront<br />
readahead required a special case for AIO<br />
alone, limited to the maximum readahead setting<br />
for the device.<br />
3.6 Streaming AIO read microbenchmark<br />
comparisons<br />
We explored streaming AIO throughput improvements<br />
with the retry-based AIO implementation<br />
and optimizations discussed above,<br />
using a custom microbenchmark called aiostress<br />
[2]. aio-stress issues a stream of AIO<br />
requests to one or more files, where one can<br />
vary several parameters including I/O unit size,<br />
total I/O size, depth of iocbs submitted at a<br />
time, number of concurrent threads, and type<br />
and pattern of I/O operations, and reports the<br />
overall throughput attained.<br />
<strong>The</strong> hardware included a 4-way 700MHz<br />
Pentium ® III machine with 512MB of RAM<br />
and a 1MB L2 cache. <strong>The</strong> disk subsystem<br />
used for the I/O tests consisted of an Adaptec<br />
AIC7896/97 Ultra2 SCSI controller connected<br />
to a disk enclosure with six 9GB disks, one<br />
of which was configured as an ext3 filesystem<br />
with a block size of 4KB for testing.<br />
<strong>The</strong> runs compared aio-stress throughputs for<br />
streaming random buffered I/O reads (i.e.,<br />
without O_DIRECT), with and without the<br />
previously described changes. All the runs<br />
were for the case where the file was not already<br />
cached in memory. <strong>The</strong> above graph<br />
summarizes how the results varied across individual<br />
request sizes of 4KB to 64KB, where<br />
I/O was targeted to a single file of size 1GB,<br />
the depth of iocbs outstanding at a time being<br />
64KB. A third run was performed to find out<br />
how the results compared with equivalent runs<br />
using AIO-DIO.<br />
With the changes applied, the results showed<br />
an approximate 2x improvement across all<br />
block sizes, bringing throughputs to levels that<br />
match the corresponding results using AIO-<br />
DIO.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 67<br />
aio-stress throughput (MB/s)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
Streaming AIO read results with aio-stress<br />
FSAIO (non-cached) 2.6.2 Vanilla<br />
FSAIO (non-cached) 2.6.2 Patched<br />
AIO-DIO 2.6.2 Vanilla<br />
0<br />
0 10 20 30 40 50 60 70<br />
request size (KB)<br />
Figure 1: Comparisons of streaming random<br />
AIO read throughputs<br />
of blocks by a truncate while DIO was in<br />
progress. <strong>The</strong> semaphore was implemented<br />
held in shared mode by DIO and in exclusive<br />
mode by truncate.<br />
Note that handling the new locking rules (i.e.,<br />
lock ordering of i_sem first and then i_<br />
alloc_sem) while allowing for filesystemspecific<br />
implementations of the DIO and filewrite<br />
interfaces had to be handled with some<br />
care.<br />
4.2 AIO-DIO specific races<br />
4 AIO DIO vs cached I/O integrity<br />
issues<br />
4.1 DIO vs buffered races<br />
Stephen Tweedie discovered several races between<br />
DIO and buffered I/O to the same file<br />
[3]. <strong>The</strong>se races could lead to potential staledata<br />
exposures and even data-integrity issues.<br />
Most instances were related to situations when<br />
in-core meta-data updates were visible before<br />
actual instantiation or resetting of corresponding<br />
data blocks on disk. Problems could also<br />
arise when meta-data updates were not visible<br />
to other code paths that could simultaneously<br />
update meta-data as well. <strong>The</strong> races mainly affected<br />
sparse files due to the lack of atomicity<br />
between the file flush in the DIO paths and actual<br />
data block accesses.<br />
<strong>The</strong> solution that Stephen Tweedie came<br />
up with, and which Badari Pulavarty reported<br />
to <strong>Linux</strong> 2.6, involved protecting block<br />
lookups and meta-data updates with the inode<br />
semaphore (i_sem) in DIO paths for both read<br />
and write, atomically with the file flush. Overwriting<br />
of sparse blocks in the DIO write path<br />
was modified to fall back to buffered writes.<br />
Finally, an additional semaphore (i_alloc_<br />
sem) was introduced to lock out deallocation<br />
<strong>The</strong> inclusion of AIO in <strong>Linux</strong> 2.6 added some<br />
tricky scenarios to the above-described problems<br />
because of the potential races inherent in<br />
returning without waiting for I/O completion.<br />
<strong>The</strong> interplay of AIO-DIO writes and truncate<br />
was a particular worry as it could lead to corruption<br />
of file data; for example, blocks could<br />
get deallocated and reallocated to a new file<br />
while an AIO-DIO write to the file was still in<br />
progress. To avoid this, AIO-DIO had to return<br />
with i_alloc_sem held, and only release it<br />
as part of I/O completion post-processing. Notice<br />
that this also had implications for AIO cancellation.<br />
File size updates for AIO-DIO file extends<br />
could expose unwritten blocks if they happened<br />
before I/O completed asynchronously.<br />
<strong>The</strong> case involving fallback to buffered I/O<br />
was particularly non-trivial if a single request<br />
spanned allocated and sparse regions of a<br />
file. Specifically, part of the I/O could have<br />
been initiated via DIO then continued asynchronously,<br />
while the fallback to buffered I/O<br />
occurred and signaled I/O completion to the<br />
application. <strong>The</strong> application may thus have<br />
reused its I/O buffer, overwriting it with other<br />
data and potentially causing file data corruption<br />
if writeout to disk had still been pending.<br />
It might appear that some of these problems
68 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
could be avoided if I/O schedulers guaranteed<br />
the ordering of I/O requests issued to the same<br />
disk block. However, this isn’t a simple proposition<br />
in the current architecture, especially in<br />
generalizing the design to all possible cases,<br />
including network block devices. <strong>The</strong> use of<br />
I/O barriers would be necessary and the costs<br />
may not be justified for these special-case situations.<br />
Instead, a pragmatic approach was taken in order<br />
to address this based on the assumptions<br />
that true asynchronous behaviour was really<br />
meaningful in practice, mainly when performing<br />
I/O to already-allocated file blocks. For<br />
example, databases typically preallocate files<br />
at the time of creation, so that AIO writes<br />
during normal operation and in performancecritical<br />
paths do not extend the file or encounter<br />
sparse regions. Thus, for the sake of correctness,<br />
synchronous behaviour may be tolerable<br />
for AIO writes involving sparse regions or file<br />
extends. This compromise simplified the handling<br />
of the scenarios described earlier. AIO-<br />
DIO file extends now wait for I/O to complete<br />
and update the file size. AIO-DIO writes spanning<br />
allocated and sparse regions now wait for<br />
previously- issued DIO for that request to complete<br />
before falling back to buffered I/O.<br />
5 Concurrent I/O with synchronized<br />
write guarantees<br />
An application opts for synchronized writes<br />
(by using the O_SYNC option on file open)<br />
when the I/O must be committed to disk before<br />
the write request completes. In the case<br />
of DIO, writes directly go to disk anyway. For<br />
buffered I/O, data is first copied into the page<br />
cache and later written out to disk; if synchronized<br />
I/O is specified then the request returns<br />
only after the writeout is complete.<br />
An application might also choose to synchronize<br />
previously-issued writes to disk by invoking<br />
fsync(), which writes back data from the<br />
page cache to disk and waits for writeout to<br />
complete before returning.<br />
5.1 Concurrent DIO writes<br />
DIO writes formerly held the inode semaphore<br />
in exclusive mode until write completion. This<br />
helped ensure atomicity of DIO writes and<br />
protected against potential file data corruption<br />
races with truncate. However, it also meant that<br />
multiple threads or processes submitting parallel<br />
DIOs to different parts of the same file<br />
effectively became serialized synchronously.<br />
If the same behaviour were extended to AIO<br />
(i.e., having the i_sem held through I/O completion<br />
for AIO-DIO writes), it would significantly<br />
degrade throughput of streaming AIO<br />
writes as subsequent write submissions would<br />
block until completion of the previous request.<br />
With the fixes described in the previous section,<br />
such synchronous serialization is avoidable<br />
without loss of correctness, as the inode<br />
semaphore needs to be held only when looking<br />
up the blocks to write, and not while actual I/O<br />
is in progress on the data blocks. This could allow<br />
concurrent DIO writes on different parts of<br />
a file to proceed simultaneously, and efficient<br />
throughputs for streaming AIO-DIO writes.<br />
5.2 Concurrent O_SYNC buffered writes<br />
In the original writeback design in the <strong>Linux</strong><br />
VFS, per-address space lists were maintained<br />
for dirty pages and pages under writeback for<br />
a given file. Synchronized write was implemented<br />
by traversing these lists to issue writeouts<br />
for the dirty pages and waiting for writeback<br />
to complete on the pages on the writeback<br />
list. <strong>The</strong> inode semaphore had to be held all<br />
through to avoid possibilities of livelocking on<br />
these lists as further writes streamed into the<br />
same file. While this helped maintain atomicity
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 69<br />
of writes, it meant that parallel O_SYNC writes<br />
to different parts of the file were effectively<br />
serialized synchronously. Further, dependence<br />
on i_sem-protected state in the address space<br />
lists across I/O waits made it difficult to retryenable<br />
this code path for AIO support.<br />
In order to allow concurrent O_SYNC writes to<br />
be active on a file, the range of pages to be<br />
written back and waited on could instead be<br />
obtained directly through a radix-tree lookup<br />
for the range of offsets in the file that was being<br />
written out by the request [4]. This would<br />
avoid traversal of the page lists and hence the<br />
need to hold i_sem across the I/O waits. Such<br />
an approach would also make it possible to<br />
complete O_SYNC writes as a sequence of nonblocking<br />
retry iterations across the range of<br />
bytes in a given request.<br />
5.3 Data-integrity guarantees<br />
shared mode for background writers. It involved<br />
navigating issues with busy-waits in<br />
background writers and the code was beginning<br />
to get complicated and potentially fragile.<br />
This was one of the problems that finally<br />
prompted Andrew Morton to change the entire<br />
VFS writeback code to use radix-tree walks instead<br />
of the per-address space pagelists. <strong>The</strong><br />
main advantage was that avoiding the need<br />
for movement across lists during state changes<br />
(e.g., when re-dirtying a page if its buffers were<br />
locked for I/O by another process) reduced the<br />
chances of pages getting missed from consideration<br />
without the added serialization of entire<br />
writebacks.<br />
6 Tagged radix-tree based writeback<br />
Background writeout threads cannot block on<br />
the inode semaphore like O_SYNC/fsync writers.<br />
Hence, with the per-address space lists<br />
writeback model, some juggling involving<br />
movement across multiple lists was required<br />
to avoid livelocks. <strong>The</strong> implementation had<br />
to make sure that pages which by chance got<br />
picked up for processing by background writeouts<br />
didn’t slip from consideration when waiting<br />
for writeback to complete for a synchronized<br />
write request. <strong>The</strong> latter would be particularly<br />
relevant for ensuring synchronized-write<br />
guarantees that impacted data integrity for applications.<br />
However, as Daniel McNeil’s analysis<br />
would indicate [5], getting this right required<br />
the writeback code to write and wait<br />
upon I/O and dirty pages which were initiated<br />
by other processes, and that turned out to be<br />
fairly tricky.<br />
<strong>One</strong> solution that was explored was peraddress<br />
space serialization of writeback to ensure<br />
exclusivity to synchronous writers and<br />
For the radix-tree walk writeback design to perform<br />
as well as the address space lists-based<br />
approach, an efficient way to get to the pages<br />
of interest in the radix trees is required. This<br />
is especially so when there are many pages in<br />
the pagecache but only a few are dirty or under<br />
writeback. Andrew Morton solved this problem<br />
by implementing tagged radix-tree lookup<br />
support to enable lookup of dirty or writeback<br />
pages in O(log64(n)) time [6].<br />
This was achieved by adding tag bits for each<br />
slot to each radix-tree node. If a node is<br />
tagged, then the corresponding slots on all the<br />
nodes above it in the tree are tagged. Thus,<br />
to search for a particular tag, one would keep<br />
going down sub-trees under slots which have<br />
the tag bit set until the tagged leaf nodes are<br />
accessed. A tagged gang lookup function is<br />
used for in-order searches for dirty or writeback<br />
pages within a specified range. <strong>The</strong>se<br />
lookups are used to replace the per-addressspace<br />
page lists altogether.
70 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
To synchronize writes to disk, a tagged radixtree<br />
gang lookup of dirty pages in the byterange<br />
corresponding to the write request is performed<br />
and the resulting pages are written out.<br />
Next, pages under writeback in the byte-range<br />
are obtained through a tagged radix-tree gang<br />
lookup of writeback pages, and we wait for<br />
writeback to complete on these pages (without<br />
having to hold the inode semaphore across the<br />
waits). Observe how this logic lends itself to be<br />
broken up into a series of non-blocking retry iterations<br />
proceeding in-order through the range.<br />
<strong>The</strong> same logic can also be used for a whole<br />
file sync, by specifying a byte-range that spans<br />
the entire file.<br />
Background writers also use tagged radix-tree<br />
gang lookups of dirty pages. Instead of always<br />
scanning a file from its first dirty page, the index<br />
where the last batch of writeout terminated<br />
is tracked so the next batch of writeouts can be<br />
started after that point.<br />
7 Streaming AIO writes<br />
<strong>The</strong> tagged radix-tree walk writeback approach<br />
greatly simplifies the design of AIO support for<br />
synchronized writes, as mentioned in the previous<br />
section,<br />
7.1 Basic retry pattern for synchronized AIO<br />
writes<br />
<strong>The</strong> retry-based design for buffered AIO O_<br />
SYNC writes works by converting each blocking<br />
wait for writeback completion of a page<br />
into a retry exit. <strong>The</strong> conversion point queues<br />
an asynchronous notification callback and returns<br />
to the caller of the filesystem’s AIO<br />
write operation the number of bytes for which<br />
writeback has completed so far without blocking.<br />
<strong>The</strong>n, when writeback completes for that<br />
page, the callback kicks off a retry continuation<br />
in task context which invokes the same AIO<br />
write operation again using the caller’s address<br />
space, but this time with arguments modified to<br />
reflect the remaining part of the write request.<br />
As writeouts for the range would have already<br />
been issued the first time before the loop to<br />
wait for writeback completion, the implementation<br />
takes care not to re-dirty pages or reissue<br />
writeouts during subsequent retries of<br />
AIO write. Instead, when the code detects that<br />
it is being called in a retry context, it simply<br />
falls through directly to the step involving waiton-writeback<br />
for the remaining range as specified<br />
by the modified arguments.<br />
7.2 Filtered waitqueues to avoid retry storms<br />
with hashed wait queues<br />
Code that is in a retry-exit path (i.e., the return<br />
path following a blocking point where a retry is<br />
queued) should in general take care not to call<br />
routines that could wakeup the newly-queued<br />
retry.<br />
<strong>One</strong> thing that we had to watch for was calls<br />
to unlock_page() in the retry-exit path.<br />
This could cause a redundant wakeup if an<br />
async wait-on-page writeback was just queued<br />
for that page. <strong>The</strong> redundant wakeup would<br />
arise if the kernel used the same waitqueue<br />
on unlock as well as writeback completion for<br />
a page, with the expectation that the waiter<br />
would check for the condition it was waiting<br />
for and go back to sleep if it hadn’t occurred. In<br />
the AIO case, however, a wakeup of the newlyqueued<br />
callback in the same code path could<br />
potentially trigger a retry storm, as retries kept<br />
triggering themselves over and over again for<br />
the wrong condition.<br />
<strong>The</strong> interplay of unlock_page() and<br />
wait_on_page_writeback() with<br />
hashed waitqueues can get quite tricky for<br />
retries. For example, consider what happens<br />
when the following sequence in retryable code<br />
is executed at the same time for 2 pages, px
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 71<br />
and py, which happen to hash to the same<br />
waitqueue (Table 1).<br />
lock_page(p)<br />
check condition and process<br />
unlock_page(p)<br />
if (wait_on_page_writeback_wq(p)<br />
== -EIOCBQUEUED)<br />
return bytes_done<br />
<strong>The</strong> above code could keep cycling between<br />
spurious retries on px and py until I/O is done,<br />
wasting precious CPU time!<br />
If we can ensure specificity of the wakeup with<br />
hashed waitqueues then this problem can be<br />
avoided. William Lee Irwin’s implementation<br />
of filtered wakeup support in the recent <strong>Linux</strong><br />
2.6 kernels [7] achieves just that. <strong>The</strong> wakeup<br />
routine specifies a key to match before invoking<br />
the wakeup function for an entry in the<br />
waitqueue, thereby limiting wakeups to those<br />
entries which have a matching key. For page<br />
waitqueues, the key is computed as a function<br />
of the page and the condition (unlock or writeback<br />
completion) for the wakeup.<br />
7.3 Streaming AIO write microbenchmark<br />
comparisons<br />
<strong>The</strong> following graph compares aio-stress<br />
throughputs for streaming random buffered<br />
I/O O_SYNC writes, with and without the<br />
previously-described changes. <strong>The</strong> comparison<br />
was performed on the same setup used for<br />
the streaming AIO read results discussed earlier.<br />
<strong>The</strong> graph summarizes how the results varied<br />
across individual request sizes of 4KB to<br />
64KB, where I/O was targeted to a single file<br />
of size 1GB and the depth of iocbs outstanding<br />
at a time was 64KB. A third run was performed<br />
to determine how the results compared<br />
with equivalent runs using AIO-DIO.<br />
With the changes applied, the results showed<br />
an approximate 2x improvement across all<br />
aio-stress throughput (MB/s)<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
Streaming AIO O_SYNC write results with aio-stress<br />
FSAIO 2.6.2 Vanilla<br />
FSAIO 2.6.2 Patched<br />
AIO-DIO 2.6.2 Vanilla<br />
0<br />
0 10 20 30 40 50 60 70<br />
request size (KB)<br />
Figure 2: Comparisons of streaming random<br />
AIO write throughputs.<br />
block sizes, bringing throughputs to levels that<br />
match the corresponding results using AIO-<br />
DIO.<br />
8 AIO performance analysis for<br />
database workloads<br />
Large database systems leveraging AIO can<br />
show marked performance improvements compared<br />
to those systems that use synchronous<br />
I/O alone. We use IBM ® DB2 ® Universal<br />
Database V8 running an online transaction<br />
processing (OLTP) workload to illustrate the<br />
performance improvement of AIO on raw devices<br />
and on filesystems.<br />
8.1 DB2 page cleaners<br />
A DB2 page cleaner is a process responsible<br />
for flushing dirty buffer pool pages to disk.<br />
It simulates AIO by executing asynchronously<br />
with respect to the agent processes. <strong>The</strong> number<br />
of page cleaners and their behavior can be<br />
tuned according to the demands of the system.<br />
<strong>The</strong> agents, freed from cleaning pages themselves,<br />
can dedicate their resources (e.g., processor<br />
cycles) towards processing transactions,<br />
thereby improving throughput.
72 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
CPU1<br />
CPU2<br />
lock_page(px)<br />
...<br />
unlock_page(px)<br />
lock_page(py)<br />
wait_on_page_writeback_wq(px) ...<br />
unlock_page(py) -> wakes up p1<br />
triggering<br />
<br />
Table 1: Retry storm livelock with redundant wakeups on hashed wait queues<br />
8.2 AIO performance analysis for raw devices<br />
Two experiments were conducted to measure<br />
the performance benefits of AIO on raw devices<br />
for an update-intensive OLTP database<br />
workload. <strong>The</strong> workload used was derived<br />
from a TPC[8] benchmark, but is in no way<br />
comparable to any TPC results. For the first experiment,<br />
the database was configured with one<br />
page cleaner using the native <strong>Linux</strong> AIO interface.<br />
For the second experiment, the database<br />
was configured with 55 page cleaners all using<br />
the synchronous I/O interface. <strong>The</strong>se experiments<br />
showed that a database, properly configured<br />
in terms of the number of page cleaners<br />
with AIO, can out-perform a properly configured<br />
database using synchronous I/O page<br />
cleaning.<br />
For both experiments, the system configuration<br />
consisted of DB2 V8 running on a 2-way AMD<br />
Opteron system with <strong>Linux</strong> 2.6.1 installed. <strong>The</strong><br />
disk subsystem consisted of two FAStT 700<br />
storage servers, each with eight disk enclosures.<br />
<strong>The</strong> disks were configured as RAID-0<br />
arrays with a stripe size of 256KB.<br />
Table 2 shows the relative database performance<br />
with and without AIO. Higher numbers<br />
are better. <strong>The</strong> results show that the database<br />
performed 9% better when configured with one<br />
page cleaner using AIO, than when it was<br />
configured with 55 page cleaners using synchronous<br />
I/O.<br />
Configuration<br />
Relative<br />
Throughput<br />
1 page cleaner with AIO 133<br />
55 page cleaners without AIO 122<br />
Table 2: Database performance with and without<br />
AIO.<br />
Analyzing the I/O write patterns (see Table 3),<br />
we see that one page cleaner using AIO was<br />
sufficient to keep the buffer pools clean under<br />
a very heavy load, but that 55 page cleaners<br />
using synchronous I/O were not, as indicated<br />
by the 30% agent writes. This data<br />
suggests that more page cleaners should have<br />
been configured to improve the performance of<br />
the case with synchronous I/O. However, additional<br />
page cleaners consumed more memory,<br />
requiring a reduction in bufferpool size<br />
and thereby decreasing throughput. For the<br />
test configuration, 55 cleaners was the optimal<br />
number before memory constraints arose.<br />
8.3 AIO performance analysis for filesystems<br />
This section examines the performance improvements<br />
of AIO when used in conjunction<br />
with filesystems. This experiment was per-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 73<br />
Configuration Page cleaner Agent<br />
writes (%) writes (%)<br />
1 page cleaner with 100 0<br />
AIO<br />
55 page cleaners without<br />
AIO<br />
70 30<br />
Table 3: DB2 write patterns for raw device<br />
configurations.<br />
formed using the same OLTP benchmark as in<br />
the previous section.<br />
<strong>The</strong> test system consisted of two 1GHz AMD<br />
Opteron processors, 4GB of RAM and two<br />
QLogic 2310 FC controllers. Attached to the<br />
server was a single FAStT900 storage server<br />
and two disk enclosures with a total of 28 15K<br />
RPM 18GB drives. <strong>The</strong> <strong>Linux</strong> kernel used<br />
for the examination was 2.6.0+mm1, which includes<br />
the AIO filesystem support patches [9]<br />
discussed in this paper.<br />
<strong>The</strong> database tables were spread across multiple<br />
ext2 filesystem partitions. Database logs<br />
were stored on a single raw partition.<br />
Three separate tests were performed, utilizing<br />
different I/O methods for the database page<br />
cleaners.<br />
Test 1. Synchronous (Buffered) I/O.<br />
Test 2. Asynchronous (Buffered) I/O.<br />
Test 3. Direct I/O.<br />
<strong>The</strong> results are shown in Table 4 as relative<br />
commercial processing scores using synchronous<br />
I/O as the baseline (i.e., higher is better).<br />
Looking at the efficiency of the page cleaners<br />
(see Table 5), we see that the use of AIO<br />
is more successful in keeping the buffer pools<br />
clean. In the synchronous I/O and DIO cases,<br />
the agents needed to spend more time cleaning<br />
Configuration Commercial Processing<br />
Scores<br />
Synchronous I/O 100<br />
AIO (Buffered) 113.7<br />
DIO 111.9<br />
Table 4: Database performance on filesystems<br />
with and without AIO.<br />
buffer pool pages, resulting in less time processing<br />
transactions.<br />
Configuration Page cleaner Agent<br />
writes (%) writes (%)<br />
Synchronous I/O 37 63<br />
AIO (buffered) 100 0<br />
DIO 49 51<br />
Table 5: DB2 write patterns for filesystem configurations.<br />
8.4 Optimizing AIO for database workloads<br />
Databases typically use AIO for streaming<br />
batches of random, synchronized write requests<br />
to disk (where the writes are directed<br />
to preallocated disk blocks). This has been<br />
found to improve the performance of OLTP<br />
workloads, as it helps bring down the number<br />
of dedicated threads or processes needed<br />
for flushing updated pages, and results in reduced<br />
memory footprint and better CPU utilization<br />
and scaling.<br />
<strong>The</strong> size of individual write requests is determined<br />
by the page size used by the database.<br />
For example, a DB2 UDB installation might<br />
use a database page size of 8KB.<br />
As observed in previous sections, the use of<br />
AIO helps reduce the number of database page<br />
cleaner processes required to keep the bufferpool<br />
clean. To keep the disk queues maximally<br />
utilized and limit contention, it may be preferable<br />
to have requests to a given disk streamed<br />
out from a single page cleaner. Typically a<br />
set of of disks could be serviced by each page
74 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
cleaner if and when multiple page cleaners<br />
need to be used.<br />
Databases might also use AIO for reads, for example,<br />
for prefetching data to service queries.<br />
This usually helps improve the performance of<br />
decision support workloads. <strong>The</strong> I/O pattern<br />
generated in these cases is that of streaming<br />
batches of large AIO reads, with sizes typically<br />
determined by the file allocation extent size<br />
used by the database (e.g., a DB2 installation<br />
might use a database extent size of 256KB).<br />
For installations using buffered AIO reads, tuning<br />
the readahead setting for the corresponding<br />
devices to be more than the extent size would<br />
help improve performance of streaming AIO<br />
reads (recall the discussion in Section 3.5).<br />
9 Addressing AIO workloads involving<br />
both disk and communications<br />
I/O<br />
Certain applications need to handle both diskbased<br />
AIO and communications I/O. For communications<br />
I/O, the epoll interface—which<br />
provides support for efficient scalable event<br />
polling in <strong>Linux</strong> 2.6—could be used as appropriate,<br />
possibly in conjunction with O_<br />
NONBLOCK socket I/O. Disk-based AIO on<br />
the other hand, uses the native AIO API io_<br />
getevents for completion notification. This<br />
makes it difficult to combine both types of I/O<br />
processing within a single event loop, even<br />
when such a model is a natural way to program<br />
the application, as in implementations of the<br />
application on other operating systems.<br />
How do we address this issue? <strong>One</strong> option is to<br />
extend epoll to enable it to poll for notification<br />
of AIO completion events, so that AIO completion<br />
status can then be reaped in a non-blocking<br />
manner. This involves mixing both epoll and<br />
AIO API programming models, which is not<br />
ideal.<br />
9.1 AIO poll interface<br />
Another alternative is to add support for<br />
polling an event on a given file descriptor<br />
through the AIO interfaces. This function, referred<br />
to as AIO poll, can be issued through<br />
io_submit() just like other AIO operations,<br />
and specifies the file descriptor and<br />
the eventset to wait for. When the event<br />
occurs, notification is reported through io_<br />
getevents().<br />
<strong>The</strong> retry-based design of AIO poll works by<br />
converting the blocking wait for the event into<br />
a retry exit.<br />
<strong>The</strong> generic synchronous polling code fits<br />
nicely into the AIO retry design, so most of the<br />
original polling code can be used unchanged.<br />
<strong>The</strong> private data area of the iocb can be used<br />
to hold polling-specific data structures, and a<br />
few special cases can be added to the generic<br />
polling entry points. This allows the AIO poll<br />
case to proceed without additional memory allocations.<br />
9.2 AIO operations for communications I/O<br />
A third option is to add support for AIO operations<br />
for communications I/O. For example,<br />
AIO support for pipes has been implemented<br />
by converting the blocking wait for<br />
I/O on pipes to a retry exit. <strong>The</strong> generic pipe<br />
code was also structured such that conversion<br />
to AIO retries was quite simple, the only significant<br />
change was using the current io_wait<br />
context instead of a locally defined waitqueue,<br />
and returning early if no data was available.<br />
However, AIO pipe testing did show significantly<br />
more context switches then the 2.4 AIO<br />
pipe implementation, and this was coupled<br />
with much lower performance. <strong>The</strong> AIO core<br />
functions were relying on workqueues to do<br />
most of the retries, and this resulted in constant
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 75<br />
switching between the workqueue threads and<br />
user processes.<br />
<strong>The</strong> solution was to change the AIO core<br />
to do retries in io_submit() and in io_<br />
getevents(). This allowed the process to<br />
do some of its own work while it is scheduled<br />
in. Also, retries were switched to a delayed<br />
workqueue, so that bursts of retries would trigger<br />
fewer context switches.<br />
While delayed wakeups helped with pipe<br />
workloads, it also caused I/O stalls in filesystem<br />
AIO workloads. This was because a delayed<br />
wakeup was being used even when a user<br />
process was waiting in io_getevents().<br />
When user processes are actively waiting for<br />
events, it proved best to trigger the worker<br />
thread immediately.<br />
General AIO support for network operations<br />
has been considered but not implemented so far<br />
because of lack of supporting study that predicts<br />
a significant benefit over what epoll and<br />
non-blocking I/O can provide, except for the<br />
scope for enabling potential zero-copy implementations.<br />
This is a potential area for future<br />
research.<br />
10 Conclusions<br />
Our experience over the last year with AIO development,<br />
stabilization and performance improvements<br />
brought us to design and implementation<br />
issues that went far beyond the initial<br />
concern of converting key I/O blocking<br />
points to be asynchronous.<br />
AIO uncovered scenarios and I/O patterns that<br />
were unlikely or less significant with synchronous<br />
I/O alone. For example, the issues we<br />
discussed around streaming AIO performance<br />
with readahead and concurrent synchronized<br />
writes, as well as DIO vs buffered I/O complexities<br />
in the presence of AIO. In retrospect,<br />
this was the hardest part of supporting AIO—<br />
modifiying code that was originally designed<br />
only for synchronous I/O.<br />
Interestingly, this also meant that AIO appeared<br />
to magnify some problems early. For<br />
example, issues with hashed waitqueues that<br />
led to the filtered wakeup patches, and readahead<br />
window collapses with large random<br />
reads which precipitated improvements to the<br />
readahead code from Ramachandra Pai. Ultimately,<br />
many of the core improvements that<br />
helped AIO have had positive benefits in allowing<br />
improved concurrency for some of the<br />
synchronous I/O paths.<br />
In terms of benchmarking and optimizing<br />
<strong>Linux</strong> AIO performance, there is room for<br />
more exhaustive work. Requirements for AIO<br />
fsync support are currently under consideration.<br />
<strong>The</strong>re is also a need for more widely used<br />
AIO applications, especially those that take advantaged<br />
of AIO support for buffered I/O or<br />
bring out additional requirements like network<br />
I/O beyond epoll or AIO poll. Finally, investigations<br />
into API changes to help enable more<br />
efficient POSIX AIO implementations based<br />
on kernel AIO support may be a worthwhile<br />
endeavor.<br />
11 Acknowledgements<br />
We would like to thank the many people<br />
on the linux-aio@kvack.org and<br />
linux-kernel@vger.kernel.org<br />
mailing lists who provided us with valuable<br />
comments and suggestions during our<br />
development efforts.<br />
We would especially like to acknowledge the<br />
important contributions of Andrew Morton,<br />
Daniel McNeil, Badari Pulavarty, Stephen<br />
Tweedie, and William Lee Irwin towards several<br />
pieces of work discussed in this paper.
76 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
This paper and the work it describes wouldn’t<br />
have been possible without the efforts of Janet<br />
Morgan in many different ways, starting from<br />
review, test and debugging feedback to joining<br />
the midnight oil camp to help with modifications<br />
and improvements to the text during the<br />
final stages of the paper.<br />
We also thank Brian Twitchell, Steve Pratt,<br />
Gerrit Huizenga, Wayne Young, and John<br />
Lumby from IBM for their help and discussions<br />
along the way.<br />
This work was a part of the <strong>Linux</strong> Scalability<br />
Effort (LSE) on SourceForge, and further<br />
information about <strong>Linux</strong> 2.6 AIO is available<br />
at the LSE AIO web page [10]. All the external<br />
AIO patches including AIO support for<br />
buffered filesystem I/O, AIO poll and AIO support<br />
for pipes are available at [9].<br />
12 Legal Statement<br />
This work represents the view of the authors and<br />
does not necessarily represent the view of IBM.<br />
IBM, DB2 and DB2 Universal Database are registered<br />
trademarks of International Business Machines<br />
Corporation in the United States and/or other<br />
countries.<br />
<strong>Linux</strong> is a registered trademark of Linus Torvalds.<br />
Pentium is a trademark of Intel Corporation in the<br />
United States, other countries, or both.<br />
Other company, product, and service names may be<br />
trademarks or service marks of others.<br />
13 Disclaimer<br />
<strong>The</strong> benchmarks discussed in this paper were conducted<br />
for research purposes only, under laboratory<br />
conditions. Results will not be realized in all computing<br />
environments.<br />
References<br />
[1] Suparna Bhattacharya, Badari<br />
Pulavarthy, Steven Pratt, and Janet<br />
Morgan. Asynchronous i/o support for<br />
linux 2.5. In Proceedings of the <strong>Linux</strong><br />
Symposium. <strong>Linux</strong> Symposium, Ottawa,<br />
July 2003. http://archive.<br />
linuxsymposium.org/ols2003/<br />
Proceedings/All-Reprints/<br />
Reprint-Pulavarty-OLS2003.pdf.<br />
[2] Chris Mason. aio-stress<br />
microbenchmark.<br />
ftp://ftp.suse.com/pub/people/<br />
mason/utils/aio-stress.c.<br />
[3] Stephen C. Tweedie. Posting on dio races<br />
in 2.4. http://marc.theaimsgroup.<br />
com/?l=linux-fsdevel&m=<br />
105597840711609&w=2.<br />
[4] Andrew Morton. O_sync speedup patch.<br />
http:<br />
//www.kernel.org/pub/linux/<br />
kernel/people/akpm/patches/2.<br />
6/2.6.0/2.6.0-mm1/broken-out/<br />
O_SYNC-speedup-2.patch.<br />
[5] Daniel McNeil. Posting on synchronized<br />
writeback races.<br />
http://marc.theaimsgroup.com/<br />
?l=linux-aio&m=<br />
107671729611002&w=2.<br />
[6] Andrew Morton. Posting on in-order<br />
tagged radix tree walk based vfs<br />
writeback.<br />
http://marc.theaimsgroup.com/<br />
?l=bk-commits-head&m=<br />
108184544016117&w=2.<br />
[7] William Lee Irwin. Filtered wakeup<br />
patch. http://marc.theaimsgroup.<br />
com/?l=bk-commits-head&m=<br />
108459430513660&w=2.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 77<br />
[8] Transaction processing performance<br />
council. http://www.tpc.org.<br />
[9] Suparna Bhattacharya (with<br />
contributions from Andrew Morton &<br />
Chris Mason). Additional 2.6 <strong>Linux</strong><br />
<strong>Kernel</strong> Asynchronous I/O patches.<br />
http:<br />
//www.kernel.org/pub/linux/<br />
kernel/people/suparna/aio.<br />
[10] LSE team. <strong>Kernel</strong> Asynchronous I/O<br />
(AIO) Support for <strong>Linux</strong>. http:<br />
//lse.sf.net/io/aio.html.
78 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Methods to Improve Bootup Time in <strong>Linux</strong><br />
Tim R. Bird<br />
Sony Electronics<br />
tim.bird@am.sony.com<br />
Abstract<br />
This paper presents several techniques for reducing<br />
the bootup time of the <strong>Linux</strong> kernel, including<br />
Execute-In-Place (XIP), avoidance of<br />
calibrate_delay(), and reduced probing<br />
by certain drivers and subsystems. Using<br />
a variety of techniques, the <strong>Linux</strong> kernel can<br />
be booted on embedded hardware in under 500<br />
milliseconds. Current efforts and future directions<br />
of work to improve bootup time are described.<br />
2 Overview of Boot Process<br />
<strong>The</strong> entire boot process of <strong>Linux</strong> can be<br />
roughly divided into 3 main areas: firmware,<br />
kernel, and user space. <strong>The</strong> following is a list<br />
of events during a typical boot sequence:<br />
1. power on<br />
2. firmware (bootloader) starts<br />
3. kernel decompression starts<br />
4. kernel start<br />
5. user space start<br />
1 Introduction<br />
6. RC script start<br />
7. application start<br />
Users of consumer electronics products expect<br />
their devices to be available for use very soon<br />
after being turned on. Configurations of <strong>Linux</strong><br />
for desktop and server markets exhibit boot<br />
times in the range of 20 seconds to a few minutes,<br />
which is unacceptable for many consumer<br />
products.<br />
No single item is responsible for overall poor<br />
boot time performance. <strong>The</strong>refore a number<br />
of techniques must be employed to reduce the<br />
boot up time of a <strong>Linux</strong> system. This paper<br />
presents several techniques which have been<br />
found to be useful for embedded configurations<br />
of <strong>Linux</strong>.<br />
8. first available use<br />
This paper focuses on techniques for reducing<br />
the bootup time up until the start of user space.<br />
That is, techniques are described which reduce<br />
the firmware time, and the kernel start time.<br />
This includes activities through the completion<br />
of event 4 in the list above.<br />
<strong>The</strong> actual kernel execution begins with<br />
the routine start_kernel(), in the file<br />
init/main.c.<br />
An overview of major steps in the initialization<br />
sequence of the kernel is as follows:
80 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
• start_kernel()<br />
– init architecture<br />
– init interrupts<br />
– init memory<br />
– start idle thread<br />
– call rest_init()<br />
*<br />
start ‘init’ kernel thread<br />
<strong>The</strong> init kernel thread performs a few<br />
other tasks, then calls do_basic_setup(),<br />
which calls do_initcalls(), to run<br />
through the array of initialization routines for<br />
drivers statically linked in the kernel. Finally,<br />
this thread switches to user space by execveing<br />
to the first user space program, usually<br />
/sbin/init.<br />
• init (kernel thread)<br />
– call do_basic_setup()<br />
*<br />
call do_initcalls()<br />
· init buses and drivers<br />
– prepare and mount root filesystem<br />
– call run_init_process()<br />
*<br />
call execve() to start user<br />
space process<br />
3 Typical Desktop Boot Time<br />
<strong>The</strong> boot times for a typical desktop system<br />
were measured and the results are presented<br />
below, to give an indication of the major areas<br />
in the kernel where time is spent. While the<br />
numbers in these tests differ somewhat from<br />
those for a typical embedded system, it is useful<br />
to see these to get an idea of where some of<br />
the trouble spots are for kernel booting.<br />
3.1 System<br />
An HP XW4100 <strong>Linux</strong> workstation system<br />
was used for these tests, with the following<br />
characteristics:<br />
• Pentium 4 HT processor, running at 3GHz<br />
• 512 MB RAM<br />
• Western Digital 40G hard drive on hda<br />
• Generic CDROM drive on hdc<br />
3.2 Measurement method<br />
<strong>The</strong> kernel used was 2.6.6, with the KFI patch<br />
applied. KFI stands for “<strong>Kernel</strong> Function Instrumentation”.<br />
This is an in-kernel system<br />
to measure the duration of each function executed<br />
during a particular profiling run. It<br />
uses the -finstrument-functions option<br />
of gcc to instrument kernel functions<br />
with callouts on each function entry and exit.<br />
This code was authored by developers at MontaVista<br />
Software, and a patch for 2.6.6 is available,<br />
although the code is not ready (as of the<br />
time of this writing) for general publication.<br />
Information about KFI and the patch are available<br />
at:<br />
http://tree.celinuxforum.org/pubwiki<br />
/moin.cgi<br />
/<strong>Kernel</strong>FunctionInstrumentation<br />
3.3 Key delays<br />
<strong>The</strong> average time for kernel startup of the test<br />
system was about 7 seconds. This was the<br />
amount of time for just the kernel and NOT the<br />
firmware or user space. It corresponds to the<br />
period of time between events 4 and 5 in the<br />
boot sequence listed in Section 2.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 81<br />
Some key delays were found in the kernel<br />
startup on the test system. Table 1 shows<br />
some of the key routines where time was spent<br />
during bootup. <strong>The</strong>se are the low-level routines<br />
where significant time was spent inside<br />
the functions themselves, rather than in subroutines<br />
called by the functions.<br />
<strong>Kernel</strong> Function No. of Avg. call Total<br />
calls time time<br />
delay_tsc 5153 1 5537<br />
default_idle 312 1 325<br />
get_cmos_time 1 500 500<br />
psmouse_sendbyte 44 2.4 109<br />
pci_bios_find_device 25 1.7 44<br />
atkbd_sendbyte 7 3.7 26<br />
calibrate_delay 1 24 24<br />
Note: Times are in milliseconds.<br />
Table 1: Functions consuming lots of time during<br />
a typical desktop <strong>Linux</strong> kernel startup.<br />
Note that over 80% of the total time of the<br />
bootup (almost 6 seconds out of 7) was spent<br />
busywaiting in delay_tsc() or spinning in<br />
the routine default_idle(). It appears<br />
that great reductions in total bootup time could<br />
be achieved if these delays could be reduced,<br />
or if it were possible to run some initialization<br />
tasks concurrently.<br />
Another interesting point is that the routine<br />
get_cmos_time() was extremely variable<br />
in the length of time it took. Measurements<br />
of its duration ranged from under 100 milliseconds<br />
to almost one second. This routine, and<br />
methods to avoid this delay and variability, are<br />
discussed in section 9.<br />
3.4 High-level delay areas<br />
Since delay_tsc() is used (via various<br />
delay mechanisms) for busywaiting by a<br />
number of different subsystems, it is helpful to<br />
identify the higher-level routines which end up<br />
invoking this function.<br />
Table 2 shows some high-level routines called<br />
during kernel initialization, and the amount of<br />
time they took to complete on the test machine.<br />
Duration times marked with a tilde denote<br />
functions which were highly variable in<br />
duration.<br />
Table 2:<br />
startup.<br />
<strong>Kernel</strong> Function Duration time<br />
ide_init 3327<br />
time_init ~500<br />
isapnp_init 383<br />
i8042_init 139<br />
prepare_namespace ~50<br />
calibrate_delay 24<br />
Note: Times are in milliseconds.<br />
High-level delays during a typical<br />
For a few of these, it is interesting to examine<br />
the call sequences underneath the high-level<br />
routines. This shows the connection between<br />
the high-level routines that are taking a long<br />
time to complete and the functions where the<br />
time is actually being spent.<br />
Figures 1 and 2 show some call sequences for<br />
high-level calls which take a long time to complete.<br />
In each call tree, the number in parentheses is<br />
the number of times that the routine was called<br />
by the parent in this chain. Indentation shows<br />
the call nesting level.<br />
For example, in Figure 1, do_probe() is<br />
called a total of 31 times by probe_hwif(),<br />
and it calls ide_delay_50ms() 78 times,<br />
and try_to_identify() 8 times.<br />
<strong>The</strong> timing data for the test system showed<br />
that IDE initialization was a significant contributor<br />
to overall bootup time. <strong>The</strong> call sequence<br />
underneath ide_init() shows that<br />
a large number of calls are made to the routine<br />
ide_delay_50ms(), which in turn calls
82 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
ide_init-><br />
probe_for_hwifs(1)-><br />
ide_scan_pcibus(1)-><br />
ide_scan_pci_dev(2)-><br />
piix_init_one(2)-><br />
init_setup_piix(2)-><br />
ide_setup_pci_device(2)-><br />
probe_hwif_init(2)-><br />
probe_hwif(4)-><br />
do_probe(31)-><br />
ide_delay_50ms(78)-><br />
__const_udelay(3900)-><br />
__delay(3900)-><br />
delay_tsc(3900)<br />
try_to_identify(8)-><br />
actual_try_to_identify(8)-><br />
ide_delay_50ms(24)-><br />
__const_udelay(1200)-><br />
__delay(1200)-><br />
delay_tsc(1200)<br />
Figure 1: IDE init call tree<br />
isapnp_init-><br />
isapnp_isolate(1)-><br />
isapnp_isolate_rdp_select(1)-><br />
__const_udelay(25)-><br />
__delay(25)-><br />
delay_tsc(25)<br />
isapnp_key(18)-><br />
__const_udelay(18)-><br />
__delay(18)-><br />
delay_tsc(18)<br />
Figure 2: ISAPnP init call tree<br />
__const_udelay() very many times. <strong>The</strong><br />
busywaits in ide_delay_50ms() alone accounted<br />
for over 5 seconds, or about 70% of<br />
the total boot up time.<br />
Another significant area of delay was the initialization<br />
of the ISAPnP system. This took<br />
about 380 milliseconds on the test machine.<br />
Both the mouse and the keyboard drivers used<br />
crude busywaits to wait for acknowledgements<br />
from their respective hardware.<br />
Finally, the routine calibrate_delay()<br />
took about 25 milliseconds to run, to compute<br />
the value of loops_per_jiffy and print<br />
(the related) BogoMips for the machine.<br />
<strong>The</strong> remaining sections of this paper discuss<br />
various specific methods for reducing bootup<br />
time for embedded and desktop systems. Some<br />
of these methods are directly related to some of<br />
the delay areas identified in this test configuration.<br />
4 <strong>Kernel</strong> Execute-In-Place<br />
A typical sequence of events during bootup is<br />
for the bootloader to load a compressed kernel<br />
image from either disk or Flash, placing it into<br />
RAM. <strong>The</strong> kernel is decompressed, either during<br />
or just after the copy operation. <strong>The</strong>n the<br />
kernel is executed by jumping to the function<br />
start_kernel().<br />
<strong>Kernel</strong> Execute-In-Place (XIP) is a mechanism<br />
where the kernel instructions are executed directly<br />
from ROM or Flash.<br />
In a kernel XIP configuration, the step of copying<br />
the kernel code segment into RAM is omitted,<br />
as well as any decompression step. Instead,<br />
the kernel image is stored uncompressed<br />
in ROM or Flash. <strong>The</strong> kernel data segments<br />
still need to be initialized in RAM, but by eliminating<br />
the text segment copy and decompression,<br />
the overall effect is a reduction in the time<br />
required for the firmware phase of the bootup.<br />
Table 3 shows the differences in time duration<br />
for various parts of the boot stage for a system<br />
booted with and without use of kernel XIP.<br />
<strong>The</strong> times in the table are shown in milliseconds.<br />
<strong>The</strong> table shows that using XIP in this<br />
configuration significantly reduced the time to<br />
copy the kernel to RAM (because only the data<br />
segments were copied), and completely eliminated<br />
the time to decompress the kernel (453<br />
milliseconds). However, the kernel initialization<br />
time increased slightly in the XIP configuration,<br />
for a net savings of 463 milliseconds.<br />
In order to support an Execute-In-Place con-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 83<br />
Boot Stage Non-XIP time XIP time<br />
Copy kernel to RAM 85 12<br />
Decompress kernel 453 0<br />
<strong>Kernel</strong> initialization 819 882<br />
Total kernel boot time 1357 894<br />
Note: Times are in milliseconds. Results are for<br />
PowerPC 405 LP at 266 MHz<br />
Table 3: Comparison of Non-XIP vs. XIP<br />
bootup times<br />
figuration, the kernel must be compiled and<br />
linked so that the code is ready to be executed<br />
from a fixed memory location. <strong>The</strong>re<br />
are examples of XIP configurations for ARM,<br />
MIPS and SH platforms in the CE<strong>Linux</strong><br />
source tree, available at: http://tree.<br />
celinuxforum.org/<br />
4.1 XIP Design Tradeoffs<br />
<strong>The</strong>re are tradeoffs involved in the use of XIP.<br />
First, it is common for access times to flash<br />
memory to be greater than access times to<br />
RAM. Thus, a kernel executing from Flash<br />
usually runs a bit slower than a kernel executing<br />
from RAM. Table 4 shows some of the results<br />
from running the lmbench benchmark<br />
on system, with the kernel executing in a standard<br />
non-XIP configuration versus an XIP configuration.<br />
Operation Non-XIP XIP<br />
stat() syscall 22.4 25.6<br />
fork a process 4718 7106<br />
context switching for 16 932 1109<br />
processes and 64k data size<br />
pipe communication 248 548<br />
Note: Times are in microseconds. Results are for<br />
lmbench benchmark run on OMAP 1510 (ARM9 at<br />
168 MHz) processor<br />
Table 4: Comparison of Non-XIP and XIP performance<br />
Some of the operations in the benchmark took<br />
significantly longer with the kernel run in the<br />
XIP configuration. Most individual operations<br />
took about 20% to 30% longer. This performance<br />
penalty is suffered permanently while<br />
the kernel is running, and thus is a serious<br />
drawback to the use of XIP for reducing bootup<br />
time.<br />
A second tradeoff with kernel XIP is between<br />
the sizes of various types of memory in the<br />
system. In the XIP configuration the kernel<br />
must be stored uncompressed, so the amount<br />
of Flash required for the kernel increases, and<br />
is usually about doubled, versus a compressed<br />
kernel image used with a non-XIP configuration.<br />
However, the amount of RAM required<br />
for the kernel is decreased, since the kernel<br />
code segment is never copied to RAM. <strong>The</strong>refore,<br />
kernel XIP is also of interest for reducing<br />
the runtime RAM footprint for <strong>Linux</strong> in embedded<br />
systems.<br />
<strong>The</strong>re is additional research under way to investigate<br />
ways of reducing the performance<br />
impact of using XIP. <strong>One</strong> promising technique<br />
appears to be the use of “partial-XIP,” where a<br />
highly active subset of the kernel is loaded into<br />
RAM, but the majority of the kernel is executed<br />
in place from Flash.<br />
5 Delay Calibration Avoidance<br />
<strong>One</strong> time-consuming operation inside the kernel<br />
is the process of calibrating the value used<br />
for delay loops. <strong>One</strong> of the first routines in<br />
the kernel, calibrate_delay(), executes<br />
a series of delays in order to determine the correct<br />
value for a variable called loops_per_<br />
jiffy, which is then subsequently used to execute<br />
short delays in the kernel.<br />
<strong>The</strong> cost of performing this calibration is, interestingly,<br />
independent of processor speed.<br />
Rather, it is dependent on the number of iter-
84 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
ations required to perform the calibration, and<br />
the length of each iteration. Each iteration requires<br />
1 jiffy, which is the length of time defined<br />
by the HZ variable.<br />
In 2.4 versions of the <strong>Linux</strong> kernel, most platforms<br />
defined HZ as 100, which makes the<br />
length of a jiffy 10 milliseconds. A typical<br />
number of iterations for the calibration operation<br />
is 20 to 25, making the total time required<br />
for this operation about 250 milliseconds.<br />
In 2.6 versions of the <strong>Linux</strong> kernel, a few platforms<br />
(notably i386) have changed HZ to 1000,<br />
making the length of a jiffy 1 millisecond. On<br />
those platforms, the typical cost of this calibration<br />
operation has decreased to about 25 milliseconds.<br />
Thus, the benefit of eliminating this<br />
operation on most standard desktop systems<br />
has been reduced. However, for many embedded<br />
systems, HZ is still defined as 100, which<br />
makes bypassing the calibration useful.<br />
It is easy to eliminate the calibration operation.<br />
You can directly edit the code in init/main.<br />
c:calibrate_delay() to hardcode a value<br />
for loops_per_jiffy, and avoid the calibration<br />
entirely. Alternatively, there is a patch<br />
available at http://tree.celinuxforum.<br />
org/pubwiki/moin.cgi/PresetLPJ<br />
This patch allows you to use a kernel configuration<br />
option to specify a value for loops_<br />
per_jiffy at kernel compile time. Alternatively,<br />
the patch also allows you to use a kernel<br />
command line argument to specify a preset<br />
value for loops_per_jiffy at kernel boot<br />
time.<br />
6 Avoiding Probing During Bootup<br />
Another technique for reducing bootup time is<br />
to avoid probing during bootup. As a general<br />
technique, this can consist of identifying hardware<br />
which is known not to be present on one’s<br />
machine, and making sure the kernel is compiled<br />
without the drivers for that hardware.<br />
In the specific case of IDE, the kernel supports<br />
options at the command line to allow the<br />
user to avoid performing probing for specific<br />
interfaces and devices. To do this, you can<br />
use the IDE and harddrive noprobe options<br />
at the kernel command line. Please see the<br />
file Documentation/ide.txt in the kernel<br />
source tree for details on the syntax of using<br />
these options.<br />
On the test machine, IDE noprobe options<br />
were used to reduce the amount of probing during<br />
startup. <strong>The</strong> test machine had only a hard<br />
drive on hda (ide0 interface, first device) and<br />
a CD-ROM drive on hdc (ide1 interface, first<br />
device).<br />
In one test, noprobe options were specified<br />
to suppress probing of non-used interfaces and<br />
devices. Specifically, the following arguments<br />
were added to the kernel command line:<br />
hdb=none hdd=none ide2=noprobe<br />
<strong>The</strong> kernel was booted and the result was<br />
that the function ide_delay_50ms() was<br />
called only 68 times, and delay_tsc() was<br />
called only 3453 times. During a regular<br />
kernel boot without these options specified,<br />
the function ide_delay_50ms() is called<br />
102 times, and delay_tsc() is called 5153<br />
times. Each call to delay_tsc() takes<br />
about 1 millisecond, so the total time savings<br />
from using these options was 1700 milliseconds.<br />
<strong>The</strong>se IDE noprobe options have been available<br />
at least since the 2.4 kernel series, and are<br />
an easy way to reduce bootup time, without<br />
even having to recompile the kernel.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 85<br />
7 Reducing Probing Delays<br />
As was noted on the test machine, IDE initialization<br />
takes a significant percentage of<br />
the total bootup time. Almost all of this<br />
time is spent busywaiting in the routine ide_<br />
delay_50ms().<br />
It is trivial to modify the value of the timeout<br />
used in this routine. As an experiment,<br />
this code (located in the file drivers/ide/<br />
ide.c) was modified to only delay 5 milliseconds<br />
instead of 50 milliseconds.<br />
<strong>The</strong> results of this change were interesting.<br />
When a kernel with this change was run on<br />
the test machine, the total time for the ide_<br />
init() routine dropped from 3327 milliseconds<br />
to 339 milliseconds. <strong>The</strong> total time spent<br />
in all invocations of ide_delay_50ms()<br />
was reduced from 5471 milliseconds to 552<br />
milliseconds. <strong>The</strong> overall bootup time was reduced<br />
accordingly, by about 5 seconds.<br />
<strong>The</strong> ide devices were successfully detected,<br />
and the devices operated without problem on<br />
the test machine. However, this configuration<br />
was not tested exhaustively.<br />
Reducing the duration of the delay in the ide_<br />
delay_50ms() routine provides a substantial<br />
reduction in the overall bootup time for the<br />
kernel on a typical desktop system. It also has<br />
potential use in embedded systems where PCIbased<br />
IDE drives are used.<br />
However, there are several issues with this<br />
modification that need to be resolved. This<br />
change may not support legacy hardware<br />
which requires long delays for proper probing<br />
and initializing. <strong>The</strong> kernel code needs to be<br />
analyzed to determine if any callers of this routine<br />
really need the 50 milliseconds of delay<br />
that they are requesting. Also, it should be determined<br />
whether this call is used only in initialization<br />
context or if it is used during regular<br />
runtime use of IDE devices also.<br />
Also, it may be that 5 milliseconds does not<br />
represent the lowest possible value for this delay.<br />
It is possible that this value will need to<br />
be tuned to match the hardware for a particular<br />
machine. This type of tuning may be acceptable<br />
in the embedded space, where the hardware<br />
configuration of a product may be fixed.<br />
But it may be too risky to use in desktop configurations<br />
of <strong>Linux</strong>, where the hardware is not<br />
known ahead of time.<br />
More experimentation, testing and validation<br />
are required before this technique should be<br />
used.<br />
IMPORTANT NOTE: You should probably not<br />
experiment with this modification on production<br />
hardware unless you have evaluated the<br />
risks.<br />
8 Using the “quiet” Option<br />
<strong>One</strong> non-obvious method to reduce overhead<br />
during booting is to use the quiet option on<br />
the kernel command line. This option changes<br />
the loglevel to 4, which suppresses the output<br />
of regular (non-emergency) printk messages.<br />
Even though the messages are not printed to<br />
the system console, they are still placed in the<br />
kernel printk buffer, and can be retrieved after<br />
bootup using the dmesg command.<br />
When embedded systems boot with a serial<br />
console, the speed of printing the characters<br />
to the console is constrained by the speed of<br />
the serial output. Also, depending on the<br />
driver, some VGA console operations (such as<br />
scrolling the screen) may be performed in software.<br />
For slow processors, this may take a significant<br />
amount of time. In either case, the cost<br />
of performing output of printk messages during<br />
bootup may be high. But it is easily eliminated<br />
using the quiet command line option.
86 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Table 5 shows the difference in bootup time of<br />
using the quiet option and not, for two different<br />
systems (one with a serial console and<br />
one with a VGA console).<br />
9 RTC Read Synchronization<br />
<strong>One</strong> routine that potentially takes a long time<br />
during kernel startup is get_cmos_time().<br />
This routine is used to read the value of the external<br />
real-time clock (RTC) when the kernel<br />
boots. Currently, this routine delays until the<br />
edge of the next second rollover, in order to ensure<br />
that the time value in the kernel is accurate<br />
with respect to the RTC.<br />
However, this operation can take up to one full<br />
second to complete, and thus introduces up to<br />
1 second of variability in the total bootup time.<br />
For systems where the target bootup time is under<br />
1 second, this variability is unacceptable.<br />
<strong>The</strong> synchronization in this routine is easy<br />
to remove. It can be eliminated by removing<br />
the first two loops in the function<br />
get_cmos_time(), which is located in<br />
include/asm-i386/mach-default/<br />
mach_time.h for the i386 architecture. Similar<br />
routines are present in the kernel source<br />
tree for other architectures.<br />
When the synchronization is removed, the routine<br />
completes very quickly.<br />
<strong>One</strong> tradeoff in making this modification is that<br />
the time stored by the <strong>Linux</strong> kernel is no longer<br />
completely synchronized (to the boundary of a<br />
second) with the time in the machine’s realtime<br />
clock hardware. Some systems save the system<br />
time back out to the hardware clock on system<br />
shutdown. After numerous bootups and shutdowns,<br />
this lack of synchronization will cause<br />
the realtime clock value to drift from the correct<br />
time value.<br />
Since the amount of un-synchronization is up<br />
to a second per boot cycle, this drift can be<br />
significant. However, for some embedded applications,<br />
this drift is unimportant. Also, in<br />
some situations the system time may be synchronized<br />
with an external source anyway, so<br />
the drift, if any, is corrected under normal circumstances<br />
soon after booting.<br />
10 User space Work<br />
<strong>The</strong>re are a number of techniques currently<br />
available or under development for user space<br />
bootup time reductions. <strong>The</strong>se techniques are<br />
(mostly) outside the scope of kernel development,<br />
but may provide additional benefits for<br />
reducing overall bootup time for <strong>Linux</strong> systems.<br />
Some of these techniques are mentioned briefly<br />
in this section.<br />
10.1 Application XIP<br />
<strong>One</strong> technique for improving application<br />
startup speed is application XIP, which is similar<br />
to the kernel XIP discussed in this paper.<br />
To support application XIP the kernel must be<br />
compiled with a file system where files can be<br />
stored linearly (where the blocks for a file are<br />
stored contiguously) and uncompressed. <strong>One</strong><br />
file system which supports this is CRAMFS,<br />
with the LINEAR option turned on. This is a<br />
read-only file system.<br />
With application XIP, when a program is executed,<br />
the kernel program loader maps the<br />
text segments for applications directly from the<br />
flash memory of the file system. This saves the<br />
time required to load these segments into system<br />
RAM.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 87<br />
Platform Speed console w/o quiet with quiet difference<br />
type option option<br />
SH-4 SH7751R 240 MHz VGA 637 461 176<br />
OMAP 1510 (ARM 9) 168 MHz serial 551 280 271<br />
Note: Times are in milliseconds<br />
Table 5: Bootup time with and without the quiet option<br />
10.2 RC Script improvements<br />
Also, there are a number of projects which<br />
strive to decrease total bootup time of a system<br />
by parallelizing the execution of the system<br />
run-command scripts (“RC scripts”). <strong>The</strong>re is<br />
a list of resources for some of these projects at<br />
the following web site:<br />
http://tree.celinuxforum.org/<br />
pubwiki/moin.cgi/<br />
BootupTimeWorkingGroup<br />
Also, there has been some research conducted<br />
in reducing the overhead of running RC scripts.<br />
This consists of modifying the multi-function<br />
program busybox to reduce the number and<br />
cost of forks during RC script processing, and<br />
to optimize the usage of functions builtin to the<br />
busybox program. Initial testing has shown a<br />
reduction from about 8 seconds to 5 seconds<br />
for a particular set of Debian RC scripts on an<br />
OMAP 1510 (ARM 9) processor, running at<br />
168 MHz.<br />
11 Results<br />
By use of the some of the techniques mentioned<br />
in this paper, as well as additional techniques,<br />
Sony was able to boot a 2.4.20-based<br />
<strong>Linux</strong> system, from power on to user space display<br />
of a greeting image and sound playback,<br />
in 1.2 seconds. <strong>The</strong> time from power on to the<br />
end of kernel initialization (first user space instruction)<br />
in this configuration was about 110<br />
milliseconds. <strong>The</strong> processor was a TI OMAP<br />
1510 processor, with an ARM9-based core,<br />
running at 168 MHz.<br />
Some of the techniques used for reducing the<br />
bootup time of embedded systems can also be<br />
used for desktop or server systems. Often, it<br />
is possible, with rather simple and small modifications,<br />
to decrease the bootup time of the<br />
<strong>Linux</strong> kernel to only a few seconds. In the<br />
desktop configuration of <strong>Linux</strong> presented here,<br />
techniques from this paper were used to reduced<br />
the total bootup time from around 7 seconds<br />
to around 1 second. This was with no<br />
loss of functionality that the author could detect<br />
(with limited testing).<br />
12 Further Research<br />
As stated in the beginning of the paper, numerous<br />
techniques can be employed to reduce the<br />
overall bootup time of <strong>Linux</strong> systems. Further<br />
work continues or is needed in a number of areas.<br />
12.1 Concurrent Driver Init<br />
<strong>One</strong> area of additional research that seems<br />
promising is to structure driver initializations<br />
in the kernel so that they can proceed in parallel.<br />
For some items, like IDE initialization,<br />
there are large delays as buses and devices are<br />
probed and initialized. <strong>The</strong> time spent in such<br />
busywaits could potentially be used to perform<br />
other startup tasks, concurrently with the ini-
88 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
tializations waiting for hardware events to occur<br />
or time out.<br />
<strong>The</strong> big problem to be addressed with concurrent<br />
initialization is to identify what kernel<br />
startup activities can be allowed to occur<br />
in parallel. <strong>The</strong> kernel init sequence is already<br />
a carefully ordered sequence of events to make<br />
sure that critical startup dependencies are observed.<br />
Any system of concurrent driver initialization<br />
will have to provide a mechanism<br />
to guarantee sequencing of initialization tasks<br />
which have order dependencies.<br />
12.2 Partial XIP<br />
Another possible area of further investigation,<br />
which has already been mentioned, is<br />
“partial XIP,” whereby the kernel is executed<br />
mostly in-place. Prototype code already exists<br />
which demonstrates the mechanisms necessary<br />
to move a subset of an XIP-configured kernel<br />
into RAM, for faster code execution. <strong>The</strong> key<br />
to making partial kernel XIP useful will be to<br />
ensure correct identification (either statically or<br />
dynamically) of the sections of kernel code that<br />
need to be moved to RAM. Also, experimentation<br />
and testing need to be performed to determine<br />
the appropriate tradeoff between the size<br />
of the RAM-based portion of the kernel, and<br />
the effect on bootup time and system runtime<br />
performance.<br />
of only performing fixups on demand as library<br />
routines are called by a running program.<br />
Additional research is needed with both of<br />
these techniques to determine if they can provide<br />
benefit for current <strong>Linux</strong> systems.<br />
13 Credits<br />
This paper is the result of work performed by<br />
the Bootup Time Working Group of the CE<br />
<strong>Linux</strong> forum (of which the author is Chair).<br />
I would like to thank developers at some of<br />
CELF’s member companies, including Hitachi,<br />
Intel, Mitsubishi, MontaVista, Panasonic, and<br />
Sony, who contributed information or code<br />
used in writing this paper.<br />
12.3 Pre-linking and Lazy Linking<br />
Finally, research is needed into reducing the<br />
time required to fixup links between programs<br />
and their shared libraries.<br />
Two systems that have been proposed and experimented<br />
with are pre-linking and lazy linking.<br />
Pre-linking involves fixing the location in<br />
virtual memory of the shared libraries for a system,<br />
and performing fixups on the programs of<br />
the system ahead of time. Lazy linking consists
<strong>Linux</strong> on NUMA Systems<br />
Martin J. Bligh<br />
mbligh@aracnet.com<br />
Matt Dobson<br />
colpatch@us.ibm.com<br />
Darren Hart<br />
dvhltc@us.ibm.com<br />
Gerrit Huizenga<br />
gh@us.ibm.com<br />
Abstract<br />
NUMA is becoming more widespread in the<br />
marketplace, used on many systems, small or<br />
large, particularly with the advent of AMD<br />
Opteron systems. This paper will cover a summary<br />
of the current state of NUMA, and future<br />
developments, encompassing the VM subsystem,<br />
scheduler, topology (CPU, memory, I/O<br />
layouts including complex non-uniform layouts),<br />
userspace interface APIs, and network<br />
and disk I/O locality. It will take a broad-based<br />
approach, focusing on the challenges of creating<br />
subsystems that work for all machines (including<br />
AMD64, PPC64, IA-32, IA-64, etc.),<br />
rather than just one architecture.<br />
1 What is a NUMA machine?<br />
NUMA stands for non-uniform memory architecture.<br />
Typically this means that not all memory<br />
is the same “distance” from each CPU in<br />
the system, but also applies to other features<br />
such as I/O buses. <strong>The</strong> word “distance” in this<br />
context is generally used to refer to both latency<br />
and bandwidth. Typically, NUMA machines<br />
can access any resource in the system,<br />
just at different speeds.<br />
NUMA systems are sometimes measured with<br />
a simple “NUMA factor” ratio of N:1—<br />
meaning that the latency for a cache miss memory<br />
read from remote memory is N times the latency<br />
for that from local memory (for NUMA<br />
machines, N > 1). Whilst such a simple descriptor<br />
is attractive, it can also be highly misleading,<br />
as it describes latency only, not bandwidth,<br />
on an uncontended bus (which is not<br />
particularly relevant or interesting), and takes<br />
no account of inter-node caches.<br />
<strong>The</strong> term node is normally used to describe a<br />
grouping of resources—e.g., CPUs, memory,<br />
and I/O. On some systems, a node may contain<br />
only some types of resources (e.g., only<br />
memory, or only CPUs, or only I/O); on others<br />
it may contain all of them. <strong>The</strong> interconnect<br />
between nodes may take many different<br />
forms, but can be expected to be higher latency<br />
than the connection within a node, and typically<br />
lower bandwidth.<br />
Programming for NUMA machines generally<br />
implies focusing on locality—the use of resources<br />
close to the device in question, and<br />
trying to reduce traffic between nodes; this<br />
type of programming generally results in better<br />
application throughput. On some machines<br />
with high-speed cross-node interconnects, bet-
90 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
ter performance may be derived under certain<br />
workloads by “striping” accesses across multiple<br />
nodes, rather than just using local resources,<br />
in order to increase bandwidth. Whilst<br />
it is easy to demonstrate a benchmark that<br />
shows improvement via this method, it is difficult<br />
to be sure that the concept is generally<br />
benefical (i.e., with the machine under full<br />
load).<br />
2 Why use a NUMA architecture to<br />
build a machine?<br />
<strong>The</strong> intuitive approach to building a large machine,<br />
with many processors and banks of<br />
memory, would be simply to scale up the typical<br />
2–4 processor machine with all resources<br />
attached to a shared system bus. However, restrictions<br />
of electronics and physics dictate that<br />
accesses slow as the length of the bus grows,<br />
and the bus is shared amongst more devices.<br />
Rather than accept this global slowdown for a<br />
larger machine, designers have chosen to instead<br />
give fast access to a limited set of local<br />
resources, and reserve the slower access times<br />
for remote resources.<br />
Historically, NUMA architectures have only<br />
been used for larger machines (more than 4<br />
CPUs), but the advantages of NUMA have<br />
been brought into the commodity marketplace<br />
with the advent of AMD’s x86-64, which has<br />
one CPU per node, and local memory for each<br />
processor. <strong>Linux</strong> supports NUMA machines<br />
of every size from 2 CPUs upwards (e.g., SGI<br />
have machines with 512 processors).<br />
It might help to envision the machine as a<br />
group of standard SMP machines, connected<br />
by a very fast interconnect somewhat like a network<br />
connection, except that the transfers over<br />
that bus are transparent to the operating system.<br />
Indeed, some earlier systems were built<br />
exactly like that; the older Sequent NUMA-<br />
Q hardware uses a standard 450NX 4 processor<br />
chipset, with an SCI interconnect plugged<br />
into the system bus of each node to unify them,<br />
and pass traffic between them. <strong>The</strong> complex<br />
part of the implementation is to ensure cachecoherency<br />
across the interconnect, and such<br />
machines are often referred to as CC-NUMA<br />
(cache coherent NUMA). As accesses over the<br />
interconnect are transparent, it is possible to<br />
program such machines as if they were standard<br />
SMP machines (though the performance<br />
will be poor). Indeed, this is exactly how the<br />
NUMA-Q machines were first bootstrapped.<br />
Often, we are asked why people do not use<br />
clusters of smaller machines, instead of a large<br />
NUMA machine, as clusters are cheaper, simpler,<br />
and have a better price:performance ratio.<br />
Unfortunately, it makes the programming<br />
of applications much harder; all of the intercommunication<br />
and load balancing now has to<br />
be more explicit. Some large applications (e.g.,<br />
database servers) do not split up across multiple<br />
cluster nodes easily—in those situations,<br />
people often use NUMA machines. In addition,<br />
the interconnect for NUMA boxes is normally<br />
very low latency, and very high bandwidth,<br />
yielding excellent performance. <strong>The</strong><br />
management of a single NUMA machine is<br />
also simpler than that of a whole cluster with<br />
multiple copies of the OS.<br />
We could either have the operating system<br />
make decisions about how to deal with the architecture<br />
of the machine on behalf of the user<br />
processes, or give the userspace application an<br />
API to specify how such decisions are to be<br />
made. It might seem, at first, that the userspace<br />
application is in a better position to make such<br />
decisions, but this has two major disadvantages:<br />
1. Every application must be changed to support<br />
NUMA machines, and may need to
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 91<br />
be revised when a new hardware platform<br />
is released.<br />
2. Applications are not in a good position<br />
to make global holistic decisions about<br />
machine resources, coordinate themselves<br />
with other applications, and balance decisions<br />
between them.<br />
Thus decisions on process, memory and I/O<br />
placement are normally best left to the operating<br />
system, perhaps with some hints from<br />
userspace about which applications group together,<br />
or will use particular resources heavily.<br />
Details of hardware layout are put in one place,<br />
in the operating system, and tuning and modification<br />
of the necessary algorithms are done<br />
once in that central location, instead of in every<br />
application. In some circumstances, the<br />
application or system administrator will want<br />
to override these decisions with explicit APIs,<br />
but this should be the exception, rather than the<br />
norm.<br />
3 <strong>Linux</strong> NUMA Memory Support<br />
In order to manage memory, <strong>Linux</strong> requires<br />
a page descriptor structure (struct page)<br />
for each physical page of memory present in<br />
the system. This consumes approximately 1%<br />
of the memory managed (assuming 4K page<br />
size), and the structures are grouped into an array<br />
called mem_map. For NUMA machines,<br />
there is a separate array for each node, called<br />
lmem_map. <strong>The</strong> mem_map and lmem_map<br />
arrays are simple contiguous data structures accessed<br />
in a linear fashion by their offset from<br />
the beginning of the node. This means that the<br />
memory controlled by them is assumed to be<br />
physically contiguous.<br />
NUMA memory support is enabled by<br />
CONFIG_DISCONTIGMEM and CONFIG_<br />
NUMA. A node descriptor called a struct<br />
pgdata_t is created for each node. Currently<br />
we do not support discontiguous memory<br />
within a node (though large gaps in the<br />
physical address space are acceptable between<br />
nodes). Thus we must still create page descriptor<br />
structures for “holes” in memory within a<br />
node (and then mark them invalid), which will<br />
waste memory (potentially a problem for large<br />
holes).<br />
Dave McCracken has picked up Daniel<br />
Phillips’ earlier work on a better data structure<br />
for holding the page descriptors, called<br />
CONFIG_NONLINEAR. This will allow the<br />
mapping of discontigous memory ranges inside<br />
each node, and greatly simplify the existing<br />
code for discontiguous memory on non-<br />
NUMA machines.<br />
CONFIG_NONLINEAR solves the problem by<br />
creating an artificial layer of linear addresses.<br />
It does this by dividing the physical address<br />
space into fixed size sections (akin to very<br />
large pages), then allocating an array to allow<br />
translations from linear physical address to true<br />
physical address. This added level of indirection<br />
allows memory with widely differing true<br />
physical addresses to appear adjacent to the<br />
page allocator and to be in the same zone, with<br />
a single struct page array to describe them. It<br />
also provides support for memory hotplug by<br />
allowing new physical memory to be added to<br />
an existing zone and struct page array.<br />
<strong>Linux</strong> normally allocates memory for a process<br />
on the local node, i.e., the node that the process<br />
is currently running on. alloc_pages<br />
will call alloc_pages_node for the current<br />
processor’s node, which will pass the relevant<br />
zonelist (pgdat->node_zonelists)<br />
to the core allocator (__alloc_pages). <strong>The</strong><br />
zonelists are built by build_zonelists,<br />
and are set up to allocate memory in a roundrobin<br />
fashion, starting from the local node (this<br />
creates a roughly even distribution of memory
92 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
pressure).<br />
In the interest of reducing cross-node traffic,<br />
and reducing memory access latency for frequently<br />
accessed data and text, it is desirable<br />
to replicate any such memory that is read-only<br />
to each node, and use the local copy on any accesses,<br />
rather than a remote copy. <strong>The</strong> obvious<br />
candidates for such replication are the kernel<br />
text itself, and the text of shared libraries such<br />
as libc. Of course, this faster access comes<br />
at the price of increased memory usage, but<br />
this is rarely a problem on large NUMA machines.<br />
Whilst it might be technically possible<br />
to replicate read/write mappings, this is complex,<br />
of dubious utility, and is unlikely to be<br />
implemented.<br />
<strong>Kernel</strong> text is assumed by the kernel itself to<br />
appear at a fixed virtual address, and to change<br />
this would be problematic. Hence the easiest<br />
way to replicate it is to change the virtual to<br />
physical mappings for each node to point at a<br />
different address. On IA-64, this is easy, since<br />
the CPU provides hardware assistance in the<br />
form of a pinned TLB entry.<br />
On other architectures this proves more difficult,<br />
and would depend on the structure of the<br />
pagetables. On IA-32 with PAE enabled, as<br />
long as the user-kernel split is aligned on a<br />
PMD boundary, we can have a separate kernel<br />
PMD for each node, and point the vmalloc<br />
area (which uses small page mappings) back to<br />
a globally shared set of PTE pages. <strong>The</strong> PMD<br />
entries for the ZONE_NORMAL areas normally<br />
never change, so this is not an issue, though<br />
there is an issue with ioremap_nocache<br />
that can change them (GART trips over this)<br />
and speculative execution means that we will<br />
have to deal with that (this can be a slow-path<br />
that updates all copies of the PMDs though).<br />
Dave Hansen has created a patch to replicate<br />
read only pagecache data, by adding a per-node<br />
data structure to each node of the pagecache<br />
radix tree. As soon as any mapping is opened<br />
for write, the replication is collapsed, making<br />
it safe. <strong>The</strong> patch gives a 5%–40% increase in<br />
performance, depending on the workload.<br />
In the 2.6 <strong>Linux</strong> kernel, we have a per-node<br />
LRU for page management and a per-node<br />
LRU lock, in place of the global structures<br />
and locks of 2.4. Not only does this reduce<br />
contention through finer grained locking, it<br />
also means we do not have to search other<br />
nodes’ page lists to free up pages on one node<br />
which is under memory pressure. Moreover,<br />
we get much better locality, as only the local<br />
kswapd process is accessing that node’s<br />
pages. Before splitting the LRU into per-node<br />
lists, we were spending 50% of the system time<br />
during a kernel compile just spinning waiting<br />
for pagemap_lru_lock (which was the<br />
biggest global VM lock at the time). Contention<br />
for the pagemap_lru_lock is now<br />
so small it is not measurable.<br />
4 Sched Domains—a Topologyaware<br />
Scheduler<br />
<strong>The</strong> previous <strong>Linux</strong> scheduler, the O(1) scheduler,<br />
provided some needed improvements to<br />
the 2.4 scheduler, but shows its age as more<br />
complex system topologies become more and<br />
more common. With technologies such as<br />
NUMA, Symmetric Multi-Threading (SMT),<br />
and variations and combinations of these, the<br />
need for a more flexible mechanism to model<br />
system topology is evident.<br />
4.1 Overview<br />
In answer to this concern, the mainline 2.6<br />
tree (linux-2.6.7-rc1 at the time of this writing)<br />
contains an updated scheduler with support for<br />
generic CPU topologies with a data structure,<br />
struct sched_domain, that models the<br />
architecture and defines scheduling policies.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 93<br />
Simply speaking, sched domains group CPUs<br />
together in a hierarchy that mimics that of the<br />
physical hardware. Since CPUs at the bottom<br />
of the hierarchy are most closely related<br />
(in terms of memory access), the new scheduler<br />
performs load balancing most often at the<br />
lower domains, with decreasing frequency at<br />
each higher level.<br />
Consider the case of a machine with two SMT<br />
CPUs. Each CPU contains a pair of virtual<br />
CPU siblings which share a cache and the core<br />
processor. <strong>The</strong> machine itself has two physical<br />
CPUs which share main memory. In such<br />
a situation, treating each of the four effective<br />
CPUs the same would not result in the best<br />
possible performance. With only two tasks,<br />
for example, the scheduler should place one<br />
on CPU0 and one on CPU2, and not on the<br />
two virtual CPUs of the same physical CPU.<br />
When running several tasks it seems natural to<br />
try to place newly ready tasks on the CPU they<br />
last ran on (hoping to take advantage of cache<br />
warmth). However, virtual CPU siblings share<br />
a cache; a task that was running on CPU0,<br />
then blocked, and became ready when CPU0<br />
was running another task and CPU1 was idle,<br />
would ideally be placed on CPU1. Sched domains<br />
provide the structures needed to realize<br />
these sorts of policies. With sched domains,<br />
each physical CPU represents a domain containing<br />
the pair of virtual siblings, each represented<br />
in a sched_group structure. <strong>The</strong>se<br />
two domains both point to a parent domain<br />
which contains all four effective processors in<br />
two sched_group structures, each containing<br />
a pair of virtual siblings. Figure 1 illustrates<br />
this hierarchy.<br />
Next consider a two-node NUMA machine<br />
with two processors per node. In this example<br />
there are no virtual sibling CPUs, and therefore<br />
no shared caches. When a task becomes<br />
ready and the processor it last ran on is busy,<br />
the scheduler needs to consider waiting un-<br />
Figure 1: SMT Domains<br />
til that CPU is available to take advantage of<br />
cache warmth. If the only available CPU is<br />
on another node, the scheduler must carefully<br />
weigh the costs of migrating that task to another<br />
node, where access to its memory will<br />
be slower. <strong>The</strong> lowest level sched domains in<br />
a machine like this will contain the two processors<br />
of each node. <strong>The</strong>se two CPU level<br />
domains each point to a parent domain which<br />
contains the two nodes. Figure 2 illustrates this<br />
hierarchy.<br />
Figure 2: NUMA Domains<br />
<strong>The</strong> next logical step is to consider an SMT<br />
NUMA machine. By combining the previous<br />
two examples, the resulting sched domain hierarchy<br />
has three levels, sibling domains, physical<br />
CPU domains, and the node domain. Figure<br />
3 illustrates this hierarchy.<br />
<strong>The</strong> unique AMD Opteron architecture warrants<br />
mentioning here as it creates a NUMA<br />
system on a single physical board. In this case,<br />
however, each NUMA node contains only one
94 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Figure 3: SMT NUMA Domains<br />
physical CPU. Without careful consideration<br />
of this property, a typical NUMA sched domains<br />
hierarchy would perform badly, trying<br />
to load balance single CPU nodes often (an obvious<br />
waste of cycles) and between node domains<br />
only rarely (also bad since these actually<br />
represent the physical CPUs).<br />
4.2 Sched Domains Implementation<br />
4.2.1 Structure<br />
<strong>The</strong> sched_domain structure stores policy<br />
parameters and flags and, along with<br />
the sched_group structure, is the primary<br />
building block in the domain hierarchy. Figure<br />
4 describes these structures. <strong>The</strong> sched_<br />
domain structure is constructed into an upwardly<br />
traversable tree via the parent pointer,<br />
the top level domain setting parent to NULL.<br />
<strong>The</strong> groups list is a circular list of of sched_<br />
group structures which essentially define the<br />
CPUs in each child domain and the relative<br />
power of that group of CPUs (two physical<br />
CPUs are more powerful than one SMT CPU).<br />
<strong>The</strong> span member is simply a bit vector with a<br />
1 for every CPU encompassed by that domain<br />
and is always the union of the bit vector stored<br />
in each element of the groups list. <strong>The</strong> remaining<br />
fields define the scheduling policy to be followed<br />
while dealing with that domain, see Section<br />
4.2.2.<br />
While the hierarchy may seem simple, the details<br />
of its construction and resulting tree structures<br />
are not. For performance reasons, the<br />
domain hierarchy is built on a per-CPU basis,<br />
meaning each CPU has a unique instance of<br />
each domain in the path from the base domain<br />
to the highest level domain. <strong>The</strong>se duplicate<br />
structures do share the sched_group structures<br />
however. <strong>The</strong> resulting tree is difficult to<br />
diagram, but resembles Figure 5 for the machine<br />
with two SMT CPUs discussed earlier.<br />
In accordance with common practice, each<br />
architecture may specify the construction of<br />
the sched domains hierarchy and the parameters<br />
and flags defining the various policies.<br />
At the time of this writing, only i386<br />
and ppc64 defined custom construction routines.<br />
Both architectures provide for SMT<br />
processors and NUMA configurations. Without<br />
an architecture-specific routine, the kernel<br />
uses the default implementations in sched.c,<br />
which do take NUMA into account.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 95<br />
struct sched_domain {<br />
/* <strong>The</strong>se fields must be setup */<br />
struct sched_domain *parent; /* top domain must be null terminated */<br />
struct sched_group *groups; /* the balancing groups of the domain */<br />
cpumask_t span; /* span of all CPUs in this domain */<br />
unsigned long min_interval; /* Minimum balance interval ms */<br />
unsigned long max_interval; /* Maximum balance interval ms */<br />
unsigned int busy_factor; /* less balancing by factor if busy */<br />
unsigned int imbalance_pct; /* No balance until over watermark */<br />
unsigned long long cache_hot_time; /* Task considered cache hot (ns) */<br />
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */<br />
unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */<br />
int flags; /* See SD_* */<br />
};<br />
/* Runtime fields. */<br />
unsigned long last_balance; /* init to jiffies. units in jiffies */<br />
unsigned int balance_interval; /* initialise to 1. units in ms. */<br />
unsigned int nr_balance_failed; /* initialise to 0 */<br />
struct sched_group {<br />
struct sched_group *next; /* Must be a circular list */<br />
cpumask_t cpumask;<br />
unsigned long cpu_power;<br />
};<br />
Figure 4: Sched Domains Structures<br />
4.2.2 Policy<br />
<strong>The</strong> new scheduler attempts to keep the system<br />
load as balanced as possible by running rebalance<br />
code when tasks change state or make<br />
specific system calls, we will call this event<br />
balancing, and at specified intervals measured<br />
in jiffies, called active balancing. Tasks must<br />
do something for event balancing to take place,<br />
while active balancing occurs independent of<br />
any task.<br />
Event balance policy is defined in each<br />
sched_domain structure by setting a combination<br />
of the #defines of figure 6 in the flags<br />
member.<br />
To define the policy outlined for the dual SMT<br />
processor machine in Section 4.1, the lowest<br />
level domains would set SD_BALANCE_<br />
NEWIDLE and SD_WAKE_IDLE (as there is<br />
no cache penalty for running on a different<br />
sibling within the same physical CPU),<br />
SD_SHARE_CPUPOWER to indicate to the<br />
scheduler that this is an SMT processor (the<br />
scheduler will give full physical CPU access<br />
to a high priority task by idling the<br />
virtual sibling CPU), and a few common<br />
flags SD_BALANCE_EXEC, SD_BALANCE_<br />
CLONE, and SD_WAKE_AFFINE. <strong>The</strong> next<br />
level domain represents the physical CPUs<br />
and will not set SD_WAKE_IDLE since cache<br />
warmth is a concern when balancing across<br />
physical CPUs, nor SD_SHARE_CPUPOWER.<br />
This domain adds the SD_WAKE_BALANCE<br />
flag to compensate for the removal of SD_<br />
WAKE_IDLE. As discussed earlier, an SMT<br />
NUMA system will have these two domains<br />
and another node-level domain. This domain<br />
removes the SD_BALANCE_NEWIDLE<br />
and SD_WAKE_AFFINE flags, resulting in<br />
far fewer balancing across nodes than within<br />
nodes. When one of these events occurs, the<br />
scheduler search up the domain hierarchy and<br />
performs the load balancing at the highest level<br />
domain with the corresponding flag set.<br />
Active balancing is fairly straightforward and<br />
aids in preventing CPU-hungry tasks from hogging<br />
a processor, since these tasks may only
96 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
#define SD_BALANCE_NEWIDLE 1 /* Balance when about to become idle */<br />
#define SD_BALANCE_EXEC 2 /* Balance on exec */<br />
#define SD_BALANCE_CLONE 4 /* Balance on clone */<br />
#define SD_WAKE_IDLE 8 /* Wake to idle CPU on task wakeup */<br />
#define SD_WAKE_AFFINE 16 /* Wake task to waking CPU */<br />
#define SD_WAKE_BALANCE 32 /* Perform balancing at task wakeup */<br />
#define SD_SHARE_CPUPOWER 64 /* Domain members share cpu power */<br />
Figure 6: Sched Domains Policies<br />
4.3 Conclusions and Future Work<br />
Figure 7: Kernbench Results<br />
Figure 5: Per CPU Domains<br />
rarely trigger event balancing. At each rebalance<br />
tick, the scheduler starts at the lowest<br />
level domain and works its way up, checking<br />
the balance_interval and last_<br />
balance fields to determine if that domain<br />
should be balanced. If the domain is already<br />
busy, the balance_interval is adjusted<br />
using the busy_factor field. Other fields<br />
define how out of balance a node must be before<br />
rebalancing can occur, as well as some<br />
sane limits on cache hot time and min and max<br />
balancing intervals. As with the flags for event<br />
balancing, the active balancing parameters are<br />
defined to perform less balancing at higher domains<br />
in the hierarchy.<br />
To compare the O(1) scheduler of mainline<br />
with the sched domains implementation in the<br />
mm tree, we ran kernbench (with the -j option<br />
to make set to 8, 16, and 32) on a 16 CPU SMT<br />
machine (32 virtual CPUs) on linux-2.6.6 and<br />
linux-2.6.6-mm3 (the latest tree with sched domains<br />
at the time of the benchmark) with and<br />
without CONFIG_SCHED_SMT enabled. <strong>The</strong><br />
results are displayed in Figure 7. <strong>The</strong> O(1)<br />
scheduler evenly distributed compile tasks accross<br />
virtual CPUs, forcing tasks to share cache<br />
and computational units between virtual sibling<br />
CPUs. <strong>The</strong> sched domains implementation<br />
with CONFIG_SCHED_SMT enabled balanced<br />
the load accross physical CPUs, making<br />
far better use of CPU resources when running<br />
fewer tasks than CPUs (as in the j8 case) since<br />
each compile task would have exclusive access<br />
to the physical CPU. Surprisingly, sched domains<br />
(which would seem to have more overhead<br />
than the mainline scheduler) even showed<br />
improvement for the j32 case, where it doesn’t
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 97<br />
benefit from balancing across physical CPUs<br />
before virtual CPUs as there are more tasks<br />
than virtual CPUs. Considering the sched domains<br />
implementation has not been heavily<br />
tested or tweaked for performance, some fine<br />
tuning is sure to further improve performance.<br />
<strong>The</strong> sched domains structures replace the expanding<br />
set of #ifdefs of the O(1) scheduler,<br />
which should improve readability and<br />
maintainability. Unfortunately, the per CPU<br />
nature of the domain construction results in a<br />
non-intuitive structure that is difficult to work<br />
with. For example, it is natural to discuss the<br />
policy defined at “the” top level domain; unfortunately<br />
there are NR_CPUS top level domains<br />
and, since they are self-adjusting, each<br />
one could conceivably have a different set of<br />
flags and parameters. Depending on which<br />
CPU the scheduler was running on, it could behave<br />
radically differently. As an extension of<br />
this research, an effort to analyze the impact of<br />
a unified sched domains hierarchy is needed,<br />
one which only creates one instance of each<br />
domain.<br />
Sched domains provides a needed structural<br />
change to the way the <strong>Linux</strong> scheduler views<br />
modern architectures, and provides the parameters<br />
needed to create complex scheduling<br />
policies that cater to the strengths and weaknesses<br />
of these systems. Currently only i386<br />
and ppc64 machines benefit from arch specific<br />
construction routines; others must now step<br />
forward and fill in the construction and parameter<br />
setting routines for their architecture of<br />
choice. <strong>The</strong>re is still plenty of fine tuning and<br />
performance tweaking to be done.<br />
5 NUMA API<br />
5.1 Introduction<br />
<strong>One</strong> of the biggest impediments to the acceptance<br />
of a NUMA API for <strong>Linux</strong> was a<br />
lack of understanding of what its potential uses<br />
and users would be. <strong>The</strong>re are two schools<br />
of thought when it comes to writing NUMA<br />
code. <strong>One</strong> says that the OS should take care<br />
of all the NUMA details, hide the NUMAness<br />
of the underlying hardware in the kernel<br />
and allow userspace applications to pretend<br />
that it’s a regular SMP machine. <strong>Linux</strong><br />
does this by having a process scheduler and<br />
a VMM that make intelligent decisions based<br />
on the hardware topology presented by archspecific<br />
code. <strong>The</strong> other way to handle NUMA<br />
programming is to provide as much detail as<br />
possible about the system to userspace and<br />
allow applications to exploit the hardware to<br />
the fullest by giving scheduling hints, memory<br />
placement directives, etc., and the NUMA<br />
API for <strong>Linux</strong> handles this. Many applications,<br />
particularly larger applications with many concurrent<br />
threads of execution, cannot fully utilize<br />
a NUMA machine with the default scheduler<br />
and VM behavior. Take, for example, a<br />
database application that uses a large region of<br />
shared memory and many threads. This application<br />
may have a startup thread that initializes<br />
the environment, sets up the shared memory<br />
region, and forks off the worker threads. <strong>The</strong><br />
default behavior of <strong>Linux</strong>’s VM for NUMA is<br />
to bring pages into memory on the node that<br />
faulted them in. This behavior for our hypothetical<br />
app would mean that many pages<br />
would get faulted in by the startup thread on<br />
the node it is executing on, not necessarily on<br />
the node containing the processes that will actually<br />
use these pages. Also, the forked worker<br />
threads would get spread around by the scheduler<br />
to be balanced across all the nodes and<br />
their CPUs, but with no guarantees as to which
98 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
threads would be associated with which nodes.<br />
<strong>The</strong> NUMA API and scheduler affinity syscalls<br />
allow this application to specify that its threads<br />
be pinned to particular CPUs and that its memory<br />
be placed on particular nodes. <strong>The</strong> application<br />
knows which threads will be working<br />
with which regions of memory, and is better<br />
equipped than the kernel to make those decisions.<br />
<strong>The</strong> <strong>Linux</strong> NUMA API allows applications<br />
to give regions of their own virtual memory<br />
space specific allocation behaviors, called policies.<br />
Currently there are four supported policies:<br />
PREFERRED, BIND, INTERLEAVE,<br />
and DEFAULT. <strong>The</strong> DEFAULT policy is the<br />
simplest, and tells the VMM to do what it<br />
would normally do (ie: pre-NUMA API) for<br />
pages in the policied region, and fault them<br />
in from the local node. This policy applies<br />
to all regions, but is overridden if an application<br />
requests a different policy. <strong>The</strong> PRE-<br />
FERRED policy allows an application to specify<br />
one node that all pages in the policied region<br />
should come from. However, if the specified<br />
node has no available pages, the PRE-<br />
FERRED policy allows allocation to fall back<br />
to any other node in the system. <strong>The</strong> BIND<br />
policy allows applications to pass in a nodemask,<br />
a bitmap of nodes, that the VM is required<br />
to use when faulting in pages from a region.<br />
<strong>The</strong> fourth policy type, INTERLEAVE,<br />
again requires applications to pass in a nodemask,<br />
but with the INTERLEAVE policy, the<br />
nodemask is used to ensure pages are faulted<br />
in in a round-robin fashion from the nodes<br />
in the nodemask. As with the PREFERRED<br />
policy, the INTERLEAVE policy allows page<br />
allocation to fall back to other nodes if necessary.<br />
In addition to allowing a process to<br />
policy a specific region of its VM space, the<br />
NUMA API also allows a process to policy<br />
its entire VM space with a process-wide policy,<br />
which is set with a different syscall: set_<br />
mempolicy(). Note that process-wide policies<br />
are not persistent over swapping, however<br />
per-VMA policies are. Please also note that<br />
none of the policies will migrate existing (already<br />
allocated) pages to match the binding.<br />
<strong>The</strong> actual implementation of the in-kernel<br />
policies uses a struct mempolicy that is<br />
hung off the struct vm_area_struct.<br />
This choice involves some tradeoffs. <strong>The</strong> first<br />
is that, previous to the NUMA API, the per-<br />
VMA structure was exactly 32 bytes on 32-<br />
bit architectures, meaning that multiple vm_<br />
area_structs would fit conveniently in a<br />
single cacheline. <strong>The</strong> structure is now a little<br />
larger, but this allowed us to achieve a per-<br />
VMA granularity to policied regions. This is<br />
important in that it is flexible enough to bind<br />
a single page, a whole library, or a whole process’<br />
memory. This choice did lead to a second<br />
obstacle, however, which was for shared<br />
memory regions. For shared memory regions,<br />
we really want the policy to be shared amongst<br />
all processes sharing the memory, but VMAs<br />
are not shared across separate tasks. <strong>The</strong> solution<br />
that was implemented to work around this<br />
was to create a red-black tree of “shared policy<br />
nodes” for shared memory regions. Due<br />
to this, calls were added to the vm_ops structure<br />
which allow the kernel to check if a shared<br />
region has any policies and to easily retrieve<br />
these shared policies.<br />
5.2 Syscall Entry Points<br />
1. sys_mbind(unsigned long start, unsigned<br />
long len, unsigned long mode, unsigned<br />
long *nmask, unsigned long maxnode,<br />
unsigned flags);<br />
Bind the region of memory [start,<br />
start+len) according to mode and<br />
flags on the nodes enumerated in<br />
nmask and having a maximum possible<br />
node number of maxnode.<br />
2. sys_set_mempolicy(int mode, unsigned
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 99<br />
long *nmask, unsigned long maxnode);<br />
Bind the entire address space of the current<br />
process according to mode on the<br />
nodes enumerated in nmask and having<br />
a maximum possible node number of<br />
maxnode.<br />
3. sys_get_mempolicy(int *policy, unsigned<br />
long *nmask, unsigned long maxnode,<br />
unsigned long addr, unsigned long flags);<br />
Return the current binding’s mode in<br />
policy and node enumeration in<br />
nmask based on the maxnode, addr,<br />
and flags passed in.<br />
5.4 At Page Fault Time<br />
<strong>The</strong>re are now several new and different<br />
flavors of alloc_pages() style functions.<br />
Previous to the NUMA API, there<br />
existed alloc_page(), alloc_pages()<br />
and alloc_pages_node(). Without going<br />
into too much detail, alloc_page()<br />
and alloc_pages() both called alloc_<br />
pages_node() with the current node id as<br />
an argument. alloc_pages_node() allocated<br />
2 order pages from a specific node, and<br />
was the only caller to the real page allocator,<br />
__alloc_pages().<br />
In addition to the raw syscalls discussed above,<br />
there is a user-level library called “libnuma”<br />
that attempts to present a more cohesive interface<br />
to the NUMA API, topology, and scheduler<br />
affinity functionality. This, however, is<br />
documented elsewhere.<br />
alloc_page()<br />
alloc_pages()<br />
5.3 At mbind() Time<br />
After argument validation, the passed-in list of<br />
nodes is checked to make sure they are all online.<br />
If the node list is ok, a new memory policy<br />
structure is allocated and populated with the<br />
binding details. Next, the given address range<br />
is checked to make sure the vma’s for the region<br />
are present and correct. If the region is ok,<br />
we proceed to actually install the new policy<br />
into all the vma’s in that range. For most types<br />
of virtual memory regions, this involves simply<br />
pointing the vma->vm_policy to the newly<br />
allocated memory policy structure. For shared<br />
memory, hugetlbfs, and tmpfs, however, it’s<br />
not quite this simple. In the case of a memory<br />
policy for a shared segment, a red-black tree<br />
root node is created, if it doesn’t already exist,<br />
to represent the shared memory segment and<br />
is populated with “shared policy nodes.” This<br />
allows a user to bind a single shared memory<br />
segment with multiple different bindings.<br />
alloc_pages_node()<br />
__alloc_pages()<br />
Figure 8: old alloc_pages<br />
With the introduction of the NUMA API, non-<br />
NUMA kernels still retain the old alloc_<br />
page*() routines, but the NUMA allocators<br />
have changed. alloc_pages_node()<br />
and __alloc_pages(), the core routines<br />
remain untouched, but all calls to alloc_<br />
page()/alloc_pages() now end up calling<br />
alloc_pages_current(), a new<br />
function.
100 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
<strong>The</strong>re has also been the addition<br />
of two new page allocation functions:<br />
alloc_page_vma() and<br />
alloc_page_interleave().<br />
alloc_pages_current() checks that the<br />
system is not currently in_interrupt(),<br />
and if it isn’t, uses the current process’s<br />
process policy for allocation. If<br />
the system is currently in interrupt context,<br />
alloc_pages_current() falls<br />
back to the old default allocation scheme.<br />
alloc_page_interleave() allocates<br />
pages from regions that are bound with an<br />
interleave policy, and is broken out separately<br />
because there are some statistics kept for<br />
interleaved regions. alloc_page_vma()<br />
is a new allocator that allocates only single<br />
pages based on a per-vma policy. <strong>The</strong><br />
alloc_page_vma() function is the only<br />
one of the new allocator functions that must be<br />
called explicity, so you will notice that some<br />
calls to alloc_pages() have been replaced<br />
by calls to alloc_page_vma() throughout<br />
the kernel, as necessary.<br />
6 Legal statement<br />
This work represents the view of the authors, and<br />
does not necessarily represent the view of IBM.<br />
IBM, NUMA-Q and Sequent are registerd trademarks<br />
of International Business Machines Corporation<br />
in the United States, other contries, or both.<br />
Other company, product, or service names may be<br />
trademarks of service names of others.<br />
References<br />
[LWN] LWN Editor, “Scheduling Domains,”<br />
http://lwn.net/Articles/<br />
80911/<br />
[MM2] <strong>Linux</strong> 2.6.6-rc2/mm2 source, http:<br />
//www.kernel.org<br />
5.5 Problems/Future Work<br />
<strong>The</strong>re is no checking that the nodes requested<br />
are online at page fault time, so interactions<br />
with hotpluggable CPUs/memory<br />
will be tricky. <strong>The</strong>re is an asymmetry between<br />
how you bind a memory region and<br />
a whole process’s memory: <strong>One</strong> call takes<br />
a flags argument, and one doesn’t. Also<br />
the maxnode argument is a bit strange,<br />
the get/set_affinity calls take a number of<br />
bytes to be read/written instead of a maximum<br />
CPU number. <strong>The</strong> alloc_page_<br />
interleave() function could be dropped if<br />
we were willing to forgo the statistics that are<br />
kept for interleaved regions. Again, a lack of<br />
symmetry exists because other types of policies<br />
aren’t tracked in any way.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 101<br />
Both UP/SMP & NUMA<br />
UP/SMP only<br />
alloc_page_vma()<br />
alloc_page()<br />
NUMA only<br />
alloc_pages()<br />
alloc_pages_current()<br />
alloc_pages_node()<br />
alloc_page_interleave()<br />
__alloc_pages()<br />
Figure 9: new alloc_pages
102 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Improving <strong>Kernel</strong> Performance by Unmapping the<br />
Page Cache<br />
James Bottomley<br />
SteelEye Technology, Inc.<br />
James.Bottomley@SteelEye.com<br />
Abstract<br />
<strong>The</strong> current DMA API is written on the founding<br />
assumption that the coherency is being<br />
done between the device and kernel virtual addresses.<br />
We have a different API for coherency<br />
between the kernel and userspace. <strong>The</strong> upshot<br />
is that every Process I/O must be flushed twice:<br />
Once to make the user coherent with the kernel<br />
and once to make the kernel coherent with the<br />
device. Additionally, having to map all pages<br />
for I/O places considerable resource pressure<br />
on x86 (where any highmem page must be separately<br />
mapped).<br />
We present a different paradigm: Assume that<br />
by and large, read/write data is only required<br />
by a single entity (the major consumers of large<br />
multiply shared mappings are libraries, which<br />
are read only) and optimise the I/O path for this<br />
case. This means that any other shared consumers<br />
of the data (including the kernel) must<br />
separately map it themselves. <strong>The</strong> DMA API<br />
would be changed to perform coherence to the<br />
preferred address space (which could be the<br />
kernel). This is a slight paradigm shift, because<br />
now devices that need to peek at the data may<br />
have to map it first. Further, to free up more<br />
space for this mapping, we would break the assumption<br />
that any page in ZONE_NORMAL<br />
is automatically mapped into kernel space.<br />
<strong>The</strong> benefits are that I/O goes straight from<br />
the device into the user space (for processors<br />
that have virtually indexed caches) and the kernel<br />
has quite a large unmapped area for use in<br />
kmapping highmem pages (for x86).<br />
1 Introduction<br />
In the <strong>Linux</strong> kernel 1 there are two addressing<br />
spaces: memory physical which is the location<br />
in the actual memory subsystem and CPU virtual,<br />
which is an address the CPU’s Memory<br />
Management Unit (MMU) translates to a memory<br />
physical address internally. <strong>The</strong> <strong>Linux</strong> kernel<br />
operates completely in CPU virtual space,<br />
keeping separate virtual spaces for the kernel<br />
and each of the current user processes. However,<br />
the kernel also has to manage the mappings<br />
between physical and virtual spaces, and<br />
to do that it keeps track of where the physical<br />
pages of memory currently are.<br />
In the <strong>Linux</strong> kernel, memory is split into zones<br />
in memory physical space:<br />
• ZONE_DMA: A historical region where<br />
ISA DMAable memory is allocated from.<br />
On x86 this is all memory under 16MB.<br />
• ZONE_NORMAL: This is where normally<br />
allocated kernel memory goes. Where<br />
1 This is not quite true, there are kernels for processors<br />
without memory management units, but these are<br />
very specialised and won’t be considered further
104 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
this zone ends depends on the architecture.<br />
However, all memory in this zone<br />
is mapped in kernel space (visible to the<br />
kernel).<br />
• ZONE_HIGHMEM: This is where the rest<br />
of the memory goes. Its characteristic is<br />
that it is not mapped in kernel space (thus<br />
the kernel cannot access it without first<br />
mapping it).<br />
1.1 <strong>The</strong> x86 and Highmem<br />
<strong>The</strong> main reason for the existence of ZONE_<br />
HIGHMEM is a peculiar quirk on the x86 processor<br />
which makes it rather expensive to have<br />
different page table mappings between the kernel<br />
and user space. <strong>The</strong> root of the problem<br />
is that the x86 can only keep one set of physical<br />
to virtual mappings on-hand at once. Since<br />
the kernel and the processes occupy different<br />
virtual mappings, the TLB context would have<br />
to be switched not only when the processor<br />
changes current user tasks, but also when the<br />
current user task calls on the kernel to perform<br />
an operation on its behalf. <strong>The</strong> time taken<br />
to change mappings, called the TLB flushing<br />
penalty, contributes to a degradation in process<br />
performance and has been measured at around<br />
30%[1]. To avoid this penalty, the <strong>Kernel</strong> and<br />
user spaces share a partitioned virtual address<br />
space so that the kernel is actually mapped into<br />
user space (although protected from user access)<br />
and vice versa.<br />
<strong>The</strong> upshot of this is that the x86 userspace<br />
is divided 3GB/1GB with the virtual address<br />
range 0x00000000-0xbfffffff<br />
being available for the user process and<br />
0xc0000000-0xffffffff being reserved<br />
for the kernel.<br />
<strong>The</strong> problem, for the kernel, is that it now only<br />
has 1GB of virtual address to play with including<br />
all memory mapped I/O regions. <strong>The</strong> result<br />
being that ZONE_NORMAL actually ends<br />
at around 850kb on most x86 boxes. Since<br />
the kernel must also manage the mappings for<br />
every user process (and these mappings must<br />
be memory resident), the larger the physical<br />
memory of the kernel becomes, the less of<br />
ZONE_NORMAL becomes available to the kernel.<br />
On a 64GB x86 box, the usable memory<br />
becomes minuscule and has lead to the<br />
proposal[2] to use a 4G/4G split and just accept<br />
the TLB flushing penalty.<br />
1.2 Non-x86 and Virtual Indexing<br />
Most other architectures are rather better implemented<br />
and are able to cope easily with separate<br />
virtual spaces for the user and the kernel<br />
without imposing a performance penalty<br />
transitioning from one virtual address space to<br />
another. However, there are other problems<br />
the kernel’s penchant for keeping all memory<br />
mapped causes, notably with Virtual Indexing.<br />
Virtual Indexing[3] (VI) means that the CPU<br />
cache keeps its data indexed by virtual address<br />
(rather than by physical address like the x86<br />
does). <strong>The</strong> problem this causes is that if multiple<br />
virtual address spaces have the same physical<br />
address mapped, but at different virtual addresses<br />
then the cache may contain duplicate<br />
entries, called aliases. Managing these aliases<br />
becomes impossible if there are multiple ones<br />
that become dirty.<br />
Most VI architectures find a solution to the<br />
multiple cache line problem by having a “congruence<br />
modulus” meaning that if two virtual<br />
addresses are equal modulo this congruence<br />
(usually a value around 4MB) then the cache<br />
will detect the aliasing and keep only a single<br />
copy of the data that will be seen by all the virtual<br />
addresses.<br />
<strong>The</strong> problems arise because, although architectures<br />
go to great lengths to make sure all<br />
user mappings are congruent, because the ker-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 105<br />
nel memory is always mapped, it is highly unlikely<br />
that any given kernel page would be congruent<br />
to a user page.<br />
1.3 <strong>The</strong> solution: Unmapping ZONE_NORMAL<br />
It has already been pointed out[4] that x86<br />
could recover some of its precious ZONE_<br />
NORMAL space simply by moving page table<br />
entries into unmapped highmem space. However,<br />
the penalty of having to map and unmap<br />
the page table entries to modify them turned<br />
out to be unacceptable.<br />
<strong>The</strong> solution, though, remains valid. <strong>The</strong>re<br />
are many pages of data currently in ZONE_<br />
NORMAL that the kernel doesn’t ordinarily use.<br />
If these could be unmapped and their virtual<br />
address space given up then the x86 kernel<br />
wouldn’t be facing quite such a memory<br />
crunch.<br />
For VI architectures, the problems stem from<br />
having unallocated kernel memory already<br />
mapped. If we could keep the majority of kernel<br />
memory unmapped, and map it only when<br />
we really need to use it, then we would stand<br />
a very good chance of being able to map the<br />
memory congruently even in kernel space.<br />
<strong>The</strong> solution this paper will explore is that of<br />
keeping the majority of kernel memory unmapped,<br />
mapping it only when it is used.<br />
2 A closer look at Virtual Indexing<br />
As well as the aliasing problem, VI architectures<br />
also have issues with I/O coherency on<br />
DMA. <strong>The</strong> essence of the problem stems from<br />
the fact that in order to make a device access<br />
to physical memory coherent, any cache<br />
lines that the processor is holding need to be<br />
flushed/invalidates as part of the DMA transaction.<br />
In order to do DMA, a device simply<br />
presents a physical address to the system with<br />
a request to read or write. However, if the processor<br />
indexes the caches virtually, it will have<br />
no idea whether it is caching this physical address<br />
or not. <strong>The</strong>refore, in order to give the<br />
processor an idea of where in the cache the data<br />
might be, the DMA engines on VI architectures<br />
also present a virtual index (called the “coherence<br />
index”) along with the physical address.<br />
2.1 Coherence Indices and DMA<br />
<strong>The</strong> Coherence Index is computed by the processor<br />
on a per page basis, and is used to identify<br />
the line in the cache belonging to the physical<br />
address the DMA is using.<br />
<strong>One</strong> will notice that this means the coherence<br />
index must be computed on every DMA transaction<br />
for a particular address space (although,<br />
if all the addresses are congruent, one may simply<br />
pick any one). Since, at the time the dma<br />
mapping is done, the only virtual address the<br />
kernel knows about is the kernel virtual address,<br />
it means that DMA is always done coherently<br />
with the kernel.<br />
In turn, since the kernel address is pretty much<br />
not congruent with any user address, before the<br />
DMA is signalled as being completed to the<br />
user process, the kernel mapping and the user<br />
mappings must likewise be made coherent (using<br />
the flush_dcache_page() function).<br />
However, since the majority of DMA transactions<br />
occur on user data in which the kernel has<br />
no interest, the extra flush is simply an unnecessary<br />
performance penalty.<br />
This performance penalty would be eliminated<br />
if either we knew that the designated kernel address<br />
was congruent to all the user addresses<br />
or we didn’t bother to map the DMA region<br />
into kernel space and simply computed the coherence<br />
index from a given user process. <strong>The</strong><br />
latter would be preferable from a performance<br />
point of view since it eliminates an unneces-
106 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
sary map and unmap.<br />
2.2 Other Issues with Non-Congruence<br />
On the parisc architecture, there is an architectural<br />
requirement that we don’t simultaneously<br />
enable multiple read and write translations of<br />
a non-congruent address. We can either enable<br />
a single write translation or multiple read (but<br />
no write) translations. With the current manner<br />
of kernel operation, this is almost impossible<br />
to satisfy without going to enormous lengths in<br />
our page translation and fault routines to work<br />
around the issues.<br />
Previously, we were able to get away with<br />
ignoring this restriction because the machine<br />
would only detect it if we allowed multiple<br />
aliases to become dirty (something <strong>Linux</strong> never<br />
does). However, in the next generation systems,<br />
this condition will be detected when it<br />
occurs. Thus, addressing it has become critical<br />
to providing a bootable kernel on these new<br />
machines.<br />
Thus, as well as being a simple performance<br />
enhancement, removing non-congruence becomes<br />
vital to keeping the kernel booting on<br />
next generation machines.<br />
2.3 VIPT vs VIVT<br />
This topic is covered comprehensively in [3].<br />
However, there is a problem in VIPT caches,<br />
namely that if we are reusing the virtual address<br />
in kernel space, we must flush the processor’s<br />
cache for that page on this re-use otherwise<br />
it may fall victim to stale cache references<br />
that were left over from a prior use.<br />
Flushing a VIPT cache is easier said than done,<br />
since in order to flush, a valid translation must<br />
exist for the virtual address in order for the<br />
flush to be effective. This causes particular<br />
problems for pages that were mapped to a user<br />
space process, since the address translations<br />
are destroyed before the page is finally freed.<br />
3 <strong>Kernel</strong> Virtual Space<br />
Although the kernel is nominally mapped in<br />
the same way the user process is (and can theoretically<br />
be fragmented in physical space), in<br />
fact it is usually offset mapped. This means<br />
there is a simple mathematical relation between<br />
the physical and virtual addresses:<br />
virtual = physical + __PAGE_OFFSET<br />
where __PAGE_OFFSET is an architecture<br />
defined quantity. This type of mapping makes<br />
it very easy to calculate virtual addresses from<br />
physical ones and vice versa without having to<br />
go to all the bother (and CPU time) of having<br />
to look them up in the kernel page tables.<br />
3.1 Moving away from Offset Mapping<br />
<strong>The</strong>re’s another wrinkle on some architectures<br />
in that if an interruption occurs, the CPU<br />
turns off virtual addressing to begin processing<br />
it. This means that the kernel needs to<br />
save the various registers and turn virtual addressing<br />
back on, all in physical space. If<br />
it’s no longer a simple matter of subtracting<br />
__PAGE_OFFSET to get the kernel stack for<br />
the process, then extra time will be consumed<br />
in the critical path doing potentially cache cold<br />
page table lookups.<br />
3.2 Keeping track of Mapped pages<br />
In general, when mapping a page we will either<br />
require that it goes in the first available<br />
slot (for x86), or that it goes at the first available<br />
slot congruent with a given address (for VI<br />
architectures). All we really require is a simple<br />
mechanism for finding the first free page
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 107<br />
virtual address given some specific constraints.<br />
However, since the constraints are architecture<br />
specific, the specifics of this tracking are also<br />
implemented in architectures (see section 5.2<br />
for details on parisc).<br />
3.3 Determining Physical address from Virtual<br />
and Vice-Versa<br />
In the <strong>Linux</strong> kernel, the simple macros<br />
__pa() and __va() are used to do physical<br />
to virtual translation. Since we are now filling<br />
the mappings in randomly, this is no longer a<br />
simple offset calculation.<br />
<strong>The</strong> kernel does have help for finding the virtual<br />
address of a given page. <strong>The</strong>re is an<br />
optional virtual entry which is turned on<br />
and populated with the page’s current virtual<br />
address when the architecture defines WANT_<br />
PAGE_VIRTUAL. <strong>The</strong> __va() macro can be<br />
programmed simply to do this lookup.<br />
To find the physical address, the best method is<br />
probably to look the page up in the kernel page<br />
table mappings. This is obviously less efficient<br />
than a simple subtraction.<br />
4 Implementing the unmapping of<br />
ZONE_NORMAL<br />
It is not surprising, given that the entire kernel<br />
is designed to operate with ZONE_NORMAL<br />
mapped it is surprising that unmapping it turns<br />
out to be fairly easy. <strong>The</strong> primary reason for<br />
this is the existence of highmem. Since pages<br />
in ZONE_HIGHMEM are always unmapped and<br />
since they are usually assigned to user processes,<br />
the kernel must proceed on the assumption<br />
that it potentially has to map into its address<br />
space any page from a user process that<br />
it wishes to touch.<br />
4.1 Booting<br />
<strong>The</strong> kernel has an entire bootmem API whose<br />
sole job is to cope with memory allocations<br />
while the system is booting and before paging<br />
has been initialised to the point where normal<br />
memory allocations may proceed. On parisc,<br />
we simply get the available page ranges from<br />
the firmware, map them all and turn them over<br />
lock stock and barrel to bootmem.<br />
<strong>The</strong>n, when we’re ready to begin paging, we<br />
simply release all the unallocated bootmem<br />
pages for the kernel to use from its mem_map 2<br />
array of pages.<br />
We can implement the unmapping idea simply<br />
by covering all our page ranges with an offset<br />
map for bootmem, but then unmapping all the<br />
unreserved pages that bootmem releases to the<br />
mem_map array.<br />
This leaves us with the kernel text and data sections<br />
contiguously offset mapped, and all other<br />
boot time<br />
4.2 Pages Coming From User Space<br />
<strong>The</strong> standard mechanisms for mapping potential<br />
highmem pages from user space for the<br />
kernel to see are kmap, kunmap, kmap_<br />
atomic, and kmap_atomic_to_page.<br />
Simply hijacking them and divorcing their implementation<br />
from CONFIG_HIGHMEM is sufficient<br />
to solve all user to kernel problems<br />
that arise because of the unmapping of ZONE_<br />
NORMAL.<br />
4.3 In <strong>Kernel</strong> Problems: Memory Allocation<br />
Since now every free page in the system will<br />
be unmapped, they will have to be mapped<br />
2 This global array would be a set of per-zone arrays<br />
on NUMA
108 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
before the kernel can use them (pages allocated<br />
for use in user space have no need to<br />
be mapped additionally in kernel space at allocation<br />
time). <strong>The</strong> engine for doing this is a<br />
single point in __alloc_pages() which is<br />
the central routine for allocating every page in<br />
the system. In the single successful page return,<br />
the page is mapped for the kernel to use it<br />
if __GFP_HIGH is not set—this simple test is<br />
sufficient to ensure that kernel pages only are<br />
mapped here.<br />
<strong>The</strong> unmapping is done in two separate routines:<br />
__free_pages_ok() for freeing bulk<br />
pages (accumulations of contiguous pages) and<br />
free_hot_cold_page() for freeing single<br />
pages. Here, since we don’t know the gfp mask<br />
the page was allocated with, we simply check<br />
to see if the page is currently mapped, and unmap<br />
it if it is before freeing it. <strong>The</strong>re is another<br />
side benefit to this: the routine that transfers all<br />
the unreserved bootmem to the mem_map array<br />
does this via __free_pages(). Thus,<br />
we additionally achieve the unmapping of all<br />
the free pages in the system after booting with<br />
virtually no additional effort.<br />
4.4 Other Benefits: Variable size pages<br />
Although it wasn’t the design of this structure<br />
to provide variable size pages, one of the benefits<br />
of this approach is now that the pages that<br />
are mapped as they are allocated. Since pages<br />
in the kernel are allocated with a specified order<br />
(the power of two of the number of contiguous<br />
pages), it becomes possible to cover<br />
them with a TLB entry that is larger than the<br />
usual page size (as long as the architecture supports<br />
this). Thus, we can take the order argument<br />
to __alloc_pages() and work out<br />
the smallest number of TLB entries that we<br />
need to allocate to cover it.<br />
Implementation of variable size pages is actually<br />
transparent to the system; as far as <strong>Linux</strong><br />
is concerned, the page table entries it deal with<br />
describe 4k pages. However, we add additional<br />
flags to the pte to tell the software TLB routine<br />
that actually we’d like to use a larger size TLB<br />
to access this region.<br />
As a further optimisation, in the architecture<br />
specific routines that free the boot mem, we can<br />
remap the kernel text and data sections with the<br />
smallest number of TLB entries that will entirely<br />
cover each of them.<br />
5 Achieving <strong>The</strong> VI architecture<br />
Goal: Fully Congruent Aliasing<br />
<strong>The</strong> system possesses every attribute it now<br />
needs to implement this. We no-longer map<br />
any user pages into kernel space unless the kernel<br />
actually needs to touch them. Thus, the<br />
pages will have congruent user addresses allocated<br />
to them in user space before we try to<br />
map them in kernel space. Thus, all we have<br />
to do is track up the free address list in increments<br />
of the congruence modulus until we find<br />
an empty place to map the page congruently.<br />
5.1 Wrinkles in the I/O Subsystem<br />
<strong>The</strong> I/O subsystem is designed to operate without<br />
mapping pages into the kernel at all. This<br />
becomes problematic for VI architectures because<br />
we have to know the user virtual address<br />
to compute the coherence index for the I/O.<br />
If the page is unmapped in kernel space, we<br />
can no longer make it coherent with the kernel<br />
mapping and, unfortunately, the information in<br />
the BIO is insufficient to tell us the user virtual<br />
address.<br />
<strong>The</strong> proposal for solving this is to add an architecture<br />
defined set of elements to struct<br />
bio_vec and an architecture specific function<br />
for populating this (possibly empty) set of<br />
elements as the biovec is created. In parisc,
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 109<br />
we need to add an extra unsigned long for<br />
the coherence index, which we compute from<br />
a pointer to the mm and the user virtual address.<br />
<strong>The</strong> architecture defined components are<br />
pulled into struct scatterlist by yet<br />
another callout when the request is mapped for<br />
DMA.<br />
5.2 Tracking the Mappings in ZONE_DMA<br />
Since the tracking requirements vary depending<br />
on architectures: x86 will merely wish to<br />
find the first free pte to place a page into; however<br />
VI architectures will need to find the first<br />
free pte satisfying the congruence requirements<br />
(which vary by architecture), the actual mechanism<br />
for finding a free pte for the mapping<br />
needs to be architecture specific.<br />
On parisc, all of this can be done in kmap_<br />
kernel() which merely uses rmap[5] to determine<br />
if the page is mapped in user space<br />
and find the congruent address if it is. We<br />
use a simple hash table based bitmap with one<br />
bucket representing the set of available congruent<br />
pages. Thus, finding a page congruent to<br />
any given virtual address is the simple computation<br />
of finding the first set bit in the congruence<br />
bucket. To find an arbitrary page, we keep<br />
a global bucket counter, allocating a page from<br />
that bucket and then incrementing the counter 3 .<br />
6 Implementation Details on PA-<br />
RISC<br />
Since the whole thrust of this project was to improve<br />
the kernel on PA-RISC (and bring it back<br />
into architectural compliance), it is appropriate<br />
to investigate some of the other problems that<br />
turned up during the implementation.<br />
3 This can all be done locklessly with atomic increments,<br />
since it doesn’t really matter if we get two allocations<br />
from the same bucket because of race conditions<br />
6.1 Equivalent Mapping<br />
<strong>The</strong> PA architecture has a software TLB meaning<br />
that in Virtual mode, if the CPU accesses<br />
an address that isn’t in the CPU’s TLB cache,<br />
it will take a TLB fault so the software routine<br />
can locate the TLB entry (by walking the page<br />
tables) and insert it into the CPU’s TLB. Obviously,<br />
this type of interruption must be handled<br />
purely by referencing physical addresses.<br />
In fact, the PA CPU is designed to have fast and<br />
slow paths for faults and interruptions. <strong>The</strong> fast<br />
paths (since they cannot take another interruption,<br />
i.e. not a TLB miss fault) must all operate<br />
on physical addresses. To assist with this, the<br />
PA CPU even turns off virtual addressing when<br />
it takes an interruption.<br />
When the CPU turns off virtual address translation,<br />
it is said to be operating in absolute<br />
mode. All address accesses in this mode are<br />
physical. However, all accesses in this mode<br />
also go through the CPU cache (which means<br />
that for this particular mode the cache is actually<br />
Physically Indexed). Unfortunately, this<br />
can also set up unwanted aliasing between the<br />
physical address and its virtual translation. <strong>The</strong><br />
fix for this is to obey the architectural definition<br />
for “equivalent mapping.” Equivalent mapping<br />
is defined as virtual and physical addresses being<br />
equal; however, we benefit from the obvious<br />
loophole in that the physical and virtual addresses<br />
don’t have to be exactly equal, merely<br />
equal modulo the congruent modulus.<br />
All of this means that when a page is allocated<br />
for use by the kernel, we must determine if it<br />
will ever be used in absolute mode, and make it<br />
equivalently mapped if it will be. At the time of<br />
writing, this was simply implemented by making<br />
all kernel allocated pages equivalent. However,<br />
really all that needs to be equivalently<br />
mapped is<br />
1. the page tables (pgd, pmd and pte),
110 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
2. the task structure and<br />
3. the kernel stacks.<br />
6.2 Physical to Virtual address Translation<br />
In the interruption slow path, where we save<br />
all the registers and transition to virtual mode,<br />
there is a point where execution must be<br />
switched (and hence pointers moved from<br />
physical to virtual). Currently, with offset<br />
mapping, this is simply done by and addition<br />
of __PAGE_OFFSET. However, in the new<br />
scheme we cannot do this, nor can we call<br />
the address translation functions when in absolute<br />
mode. <strong>The</strong>refore, we had to reorganise<br />
the interruption paths in the PA code so<br />
that both the physical and virtual address was<br />
available. Currently parisc uses a control register<br />
(%cr30) to store the virtual address of<br />
the struct thread_info. We altered all<br />
paths to change %cr30 to contain the physical<br />
address of struct thread_info and<br />
also added a physical address pointer to the<br />
struct task_struct to the thread info.<br />
This is sufficient to perform all the necessary<br />
register saves in absolute addressing mode.<br />
6.3 Flushing on Page Freeing<br />
as was documented in section 2.3, we need to<br />
find a way of flushing a user virtual address after<br />
its translation is gone. Actually, this turns<br />
out to be quite easy on PARISC. We already<br />
have an area of memory (called the tmpalias<br />
space) that we use to copy to priming the user<br />
cache (it is simply a 4MB memory area we dynamically<br />
program to map to the page). <strong>The</strong>refore,<br />
as long as we know the user virtual address,<br />
we can simply flush the page through<br />
the tmpalias space. In order to confound any<br />
attempted kernel use of this page, we reserve<br />
a separate 4MB virtual area that produces a<br />
page fault if referenced, and point the page’s<br />
virtual address into this when it is removed<br />
from process mappings (so that any kernel attempt<br />
to use the page produces an immediate<br />
fault). <strong>The</strong>n, when the page is freed, if its<br />
virtual pointer is within this range, we convert<br />
it to a tmpalias address and flush it using<br />
the tmpalias mechanism.<br />
7 Results and Conclusion<br />
<strong>The</strong> best result is that on a parisc machine, the<br />
total amount of memory the operational kernel<br />
keeps mapped is around 10MB (although this<br />
alters depending on conditions).<br />
<strong>The</strong> current implementation makes all pages<br />
congruent or equivalent, but the allocation routine<br />
contains BUG_ON() asserts to detect if we<br />
run out of equivalent addresses. So far, under<br />
fairly heavy stress, none of these has tripped.<br />
Although the primary reason for the unmapping<br />
was to move parisc back within its architectural<br />
requirements, it also produces a knock<br />
on effect of speeding up I/O by eliminating the<br />
cache flushing from kernel to user space. At<br />
the time of writing, the effects of this were still<br />
unmeasured, but expected to be around 6% or<br />
so.<br />
As a final side effect, the flush on free necessity<br />
releases the parisc from a very stringent “flush<br />
the entire cache on process death or exec” requirement<br />
that was producing horrible latencies<br />
in the parisc fork/exec. With this code in<br />
place, we see a vast (50%) improvement in the<br />
fork/exec figures.<br />
References<br />
[1] Andrea Arcangeli 3:1 4:4 100HZ<br />
1000HZ comparison with the HINT<br />
benchmark 7 April 2004<br />
http://www.kernel.org/pub/
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 111<br />
linux/kernel/people/andrea/<br />
misc/31-44-100-1000/<br />
31-44-100-1000.html<br />
[2] Ingo Molnar [announce, patch] 4G/4G<br />
split on x86, 64 GB RAM (and more)<br />
support 8 July 2003<br />
http://marc.theaimsgroup.<br />
com/?t=105770467300001<br />
[3] James E.J. Bottomley Understanding<br />
Caching <strong>Linux</strong> Journal January 2004,<br />
Issue 117 p58<br />
[4] Ingo Molnar [patch] simpler ‘highpte’<br />
design 18 February 2002<br />
http://marc.theaimsgroup.<br />
com/?l=linux-kernel&m=<br />
101406121032371<br />
[5] Rik van Riel Re: Rmap code? 22 August<br />
2001 http:<br />
//marc.theaimsgroup.com/?l=<br />
linux-mm&m=99849912207578
112 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
<strong>Linux</strong> Virtualization on IBM POWER5 Systems<br />
Abstract<br />
Dave Boutcher<br />
IBM<br />
boutcher@us.ibm.com<br />
In 2004 IBM® is releasing new systems based<br />
on the POWER5 processor. <strong>The</strong>re is new<br />
support in both the hardware and firmware for<br />
virtualization of multiple operating systems on<br />
a single platform. This includes the ability to<br />
have multiple operating systems share a processor.<br />
Additionally, a hypervisor firmware<br />
layer supports virtualization of I/O devices<br />
such as SCSI, LAN, and console, allowing<br />
limited physical resources in a system to be<br />
shared.<br />
At its extreme, these new systems allow 10<br />
<strong>Linux</strong> images per physical processor to run<br />
concurrently, contending for and sharing the<br />
system’s physical resources. All changes to<br />
support these new functions are in the 2.4 and<br />
2.6 <strong>Linux</strong> kernels.<br />
This paper discusses the virtualization capabilities<br />
of the processor and firmware, as well as<br />
the changes made to the PPC64 kernel to take<br />
advantage of them.<br />
1 Introduction<br />
IBM’s new POWER5 ∗∗ processor is being used<br />
in both IBM iSeries® and pSeries® systems<br />
capable of running any combination of <strong>Linux</strong>,<br />
AIX®, and OS/400® in logical partitions. <strong>The</strong><br />
hardware and firmware, including a hypervisor<br />
[AAN00], in these systems provide the ability<br />
to create “virtual” system images with virtual<br />
Dave Engebretsen<br />
IBM<br />
engebret@us.ibm.com<br />
hardware. <strong>The</strong> virtualization technique used on<br />
POWER hardware is known as paravirtualization,<br />
where the operating system is modified<br />
in select areas to make calls into the hypervisor.<br />
PPC64 <strong>Linux</strong> has been enhanced to make<br />
use of these virtualization interfaces. Note that<br />
the same PPC64 <strong>Linux</strong> kernel binary works<br />
on both virtualized systems and previous “bare<br />
metal” pSeries systems that did not offer a hypervisor.<br />
All changes related to virtualization have been<br />
made in the kernel, and almost exclusively in<br />
the PPC64 portion of the code. <strong>One</strong> challenge<br />
has been keeping as much code common<br />
as possible between POWER5 portions of the<br />
code and other portions, such as those supporting<br />
the Apple G5.<br />
Like previous generations of POWER processors<br />
such as the RS64 and POWER4 families,<br />
POWER5 includes hardware enablement<br />
for logical partitioning. This includes features<br />
such as a hypervisor state which is more privileged<br />
than supervisor state. This higher privilege<br />
state is used to restrict access to system<br />
resources, such as the hardware page table, to<br />
hypervisor only access. All current systems<br />
based on POWER5 run in a hypervised environment,<br />
even if only one partition is active on<br />
the system.
114 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
<strong>Linux</strong> OS/400 <strong>Linux</strong> AIX<br />
1.50<br />
CPU<br />
CPU<br />
0<br />
1.0<br />
CPU<br />
Hypervisor<br />
CPU<br />
1<br />
0.50<br />
CPU<br />
CPU<br />
2<br />
1.0<br />
CPU<br />
CPU<br />
3<br />
Figure 1: POWER5 Partitioned System<br />
2 Processor Virtualization<br />
2.1 Virtual Processors<br />
When running in a partition, the operating<br />
system is allocated virtual processors (VP’s),<br />
where each VP can be configured in either<br />
shared or dedicated mode of operation. In<br />
shared mode, as little as 10%, or 10 processing<br />
units, of a physical processor can be allocated<br />
to a partition and the hypervisor layer<br />
timeslices between the partitions. In dedicated<br />
mode, 100% of the processor is given to the<br />
partition such that its capacity is never multiplexed<br />
with another partition.<br />
It is possible to create more virtual processors<br />
in the partition than there are physical processors<br />
on the system. For example, a partition allocated<br />
100 processing units (the equivalent of<br />
1 processor) of capacity could be configured to<br />
have 10 virtual processors, where each VP has<br />
10% of a physical processor’s time. While not<br />
generally valuable, this extreme configuration<br />
can be used to help test SMP configurations on<br />
small systems.<br />
On POWER5 systems with multiple logical<br />
partitions, an important requirement is to be<br />
able to move processors (either shared or dedicated)<br />
from one logical partition to another.<br />
In the case of dedicated processors, this truly<br />
means moving a CPU from one logical partition<br />
to another. In the case of shared processors,<br />
it means adjusting the number of processors<br />
used by <strong>Linux</strong> on the fly.<br />
This “hotplug CPU” capability is far more interesting<br />
in this environment than in the case<br />
that the covers are going to be removed from a<br />
real system and a CPU physically added. <strong>The</strong><br />
goal of virtualization on these systems is to dynamically<br />
create and adjust operating system<br />
images as required. Much work has been done,<br />
particularly by Rusty Russell, to get the architecture<br />
independent changes into the mainline<br />
kernel to support hotplug CPU.<br />
Hypervisor interfaces exist that help the operating<br />
system optimize its use of the physical processor<br />
resources. <strong>The</strong> following sections describe<br />
some of these mechanisms.<br />
2.2 Virtual Processor Area<br />
Each virtual processor in the partition can create<br />
a virtual processor area (VPA), which is a<br />
small (one page) data structure shared between<br />
the hypervisor and the operating system. Its<br />
primary use is to communicate information between<br />
the two software layers. Examples of<br />
the information that can be communicated in<br />
the VPA include whether the OS is in the idle<br />
loop, if floating point and performance counter<br />
register state must be saved by the hypervisor<br />
between operating system dispatches, and<br />
whether the VP is running in the partition’s operating<br />
system.<br />
2.3 Spinlocks<br />
<strong>The</strong> hypervisor provides an interface that helps<br />
minimize wasted cycles in the operating system<br />
when a lock is held. Rather than simply<br />
spin on the held lock in the OS, a new hypervi-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 115<br />
sor call, h_confer, has been provided. This<br />
interface is used to confer any remaining virtual<br />
processor cycles from the lock requester<br />
to the lock holder.<br />
<strong>The</strong> PPC64 spinlocks were changed to identify<br />
the logical processor number of the lock<br />
holder, examine that processor’s VPA yield<br />
count field to determine if it is not running in<br />
the OS (even values indicate the VP is running<br />
in the OS), and to make the h_confer call<br />
to the hypervisor to give any cycles remaining<br />
in the virtual processor’s timeslice to the lock<br />
holder. Obviously, this more expensive leg of<br />
spinlock processing is only taken if the spinlock<br />
cannot be immediately acquired. In cases<br />
where the lock is available, no additional pathlength<br />
is incurred.<br />
2.4 Idle<br />
When the operating system no longer has active<br />
tasks to run and enters its idle loop, the<br />
h_cede interface is used to indicate to the hypervisor<br />
that the processor is available for other<br />
work. <strong>The</strong> operating system simply sets the<br />
VPA idle bit and calls h_cede. Under this<br />
call, the hypervisor is free to allocate the processor<br />
resources to another partition, or even to<br />
another virtual processor within the same partition.<br />
<strong>The</strong> processor is returned to the operating<br />
system if an external, decrementer (timer),<br />
or interprocessor interrupt occurs. As an alternative<br />
to sending an IPI, the ceded processor<br />
can be awoken by another processor calling the<br />
h_prod interface, which has slightly less overhead<br />
in this environment.<br />
Making use of the cede interface is especially<br />
important on systems where partitions configured<br />
to run uncapped exist. In uncapped mode,<br />
any physical processor cycles not used by other<br />
partitions can be allocated by the hypervisor to<br />
a non-idle partition, even if that partition has<br />
already consumed its defined quantity of processor<br />
units. For example, a partition that is<br />
defined as uncapped, 2 virtual processors, and<br />
20 processing units could consume 2 full processors<br />
(200 processing units), if all other partitions<br />
are idle.<br />
2.5 SMT<br />
<strong>The</strong> POWER5 processor provides symmetric<br />
multithreading (SMT) capabilities that allow<br />
two threads of execution to simultaneously execute<br />
on one physical processor. This results<br />
in twice as many processor contexts being<br />
presented to the operating system as there<br />
are physical processors. Like other processor<br />
threading mechanisms found in POWER RS64<br />
and Intel® processors, the goal of SMT is to<br />
enable higher processor utilization.<br />
At <strong>Linux</strong> boot, each processor thread is discovered<br />
in the open firmware device tree<br />
and a logical processor is created for <strong>Linux</strong>.<br />
A command line option, smt-enabled =<br />
[on, off, dynamic], has been added to allow<br />
the <strong>Linux</strong> partition to config SMT in one<br />
of three states. <strong>The</strong> on and off modes indicate<br />
that the processor always runs with SMT<br />
either on or off. <strong>The</strong> dynamic mode allows<br />
the operating system and firmware to dynamically<br />
configure the processor to switch between<br />
threaded (SMT) and a single threaded<br />
(ST) mode where one of the processor threads<br />
is dormant. <strong>The</strong> hardware implementation is<br />
such that running in ST mode can provide additional<br />
performance when only a single task is<br />
executing.<br />
<strong>Linux</strong> can cause the processor to switch between<br />
SMT and ST modes via the h_cede hypervisor<br />
call interface. When entering its idle<br />
loop, <strong>Linux</strong> sets the VPA idle state bit, and after<br />
a selectable delay, calls h_cede. Under<br />
this interface, the hypervisor layer determines<br />
if only one thread is idle, and if so, switches<br />
the processor into ST mode. If both threads are
116 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
idle (as indicated by the VPA idle bit), then the<br />
hypervisor keeps the processor in SMT mode<br />
and returns to the operating system.<br />
<strong>The</strong> processor switches back to SMT mode<br />
if an external or decrementer interrupt is presented,<br />
or if another processor calls the h_<br />
prod interface against the dormant thread.<br />
3 Memory Virtualization<br />
Memory is virtualized only to the extent that all<br />
partitions on the system are presented a contiguous<br />
range of logical addresses that start<br />
at zero. <strong>Linux</strong> sees these logical addresses<br />
as its real storage. <strong>The</strong> actual real memory<br />
is allocated by the hypervisor from any available<br />
space throughout the system, managing<br />
the storage in logical memory blocks (LMB’s).<br />
Each LMB is presented to the partition via<br />
a memory node in the open firmware device<br />
tree. When <strong>Linux</strong> creates a mapping in the<br />
hardware page table for effective addresses, it<br />
makes a call to the hypervisor (h_enter) indicating<br />
the effective and partition logical address.<br />
<strong>The</strong> hypervisor translates the logical address<br />
to the corresponding real address and inserts<br />
the mapping into the hardware page table.<br />
<strong>One</strong> additional layer of memory virtualization<br />
managed by the hypervisor is a real mode offset<br />
(RMO) region. This is a 128 or 256 MB region<br />
of memory covering the first portion of the<br />
logical address space within a partition. It can<br />
be accessed by <strong>Linux</strong> when address relocation<br />
is off, for example after an exception occurs.<br />
When a partition is running relocation off and<br />
accesses addresses within the RMO region, a<br />
simple offset is added by the hardware to generate<br />
the actual storage access. In this manner,<br />
each partition has what it considers logical address<br />
zero.<br />
4 I/O Virtualization<br />
Once CPU and memory have been virtualized,<br />
a key requirement is to provide virtualized I/O.<br />
<strong>The</strong> goal of the POWER5 systems is to have,<br />
for example, 10 <strong>Linux</strong> images running on a<br />
small system with a single CPU, 1GB of memory,<br />
and a single SCSI adapter and Ethernet<br />
adapter.<br />
<strong>The</strong> approach taken to virtualize I/O is a cooperative<br />
implementation between the hypervisor<br />
and the operating system images. <strong>One</strong> operating<br />
system image always “owns” physical<br />
adapters and manages all I/O to those adapters<br />
(DMA, interrupts, etc.)<br />
<strong>The</strong> hypervisor and Open Firmware then provide<br />
“virtual” adapters to any operating systems<br />
that require them. Creation of virtual<br />
adapters is done by the system administrator<br />
as part of logically partitioning the system. A<br />
key concept is that these virtual adapters do not<br />
interact in any way with the physical adapters.<br />
<strong>The</strong> virtual adapters interact with other operating<br />
systems in other logical partitions, which<br />
may choose to make use of physical adapters.<br />
Virtual adapters are presented to the operating<br />
system in the Open Firmware device tree just<br />
as physical adapters are. <strong>The</strong>y have very similar<br />
attributes as physical adapters, including<br />
DMA windows and interrupts.<br />
<strong>The</strong> adapters currently supported by the hypervisor<br />
are virtual SCSI adapters, virtual Ethernet<br />
adapters, and virtual TTY adapters.<br />
4.1 Virtual Bus<br />
Virtual adapters, of course, exist on a virtual<br />
bus. <strong>The</strong> bus has slots into which virtual<br />
adapters are configured. <strong>The</strong> number of slots<br />
available on the virtual bus is configured by<br />
the system administrator. <strong>The</strong> goal is to make
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 117<br />
the behavior of virtual adapters consistent with<br />
physical adapters. <strong>The</strong> virtual bus is not presented<br />
as a PCI bus, but rather as its own bus<br />
type.<br />
4.2 Virtual LAN<br />
Virtual LAN adapters are conceptually the simplest<br />
kind of virtual adapter. <strong>The</strong> hypervisor<br />
implements a switch, which supports 802.1Q<br />
semantics for having multiple VLANs share<br />
a physical switch. Adapters can be marked<br />
as 802.1Q aware, in which case the hypervisor<br />
expects the operating system to handle the<br />
802.1Q VLAN headers, or 802.1Q unaware, in<br />
which case the hypervisor connects the adapter<br />
to a single VLAN. Multiple virtual Ethernet<br />
adapters can be created for a given partition.<br />
Virtual Ethernet adapters have an additional attribute<br />
called “Trunk Adapter.” An adapter<br />
marked as a Trunk Adapter will be delivered<br />
all frames that don’t match any MAC address<br />
on the virtual Ethernet. This is similar, but<br />
not identical, to promiscuous mode on a real<br />
adapter.<br />
For a logical partition to have network connectivity<br />
to the outside world, the partition owning<br />
a “real” network adapter generally has both<br />
the real Ethernet adapter and a virtual Ethernet<br />
adapter marked as a Trunk adapter. That<br />
partition then performs either routing or bridging<br />
between the real adapter and the virtual<br />
adapter. <strong>The</strong> <strong>Linux</strong> bridge-utils package works<br />
well to bridge the two kinds of networks.<br />
Note that there is no architected link between<br />
the real and virtual adapters, it is the responsibility<br />
of some operating system to route traffic<br />
between them.<br />
<strong>The</strong> implementation of the virtual Ethernet<br />
adapters involves a number of hypervisor interfaces.<br />
Some of the more significant interfaces<br />
are h_register_logical_lan to establish<br />
the initial link between a device driver and<br />
a virtual Ethernet device, h_send_logical_<br />
lan to send a frame, and h_add_logical_<br />
lan_buffer to tell the hypervisor about a<br />
data buffer into which a received frame is to be<br />
placed. <strong>The</strong> hypervisor interfaces then support<br />
either polled or interrupt driven notification of<br />
new frames arriving.<br />
For additional information on the virtual Ethernet<br />
implementation, the code is the documentation<br />
(drivers/net/ibmveth.c).<br />
4.3 Virtual SCSI<br />
Unlike virtual Ethernet adapters, virtual SCSI<br />
adapters come in two flavors. A “client” virtual<br />
SCSI adapter behaves just as a regular<br />
SCSI host bus adapter and is implemented<br />
within the SCSI framework of the <strong>Linux</strong> kernel.<br />
<strong>The</strong> SCSI mid-layer issues standard SCSI<br />
commands such as Inquiry to determine devices<br />
connected to the adapter, and issues regular<br />
SCSI operations to those devices.<br />
A “server” virtual SCSI adapter, generally in a<br />
different partition than the client, receives all<br />
the SCSI commands from the client and is responsible<br />
for handling them. <strong>The</strong> hypervisor<br />
is not involved in what the server does with<br />
the commands. <strong>The</strong>re is no requirement for<br />
the server to link a virtual SCSI adapter to any<br />
kind of real adapter. <strong>The</strong> server can process<br />
and return SCSI responses in any fashion it<br />
likes. If it happens to issue I/O operations to a<br />
real adapter as part of satisfying those requests,<br />
that is an implementation detail of the operating<br />
system containing the server adapter.<br />
<strong>The</strong> hypervisor provides two very primitive<br />
interpartition communication mechanisms on<br />
which the virtual SCSI implementation is built.<br />
<strong>The</strong>re is a queue of 16 byte messages referred<br />
to as a “Command/Response Queue” (CRQ).<br />
Each partition provides the hypervisor with a
118 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
page of memory where its receive queue resides,<br />
and a partition wishing to send a message<br />
to its partner’s queue issues an h_send_crq<br />
hypervisor call. When a message is received<br />
on the queue, an interrupt is (optionally) generated<br />
in the receiving partition.<br />
<strong>The</strong> second hypervisor mechanism is a facility<br />
for issuing DMA operations between partitions.<br />
<strong>The</strong> h_copy_rdma call is used to<br />
DMA a block of memory from the memory<br />
space of one logical partition to the memory<br />
space of another.<br />
<strong>The</strong> virtual SCSI interpartition protocol is<br />
implemented using the ANSI “SCSI RDMA<br />
Protocol” (SRP) (available at http://www.<br />
t10.org). When the client wishes to issue a<br />
SCSI operation, it builds an SRP frame, and<br />
sends the address of the frame in a 16 byte<br />
CRQ message. <strong>The</strong> server DMA’s the SRP<br />
frame from the client, and processes it. <strong>The</strong><br />
SRP frame may itself contain DMA addresses<br />
required for data transfer (read or write buffers,<br />
for example) which may require additional interpartition<br />
DMA operations. When the operation<br />
is complete, the server DMA’s the SRP<br />
response back to the same location as the SRP<br />
command came from and sends a 16 byte CRQ<br />
message back indicating that the SCSI command<br />
has completed.<br />
<strong>The</strong> current <strong>Linux</strong> virtual SCSI server decodes<br />
incoming SCSI commands and issues<br />
block layer commands (generic_make_<br />
request). This allows the SCSI server to<br />
share any block device (e.g., /dev/sdb6 or<br />
/dev/loop0) with client partitions as a virtual<br />
SCSI device.<br />
Note that consideration was given to using protocols<br />
such as iSCSI for device sharing between<br />
partitions. <strong>The</strong> virtual SCSI SRP design<br />
above, however, is a much simpler design<br />
that does not rely on riding above an existing<br />
IP stack. Additionally, the ability to use DMA<br />
operations between partitions fits much better<br />
into the SRP model than an iSCSI model.<br />
<strong>The</strong> <strong>Linux</strong> virtual SCSI client (drivers/<br />
scsi/ibmvscsi/ibmvscsi.c) is close, at<br />
the time of writing, to being accepted into the<br />
<strong>Linux</strong> mainline. <strong>The</strong> <strong>Linux</strong> virtual SCSI server<br />
is sufficiently unlike existing SCSI drivers that<br />
it will require much more mailing list “discussion.”<br />
4.4 Virtual TTY<br />
In addition to virtual Ethernet and SCSI<br />
adapters, the hypervisor supports virtual serial<br />
(TTY) adapters. As with SCSI adapter, these<br />
can be configured as “client” adapters, and<br />
“server” adapters and connected between partitions.<br />
<strong>The</strong> first virtual TTY adapter is used as<br />
the system console, and is treated specially by<br />
the hypervisor. It is automatically connected to<br />
the partition console on the Hardware Management<br />
Console.<br />
To date, multiple concurrent “consoles” have<br />
not been implemented, but they could be. Similarly,<br />
this interface could be used for kernel<br />
debugging as with any serial port, but such an<br />
implementation has not been done.<br />
5 Dynamic Resource Movement<br />
As mentioned for processors, the logical partition<br />
environment lends itself to moving resources<br />
(processors, memory, I/O) between<br />
partitions. In a perfect world, such movement<br />
should be done dynamically while the operating<br />
system is running. Dynamic movement of<br />
processors is currently being implemented, and<br />
dynamic movement of I/O devices (including<br />
dynamically adding and removing virtual I/O<br />
devices) is included in the kernel mainline.<br />
<strong>The</strong> one area for future work in <strong>Linux</strong> is the dynamic<br />
movement of memory into and out of an
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 119<br />
active partition. This function is already supported<br />
on other POWER5 operating systems,<br />
so there is an opportunity for <strong>Linux</strong> to catch<br />
up.<br />
6 Multiple Operating Systems<br />
A key feature of the POWER5 systems is the<br />
ability to run different operating systems in<br />
different logical partitions on the same physical<br />
system. <strong>The</strong> operating systems currently<br />
supported on the POWER5 hardware are AIX,<br />
OS/400, and <strong>Linux</strong>.<br />
While running multiple operating systems, all<br />
of the functions for interpartion interaction described<br />
above must work between operating<br />
systems. For example, idle cycles from an AIX<br />
partition can be given to <strong>Linux</strong>. A processor<br />
can be moved from OS/400 to <strong>Linux</strong> while<br />
both operating systems are active.<br />
For I/O, multiple operating systems must be<br />
able to communicate over the virtual Ethernet,<br />
and SCSI devices must be sharable from (say)<br />
an AIX virtual SCSI server to a <strong>Linux</strong> virtual<br />
SCSI client.<br />
<strong>The</strong>se requirements, along with the architected<br />
hypervisor interfaces, limit the ability to<br />
change implementations just to fit a <strong>Linux</strong> kernel<br />
internal behavior.<br />
7 Conclusions<br />
While many of the basic virtualization technologies<br />
described in this paper existed in the<br />
<strong>Linux</strong> implementation provided on POWER<br />
RS64 and POWER4 iSeries systems [Bou01],<br />
they have been significantly enhanced for<br />
POWER5 to better use the firmware provided<br />
interfaces.<br />
<strong>The</strong> introduction of POWER5-based systems<br />
converged all of the virtualization interfaces<br />
provided by firmware on legacy iSeries and<br />
pSeries systems to a model in line with the<br />
legacy pSeries partitioned system architecture.<br />
As a result much of the PPC64 <strong>Linux</strong> virtualization<br />
code was updated to use these new virtualization<br />
interface definitions.<br />
8 Acknowledgments<br />
<strong>The</strong> authors would like to thank the entire<br />
<strong>Linux</strong>/PPC64 team for the work that went into<br />
the POWER5 virtualization effort. In particular<br />
Anton Blanchard, Paul Mackerras, Rusty<br />
Rusell, Hollis Blanchard, Santiago Leon, Ryan<br />
Arnold, Will Schmidt, Colin Devilbiss, Kyle<br />
Lucke, Mike Corrigan, Jeff Scheel, and David<br />
Larson.<br />
9 Legal Statement<br />
This paper represents the view of the authors, and<br />
does not necessarily represent the view of IBM.<br />
IBM, AIX, iSeries, OS/400, POWER, POWER4,<br />
POWER5, and pSeries are trademarks or registered<br />
trademarks of International Business Machines<br />
Corporation in the United States, other countries,<br />
or both.<br />
Other company, product or service names may be<br />
trademerks or service marks of others.<br />
References<br />
[AAN00] Bill Armstrong, Troy Armstrong,<br />
Naresh Nayar, Ron Peterson, Tom Sand,<br />
and Jeff Scheel. Logical Partitioning,<br />
http://www-1.ibm.com/servers/<br />
eserver/iseries/beyondtech/<br />
lpar.htm.
120 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
[Bou01] David Boutcher, <strong>The</strong> iSeries <strong>Linux</strong><br />
<strong>Kernel</strong> 2001 <strong>Linux</strong> Symposium, (July<br />
2001).
<strong>The</strong> State of ACPI in the <strong>Linux</strong> <strong>Kernel</strong><br />
A. Leonard Brown<br />
Intel<br />
len.brown@intel.com<br />
Abstract<br />
ACPI puts <strong>Linux</strong> in control of configuration<br />
and power management. It abstracts the platform<br />
BIOS and hardware so <strong>Linux</strong> and the<br />
platform can interoperate while evolving independently.<br />
This paper starts with some background on the<br />
ACPI specification, followed by the state of<br />
ACPI deployment on <strong>Linux</strong>.<br />
It describes the implementation architecture of<br />
ACPI on <strong>Linux</strong>, followed by details on the configuration<br />
and power management features.<br />
It closes with a summary of ACPI bugzilla activity,<br />
and a list of what is next for ACPI in<br />
<strong>Linux</strong>.<br />
1 ACPI Specification Background<br />
“ACPI (Advanced Configuration and<br />
Power Interface) is an open industry<br />
specification co-developed by<br />
Hewlett-Packard, Intel, Microsoft,<br />
Phoenix, and Toshiba.<br />
ACPI establishes industry-standard<br />
interfaces for OS-directed configuration<br />
and power management on laptops,<br />
desktops, and servers.<br />
ACPI evolves the existing collection<br />
of power management BIOS<br />
code, Advanced Power Management<br />
(APM) application programming<br />
interfaces (APIs, PNPBIOS<br />
APIs, Multiprocessor Specification<br />
(MPS) tables and so on into a welldefined<br />
power management and configuration<br />
interface specification.” 1<br />
ACPI 1.0 was published in 1996. 2.0 added<br />
64-bit support in 2000. ACPI 3.0 is expected<br />
in summer 2004.<br />
2 <strong>Linux</strong> ACPI Deployment<br />
<strong>Linux</strong> supports ACPI on three architectures:<br />
ia64, i386, and x86_64.<br />
2.1 ia64 <strong>Linux</strong>/ACPI support<br />
Most ia64 platforms require ACPI support,<br />
as they do not have the legacy configuration<br />
methods seen on i386. All the <strong>Linux</strong> distributions<br />
that support ia64 include ACPI support,<br />
whether they’re based on <strong>Linux</strong>-2.4 or <strong>Linux</strong>-<br />
2.6.<br />
2.2 i386 <strong>Linux</strong>/ACPI support<br />
Not all <strong>Linux</strong>-2.4 distributions enabled ACPI<br />
by default on i386. Often they used<br />
just enough table parsing to enable Hyper-<br />
Threading (HT), ala acpi=ht below, and relied<br />
on MPS and PIRQ routers to configure the<br />
1 http://www.acpi.info
122 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
setup_arch()<br />
dmi_scan_machine()<br />
Scan DMI blacklist<br />
BIOS Date vs Jan 1, 2001<br />
acpi_boot_init()<br />
acpi_table_init()<br />
locate and checksum all ACPI tables<br />
print table headers to console<br />
acpi_blacklisted()<br />
ACPI table headers vs. blacklist<br />
parse(BOOT) /* Simple Boot Flags */<br />
parse(FADT) /* PM timer address */<br />
parse(MADT) /* LAPIC, IOAPIC */<br />
parse(HPET) /* HiPrecision Timer */<br />
parse(MCFG) /* PCI Express base */<br />
Figure 1: Early ACPI init on i386<br />
machine. Some included ACPI support by default,<br />
but required the user to add acpi=on to<br />
the cmdline to enable it.<br />
So far, the major <strong>Linux</strong> 2.6 distributions all<br />
support ACPI enabled by default on i386.<br />
Several methods are used to make it more practical<br />
to deploy ACPI onto i386 installed base.<br />
Figure 1 shows the early ACPI startup on the<br />
i386 and where these methods hook in.<br />
2. DMI also exports the hardware manufacturer,<br />
baseboard name, BIOS version,<br />
etc. that you can observe with<br />
dmidecode. 2 dmi_scan.c has a general<br />
purpose blacklist that keys off this information,<br />
and invokes various platformspecific<br />
workarounds. acpi=off is the<br />
most severe—disabling all ACPI support,<br />
even the simple table parsing needed to<br />
enable Hyper-Threading (HT). acpi=ht<br />
does the same, excepts parses enough tables<br />
to enable HT. pci=noacpi disables<br />
ACPI for PCI enumeration and interrupt<br />
configuration. And acpi=noirq disables<br />
ACPI just for interrupt configuration.<br />
3. <strong>The</strong> ACPI tables also contain header information,<br />
which you see near the top<br />
of the kernel messages. ACPI maintains<br />
a blacklist based on the table headers.<br />
But this blacklist is somewhat primitive.<br />
When an entry matches the system, it either<br />
prints warnings or invokes acpi=<br />
off.<br />
1. Most modern system BIOS support DMI,<br />
which exports the date of the BIOS. <strong>Linux</strong><br />
DMI scan in i386 disables ACPI on platforms<br />
with a BIOS older than January 1,<br />
2001. <strong>The</strong>re is nothing magic about this<br />
date, except it allowed developers to focus<br />
on recent platforms without getting distracted<br />
debugging issues on very old platforms<br />
that:<br />
(a) had been running <strong>Linux</strong> w/o ACPI<br />
support for years.<br />
(b) had virtually no chance of a BIOS<br />
update from the OEM.<br />
Boot parameter acpi=force is available<br />
to enable ACPI on platforms older<br />
than the cutoff date.<br />
All three of these methods share the problem<br />
that if they are successful, they tend to hide<br />
root-cause issues in <strong>Linux</strong> that should be fixed.<br />
For this reason, adding to the blacklists is discouraged<br />
in the upstream kernel. <strong>The</strong>ir main<br />
value is to allow <strong>Linux</strong> distributors to quickly<br />
react to deployment issues when they need to<br />
support deviant platforms.<br />
2.3 x86_64 <strong>Linux</strong>/ACPI support<br />
All x86_64 platforms I’ve seen include ACPI<br />
support. <strong>The</strong> major x86_64 <strong>Linux</strong> distributions,<br />
whether <strong>Linux</strong>-2.4 or <strong>Linux</strong>-2.6 based,<br />
all support ACPI.<br />
2 http://www.nongnu.org/dmidecode
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 123<br />
3 Implementation Overview<br />
<strong>The</strong> ACPI specification describes platform registers,<br />
ACPI tables, and operation of the ACPI<br />
BIOS. Figure 2 shows these ACPI components<br />
logically as a layer above the platform specific<br />
hardware and firmware.<br />
<strong>The</strong> ACPI kernel support centers around the<br />
ACPICA (ACPI Component Architecture 3 )<br />
core. ACPICA includes the AML 4 interpreter<br />
that implements ACPI’s hardware abstraction.<br />
ACPICA also implements other OS-agnostic<br />
parts of the ACPI specification. <strong>The</strong> ACPICA<br />
code does not implement any policy, that is the<br />
realm of the <strong>Linux</strong>-specific code. A single file,<br />
osl.c, glues ACPICA to the <strong>Linux</strong>-specific<br />
functions it requires.<br />
<strong>The</strong> box in Figure 2 labeled “<strong>Linux</strong>/ACPI” represents<br />
the <strong>Linux</strong>-specific ACPI code, including<br />
boot-time configuration.<br />
Optional “ACPI drivers,” such as Button, Battery,<br />
Processor, etc. are (optionally loadable)<br />
modules that implement policy related to those<br />
specific features and devices.<br />
3.1 Events<br />
ACPI registers for a “System Control Interrupt”<br />
(SCI) and all ACPI events come through<br />
that interrupt.<br />
<strong>The</strong> kernel interrupt handler de-multiplexes the<br />
possible events using ACPI constructs. In<br />
some cases, it then delivers events to a userspace<br />
application such as acpid via /proc/<br />
acpi/events.<br />
3 http://www.intel.com/technology/<br />
iapc/acpi<br />
4 AML, ACPI Machine Language.<br />
!"<br />
"! !"<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
#<br />
$<br />
!<br />
&<br />
"!!<br />
'<br />
%<br />
<br />
<br />
<br />
<br />
<br />
<br />
Figure 2: Implementation Architecture<br />
4 ACPI Configuration<br />
Interrupt configuration on i386 dominated the<br />
ACPI bug fixing activity over the last year.<br />
<strong>The</strong> algorithm to configure interrupts on an<br />
i386 system with an IOAPIC is shown in Figure<br />
3. ACPI mandates that all PIC mode IRQs<br />
be identity mapped to IOAPIC pins. Exceptions<br />
are specified in MADT 5 interrupt source<br />
override entries.<br />
Over-rides are often used, for example, to specify<br />
that the 8254 timer on IRQ0 in PIC mode<br />
does not use pin0 on the IOAPIC, but uses<br />
pin2. Over-rides also often move the ACPI SCI<br />
to a different pin in IOAPIC mode than it had<br />
in PIC mode, or change its polarity or trigger<br />
from the default.<br />
5 MADT, Multiple APIC Description Table.
124 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
setup_arch()<br />
acpi_boot_init()<br />
parse(MADT);<br />
parse(LAPIC); /* processors */<br />
parse(IOAPIC)<br />
parse(INT_SRC_OVERRIDE);<br />
add_identity_legacy_mappings();<br />
/* mp_irqs[] initialized */<br />
init()<br />
smp_boot_cpus()<br />
setup_IO_APIC()<br />
enable_IO_APIC();<br />
setup_IO_APIC_irqs(); /* mp_irqs[] */<br />
do_initcalls()<br />
acpi_init()<br />
"ACPI: Subsystem revision 20040326"<br />
acpi_initialize_subsystem();<br />
/* AML interpreter */<br />
acpi_load_tables(); /* DSDT */<br />
acpi_enable_subsystem();<br />
/* HW into ACPI mode */<br />
"ACPI: Interpreter enabled"<br />
acpi_bus_init_irq();<br />
AML(_PIC, PIC | IOAPIC | IOSAPIC);<br />
acpi_pci_link_init()<br />
for(every PCI Link in DSDT)<br />
acpi_pci_link_add(Link)<br />
AML(_PRS, Link);<br />
AML(_CRS, Link);<br />
"... Link [LNKA] (IRQs 9 10 *11)"<br />
pci_acpi_init()<br />
"PCI: Using ACPI for IRQ routing"<br />
acpi_irq_penalty_init();<br />
for (PCI devices)<br />
acpi_pci_irq_enable(device)<br />
acpi_pci_irq_lookup()<br />
find _PRT entry<br />
if (Link) {<br />
acpi_pci_link_get_irq()<br />
acpi_pci_link_allocate()<br />
examine possible & current IRQs<br />
AML(_SRS, Link)<br />
} else {<br />
use hard-coded IRQ in _PRT entry<br />
}<br />
acpi_register_gsi()<br />
mp_register_gsi()<br />
io_apic_set_pci_routing()<br />
"PCI: PCI interrupt 00:06.0[A] -><br />
GSI 26 (level, low) -> IRQ 26"<br />
Figure 3: Interrupt Initialization<br />
So after identifying that the system will be in<br />
IOAPIC mode, the 1st step is to record all the<br />
Interrupt Source Overrides in mp_irqs[].<br />
<strong>The</strong> second step is to add the legacy identity<br />
mappings where pins and IRQs have not been<br />
consumed by the over-rides.<br />
Step three is to digest mp_irqs[] in<br />
setup_IO_APIC_irqs(), just like it<br />
would be if the system were running in legacy<br />
MPS mode.<br />
But that is just the start of interrupt configuration<br />
in ACPI mode. <strong>The</strong> system still needs<br />
to enable the mappings for PCI devices, which<br />
are stored in the DSDT 6 _PRT 7 entries. Further,<br />
the _PRT can contain both static entries,<br />
analogous to MPS table entries, or it can contain<br />
dynamic _PRT entries that use PCI Interrupt<br />
Link Devices.<br />
So <strong>Linux</strong> enables the AML interpreter and informs<br />
the ACPI BIOS that it plans to run the<br />
system in IOAPIC mode.<br />
Next the PCI Interrupt Link Devices are<br />
parsed. <strong>The</strong>se “links” are abstract versions of<br />
what used to be called PIRQ-routers, though<br />
they are more general. acpi_pci_link_<br />
init() searches the DSDT for Link Devices<br />
and queries each about the IRQs it can be set<br />
to (_PRS) 8 and the IRQ that it is already set to<br />
(_CRS) 9<br />
A penalty table is used to help decide how<br />
to program the PCI Interrupt Link Devices.<br />
Weights are statically compiled into the table<br />
to avoid programming the links to well<br />
known legacy IRQs. acpi_irq_penalty_<br />
init() updates the table to add penalties to<br />
the IRQs where the Links have possible set-<br />
6 DSDT, Differentiated Services Description Table,<br />
written in AML<br />
7 _PRT, PCI Routing Table<br />
8 PRS, Possible Resource Settings.<br />
9 CRS, Current Resource Settings.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 125<br />
tings. <strong>The</strong> idea is to minimize IRQ sharing,<br />
while not conflicting with legacy IRQ use.<br />
While it works reasonably well in practice, this<br />
heuristic is inherently flawed because it assumes<br />
the legacy IRQs rather than asking the<br />
DSDT what legacy IRQs are actually in use. 10<br />
<strong>The</strong> PCI sub-system calls acpi_pci_irq_<br />
enable() for every device. ACPI looks up<br />
the device in the _PRT by device-id and if it<br />
a simple static entry, programs the IOAPIC.<br />
If it is a dynamic entry, acpi_pci_link_<br />
allocate() chooses an IRQ for the link and<br />
programs the link via AML (_SRS). 11 <strong>The</strong>n the<br />
associated IOAPIC entry is programmed.<br />
Later, the drivers initialize and call request_<br />
irq(IRQ) with the IRQ the PCI sub-system<br />
told it to request.<br />
<strong>One</strong> issue we have with this scheme is that it<br />
can’t automatically recover when the heuristic<br />
balancing act fails. For example when the<br />
parallel port grabs IRQ7 and a PCI Interrupt<br />
Links gets programmed to the same IRQ, then<br />
request_irq(IRQ) correctly fails to put<br />
ISA and PCI interrupts on the same pin. But<br />
the system doesn’t realize that one of the contenders<br />
could actually be re-programmed to a<br />
different IRQ.<br />
<strong>The</strong> fix for this issue will be to delete the<br />
heuristic weights from the IRQ penalty table.<br />
Instead the kernel should scan the DSDT to<br />
enumerate exactly what legacy devices reserve<br />
exactly what IRQs. 12<br />
10 In PIC mode, the default is to keep the BIOS provided<br />
current IRQ setting, unless cmdline acpi_irq_<br />
balance is used. Balancing is always enabled in<br />
IOAPIC mode.<br />
11 SRS, Set Resource Setting<br />
12 bugzilla 2733<br />
4.1 Issues With PCI Interrupt Link Devices<br />
Most of the issues have been with PCI Interrupt<br />
Link Devices, an ACPI mechanism primarily<br />
used to replace the chip-set-specific Legacy<br />
PIRQ code.<br />
• <strong>The</strong> status (_STA) returned by a PCI Interrupt<br />
Link Device does not matter. Some<br />
systems mark the ones we should use as<br />
enabled, some do not.<br />
• <strong>The</strong> status set by <strong>Linux</strong> on a link is important<br />
on some chip sets. If we do<br />
not explicitly disable some unused links,<br />
they result in tying together IRQs and can<br />
cause spurious interrupts.<br />
• <strong>The</strong> current setting returned by a link<br />
(_CRS) can not always be trusted. Some<br />
systems return invalid settings always.<br />
<strong>Linux</strong> must assume that when it sets a<br />
link, the setting was successful.<br />
• Some systems return a current setting that<br />
is outside the list of possible settings. Per<br />
above, this must be ignored and a new setting<br />
selected from the possible-list.<br />
4.2 Issues With ACPI SCI Configuration<br />
Another area that was ironed out this year<br />
was the ACPI SCI (System Control Interrupt).<br />
Originally, the SCI was always configured as<br />
level/low, but SCI failures didn’t stop until<br />
we implemented the algorithm in Figure 4.<br />
During debugging, the kernel gained the cmdline<br />
option that applies to either PIC or IOAPIC<br />
mode: acpi_sci={level,edge,high,<br />
low} but production systems seem to be working<br />
properly and this has seen use recently only<br />
to work around prototype BIOS bugs.
126 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
if (PIC mode) {<br />
set ELCR to level trigger();<br />
} else { /* IOAPIC mode */<br />
if (Interrupt Source Override) {<br />
Use IRQ specified in override<br />
if(trigger edge or level)<br />
use edge or level<br />
else (compatible trigger)<br />
use level<br />
}<br />
if (polarity high or low)<br />
use high or low<br />
else<br />
use low<br />
} else { /* no Override */<br />
use level-trigger<br />
use low-polarity<br />
}<br />
Figure 4: SCI configuration algorithm<br />
4.3 Unresolved: Local APIC Timer Issue<br />
<strong>The</strong> most troublesome configuration issue today<br />
is that many systems with no IO-APIC will<br />
hang during boot unless their LOCAL-APIC<br />
has been disabled, eg. by booting nolapic.<br />
While this issue has gone away on several systems<br />
with BIOS upgrades, entire product lines<br />
from high-volume OEMS appear to be subject<br />
to this failure. <strong>The</strong> current workaround to disable<br />
the LAPIC timer for the duration of the<br />
SMI-CMD update that enables ACPI mode. 13<br />
4.4 Wanted: Generic <strong>Linux</strong> Driver Manager<br />
<strong>The</strong> ACPI DSDT enumerates motherboard devices<br />
via PNP identifiers. This method is used<br />
to load the ACPI specific devices today, eg.<br />
battery, button, fan, thermal etc. as well as<br />
8550_acpi. PCI devices are enumerated via<br />
PCI-ids from PCI config space. Legacy devices<br />
probe out using hard-coded address values.<br />
But a device driver should not have to know or<br />
13 http://bugzilla.kernel.org 1269<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
!"<br />
<br />
#<br />
$! <br />
%&<br />
'''<br />
) "<br />
&(''<br />
Figure 5: ACPI Global, CPU, and Sleep states.<br />
care how it is enumerated by its parent bus. An<br />
8250 driver should worry about the 8250 and<br />
not if it is being discovered by legacy means,<br />
ACPI enumeration, or PCI.<br />
<strong>One</strong> fix would be to be to abstract the PCI-ids,<br />
PNP-ids, and perhaps even some hard-coded<br />
values into a generic device manager directory<br />
that maps them to device drivers.<br />
This would simply add a veneer to the PCI<br />
device configuration, simplifying a very small<br />
number of drivers that can be configured by<br />
PCI or ACPI. However, it would also fix the<br />
real issue that the configuration information in<br />
the ACPI DSDT for most motherboard devices<br />
is currently not parsed and not communicated<br />
to any <strong>Linux</strong> drivers.<br />
<strong>The</strong> Device driver manager would also be<br />
able to tell the power management sub-system<br />
which methods are used to power-manage the<br />
device. Eg. PCI or ACPI.<br />
5 ACPI Power Management<br />
<strong>The</strong> Global System States defined by ACPI are<br />
illustrated in Figure 5. G0 is the working state,<br />
G1 is sleeping, G2 is soft-off and G3 is mechanical<br />
off. <strong>The</strong> “Legacy” state illustrates<br />
where the system is not in ACPI mode.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 127<br />
5.1 P-states<br />
In the context of G0 – Global Working State,<br />
and C0 – CPU Executing State, P-states (Performance<br />
states) are available to reduce power<br />
of the running processor. P-states simultaneously<br />
modulate both the MHz and the voltage.<br />
As power varies by voltage squared, P-states<br />
are extremely effective at saving power.<br />
While P-states are extremely important, the<br />
cpufreq sub-system handles P-states on a<br />
number of different platforms, and the topic is<br />
best addressed in that larger context.<br />
5.2 Throttling<br />
In the context of the G0-Working, C0-<br />
Executing state, Throttling states are defined to<br />
modulate the frequency of the running processor.<br />
Power varies (almost) directly with MHz, so<br />
when the MHz is cut if half, so is the power.<br />
Unfortunately, so is the performance.<br />
<strong>Linux</strong> currently uses Throttling only in response<br />
to thermal events where the processor<br />
is too hot. However, in the future, <strong>Linux</strong> could<br />
add throttling when the processor is already in<br />
the lowest P-state to save additional power.<br />
Note that most processors also include a<br />
backup <strong>The</strong>rmal Monitor throttling mechanism<br />
in hardware, set with higher temperature<br />
thresholds than ACPI throttling. Most processors<br />
also have in hardware an thermal emergency<br />
shutdown mechanism.<br />
5.3 C-states<br />
In the context of G0 Working system state, C-<br />
state (CPU-state) C0 is used to refer to the executing<br />
state. Higher number C-states are entered<br />
to save successively more power when<br />
the processor is idle. No instructions are executed<br />
when in C1, C2, or C3.<br />
ACPI replaces the default idle loop so it can<br />
enter C1, C2 or C3. <strong>The</strong> deeper the C-state,<br />
the more power savings, but the higher the latency<br />
to enter/exit the C-state. You can observe<br />
the C-states supported by the system and<br />
the success at using them in /proc/acpi/<br />
processor/CPU0/power<br />
C1 is included in every processor and has<br />
negligible latency. C1 is implemented with<br />
the HALT or MONITOR/MWAIT instructions.<br />
Any interrupt will automatically wake the processor<br />
from C1.<br />
C2 has higher latency (though always under<br />
100 usec) and higher power savings than C1.<br />
It is entered through writes to ACPI registers<br />
and exits automatically with any interrupt.<br />
C3 has higher latency (though always under<br />
1000 usec) and higher power savings than C2.<br />
It is entered through writes to ACPI registers<br />
and exits automatically with any interrupt or<br />
bus master activity. <strong>The</strong> processor does not<br />
snoop its cache when in C3, which is why busmaster<br />
(DMA) activity will wake it up. <strong>Linux</strong><br />
sees several implementation issues with C3 today:<br />
1. C3 is enabled even if the latency is up to<br />
1000 usec. This compares with the <strong>Linux</strong><br />
2.6 clock tick rate of 1000Hz = 1ms =<br />
1000usec. So when a clock tick causes<br />
C3 to exit, it may take all the way to the<br />
next clock tick to execute the next kernel<br />
instruction. So the benefit of C3 is lost<br />
because the system effectively pays C3 latency<br />
and gets negligible C3 residency to<br />
save power.<br />
2. Some devices do not tolerate the DMA<br />
latency introduced by C3. <strong>The</strong>ir device<br />
buffers underrun or overflow. This is cur-
128 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
rently an issue with the ipw2100 WLAN<br />
NIC.<br />
3. Some platforms can lie about C3 latency<br />
and transparently put the system into a<br />
higher latency C4 when we ask for C3—<br />
particularly when running on batteries.<br />
4. Many processors halt their local APIC<br />
timer (a.k.a. TSC – Timer Stamp Counter)<br />
when in C3. You can observe this<br />
by watching LOC fall behind IRQ0 in<br />
/proc/interrupts.<br />
5. USB makes it virtually impossible to enter<br />
C3 because of constant bus master activity.<br />
<strong>The</strong> workaround at the moment is<br />
to unplug your USB devices when idle.<br />
Longer term, it will take enhancements<br />
to the USB sub-system to address this issue.<br />
Ie. USB software needs to recognize<br />
when devices are present but idle, and reduce<br />
the frequency of bus master activity.<br />
<strong>Linux</strong> decides which C-state to enter on idle<br />
based on a promotion/demotion algorithm.<br />
<strong>The</strong> current algorithm measures the residency<br />
in the current C-state. If it meets a threshold<br />
the processor is promoted to the deeper C-state<br />
on re-entrance into idle. If it was too short, then<br />
the processor is demoted to a lower-numbered<br />
C-state.<br />
Unfortunately, the demotion rules are overly<br />
simplistic, as <strong>Linux</strong> tracks only its previous<br />
success at being idle, and doesn’t yet account<br />
for the load on the system.<br />
Support for deeper C-states via the _CST<br />
method is currently in prototype. Hopefully<br />
this method will also give the OS more accurate<br />
data than the FADT about the latency associated<br />
with C3. If it does not, then we may<br />
need to consider discarding the table-provided<br />
latencies and measuring the actual latency at<br />
boot time.<br />
5.4 Sleep States<br />
ACPI names sleeps states S0 – S5. S0 is the<br />
non-sleep state, synonymous with G0. S1 is<br />
standby, it halts the processor and turns off the<br />
display. Of course turning off the display on an<br />
idle system saves the same amount of power<br />
without taking the system off line, so S1 isn’t<br />
worth much. S2 is deprecated. S3 is suspend to<br />
RAM. S4 is hibernate to disk. S5 is soft-power<br />
off, AKA G2.<br />
Sleep states are unreliable enough on <strong>Linux</strong> today<br />
that they’re best considered “experimental.”<br />
Suspend/Resume suffers from (at least)<br />
two systematic problems:<br />
• _init() and _initdata() on items<br />
that may be referenced after boot, say,<br />
during resume, is a bad idea.<br />
• PCI configuration space is not uniformly<br />
saved and restored either for devices or<br />
for PCI bridges. This can be observed<br />
by using lspci before and after a suspend/resume<br />
cycle. Sometimes setpci<br />
can be used to repair this damage from<br />
user-space.<br />
5.5 Device States<br />
Not shown on the diagram, ACPI defines<br />
power saving states for devices: D0 – D3. D0<br />
is on, D3 is off, D1 and D2 are intermediate.<br />
Higher device states have<br />
1. more power savings,<br />
2. less device context saved by hardware,<br />
3. more device driver state restoring,<br />
4. higher restore latency.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 129<br />
ACPI defines semantics for each device state in<br />
each device class. In practice, D1 and D2 are<br />
often optional - as many devices support only<br />
on and off either because they are low-latency,<br />
or because they are simple.<br />
<strong>Linux</strong>-2.6 includes an updated device driver<br />
model to accommodate power management. 14<br />
This model is highly compatible with PCI and<br />
ACPI. However, this vision is not yet fully realized.<br />
To do so, <strong>Linux</strong> needs a global power<br />
policy manager.<br />
5.6 Wanted: Generic <strong>Linux</strong> Run-time Power<br />
Policy Manager<br />
PCI device drivers today call pci_set_<br />
power_state() to enter D-states. This uses<br />
the power management capabilities in the PCI<br />
power management specification.<br />
<strong>The</strong> ACPI DSDT supplies methods for ACPI<br />
enumerated devices to access ACPI D-states.<br />
However, no driver calls into ACPI to enter D-<br />
states today. 15<br />
Drivers shouldn’t have to care if they are power<br />
managed by PCI or by ACPI. Drivers should be<br />
able to up-call to a generic run-time power policy<br />
manager. That manager should know about<br />
calling the PCI layer or the ACPI layer as appropriate.<br />
<strong>The</strong> power manager should also put those requests<br />
in the context of user-specified power<br />
policy. Eg. Does the user want maximum performance,<br />
or maximum battery life? Currently<br />
there is no method to specify the detailed policy,<br />
and the kernel wouldn’t know how to handle<br />
it anyway.<br />
rently only suspend upon system suspend. This<br />
is probably not the path to industry leading battery<br />
life.<br />
Device drivers should recognize when their device<br />
has gone idle. <strong>The</strong>y should invoke a suspend<br />
up-call to a power manager layer which<br />
will decide if it really is a good idea to grant<br />
that request now, and if so, how. In this case by<br />
calling the PCI or ACPI layer as appropriate.<br />
6 ACPI as seen by bugzilla<br />
Over the last year the ACPI developers have<br />
made heavy use of bugzilla 16 to help prioritize<br />
and track 460 bugs. 300 bugs are closed or resolved,<br />
160 are open. 17<br />
We cc: acpi-bugzilla@lists.<br />
sourceforge.net on these bugs, and<br />
we encourage the community to add that alias<br />
to ACPI-specific bugs in other bugzillas so that<br />
the team can help out wherever the problems<br />
are found.<br />
We haven’t really used the bugzilla priority<br />
field. Instead we’ve split the bugs into categories<br />
and have addressed the configuration issues<br />
first. This explains why most of the interrupt<br />
bugs are resolved, and most of the suspend/resume<br />
bugs are unresolved.<br />
We’ve seen an incoming bug rate of 10-<br />
bugs/week for many months, but the new reports<br />
favor the power management features<br />
over configuration, so we’re hopeful that the<br />
torrent of configuration issues is behind us.<br />
In a related point, it appears that devices cur-<br />
14 Patrick Mochel, <strong>Linux</strong> <strong>Kernel</strong> Power Management,<br />
OLS 2003.<br />
15 Actually, the ACPI hot-plug driver invokes D-states,<br />
but that is the only exception.<br />
16 http://bugzilla.kernel.org/<br />
17 <strong>The</strong> resolved state indicates that a patch is available<br />
for testing, but that it is not yet checked into the kernel.org<br />
kernel.
130 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
$<br />
" #<br />
!<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
*"<br />
+<br />
<br />
% &' '% (' )%%<br />
Figure 6: ACPI bug profile<br />
7 Future Work<br />
7.1 <strong>Linux</strong> 2.4<br />
Going forward, I expect to back-port only critical<br />
configuration related fixes to <strong>Linux</strong>-2.4.<br />
For the latest power management code, users<br />
need to migrate to <strong>Linux</strong>-2.6.<br />
7.2 <strong>Linux</strong> 2.6<br />
<strong>Linux</strong>-2.6 is a “stable” release, so it is not<br />
appropriate to integrate significant new features.<br />
However, the power management side<br />
of ACPI is widely used in 2.6 and there will be<br />
plenty of bug-fixes necessary. <strong>The</strong> most visible<br />
will probably be anything that makes Suspend/Resume<br />
work on more platforms.<br />
7.3 <strong>Linux</strong> 2.7<br />
<strong>The</strong>se feature gaps will not be addressed in<br />
<strong>Linux</strong> 2.6, and so are candidates for <strong>Linux</strong> 2.7:<br />
• Device enumeration is not abstracted in<br />
a generic device driver manager that can<br />
shield drivers from knowing if they’re<br />
enumerated by ACPI, PCI, or other.<br />
• Motherboard devices enumerated by<br />
ACPI in the DSDT are ignored, and<br />
probed instead via legacy methods. This<br />
can lead to resource conflicts.<br />
• Device power states are not abstracted in<br />
a generic device power manager that can<br />
shield drivers from knowing whether to<br />
call ACPI or PCI to handle D-states.<br />
• <strong>The</strong>re is no power policy manager to<br />
translate the user-requested power policy<br />
into kernel policy.<br />
• No devices invoke ACPI methods to enter<br />
D-states.<br />
• Devices do not detect that they are idle<br />
and request of a power manager whether<br />
they should enter power saving device<br />
states.<br />
• <strong>The</strong>re is no MP/SMT coordination of P-<br />
states. Today, P-states are disabled on<br />
SMP systems. Coordination needs to account<br />
for multiple threads and multiple<br />
cores per package.<br />
• Coordinate P-states and T-states. Throttling<br />
should be used only after the system<br />
is put in the lowest P-state.<br />
• Idle states above C1 are disabled on SMP.<br />
• Enable Suspend in PAE mode. 18<br />
18 PAE, Physical Address Extended—MMU mode to<br />
handle > 4GB RAM—optional on i386, always used<br />
on x86_64.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 131<br />
• Enable Suspend on SMP.<br />
• Tick timer modulation for idle power savings.<br />
• Video control extensions. Video is a large<br />
power consumer. <strong>The</strong> ACPI spec Video<br />
extensions are currently in prototype.<br />
• Docking Station support is completely absent<br />
from <strong>Linux</strong>.<br />
• ACPI 3.0 features. TBD after the specification<br />
is published.<br />
7.4 ACPI 3.0<br />
Although ACPI 3.0 has not yet been published,<br />
two ACPI 3.0 tidbits are already in <strong>Linux</strong>.<br />
• PCI Express table scanning. This is the<br />
basic PCI Express support, there will be<br />
more coming. Those in the PCI SIG<br />
can read all about it in the PCI Express<br />
Firmware Specification.<br />
• Several clarifications to the ACPI 2.0b<br />
spec resulted directly from open source<br />
development, 19 and the text of ACPI 3.0<br />
has been updated accordingly. For example,<br />
some subtleties of SCI interrupt configuration<br />
and device enumeration.<br />
When the ACPI 3.0 specification is published<br />
there will instantly be a multiple additions to<br />
the ACPI/<strong>Linux</strong> feature to-do list.<br />
7.5 Tougher Issues<br />
• Battery Life on <strong>Linux</strong> is not yet competitive.<br />
This single metric is the sum of all<br />
the power savings features in the platform,<br />
and if any of them are not working properly,<br />
it comes out on this bottom line.<br />
19 FreeBSD deserves kudos in addition to <strong>Linux</strong><br />
• Laptop Hot Keys are used to control<br />
things such as video brightness, etc. ACPI<br />
does not specify Hot Keys. But when they<br />
work in APM mode and don’t work in<br />
ACPI mode, ACPI gets blamed. <strong>The</strong>re are<br />
4 ways to implement hot keys:<br />
1. SMI 20 handler, the BIOS handles<br />
interrupts from the keys, and controls<br />
the device directly. This acts<br />
like “hardware” control as the OS<br />
doesn’t know it is happening. But<br />
on many systems this SMI method is<br />
disabled as soon as the system transitions<br />
into ACPI mode. Thus the<br />
complaint “the button works in APM<br />
mode, but doesn’t work in ACPI<br />
mode.”<br />
But ACPI doesn’t specify how hot<br />
keys work, so in ACPI mode one of<br />
the other methods listed here needs<br />
to handle the keys.<br />
2. Keyboard Extension driver, such as<br />
i8k. Here the keys return scan<br />
codes like any other keys on the keyboard,<br />
and the keyboard driver needs<br />
to understand those scan code. This<br />
is independent of ACPI, and generally<br />
OEM specific.<br />
3. OEM-specific ACPI hot key driver.<br />
Some OEMs enumerate the hot<br />
keys as OEM-specific devices in the<br />
ACPI tables. While the device is<br />
described in AML, such devices are<br />
not described in the ACPI spec so<br />
we can’t build generic ACPI support<br />
for them. <strong>The</strong> OEM must supply<br />
the appropriate hot-key driver since<br />
only they know how it is supposed<br />
to work.<br />
4. Platform-specific “ACPI” driver. Today<br />
<strong>Linux</strong> includes Toshiba and<br />
20 SMI, System Management Interrupt; invisible to the<br />
OS, handled by the BIOS, generally considered evil.
132 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Asus platform specific extension<br />
drivers to ACPI. <strong>The</strong>y do not use<br />
portable ACPI compliant methods to<br />
recognize and talk to the hot keys,<br />
but generally use the methods above.<br />
<strong>The</strong> correct solution to the the Hot Key issue<br />
on <strong>Linux</strong> will require direct support<br />
from the OEMs, either by supplying documentation,<br />
or code to the community.<br />
8 Summary<br />
This past year has seen great strides in the configuration<br />
aspects of ACPI. Multiple <strong>Linux</strong> distributors<br />
now enable ACPI on multiple architectures.<br />
This sets the foundation for the next era of<br />
ACPI on <strong>Linux</strong> where we can evolve the more<br />
advanced ACPI features to meet the expectations<br />
of the community.<br />
9 Resources<br />
patch, so you can test the latest ACPI code<br />
combined with other recent updates there. 22<br />
10 Acknowledgments<br />
Many thanks to the following people whose direct<br />
contributions have significantly improved<br />
the quality of the ACPI code in the last<br />
year: Jesse Barnes, John Belmonte, Dominik<br />
Brodowski, Bruno Ducrot, Bjorn Helgaas,<br />
Nitin, Kamble, Andi Kleen, Karol Kozimor,<br />
Pavel Machek, Andrew Morton, Jun Nakajima,<br />
Venkatesh Pallipadi, Nate Lawson, David<br />
Shaohua Li, Suresh Siddha, Jes Sorensen, Andrew<br />
de Quincey, Arjan van de Ven, Matt<br />
Wilcox, and Luming Yu. Thanks also to all<br />
the bug submitters, and the enthusiasts on<br />
acpi-devel.<br />
Special thanks to Intel’s Mobile Platforms<br />
Group, which created ACPICA, particularly<br />
Bob Moore and Andy Grover.<br />
<strong>Linux</strong> is a trademark of Linus Torvalds. Bit-<br />
Keeper is a trademark of BitMover, Inc.<br />
<strong>The</strong> ACPI specification is published at http:<br />
//www.acpi.info.<br />
<strong>The</strong> home page for the <strong>Linux</strong> ACPI development<br />
community is here: http://<br />
acpi.sourceforge.net/ It contains numerous<br />
useful pointers, including one to the<br />
acpi-devel mailing list.<br />
<strong>The</strong> latest ACPI code can be found against various<br />
recent releases in the BitKeeper repositories:<br />
http://linux-acpi.bkbits.<br />
net/<br />
Plain patches are available on kernel.<br />
org. 21 Note that Andrew Morton currently<br />
includes the latest ACPI test tree in the -mm<br />
21 http://ftp.kernel.org/pub/linux/<br />
kernel/people/lenb/acpi/patches/<br />
22 http://ftp.kernel.org/pub/linux/<br />
kernel/people/akpm/patches/
Scaling <strong>Linux</strong>® to the Extreme<br />
From 64 to 512 Processors<br />
Ray Bryant<br />
raybry@sgi.com<br />
Jesse Barnes<br />
jbarnes@sgi.com<br />
John Hawkes<br />
hawkes@sgi.com<br />
Jeremy Higdon<br />
jeremy@sgi.com<br />
Silicon Graphics, Inc.<br />
Jack Steiner<br />
steiner@sgi.com<br />
Abstract<br />
In January 2003, SGI announced the SGI® Altix®<br />
3000 family of servers. As announced,<br />
the SGI Altix 3000 system supported up to<br />
64 Intel® Itanium® 2 processors and 512 GB<br />
of main memory in a single <strong>Linux</strong>® image.<br />
Altix now supports up to 256 processors in<br />
a single <strong>Linux</strong> system, and we have a few<br />
early-adopter customers who are running 512<br />
processors in a single <strong>Linux</strong> system; others<br />
are running with as much as 4 terabytes of<br />
memory. This paper continues the work reported<br />
on in our 2003 OLS paper by describing<br />
the changes necessary to get <strong>Linux</strong> to efficiently<br />
run high-performance computing workloads<br />
on such large systems.<br />
Introduction<br />
At OLS 2003 [1], we discussed changes to<br />
<strong>Linux</strong> that allowed us to make <strong>Linux</strong> scale to<br />
64 processors for our high-performance computing<br />
(HPC) workloads. Since then, we have<br />
continued our scalability work, and we now<br />
support up to 256 processors in a single <strong>Linux</strong><br />
image, and we have a few early-adopter customers<br />
who are running 512 processors in a<br />
single-system image; other customers are running<br />
with as much as 4 terabytes of memory.<br />
As can be imagined, the type of changes necessary<br />
to get a single <strong>Linux</strong> system to scale on a<br />
512 processor system or to support 4 terabytes<br />
of memory are of a different nature than those<br />
necessary to get <strong>Linux</strong> to scale up to a 64 processor<br />
system, and the majority of this paper<br />
will describe such changes.<br />
While much of this work has been done in<br />
the context of a <strong>Linux</strong> 2.4 kernel, Altix is<br />
now a supported platform in the <strong>Linux</strong> 2.6 series<br />
(www.kernel.org versions of <strong>Linux</strong><br />
2.6 boot and run well on many small to moderate<br />
sized Altix systems), and our plan is to<br />
port many of these changes to <strong>Linux</strong> 2.6 and<br />
propose them as enhancements to the community<br />
kernel. While some of these changes will<br />
be unique to the <strong>Linux</strong> kernel for Altix, many<br />
of the changes we propose will also improve<br />
performance on smaller SMP and NUMA systems,<br />
so should be of general interest to the<br />
<strong>Linux</strong> scalability community.<br />
In the rest of this paper, we will first provide<br />
a brief review of the SGI Altix 3000 hardware.<br />
Next we will describe why we believe<br />
that very large single-system image, sharedmemory<br />
machine can be more effective tools<br />
for HPC than similar sized non-shared memory<br />
clusters. We will then discuss changes that<br />
we made to <strong>Linux</strong> for Altix in order to make
134 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
that system a more effective system for HPC<br />
on systems with as many as 512 processors.<br />
A second large topic of discussion will be the<br />
changes to support high-performance I/O on<br />
Altix and some of the hardware underpinnings<br />
for that support. We believe that the latter set<br />
of problems are general in the sense that they<br />
apply to any large scale NUMA system and the<br />
solutions we have adopted should be of general<br />
interest for this reason.<br />
Even though this paper is focused on the<br />
changes that we have made to <strong>Linux</strong> to effectively<br />
support very large Altix platforms, it<br />
should be remembered that the total number of<br />
such changes is small in relation to the overall<br />
size of the <strong>Linux</strong> kernel and its supporting<br />
software. SGI is committed to supporting<br />
the <strong>Linux</strong> community and continues to support<br />
<strong>Linux</strong> for Altix as a member of the <strong>Linux</strong><br />
family of kernels, and in general to support binary<br />
compatibility between <strong>Linux</strong> for Altix and<br />
<strong>Linux</strong> on other Itanium Processor Family platforms.<br />
In many cases, the scaling changes described in<br />
this paper have already been submitted to the<br />
community for consideration for inclusion in<br />
<strong>Linux</strong> 2.6. In other cases, the changes are under<br />
evaluation to determine if they need to be<br />
added to <strong>Linux</strong> 2.6, or whether they are fixes<br />
for problems in <strong>Linux</strong> 2.4.21 (the current product<br />
base for <strong>Linux</strong> for Altix) that are no longer<br />
present in <strong>Linux</strong> 2.6.<br />
Finally, this paper contains forward-looking<br />
statements regarding SGI® technologies and<br />
third-party technologies that are subject to<br />
risks and uncertainties. <strong>The</strong> reader is cautioned<br />
not to rely unduly on these forward-looking<br />
statements, which are not a guarantee of future<br />
or current performance, nor are they a guarantee<br />
that features described herein will or will<br />
not be available in future SGI products.<br />
<strong>The</strong> SGI Altix Hardware<br />
This section is condensed from [1]; the reader<br />
should refer to that paper for additional details.<br />
An Altix system consists of a configurable<br />
number of rack-mounted units, each of which<br />
SGI refers to as a brick. <strong>The</strong> most common<br />
type of brick is the C-brick (or compute brick).<br />
A fully configured C-brick consists of two separate<br />
dual-processor Intel Itanium 2 systems,<br />
each of which is a bus-connected multiprocessor<br />
or node.<br />
In addition to the two processors on the bus,<br />
there is also a SHUB chip on each bus. <strong>The</strong><br />
SHUB is a proprietary ASIC that (1) acts as<br />
a memory controller for the local memory,<br />
(2) provides the interface to the interconnection<br />
network, (3) manages the global cache coherency<br />
protocol, and (4) some other functions<br />
as discussed in [1].<br />
Memory accesses in an Altix system are either<br />
local (i.e., the reference is to memory in the<br />
same node as the processor) or remote. <strong>The</strong><br />
SHUB detects whether a reference is local, in<br />
which case it directs the request to the memory<br />
on the node, or remote, in which case it<br />
forwards the request across the interconnection<br />
network to the SHUB chip where the memory<br />
reference will be serviced.<br />
Local memory references have lower latency;<br />
the Altix system is thus a NUMA (non-uniform<br />
memory access) system. <strong>The</strong> ratio of remote to<br />
local memory access times on an Altix system<br />
varies from 1.9 to 3.5, depending on the size<br />
of the system and the relative locations of the<br />
processor and memory module involved in the<br />
transfer.<br />
<strong>The</strong> cache-coherency policy in the Altix system<br />
can be divided into two levels: local<br />
and global. <strong>The</strong> local cache-coherency protocol<br />
is defined by the processors on the local
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 135<br />
bus and is used to maintain cache-coherency<br />
between the Itanium processors on the bus.<br />
<strong>The</strong> global cache-coherency protocol is implemented<br />
by the SHUB chip. <strong>The</strong> global protocol<br />
is directory-based and is a refinement of the<br />
protocol originally developed for DASH [2].<br />
<strong>The</strong> Altix system interconnection network uses<br />
routing bricks to provide connectivity in system<br />
sizes larger than 16 processors. In systems<br />
with 128 or more processors a second layer<br />
of routing bricks is used to forward requests<br />
among subgroups of 32 processors each. <strong>The</strong><br />
routing topology is a fat-tree topology with additional<br />
“express” links being inserted to improve<br />
performance.<br />
Why Big SSI?<br />
In this section we discuss the rationale for<br />
building such a large single-system image<br />
(SSI) box as an Altix system with 512 CPU’s<br />
and (potentially) several TB of main memory:<br />
(1) Shared memory systems are more flexible<br />
and easier to manage than a cluster. <strong>One</strong> can<br />
simulate message passing on shared memory,<br />
but not the other way around. Software for<br />
cluster management and system maintenance<br />
exists, but can be expensive or complex to use.<br />
(2) Shared memory style programming is generally<br />
simpler and more easily understood than<br />
message passing. Debugging of code is often<br />
simpler on a SSI system than on a cluster.<br />
(3) It is generally easier to port or write<br />
codes from scratch using the shared memory<br />
paradigm. Additionally it is often possible to<br />
simply ignore large sections of the code (e.g.<br />
those devoted to data input and output) and<br />
only parallelize the part that matters.<br />
(4) A shared memory system supports easier<br />
load balancing within a computation. <strong>The</strong><br />
mapping of grid points to a node determines<br />
the computational load on the node. Some grid<br />
points may be located near more rapidly changing<br />
parts of computation, resulting in higher<br />
computational load. Balancing this over time<br />
requires moving grid points from node to node<br />
in a cluster, where in a shared memory system<br />
such re-balancing is typically simpler.<br />
(5) Access to large global data sets is simplified.<br />
Often, the parallel computation depends<br />
on a large data set describing, for example, the<br />
precise dimensions and characteristics of the<br />
physical object that is being modeled. This<br />
data set can be too large to fit into the node<br />
memories available on a clustered machine, but<br />
it can readily be loaded into memory on a large<br />
shared memory machine.<br />
(6) Not everything fits into the cluster model.<br />
While many production codes have been converted<br />
to message passing, the overall computation<br />
may still contain one or more phases that<br />
are better performed using a large shared memory<br />
system. Or, there may be a subset of users<br />
of the system who would prefer a shared memory<br />
paradigm to a message passing one. This<br />
can be a particularly important consideration in<br />
large data-center environments.<br />
<strong>Kernel</strong> Changes<br />
In this section we describe the most significant<br />
kernel problems we have encountered in running<br />
<strong>Linux</strong> on a 512 processor Altix system.<br />
Cache line and TLB Conflicts<br />
Cache line conflicts occur in every cachecoherent<br />
multiprocessor system, to one extent<br />
or another, and whether or not the conflict exhibits<br />
itself as a performance problem is dependent<br />
on the rate at which the conflict occurs and<br />
the time required by the hardware to resolve
136 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
the conflict. <strong>The</strong> latter time is typically proportional<br />
to the number of processors involved in<br />
the conflict. On Altix systems with 256 processors<br />
or more, we have encountered some cache<br />
line conflicts that can effectively halt forward<br />
progress of the machine. Typically, these conflicts<br />
involve global variables that are updated<br />
at each timer tick (or faster) by every processor<br />
in the system.<br />
<strong>One</strong> example of this kind of problem is the default<br />
kernel profiler. When we first enabled<br />
the default kernel profiler on a 512 CPU system,<br />
the system would not boot. <strong>The</strong> reason<br />
was that once per timer tick, each processor<br />
in the system was trying to update the profiler<br />
bin corresponding to the CPU idle routine.<br />
A work around to this problem was to initialize<br />
prof_cpu_mask to CPU_MASK_NONE<br />
instead of the default. This disables profiling<br />
on all processors until the user sets the<br />
prof_cpu_mask.<br />
Another example of this kind of problem was<br />
when we imported some timer code from<br />
Red Hat® AS 3.0. <strong>The</strong> timer code included<br />
a global variable that was used to account for<br />
differences between HZ (typically a power of<br />
2) and the number of microseconds in a second<br />
(nominally 1,000,000). This global variable<br />
was updated by each processor on each<br />
timer tick. <strong>The</strong> result was that on Altix systems<br />
larger than about 384 processors, forward<br />
progress could not be made with this version<br />
of the code. To fix this problem, we made this<br />
global variable a per processor variable. <strong>The</strong><br />
result was that the adjustment for the difference<br />
between HZ and microseconds is done on<br />
a per processor rather than on a global basis,<br />
and now the system will boot.<br />
Still other cache line conflicts were remedied<br />
by identifying cases of false cache line sharing<br />
i.e., those cache lines that inadvertently contain<br />
a field that is frequently written by one CPU<br />
and another field (or fields) that are frequently<br />
read by other CPUs.<br />
Another significant bottleneck is the ia64<br />
do_gettimeofday() with its use of<br />
cmpxchg. That operation is expensive on<br />
most architectures, and concurrent cmpxchg<br />
operations on a common memory location<br />
scale worse than concurrent simple writes from<br />
multiple CPUs. On Altix, four concurrent user<br />
gettimeofday() system calls complete in<br />
almost an order of magnitude more time than a<br />
single gettimeofday(); eight are 20 times<br />
slower than one; and the scaling deteriorates<br />
nonlinearly to the point where 32 concurrent<br />
system calls is 100 times slower than one. At<br />
the present time, we are still exploring a way to<br />
improve this scaling problem in <strong>Linux</strong> 2.6 for<br />
Altix.<br />
While moving data to per-processor storage is<br />
often a solution to the kind of scaling problems<br />
we have discussed here, it is not a panacea,<br />
particularly as the number of processors becomes<br />
large. Often, the system will want to<br />
inspect some data item in the per-processor<br />
storage of each processor in the system. For<br />
small numbers of processors this is not a problem.<br />
But when there are hundreds of processors<br />
involved, such loops can cause a TLB miss<br />
each time through the loop as well as a couple<br />
of cache-line misses, with the result that<br />
the loop may run quite slowly. (A TLB miss<br />
is caused because the per-processor storage areas<br />
are typically isolated from one another in<br />
the kernel’s virtual address space.)<br />
If such loops turn out to be bottlenecks, then<br />
what one must often do is to move the fields<br />
that such loops inspect out of the per-processor<br />
storage areas, and move them into a global<br />
static array with one entry per CPU.<br />
An example of this kind of problem in <strong>Linux</strong><br />
2.6 for Altix is the current allocation scheme<br />
of the per-CPU run queue structures. Each
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 137<br />
per-CPU structure on an Altix system requires<br />
a unique TLB to address it, and each structure<br />
begins at the same virtual offset in a page,<br />
which for a virtually indexed cache means that<br />
the same fields will collide at the same index.<br />
Thus, a CPU scheduler that wishes to<br />
do a quick peek at every other CPU’s nr_<br />
running or cpu_load will not only suffer a<br />
TLB miss on every access, but will also likely<br />
suffer a cache miss because these same virtual<br />
offsets will collide in the cache. Cache coloring<br />
of these addresses would be one way to<br />
solve this problem; we are still exploring ways<br />
to fix this problem in <strong>Linux</strong> 2.6 for Altix.<br />
Lock Conflicts<br />
A cousin of cache line conflicts are the lock<br />
conflicts. Indeed, the root mechanism of the<br />
lock bottleneck is a cache line conflict. For<br />
a spinlock_t the conflict is the cmpxchg<br />
operation on the word that signifies whether or<br />
not the lock is owned. For a rwlock_t the<br />
conflict is the cmpxchg or fetch-and-add operation<br />
on the count of the number of readers<br />
or the bit signifying whether or not the<br />
lock is owned exclusively by a writer. For a<br />
seqlock_t the conflict is the increment of<br />
the sequence number.<br />
For some lock conflicts, such as the rcu_<br />
ctrlblk.mutex, the remedy is to make the<br />
spinlock more fine-grained, e.g., by making it<br />
hierarchical or per-CPU. For other lock conflicts,<br />
the most effective remedy is to reduce<br />
the use of the lock.<br />
<strong>The</strong> O(1) CPU scheduler replaced the global<br />
runqueue_lock with per-CPU run queue<br />
locks, and replaced the global run queue with<br />
per-CPU run queues. While this did substantially<br />
decrease the CPU scheduling bottleneck<br />
for CPU counts in the 8 to 32 range, additional<br />
effort has been necessary to remedy additional<br />
bottlenecks that appear with even large configurations.<br />
For example, we discovered that at 256 processors<br />
and above, we encountered a live lock<br />
early in system boot because hundreds of idle<br />
CPUs are load-balancing and are racing in contention<br />
on one or a few busy CPUs. <strong>The</strong> contention<br />
is so severe that the busy CPU’s scheduler<br />
cannot itself acquire its own run queue<br />
lock, and thus the system live locks.<br />
A remedy we applied in our Altix 2.4-based<br />
kernel was to introduce a progressively longer<br />
back off between successive load-balancing attempts,<br />
if the load-balancing CPU continues<br />
to be unsuccessful in finding a task to pullmigrate.<br />
Perhaps all the busiest CPU’s tasks<br />
are pinned to that CPU, or perhaps all the<br />
tasks are still cache-hot. Regardless of the<br />
reason, a load-balancing failure results in that<br />
CPU delaying the next load-balance attempt<br />
by another incremental increase in time. This<br />
algorithm effectively solved the live lock, as<br />
well as improved other high-contention conflicts<br />
on a busiest CPU’s run queue lock (e.g.,<br />
always finding pinned tasks that can never be<br />
migrated).<br />
This load-balance back off algorithm did not<br />
get accepted into the early 2.6 kernels. <strong>The</strong> latest<br />
2.6.7 CPU scheduler, as developed by Nick<br />
Piggin, incorporates a similar back off algorithm.<br />
However, this algorithm (at least as it<br />
appears in 2.6.7-rc2) continues to cause a boottime<br />
live lock at 512 processors on Altix so we<br />
are continuing to investigate this matter.<br />
Page Cache<br />
Managing the page cache in Altix has been a<br />
challenging problem. <strong>The</strong> reason is that while<br />
a large Altix system may have a lot of memory,<br />
each node in the system only has a relatively<br />
small fraction of that memory available as local<br />
memory. For example, on a 512 CPU sys-
138 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
tem, if the entire system has 512 GB of memory,<br />
each node on the system has only 2 GB of<br />
local memory; less than 0.4% of the available<br />
memory on the system is local. When you consider<br />
that it is quite common on such systems<br />
to deal with files that are tens of GB in size, it<br />
is easy to understand how the page cache could<br />
consume all of the memory on several nodes in<br />
the system just doing normal, buffered-file I/O.<br />
Stated another way, this is the challenge of a<br />
large NUMA system: all memory is addressable,<br />
but only a tiny fraction of that memory<br />
is local. Users of NUMA systems need to<br />
place their most frequently accessed data in local<br />
memory; this is crucial to obtain the maximum<br />
performance possible from the system.<br />
Typically this is done by allocating pages on a<br />
first-touch basis; that is, we attempt to allocate<br />
a page on the node where it is first referenced.<br />
If all of the local memory on a node is consumed<br />
by the page cache, then these local storage<br />
allocations will spill over to other (remote)<br />
nodes, the result being a potentially significant<br />
impact on program performance.<br />
Similarly, it is important that the amount of<br />
free memory be balanced across idle nodes in<br />
the system. An imbalance could lead to some<br />
components of a parallel computation running<br />
slower than others because not all components<br />
of the computation were able to allocate their<br />
memory entirely out of local storage. Since the<br />
overall speed of parallel computation is determined<br />
by the execution of its slowest component,<br />
the performance of the entire application<br />
can be impacted by a non-local storage allocation<br />
on only a few nodes.<br />
<strong>One</strong> might think that bdflush or kupdated<br />
(in a <strong>Linux</strong> 2.4 system) would be responsible<br />
for cleaning up unused page-cache pages.<br />
As the OLS reader knows, these daemons<br />
are responsible not for deallocating page-cache<br />
pages, but cleaning them. It is the swap daemon<br />
kswapd that is responsible for causing<br />
page-cache pages to be deallocated. However,<br />
in many situations we have encountered, even<br />
though multiple nodes of the system would be<br />
completely out of local memory, there would<br />
still be lots of free memory elsewhere in the<br />
system. As a result, kswapd will never start.<br />
Once the system gets into such a state, the<br />
local memory on those nodes can remain allocated<br />
entirely to page-cache pages for very<br />
long stretches of time since as far as the kernel<br />
is concerned there is no memory “pressure”.<br />
To get around this problem, particularly<br />
for benchmarking studies, users have often<br />
resorted to programs that allocate and touch<br />
all of the memory on the system, thus causing<br />
kswapd to wake up and free unneeded buffer<br />
cache pages.<br />
We have dealt with this problem in a number<br />
of ways, but the first approach was to change<br />
page_cache_alloc() so that instead<br />
of allocating the page on the local node, we<br />
spread allocations across all nodes in the<br />
system. To do this, we added a new GFP<br />
flag: GFP_ROUND_ROBIN and a new procedure:<br />
alloc_pages_round_robin().<br />
alloc_pages_round_robin() maintains<br />
a counter in per-CPU storage; the<br />
counter is incremented on each call to<br />
page_cache_alloc(). <strong>The</strong> value of the<br />
counter, modulus the number of nodes in<br />
the system, is used to select the zonelist<br />
passed to __alloc_pages(). Like other<br />
NUMA implementations, in <strong>Linux</strong> for Altix<br />
there is a zonelist for each node, and the<br />
zonelists are sorted in nearest neighbor<br />
order with the zone for the local node as the<br />
first entry of the zonelist. <strong>The</strong> result is that<br />
each time page_cache_alloc() is called,<br />
the returned page is allocated on the next node<br />
in sequence, or as close as possible to that<br />
node.<br />
<strong>The</strong> rationale for allocating page-cache pages
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 139<br />
in this way is that while pages are local resources,<br />
the page cache is a global resource, usable<br />
by all processes on the system. Thus, even<br />
if a process is bound to a particular node, in<br />
general it does not make sense to allocate pagecache<br />
pages just on that node, since some other<br />
process in the system may be reading that same<br />
file and hence sharing the pages. So instead of<br />
flooding the current node with the page-cache<br />
pages for files that processes on that node have<br />
opened, we “tax” every node in the system with<br />
a fraction of the page-cache pages. In this<br />
way, we try to conserve a scarce resource (local<br />
memory) by spreading page-cache allocations<br />
over all nodes in the system.<br />
However, even this step was not enough to keep<br />
local storage usage balanced among nodes in<br />
the system. After reading a 10 GB file, for<br />
example, we found that the node where the<br />
reading process was running would have up to<br />
40,000 pages more storage allocated than other<br />
nodes in the system. It turned out the reason for<br />
this was that buffer heads for the read operation<br />
were being allocated locally. To solve this<br />
problem in our <strong>Linux</strong> 2.4.21 kernel for Altix,<br />
we modified kmem_cache_grow() so that<br />
it would pass the GFP_ROUND_ROBIN flag to<br />
kmem_getpages() with the result that the<br />
slab caches on our systems are now also allocated<br />
out of round-robin storage. Of course,<br />
this is not a perfect solution, since there are situations<br />
where it makes perfect sense to allocate<br />
a slab cache entry locally; but this was an expedient<br />
solution appropriate for our product. For<br />
<strong>Linux</strong> 2.6 for Altix we would like to see the<br />
slab allocator be made NUMA aware. (Manfred<br />
Spraul has created some patches to do this<br />
and we are currently evaluating these changes.)<br />
<strong>The</strong> previous two changes solved many of the<br />
cases where a local storage could be exhausted<br />
by allocation of page-cache pages. However,<br />
they still did not solve the problem of local allocations<br />
spilling off node, particularly in those<br />
cases where storage allocation was tight across<br />
the entire system. In such situations, the system<br />
would often start running the synchronous<br />
swapping code even though most (if not all) of<br />
the page-cache pages on the system were clean<br />
and unreferenced outside of the page-cache.<br />
With the very-large memory sizes typical of<br />
our larger Altix customers, entering the synchronous<br />
swapping code needs to be avoided<br />
if at all possible since this tends to freeze the<br />
system for 10s of seconds. Additionally, the<br />
round robin allocation fixes did not solve the<br />
problem of poor and unrepeatable performance<br />
on benchmarks due to the existence of significant<br />
amounts of page-cache storage left over<br />
from previous executions.<br />
To solve these problems, we introduced a routine<br />
called toss_buffer_cache_pages_<br />
node() (referred to here as toss(), for<br />
brevity). In a related change, we made the<br />
active and inactive lists per node rather than<br />
global. toss() first scans the inactive list<br />
(on a particular node) looking for idle pagecache<br />
pages to release back to the free page<br />
pool. If not enough such pages are found<br />
on the inactive list, then the active list is<br />
also scanned. Finally, if toss() has not<br />
called shrink_slab_caches() recently,<br />
that routine is also invoked in order to more<br />
aggressively free unused slab-cache entries.<br />
toss() was patterned after the main loop<br />
of shrink_caches() except that it would<br />
never call swap_out() and if it encountered<br />
a page that didn’t look to be easily free able, it<br />
would just skip that page and go on to the next<br />
page.<br />
A call to toss() was added in __alloc_<br />
pages() in such a way that if allocation on<br />
the current node fails, then before trying to allocate<br />
from some other node (i. e. spilling<br />
to another node), the system will first see if<br />
it can free enough page-cache pages from the<br />
current node so that the current node alloca-
140 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
tion can succeed. In subsequent allocation<br />
passes, toss() is also called to free pagecache<br />
pages on nodes other than the current<br />
one. <strong>The</strong> result of this change is that clean<br />
page-cache pages are effectively treated as free<br />
memory by the page allocator.<br />
At the same time that the toss() code<br />
was added, we added a new user command<br />
bcfree that could be used to free all<br />
idle page-cache pages. (On the __alloc_<br />
pages() path, toss() would only try to<br />
free 32 pages per node.) <strong>The</strong> bcfree command<br />
was intended to be used only for resetting<br />
the state of the page cache before running a<br />
benchmark, and in lieu of rebooting the system<br />
in order to get a clean system state. However,<br />
our customers found that this command could<br />
be used to reduce the size of the page cache<br />
and to avoid situations where large amounts<br />
of buffered-file I/O could force the system to<br />
begin swapping. Since bcfree kills the entire<br />
page-cache, however, this was regarded<br />
as a substandard solution that could also hurt<br />
read performance of cached data and we began<br />
looking for another way to solve this “BIGIO”<br />
problem.<br />
Just to be specific, the BIGIO problem we were<br />
trying to solve was based on the behavior of our<br />
<strong>Linux</strong> 2.4.21 kernel for Altix. A customer reported<br />
that on a 256 GB Altix system, if 200<br />
GB were allocated and 50 GB free, that if the<br />
user program then tried to write 100 GB of data<br />
out to disk, the system would start to swap,<br />
and then in many cases fill up the swap space.<br />
At that point our Out-of-memory (OOM) killer<br />
would wake up and kill the user program! (See<br />
the next section for discussion of our OOM<br />
killer changes.)<br />
Initially we were able to work around this<br />
problem by increasing the amount of swap<br />
space on the system. Our experiments showed<br />
that with an amount of swap space equal to<br />
one-quarter the main memory size, the 256 GB<br />
example discussed above would continue to<br />
completion without the OOM killer being invoked.<br />
I/O performance during this phase was<br />
typically one-half of what the hardware could<br />
deliver, since two I/O operations often had to<br />
be completed: one to read the data in from<br />
the swap device, and one to write the data to<br />
the output file. Additionally, while the swap<br />
scan was active, the system was very sluggish.<br />
<strong>The</strong>se problems led us to search for another solution.<br />
Eventually what we developed is an aggressive<br />
method of trimming the page cache when it<br />
started to grow too big. This solution involved<br />
several steps:<br />
(1) We first added a new page list, the<br />
reclaim_list. This increased the size of<br />
struct page by another 16 bytes. On our<br />
system, struct page is allocated on cachealigned<br />
boundaries anyway, so this really did<br />
not cause an increase in storage, since the current<br />
struct page size was less than 112<br />
bytes. Pages were added to the reclaim list<br />
when they were inserted into the page cache.<br />
<strong>The</strong> reclaim list is per node, with per node<br />
locking. Pages were removed from the reclaim<br />
list when they were no longer reclaimable; that<br />
is, they were removed from the reclaim list<br />
when they were marked as dirty due to buffer<br />
file-I/O or when they were mapped into an address<br />
space.<br />
(2) We rewrote toss() to scan the reclaim list<br />
instead of the inactive and active lists. Herein<br />
we will refer to the new version of toss() as<br />
toss_fast().<br />
(3) We introduced a variant of page_cache_<br />
alloc() called page_cache_alloc_<br />
limited(). Associated with this new<br />
routine were two control variables settable<br />
via sysctl(): page_cache_limit and<br />
page_cache_limit_threshold.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 141<br />
(4) We modified the generic_file_<br />
write() path to call page_cache_<br />
alloc_limited() instead of page_<br />
cache_alloc(). page_cache_alloc_<br />
limited() examines the size of the page<br />
cache. If the total amount of free memory<br />
in the system is less than page_cache_<br />
limit_threshold and the size of the page<br />
cache is larger than page_cache_limit,<br />
then page_cache_alloc_limited()<br />
calls page_cache_reduce() to free<br />
enough page-cache pages on the system to<br />
bring the page cache size down below page_<br />
cache_limit. If this succeeds, then page_<br />
cache_alloc_limited() calls page_<br />
cache_alloc to allocate the page. If not,<br />
then we wakeup bdflush and the current<br />
thread is put to sleep for 30ms (a tunable<br />
parameter)<br />
<strong>The</strong> rationale for the reclaim_list and<br />
toss_fast() was that when we needed to<br />
trim the page cache, practically all pages in<br />
the system would typically be on the inactive<br />
list. <strong>The</strong> existing toss() routine scanned<br />
the inactive list and thus was too slow to call<br />
from generic_file_write. Moreover,<br />
most of the pages on the inactive list were<br />
not reclaimable anyway. Most of the pages<br />
on the reclaim_list are reclaimable. As<br />
a result toss_fast() runs much faster and<br />
is more efficient at releasing idle page-cache<br />
pages than the old routine.<br />
<strong>The</strong> rationale for the page_cache_limit_<br />
threshold in addition to the page_<br />
cache_limit is that if there is lots of free<br />
memory then there is no reason to trim the page<br />
cache. <strong>One</strong> might think that because we only<br />
trim the page cache on the file write path that<br />
this approach would still let the page cache<br />
to grow arbitrarily due to file reads. Unfortunately,<br />
this is not the case, since the <strong>Linux</strong> kernel<br />
in normal multiuser operation is constantly<br />
writing something to the disk. So, a page cache<br />
limit enforced at file write time is also an effective<br />
limit on the size of the page cache due to<br />
file reads.<br />
Finally, the rationale for delaying the calling<br />
task when page_cache_reduce() fails is<br />
that we do not want the system to start swapping<br />
to make space for new buffered I/O pages,<br />
since that will reduce I/O bandwidth by as<br />
much as one-half anyway, as well as take a lot<br />
of CPU time to figure out which pages to swap<br />
out. So it is better to reduce the I/O bandwidth<br />
directly, by limiting the rate of requested I/O,<br />
instead of allowing that I/O to proceed at rate<br />
that causes the system to be overrun by pagecache<br />
pages.<br />
Thus far, we have had good experience with<br />
this algorithm. File I/O rates are not substantially<br />
reduced from what the hardware can provide,<br />
the system does not start swapping, and<br />
the system remains responsive and usable during<br />
the period of time when the BIGIO is running.<br />
Of course, this entire discussion is specific to<br />
<strong>Linux</strong> 2.4.21. For <strong>Linux</strong> 2.6, we have plans to<br />
evaluate whether this is a problem in the system<br />
at all. In particular, we want to see if an<br />
appropriate setting for vm_swappiness to<br />
zero can eliminate the “BIGIO causes swapping”<br />
problem. We also are interested in evaluating<br />
the recent set of VM patches that Nick<br />
Piggin [6] has assembled to see if they eliminate<br />
this problem for systems of the size of a<br />
large Altix.<br />
VM and Memory Allocation Fixes<br />
In addition to the page-cache changes described<br />
in the last section, we have made a<br />
number of smaller changes related to virtual<br />
memory and paging performance.<br />
<strong>One</strong> set of such changes increased the parallelism<br />
of page-fault handling for anonymous
142 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
pages in multi-threaded applications. <strong>The</strong>se<br />
applications allocate space using routines that<br />
eventually call mmap(); the result is that<br />
when the application touches the data area for<br />
the first time, it causes a minor page fault.<br />
<strong>The</strong>se faults are serviced while holding the<br />
address space’s page_table_lock. If the<br />
address space is large and there are a large<br />
number of threads executing in the address<br />
space, this spinlock can be an initializationtime<br />
bottleneck for the application. Examination<br />
of the handle_mm_fault() path for<br />
this case shows that the page_table_lock<br />
is acquired unconditionally but then released as<br />
soon as we have determined that this is a notpresent<br />
fault for an anonymous page. So, we<br />
reordered the code checks in handle_mm_<br />
fault() to determine in advance whether or<br />
not this was the case we were in, and if so, to<br />
skip acquiring the lock altogether.<br />
<strong>The</strong> second place the page_table_lock<br />
was used on this path was in<br />
do_anonymous_page(). Here, the<br />
lock was re-acquired to make sure that the<br />
process of allocating a page frame and filling<br />
in the pte is atomic. On Itanium, stores to<br />
page-table entries are normal stores (that is,<br />
the set_pte macro evaluates to a simple<br />
store). Thus, we can use cmpxchg to update<br />
the pte and make sure that only one thread<br />
allocates the page and fills in the pte. <strong>The</strong><br />
compare and exchange effectively lets us lock<br />
on each individual pte. So, for Altix, we<br />
have been able to completely eliminate the<br />
page_table_lock from this particular<br />
page-fault path.<br />
<strong>The</strong> performance improvement from this<br />
change is shown in Figure 1. Here we show the<br />
time required to initially touch 96 GB of data.<br />
As additional processors are added to the problem,<br />
the time required for both the baseline-<br />
<strong>Linux</strong> and <strong>Linux</strong> for Altix versions decrease<br />
until around 16 processors. At that point the<br />
page_table_lock starts to become a significant<br />
bottleneck. For the largest number of<br />
processors, even the time for the <strong>Linux</strong> for Altix<br />
case is starting to increase again. We believe<br />
that this is due to contention for the address<br />
space’s mmap semaphore.<br />
Time to Touch Data (Seconds)<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
baseline 2.4<br />
<strong>Linux</strong> 2.4 for Altix<br />
0<br />
1 10 100<br />
Number of Processors<br />
Figure 1: Time to initially touch 96 GB of data.<br />
This is particularly important for HPC applications<br />
since OpenMP[5], a common parallel<br />
programming model for FORTRAN, is implemented<br />
using a single address space, multiplethread<br />
programming model. <strong>The</strong> optimization<br />
described here is one of the reasons that Altix<br />
has recently set new performance records<br />
for the SPEC® SPEComp® L2001 benchmark<br />
[7].<br />
While the above measurements were taken using<br />
<strong>Linux</strong> 2.4.21 for Altix, a similar problem<br />
exists in <strong>Linux</strong> 2.6. For many other architectures,<br />
this same kind of change can be made;<br />
i386 is one of the exceptions to this statement.<br />
We are planning on porting our <strong>Linux</strong> 2.4.21<br />
based changes to <strong>Linux</strong> 2.6 and submitting the<br />
changes to the <strong>Linux</strong> community for inclusion<br />
in <strong>Linux</strong> 2.6. This may require moving part<br />
of do_anonymous_page() to architecture
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 143<br />
dependent code to allow for the fact that not<br />
all architectures can use the compare and exchange<br />
approach to eliminate the use of the<br />
page_table_lock in do_anonymous_<br />
page(). However, the performance improvement<br />
shown in Figure 1 is significant for Altix<br />
so we would we would like to explore some<br />
way of incorporating this code into the mainline<br />
kernel.<br />
We have encountered similar scalability limitations<br />
for other kinds of page-fault behavior.<br />
Figure 2 shows the number of page faults<br />
per second of wall clock time measured for<br />
multiple processes running simultaneously and<br />
faulting in a 1 GB /dev/zero mapping. Unlike<br />
the previous case described here, in this<br />
case each process has its own private mapping.<br />
(Here the number of processes is equal to the<br />
number of CPUs.) <strong>The</strong> dramatic difference between<br />
the baseline 2.4 and 2.6 cases and <strong>Linux</strong><br />
for Altix is due to elimination of a lock in the<br />
super block for /dev/zero.<br />
Page Faults/second (wall clock)<br />
1e+07<br />
1e+06<br />
100000<br />
2.4 baseline<br />
<strong>Linux</strong> 2.4 for Altix<br />
2.6 baseline<br />
10000<br />
1 10 100<br />
CPUS<br />
Figure 2: Page Faults per Second of Wall Clock<br />
Time.<br />
<strong>The</strong> lock in the super block protects two<br />
counts: <strong>One</strong> count limits the maximum number<br />
of /dev/zero mappings to 2 63 ; the second<br />
count limits the number of pages assigned<br />
to a /dev/zero mapping to 2 63 . Neither<br />
one of these counts is particularly useful for<br />
a /dev/zero mapping. We eliminated this<br />
lock and obtained a dramatic performance improvement<br />
for this micro-benchmark (at 512<br />
CPUs the improvement was in excess of 800x).<br />
This optimization is important in decreasing<br />
startup time for large message-passing applications<br />
on the Altix system.<br />
A related change is to distribute the count of<br />
pages in the page cache from a single global<br />
variable to a per node variable. Because every<br />
processor in the system needs to update<br />
the page-cache count when adding or removing<br />
pages from the page cache, contention for<br />
the cache line containing this global variable<br />
becomes significant. We changed this global<br />
count to a per-node count. When a page is inserted<br />
into (or removed from) the page cache,<br />
we update the page cache-count on the same<br />
node as the page itself. When we need the<br />
total number of pages in the page cache (for<br />
example if someone reads /proc/meminfo)<br />
we run a loop that sums the per node counts.<br />
However, since the latter operation is much less<br />
frequent than insertions and deletions from the<br />
page cache, this optimization is an overall performance<br />
improvement.<br />
Another change we have made in the VM<br />
subsystem is in the out-of-memory (OOM)<br />
killer for Altix. In <strong>Linux</strong> 2.4.21, the<br />
OOM killer is called from the top of<br />
memory-free and swap-out call chain. oom_<br />
kill() is called from try_to_free_<br />
pages_zone() when calls to shrink_<br />
caches() at memory priority levels 6<br />
through 0 have all failed. Inside oom_kill()<br />
a number of checks are performed, and if any<br />
of these checks succeed, the system is declared<br />
to not be out-of-memory. <strong>One</strong> of those checks<br />
is “if it has been more than 5 seconds since<br />
oom_kill() was last called, then we are not
144 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
OOM.” On a large-memory Altix system, it can<br />
easily take much longer than that to complete<br />
the necessary calls to shrink_caches().<br />
<strong>The</strong> result is that an Altix system never goes<br />
OOM in spite of the fact that swap space is full<br />
and there is no memory to be allocated.<br />
It seemed to us that part of the problem here<br />
is the amount of time it can take for a swap<br />
full condition (readily detectable in try_<br />
to_swap_out() to bubble all the way up<br />
to the top level in try_to_free_pages_<br />
zone(), especially on a large memory machine.<br />
To solve this problem on Altix, we<br />
decided to drive the OOM killer directly off<br />
of detection of swap-space-full condition provided<br />
that the system also continues to try to<br />
swap out additional pages. A count of the<br />
number of successful swaps and unsuccessful<br />
swap attempts is maintained in try_to_<br />
swap_out(). If, in a 10 second interval, the<br />
number of successful swap outs is less than<br />
one percent of the number of attempted swap<br />
outs, and the total number of swap out attempts<br />
exceeds a specified threshold, then try_to_<br />
swap_out()) will directly wake the OOM<br />
killer thread (also new in our implementation).<br />
This thread will wait another 10 seconds, and<br />
if the out-of-swap condition persists, it will invoke<br />
oom_kill() to select a victim and kill<br />
it. <strong>The</strong> OOM killer thread will repeat this sleep<br />
and kill cycle until it appears that swap space<br />
is no longer full or the number of attempts to<br />
swap out new pages (since the thread went to<br />
sleep) falls below the threshold.<br />
In our experience, this has made invocation of<br />
the OOM killer much more reliable than it was<br />
before, at least on Altix. Once again, this implementation<br />
was for <strong>Linux</strong> 2.4.21; we are in<br />
the process of evaluating this problem and the<br />
associated fix on <strong>Linux</strong> 2.6 at the present time.<br />
Another fix we have made to the VM system<br />
in <strong>Linux</strong> 2.4.21 for Altix is in handling<br />
of HUGETLB pages. <strong>The</strong> existing implementation<br />
in <strong>Linux</strong> 2.4.21 allocates HUGETLB<br />
pages to an address space at mmap() time (see<br />
hugetlb_prefault()); it also zeroes the<br />
pages at this time. This processing is done by<br />
the thread that makes the mmap() call. In<br />
particular, this means that zeroing of the allocated<br />
HUGETLB pages is done by a single<br />
processor. On a machine with 4 TB of<br />
memory and with as much memory allocated<br />
to HUGETLB pages as possible, our measurements<br />
have shown that it can take as long as<br />
5,000 seconds to allocate and zero all available<br />
HUGETLB pages. Worse yet, the thread that<br />
does this operation holds the address space’s<br />
mmap_sem and the page_table_lock for<br />
the entire 5,000 seconds. Unfortunately, many<br />
commands that query system state (such as ps<br />
and w) also wish to acquire one of these locks.<br />
<strong>The</strong> result is that the system appears to be hung<br />
for the entire 5,000 seconds.<br />
We solved this problem on Altix by changing<br />
the implementation of HUGETLB page allocation<br />
from prefault to allocate on fault. Many<br />
others have created similar patches; our patch<br />
was unique in that it also allowed zeroing of<br />
pages to occur in parallel if the HUGETLB<br />
page faults occurred on different processors.<br />
This was crucial to allow a large HUGETLB<br />
page region to be faulted into an address space<br />
in parallel, using as many processors as possible.<br />
For example, we have observed speedups<br />
of 25x using 16 processors to touch O(100 GB)<br />
of HUGETLB pages. (<strong>The</strong> speedup is super<br />
linear because if you use just one processor<br />
it has to zero many remote pages, whereas if<br />
you use more processors, at least some of the<br />
pages you are zeroing are local or on nearby<br />
nodes.) Assuming we can achieve the same<br />
kind of speedup on a 4 TB system, we would<br />
reduce the 5,000 second time stated above to<br />
200 seconds.<br />
Recently, we have worked with Kenneth Chen
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 145<br />
to get a similar set of changes proposed for<br />
<strong>Linux</strong> 2.6 [3]. Once this set of changes is accepted<br />
into the mainline this particular problem<br />
will be solved for <strong>Linux</strong> 2.6. <strong>The</strong>se changes are<br />
also necessary for Andi Kleen’s NUMA placement<br />
algorithms [4] to apply to HUGETLB<br />
pages, since otherwise pages are placed at<br />
hugetlb_prefault() time.<br />
A final set of changes is related to large kernel<br />
tables. As previously mentioned, on an Altix<br />
system with 512 processors, less than 0.4% of<br />
the available memory is local. Certain tables in<br />
the <strong>Linux</strong> kernel are sized to be on the order of<br />
one percent of available memory. (An example<br />
of this is the TCP/IP hash table.) Allocating<br />
a table of this size can use all of the local<br />
memory on a node, resulting in exactly the kind<br />
of storage-allocation imbalance we developed<br />
the page-cache changes to solve. To avoid this<br />
problem, we also implement round-robin allocation<br />
of these large tables. Our current technique<br />
uses vm_alloc() to do this. Unfortunately,<br />
this is not portable across all architectures,<br />
since certain architectures have limited<br />
amounts of space that can be allocated by<br />
vm_alloc(). Nonetheless, this is a change<br />
that we need to make; we are still exploring<br />
ways of making this change acceptable to the<br />
<strong>Linux</strong> community.<br />
Once we have solved the initial allocation<br />
problem for these tables, there is still the problem<br />
of getting them appropriately sized for an<br />
Altix system. Clearly if there are 4 TB of main<br />
memory, it does not make much sense to allocate<br />
a TCP/IP hash table of 40 GB, particularly<br />
since the TCP/IP traffic into an Altix system<br />
does not increase with memory size the way<br />
one might expect it to scale with a traditional<br />
<strong>Linux</strong> server. We have seen cases where system<br />
performance is significantly hampered due<br />
to lookups in these overly large tables. At the<br />
moment, we are still exploring a solution acceptable<br />
to the community to solve this particular<br />
problem.<br />
I/O Changes for Altix<br />
<strong>One</strong> of the design goals for the Altix system<br />
is that it support standard PCI devices and<br />
their associated <strong>Linux</strong> drivers as much as possible.<br />
In this section we discuss the performance<br />
improvements built into the Altix hardware<br />
and supported through new driver interfaces<br />
in <strong>Linux</strong> that help us to meet this goal<br />
with excellent performance even on very large<br />
Altix systems.<br />
According to the PCI specification, DMA<br />
writes and PIO read responses are strongly ordered.<br />
On large NUMA systems, however,<br />
DMA writes can take a long time to complete.<br />
Since most PIO reads do not imply completion<br />
of a previous DMA write, relaxing the ordering<br />
rules of DMA writes and PIO read responses<br />
can greatly improve system performance.<br />
Another large system issue relates to initiating<br />
PIO writes from multiple CPUs. PIO writes<br />
from two different CPUs may arrive out of order<br />
at a device. <strong>The</strong> usual way to ensure ordering<br />
is through a combination of locking and a<br />
PIO read (see Documentation/io_ordering.txt).<br />
On large systems, however, doing this read can<br />
be very expensive, particularly if it must be ordered<br />
with respect to unrelated DMA writes.<br />
Finally, the NUMA nature of large machines<br />
make some optimizations obvious and desirable.<br />
Many devices use so-called consistent<br />
system memory for retrieving commands<br />
and storing status information; allocating that<br />
memory close to its associated device makes<br />
sense.<br />
Making non–dependent PIO reads fast<br />
In its I/O chipsets, SGI chose to relax the ordering<br />
between DMAs and PIOs, instead adding
146 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
a barrier attribute to certain DMA writes (to<br />
consistent PCI allocations on Altix) and to interrupts.<br />
This works well with controllers that<br />
use DMA writes to indicate command completions<br />
(for example a SCSI controller with a<br />
response queue, where the response queue is<br />
allocated using pci_alloc_consistent,<br />
so that writes to the response queue have the<br />
barrier attribute). When we ported <strong>Linux</strong> to<br />
Altix, this behavior became a problem, because<br />
many <strong>Linux</strong> PCI drivers use PIO read responses<br />
to imply a status of a DMA write. For<br />
example, on an IDE controller, a bit status register<br />
read is performed to find out if a command<br />
is complete (command complete status implies<br />
that DMA writes of that command’s data are<br />
completed). As a result, SGI had to implement<br />
a rather heavyweight mechanism to guarantee<br />
ordering of DMA writes and PIO reads. This<br />
mechanism involves doing an explicit flush of<br />
DMA write data after each PIO read.<br />
For the cases in which strong ordering of PIO<br />
read responses and DMA writes are not necessary,<br />
a new API was needed so that drivers<br />
could communicate that a given PIO read response<br />
could used relaxed ordering with respect<br />
to prior DMA writes. <strong>The</strong> read_<br />
relaxed API [8] was added early in the 2.6<br />
series for this purpose, and mirrors the normal<br />
read routines, which have variants for various<br />
sized reads.<br />
<strong>The</strong> results below show how expensive a normal<br />
PIO read transaction can be, especially on<br />
a system doing a lot of I/O (and thus DMA).<br />
Type of PIO Time (ns)<br />
normal PIO read 3875<br />
relaxed PIO read 1299<br />
Table 1: Normal vs. relaxed PIO reads on an<br />
idle system<br />
It remains to be seen whether this API will also<br />
apply to the newly added RO bit in the PCI-<br />
Type of PIO Time (ns)<br />
normal PIO read 4889<br />
relaxed PIO read 1646<br />
Table 2: Normal vs. relaxed PIO reads on a<br />
busy system<br />
X specification—the author is hopeful! Either<br />
way, it does give hardware vendors who want<br />
to support <strong>Linux</strong> some additional flexibility in<br />
their design.<br />
Ordering posted writes efficiently<br />
On many platforms, PIO writes from different<br />
CPUs will not necessarily arrive in order (i.e.,<br />
they may be intermixed) even when locking is<br />
used. Since the platform has no way of knowing<br />
whether a given PIO read depends on preceding<br />
writes, it has to guarantee that all writes<br />
have completed before allowing a read transaction<br />
to complete. So performing a read prior<br />
to releasing a lock protecting a region doing<br />
writes is sufficient to guarantee that the writes<br />
arrive in the correct order.<br />
However, performing PIO reads can be an expensive<br />
operation, especially if the device is on<br />
a distant node. SGI chipset designers foresaw<br />
this problem, however, and provided a way to<br />
ensure ordering by simply reading a register<br />
from the chipset on the local node. When the<br />
register indicates that all PIO writes are complete,<br />
it means they have arrived at the chipset<br />
attached to the device, and so are guaranteed<br />
to arrive at the device in the intended order.<br />
<strong>The</strong> SGI sn2 specific portion of the <strong>Linux</strong> ia64<br />
port (sn2 is the architecture name for Altix in<br />
the <strong>Linux</strong> kernel source tree) provides a small<br />
function, sn_mmiob() (for memory–mapped<br />
I/O barrier, analogous to the mb() macro), to<br />
do just that. It can be used in place of reads<br />
that are intended to deal with posted writes and<br />
provides some benefit:
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 147<br />
Type of flush<br />
Time (ns)<br />
regular PIO read 5940<br />
relaxed PIO read 2619<br />
sn_mmiob() 1610<br />
(local chipset read alone) 399<br />
Table 3: Normal vs. fast flushing of 5 PIO<br />
writes<br />
Adding this API to <strong>Linux</strong> (i.e., making it nonsn2-specific)<br />
was discussed some time ago [9],<br />
and may need to be raised again, since it does<br />
appear to be useful on Altix, and is probably<br />
similarly useful on other platforms.<br />
Local allocation of consistent DMA mappings<br />
Consistent DMA mappings are used frequently<br />
by drivers to store command and status buffers.<br />
<strong>The</strong>y are frequently read and written by the<br />
device that owns them, so making sure they<br />
can be accessed quickly is important. <strong>The</strong> table<br />
below shows the difference in the number<br />
of operations per second that can be<br />
achieved using local versus remote allocation<br />
of consistent DMA buffers. Local allocations<br />
were guaranteed by changing the pci_<br />
alloc_consistent function so that it calls<br />
alloc_pages_node using the node closest<br />
to the PCI device in question.<br />
Type<br />
I/Os per second<br />
Local consistent buffer 46231<br />
Remote consistent buffer 41295<br />
Table 4: Local vs. remote DMA buffer allocation<br />
Although this change is platform specific, it<br />
can be made generic if a pci_to_node or<br />
pci_to_nodemask routine is added to the<br />
<strong>Linux</strong> topology API.<br />
Concluding Remarks<br />
Today, our <strong>Linux</strong> 2.4.21 kernel for Altix provides<br />
a productive platform for our highperformance-computing<br />
users who desire to<br />
exploit the features of the SGI Altix 3000 hardware.<br />
To achieve this goal, we have made a<br />
number of changes to our <strong>Linux</strong> for Altix kernel.<br />
We are now in the process of either moving<br />
those changes forward to <strong>Linux</strong> 2.6 for Altix,<br />
or of evaluating the <strong>Linux</strong> 2.6 kernel on Altix<br />
in order to determine if these changes are indeed<br />
needed at all. Our goal is to develop a<br />
version of the <strong>Linux</strong> 2.6 kernel for Altix that<br />
not only supports our HPC customers equally<br />
well as our existing <strong>Linux</strong> 2.4.21 kernel, but<br />
also consists as much as possible of community<br />
supported code.<br />
References<br />
[1] Ray Bryant and John Hawkes, <strong>Linux</strong><br />
Scalability for Large NUMA Systems,<br />
Proceedings of the 2003 Ottawa <strong>Linux</strong><br />
Symposium, Ottawa, Ontario, Canada,<br />
(July 2003).<br />
[2] Daniel Lenoski, James Laudon, Truman<br />
Joe, David Nakahira, Luis Stevens,<br />
Anoop Gupta, and John Hennesy, <strong>The</strong><br />
DASH prototype: Logic overhead and<br />
performance, IEEE Transactions on<br />
Parallel and Distributed Systems,<br />
4(1):41-61, January 1993.<br />
[3] Kenneth Chen, “hugetlb demand paging<br />
patch part [0/3],”<br />
linux-kernel@vger.kernel.org,<br />
2004-04-13 23:17:04,<br />
http://marc.theaimsgroup.<br />
com/?l=linux-kernel&m=<br />
108189860419356&w=2<br />
[4] Andi Kleen, “Patch: NUMA API for<br />
<strong>Linux</strong>,” linux-kernel@vger.kernel.org,
148 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Tue, 6 Apr 2004 15:33:22 +0200,<br />
http:<br />
//lwn.net/Articles/79100/<br />
[5] http://www.openmp.org<br />
[6] Nick Piggin, “MM patches,”<br />
http://www.kerneltrap.org/<br />
~npiggin/nickvm-267r1m1.gz<br />
[7] http://www.spec.org/omp/<br />
results/ompl2001.html<br />
[8] http://linux.bkbits.net:<br />
8080/linux-2.5/cset%<br />
4040213ca0d3eIznHTPAR_<br />
kLCsMZI9VQ?nav=index.html|<br />
ChangeSet@-1d<br />
[9] http://www.cs.helsinki.fi/<br />
linux/linux-kernel/2002-01/<br />
1540.html<br />
© 2004 Silicon Graphics, Inc. Permission to redistribute<br />
in accordance with Ottawa <strong>Linux</strong> Symposium<br />
paper submission guidelines is granted; all<br />
other rights reserved. Silicon Graphics, SGI and<br />
Altix are registered trademarks and OpenMP is a<br />
trademark of Silicon Graphics, Inc., in the U.S.<br />
and/or other countries worldwide. <strong>Linux</strong> is a registered<br />
trademark of Linus Torvalds in several countries.<br />
Intel and Itanium are trademarks or registered<br />
trademarks of Intel Corporation or its subsidiaries<br />
in the United States and other countries. Red Hat<br />
and all Red Hat-based trademarks are trademarks<br />
or registered trademarks of Red Hat, Inc. in the<br />
United States and other countries. All other trademarks<br />
mentioned herein are the property of their<br />
respective owners.
Get More Device Drivers out of the <strong>Kernel</strong>!<br />
Peter Chubb ∗<br />
National ICT Australia<br />
and<br />
<strong>The</strong> University of New South Wales<br />
peterc@gelato.unsw.edu.au<br />
Abstract<br />
Now that <strong>Linux</strong> has fast system calls, good<br />
(and getting better) threading, and cheap context<br />
switches, it’s possible to write device<br />
drivers that live in user space for whole new<br />
classes of devices. Of course, some device<br />
drivers (Xfree, in particular) have always run<br />
in user space, with a little bit of kernel support.<br />
With a little bit more kernel support (a way to<br />
set up and tear down DMA safely, and a generalised<br />
way to be informed of and control interrupts)<br />
almost any PCI bus-mastering device<br />
could have a user-mode device driver.<br />
I shall talk about the benefits and drawbacks<br />
of device drivers being in user space or kernel<br />
space, and show that performance concerns<br />
are not really an issue—in fact, on some platforms,<br />
our user-mode IDE driver out-performs<br />
the in-kernel one. I shall also present profiling<br />
and benchmark results that show where time is<br />
spent in in-kernel and user-space drivers, and<br />
describe the infrastructure I’ve added to the<br />
<strong>Linux</strong> kernel to allow portable, efficient userspace<br />
drivers to be written.<br />
∗ This work was funded by HP, National ICT Australia,<br />
the ARC, and the University of NSW through the<br />
Gelato programme (http://www.gelato.unsw.<br />
edu.au)<br />
1 Introduction<br />
Normal device drivers in <strong>Linux</strong> run in the kernel’s<br />
address space with kernel privilege. This<br />
is not the only place they can run—see Figure<br />
1.<br />
Address Space<br />
<strong>Kernel</strong><br />
Own<br />
Client<br />
A<br />
B<br />
<strong>Kernel</strong><br />
Privilege<br />
C<br />
D<br />
User<br />
Figure 1: Where a Device Driver can Live<br />
Point A is the normal <strong>Linux</strong> device driver,<br />
linked with the kernel, running in the kernel<br />
address space with kernel privilege.<br />
Device drivers can also be linked directly with<br />
the applications that use them (Point B)—<br />
the so-called ‘in-process’ device drivers proposed<br />
by [Keedy, 1979]—or run in a separate<br />
process, and be talked to by an IPC mechanism<br />
(for example, an X server, point D).<br />
<strong>The</strong>y can also run with kernel privilege, but<br />
with a separate kernel address space (Point
150 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
C) (as in the Nooks system described by<br />
[Swift et al., 2002]).<br />
2 Motivation<br />
Traditionally, device drivers have been developed<br />
as part of the kernel source. As such, they<br />
have to be written in the C language, and they<br />
have to conform to the (rapidly changing) interfaces<br />
and conventions used by kernel code.<br />
Even though drivers can be written as modules<br />
(obviating the need to reboot to try out<br />
a new version of the driver 1 ), in-kernel driver<br />
code has access to all of kernel memory, and<br />
runs with privileges that give it access to all instructions<br />
(not just unprivileged ones) and to<br />
all I/O space. As such, bugs in drivers can easily<br />
cause kernel lockups or panics. And various<br />
studies (e.g., [Chou et al., 2001]) estimate that<br />
more than 85% of the bugs in an operating system<br />
are driver bugs.<br />
Device drivers that run as user code, however,<br />
can use any language, can be developed<br />
using any IDE, and can use whatever internal<br />
threading, memory management, etc., techniques<br />
are most appropriate. When the infrastructure<br />
for supporting user-mode drivers is adequate,<br />
the processes implementing the driver<br />
can be killed and restarted almost with impunity<br />
as far as the rest of the operating system<br />
goes.<br />
Drivers that run in the kernel have to be updated<br />
regularly to match in-kernel interface<br />
changes. Third party drivers are therefore usually<br />
shipped as source code (or with a compilable<br />
stub encapsulating the interface) that has<br />
to be compiled against the kernel the driver is<br />
to be installed into.<br />
This means that everyone who wants to run a<br />
1 except that many drivers currently cannot be unloaded<br />
third-party driver also has to have a toolchain<br />
and kernel source on his or her system, or obtain<br />
a binary for their own kernel from a trusted<br />
third party.<br />
Drivers for uncommon devices (or devices that<br />
the mainline kernel developers do not use regularly)<br />
tend to lag behind. For example, in the<br />
2.6.6 kernel, there are 81 drivers known to be<br />
broken because they have not been updated to<br />
match the current APIs, and a number more<br />
that are still using APIs that have been deprecated.<br />
User/kernel interfaces tend to change much<br />
more slowly than in-kernel ones; thus a<br />
user-mode driver has much more chance of<br />
not needing to be changed when the kernel<br />
changes. Moreover, user mode drivers can be<br />
distributed under licences other than the GPL,<br />
which may make them more attractive to some<br />
people 2 .<br />
User-mode drivers can be either closely or<br />
loosely coupled with the applications that use<br />
them. Two obvious examples are the X server<br />
(XFree86) which uses a socket to communicate<br />
with its clients and so has isolation from kernel<br />
and client address spaces and can be very<br />
complex; and the Myrinet drivers, which are<br />
usually linked into their clients to gain performance<br />
by eliminating context switch overhead<br />
on packet reception.<br />
<strong>The</strong> Nooks work [Swift et al., 2002] showed<br />
that by isolating drivers from the kernel address<br />
space, the most common programming<br />
errors could be made recoverable. In Nooks,<br />
drivers are insulated from the rest of the kernel<br />
by running each in a separate address space,<br />
and replacing the driver ↔ kernel interface<br />
with a new one that uses cross-domain procedure<br />
calls to replace any procedure calls in<br />
the ABI, and that creates shadow copies of any<br />
2 for example, the ongoing problems with the Nvidia<br />
graphics card driver could possibly be avoided.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 151<br />
shared variables in the protected address space<br />
of the driver.<br />
This approach provides isolation, but also has<br />
problems: as the driver model changes, there<br />
is quite a lot of wrapper code that has to be<br />
changed to accommodate the changed APIs.<br />
Also, the value of any shared variable is frozen<br />
for the duration of a driver ABI call. <strong>The</strong><br />
Nooks work is uniprocessor only; locking issues<br />
therefore have not yet been addressed.<br />
Windriver [Jungo, 2003] allows development<br />
of user mode device drivers. It loads a proprietary<br />
device module /dev/windrv6; user<br />
code can interact with this device to setup and<br />
teardown DMA, catch interrupts, etc.<br />
Even from user space, of course, it is possible<br />
to make your machine unusable. Device<br />
drivers have to be trusted to a certain extent to<br />
do what they are advertised to do; this means<br />
that they can program their devices, and possibly<br />
corrupt or spy on the data that they transfer<br />
between their devices and their clients. Moving<br />
a driver to user space does not change this.<br />
It does however make it less likely that a fault<br />
in a driver will affect anything other than its<br />
clients<br />
3 Existing Support<br />
<strong>Linux</strong> has good support for user-mode drivers<br />
that do not need DMA or interrupt handling—<br />
see, e.g., [Nakatani, 2002].<br />
<strong>The</strong> ioperm() and iopl() system calls allow<br />
access to the first 65536 I/O ports; and,<br />
with a patch from Albert Calahan 3 one can<br />
map the appropriate parts of /proc/bus/pci/... to<br />
gain access to memory-mapped registers. Or<br />
on some architectures it is safe to mmap()<br />
/dev/mem.<br />
3 http://lkml.org/lkml/2003/7/13/258<br />
It is usually best to use MMIO if it is available,<br />
because on many 64-bit platforms there<br />
are more than 65536 ports—the PCI specification<br />
says that there are 2 32 ports available—<br />
(and on many architectures the ports are emulated<br />
by mapping memory anyway).<br />
For particular devices—USB input devices,<br />
SCSI devices, devices that hang off the parallel<br />
port, and video drivers such as XFree86—<br />
there is explicit kernel support. By opening a<br />
file in /dev, a user-mode driver can talk through<br />
the USB hub, SCSI controller, AGP controller,<br />
etc., to the device. In addition, the input handler<br />
allows input events to be queued back into<br />
the kernel, to allow normal event handling to<br />
proceed.<br />
libpci allows access to the PCI configuration<br />
space, so that a driver can determine what interrupt,<br />
IO ports and memory locations are being<br />
used (and to determine whether the device<br />
is present or not).<br />
Other recent changes—an improved scheduler,<br />
better and faster thread creation and synchronisation,<br />
a fully preemptive kernel, and faster<br />
system calls—mean that it is possible to write<br />
a driver that operates in user space that is almost<br />
as fast as an in-kernel driver.<br />
4 Implementing the Missing Bits<br />
<strong>The</strong> parts that are missing are:<br />
1. the ability to claim a device from user<br />
space so that other drivers do not try to<br />
handle it;<br />
2. <strong>The</strong> ability to deliver an interrupt from a<br />
device to user space,<br />
3. <strong>The</strong> ability to set up and tear-down DMA<br />
between a device and some process’s<br />
memory, and
152 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
4. the ability to loop a device driver’s control<br />
and data interfaces into the appropriate<br />
part of the kernel (so that, for example,<br />
an IDE driver can appear as a standard<br />
block device), preferably without having<br />
to copy any payload data.<br />
<strong>The</strong> work at UNSW covers only PCI devices,<br />
as that is the only bus available on all of the<br />
architectures we have access to (IA64, X86,<br />
MIPS, PPC, alpha and arm).<br />
4.1 PCI interface<br />
Each device should have only a single driver.<br />
<strong>The</strong>refore one needs a way to associate a driver<br />
with a device, and to remove that association<br />
automatically when the driver exits. This has<br />
to be implemented in the kernel, as it is only<br />
the kernel that can be relied upon to clean up<br />
after a failed process. <strong>The</strong> simplest way to<br />
keep the association and to clean it up in <strong>Linux</strong><br />
is to implement a new filesystem, using the<br />
PCI namespace. Open files are automatically<br />
closed when a process exits, so cleanup also<br />
happens automatically.<br />
A new system call, usr_pci_open(int<br />
bus, int slot, int fn) returns a file<br />
descriptor. Internally, it calls pci_enable_<br />
device() and pci_set_master() to set<br />
up the PCI device after doing the standard<br />
filesystem boilerplate to set up a vnode and a<br />
struct file.<br />
Attempts to open an already-opened PCI device<br />
will fail with -EBUSY.<br />
When the file descriptor is finally closed, the<br />
PCI device is released, and any DMA mappings<br />
removed. All files are closed when a process<br />
dies, so if there is a bug in the driver that<br />
causes it to crash, the system recovers ready for<br />
the driver to be restarted.<br />
4.2 DMA handling<br />
On low-end systems, it’s common for the PCI<br />
bus to be connected directly to the memory<br />
bus, so setting up a DMA transfer means<br />
merely pinning the appropriate bit of memory<br />
(so that the VM system can neither swap it out<br />
nor relocate it) and then converting virtual addresses<br />
to physical addresses.<br />
<strong>The</strong>re are, in general, two kinds of DMA, and<br />
this has to be reflected in the kernel interface:<br />
1. Bi-directional DMA, for holding scattergather<br />
lists, etc., for communication with<br />
the device. Both the CPU and the device<br />
read and write to a shared memory area.<br />
Typically such memory is uncached, and<br />
on some architectures it has to be allocated<br />
from particular physical areas. This<br />
kind of mapping is called PCI-consistent;<br />
there is an internal kernel ABI function to<br />
allocate and deallocate appropriate memory.<br />
2. Streaming DMA, where, once the device<br />
has either read or written the area, it has<br />
no further immediate use for it.<br />
I implemented a new system call 4 , usr_pci_<br />
map(), that does one of three things:<br />
1. Allocates an area of memory suitable for a<br />
PCI-consistent mapping, and maps it into<br />
the current process’s address space; or<br />
2. Converts a region of the current process’s<br />
virtual address space into a scatterlist in<br />
terms of virtual addresses (one entry per<br />
page), pins the memory, and converts the<br />
4 Although multiplexing system calls are in general<br />
deprecated in <strong>Linux</strong>, they are extremely useful while developing,<br />
because it is not necessary to change every<br />
architecture-dependent entry.S when adding new functionality
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 153<br />
scatterlist into a list of addresses suitable<br />
for DMA (by calling pci_map_sg(),<br />
which sets up the IOMMU if appropriate),<br />
or<br />
Device 1<br />
PCI bus<br />
3. Undoes the mapping in point 2.<br />
<strong>The</strong> file descriptor returned from usr_pci_<br />
open() is an argument to usr_pci_<br />
map(). Mappings are tracked as part of the<br />
private data for that open file descriptor, so that<br />
they can be undone if the device is closed (or<br />
the driver dies).<br />
Underlying usr_pci_map() are the kernel<br />
routines pci_map_sg() and pci_unmap_<br />
sg(), and the kernel routine pci_alloc_<br />
consistent().<br />
Different PCI cards can address different<br />
amounts of DMA address space. In the kernel<br />
there is an interface to request that the dma addresses<br />
supplied are within the range addressable<br />
by the card. <strong>The</strong> current implementation<br />
assumes 32-bit addressing, but it would be possible<br />
to provide an interface to allow the real<br />
capabilities of the device to be communicated<br />
to the kernel.<br />
4.2.1 <strong>The</strong> IOMMU<br />
Many modern architectures have an IO memory<br />
management unit (see Figure 2), to convert<br />
from physical to I/O bus addresses—in much<br />
the same way that the processor’s MMU converts<br />
virtual to physical addresses—allowing<br />
even thirty-two bit cards to do single-cycle<br />
DMA to anywhere in the sixty-four bit memory<br />
address space.<br />
On such systems, after the memory has been<br />
pinned, the IOMMU has to be set up to translate<br />
from bus to physical addresses; and then<br />
after the DMA is complete, the translation can<br />
be removed from the IOMMU.<br />
Device 2<br />
Device 3<br />
IOMMU<br />
Figure 2: <strong>The</strong> IO MMU<br />
Main<br />
Memory<br />
<strong>The</strong> processor’s MMU also protects one virtual<br />
address space from another. Currently shipping<br />
IOMMU hardware does not do this: all<br />
mappings are visible to all PCI devices, and<br />
moreover for some physical addresses on some<br />
architectures the IOMMU is bypassed.<br />
For fully secure user-space drivers, one would<br />
want this capability to be turned off, and also<br />
to be able to associate a range of PCI bus addresses<br />
with a particular card, and disallow access<br />
by that card to other addresses. Only thus<br />
could one ensure that a card could perform<br />
DMA only into memory areas explicitly allocated<br />
to it.<br />
4.3 Interrupt Handling<br />
<strong>The</strong>re are essentially two ways that interrupts<br />
can be passed to user level.<br />
<strong>The</strong>y can be mapped onto signals, and sent<br />
asynchronously, or a synchronous ‘wait-forsignal’<br />
mechanism can be used.<br />
A signal is a good intuitive match for what an<br />
interrupt is, but has other problems:<br />
1. <strong>One</strong> is fairly restricted in what one can do<br />
in a signal handler, so a driver will usually
154 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
have to take extra context switches to respond<br />
to an interrupt (into and out of the<br />
signal handler, and then perhaps the interrupt<br />
handler thread wakes up)<br />
2. Signals can be slow to deliver on busy systems,<br />
as they require the process table to<br />
be locked. It would be possible to short<br />
circuit this to some extent.<br />
3. <strong>One</strong> needs an extra mechanism for registering<br />
interest in an interrupt, and for tearing<br />
down the registration when the driver<br />
dies.<br />
For these reasons I decided to map interrupts<br />
onto file descriptors. /proc already has a directory<br />
for each interrupt (containing a file that<br />
can be written to to adjust interrupt routing to<br />
processors); I added a new file to each such directory.<br />
Suitably privileged processes can open<br />
and read these files. <strong>The</strong> files have open-once<br />
semantics; attempts to open them while they<br />
are open return −1 with EBUSY.<br />
When an interrupt occurs, the in-kernel interrupt<br />
handler masks just that interrupt in the interrupt<br />
controller, and then does an up() operation<br />
on a semaphore (well, actually, the implementation<br />
now uses a wait queue, but the<br />
effect is the same).<br />
When a process reads from the file, then kernel<br />
enables the interrupt, then calls down() on a<br />
semaphore, which will block until an interrupt<br />
arrives.<br />
<strong>The</strong> actual data transferred is immaterial, and<br />
in fact none ever is transferred; the read()<br />
operation is used merely as a synchronisation<br />
mechanism.<br />
poll() is also implemented, so a user process<br />
is not forced into the ‘wait for interrupt’<br />
model that we use.<br />
Obviously, one cannot share interrupts between<br />
devices if there is a user process involved.<br />
<strong>The</strong> in-kernel driver merely passes<br />
the interrupt onto the user-mode process; as it<br />
knows nothing about the underlying hardware,<br />
it cannot tell if the interrupt is really for this<br />
driver or not. As such it always reports the interrupt<br />
as ‘handled.’<br />
This scheme works only for level-triggered interrupts.<br />
Fortunately, all PCI interrupts are<br />
level triggered.<br />
If one really wants a signal when an interrupt<br />
happens, one can arrange for a SIGIO using<br />
fcntl().<br />
It may be possible, by more extensive rearrangement<br />
of the interrupt handling code, to<br />
delay the end-of-interrupt to the interrupt controller<br />
until the user process is ready to get an<br />
interrupt. As masking and unmasking interrupts<br />
is slow if it has to go off-chip, delaying<br />
the EOI should be significantly faster than<br />
the current code. However, interrupt delivery<br />
to userspace turns out not to be a bottleneck,<br />
so there’s not a lot of point in this optimisation<br />
(profiles show less than 0.5% of the time<br />
is spent in the kernel interrupt handler and delivery<br />
even for heavy interrupt load—around<br />
1000 cycles per interrupt).<br />
5 Driver Structure<br />
<strong>The</strong> user-mode drivers developed at UNSW are<br />
structured as a preamble, an interrupt thread,<br />
and a control thread (see Figure 3).<br />
<strong>The</strong> preamble:<br />
1. Uses libpci.a to find the device or devices<br />
it is meant to drive,<br />
2. Calls usr_pci_open() to claim the<br />
device, and<br />
3. Spawns the interrupt thread, then
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 155<br />
User<br />
libpci<br />
<strong>Kernel</strong><br />
Generic<br />
IRQ Handler<br />
Client<br />
pci_read_config()<br />
read()<br />
Architecture−dependent<br />
DMA support<br />
IPC or<br />
function calls<br />
Driver<br />
usrdrv<br />
Driver<br />
pci_map()<br />
pci_unmap()<br />
pci_map_sg()<br />
pci_unmap_sg()<br />
so that the control thread can continue. (<strong>The</strong><br />
semaphore is implemented as a pthreads mutex).<br />
<strong>The</strong> driver relies on system calls and threading,<br />
so the fast system call support now available<br />
in <strong>Linux</strong>, and the NPTL are very important to<br />
get good performance. Each physical I/O involves<br />
at least three system calls, plus whatever<br />
is necessary for client communication: a<br />
read() on the interrupt FD, calls to set up<br />
and tear down DMA, and maybe a futex()<br />
operation to wake the client.<br />
<strong>The</strong> system call overhead could be reduced by<br />
combining DMA setup and teardown into a<br />
single system call.<br />
Figure 3: Architecture of a User-Mode Device<br />
Driver<br />
4. Goes into a loop collecting client requests.<br />
<strong>The</strong> interrupt thread:<br />
1. Opens /proc/irq/irq/irq<br />
2. Loops calling read() on the resulting<br />
file descriptor and then calling the driver<br />
proper to handle the interrupt.<br />
3. <strong>The</strong> driver handles the interrupt, calls out<br />
to the control thread(s) to say that work is<br />
completed or that there has been an error,<br />
queues any more work to the device, and<br />
then repeats from step 2.<br />
For the lowest latency, the interrupt thread can<br />
be run as a real time thread. For our benchmarks,<br />
however, this was not done.<br />
<strong>The</strong> control thread queues work to the driver<br />
then sleeps on a semaphore. When the driver,<br />
running in the interrupt thread, determines that<br />
a request is complete, it signals the semaphore<br />
6 Looping the Drivers<br />
An operating system has two functions with regard<br />
to devices: firstly to drive them, and secondly<br />
to abstract them, so that all devices of the<br />
same class have the same interface. While a<br />
standalone user-level driver is interesting in its<br />
own right (and could be used, for example, to<br />
test hardware, or could be linked into an application<br />
that doesn’t like sharing the device with<br />
anyone), it is much more useful if the driver<br />
can be used like any other device.<br />
For the network interface, that’s easy: use<br />
the tun/tap interface and copy frames between<br />
the driver and /dev/net/tun. Having to copy<br />
slows things down; others on the team here are<br />
planning to develop a zero-copy equivalent of<br />
tun/tap.<br />
For the IDE device, there’s no standard <strong>Linux</strong><br />
way to have a user-level block device, so I implemented<br />
one. It is a filesystem that has pairs<br />
of directories: a master and a slave. When<br />
the filesystem is mounted, creating a file in the<br />
master directory creates a set of block device<br />
special files, one for each potential partition, in
156 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
the slave directory. <strong>The</strong> file in the master directory<br />
can then be used to communicate via<br />
a very simple protocol between a user level<br />
block device and the kernel’s block layer. <strong>The</strong><br />
block device special files in the slave directory<br />
can then be opened, closed, read, written or<br />
mounted, just as any other block device.<br />
<strong>The</strong> main reason for using a mounted filesystem<br />
was to allow easy use of dynamic major<br />
numbers.<br />
I didn’t bother implementing ioctl; it was not<br />
necessary for our performance tests, and when<br />
the driver runs at user level, there are cleaner<br />
ways to communicate out-of-band data with<br />
the driver, anyway.<br />
7 Results<br />
Device drivers were coded up by<br />
[Leslie and Heiser, 2003] for a CMD680<br />
IDE disc controller, and by another PhD<br />
student (Daniel Potts) for a DP83820 Gigabit<br />
ethernet controller. Daniel also designed and<br />
implemented the tuntap interface.<br />
7.1 IDE driver<br />
<strong>The</strong> disc driver was linked into a program that<br />
read 64 Megabytes of data from a Maxtor 80G<br />
disc into a buffer, using varying read sizes.<br />
Measurements were also made using <strong>Linux</strong>’s<br />
in-kernel driver, and a program that read 64M<br />
of data from the same on-disc location using<br />
O_DIRECT and the same read sizes.<br />
We also measured write performance, but the<br />
results are sufficiently similar that they are not<br />
reproduced here.<br />
At the same time as the tests, a lowpriority<br />
process attempted to increment a 64-<br />
bit counter as fast as possible. <strong>The</strong> number of<br />
increments was calibrated to processor time on<br />
an otherwise idle system; reading the counter<br />
before and after a test thus gives an indication<br />
of how much processor time is available to processes<br />
other than the test process.<br />
<strong>The</strong> initial results were disappointing; the<br />
user-mode drivers spent far too much time<br />
in the kernel. This was tracked down to<br />
kmalloc(); so the usr_pci_map() function<br />
was changed to maintain a small cache<br />
of free mapping structures instead of calling<br />
kmalloc() and kfree() each time (we<br />
could have used the slab allocator, but it’s easier<br />
to ensure that the same cache-hot descriptor<br />
is reused by coding a small cache ourselves).<br />
This resulted in the performance graphs in Figure<br />
4.<br />
<strong>The</strong> two drivers compared are the new<br />
CMD680 driver running in user space, and<br />
<strong>Linux</strong>’s in-kernel SIS680 driver. As can be<br />
seen, there is very little to choose between<br />
them.<br />
<strong>The</strong> graphs show average of ten runs; the standard<br />
deviations were calculated, but are negligible.<br />
Each transfer request takes five system calls to<br />
do, in the current design. <strong>The</strong> client queues<br />
work to the driver, which then sets up DMA for<br />
the transfer (system call one), starts the transfer,<br />
then returns to the client, which then sleeps<br />
on a semaphore (system call two). <strong>The</strong> interrupt<br />
thread has been sleeping in read(),<br />
when the controller finishes its DMA, it cause<br />
an interrupt, which wakes the interrupt thread<br />
(half of system call three). <strong>The</strong> interrupt thread<br />
then tears down the DMA (system call four),<br />
and starts any queued and waiting activity, then<br />
signals the semaphore (system call five) and<br />
goes back to read the interrupt FD again (the<br />
other half of system call three).<br />
When the transfer is above 128k, the IDE controller<br />
can no longer do a single DMA opera-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 157<br />
100<br />
50<br />
80<br />
kernel read<br />
user read<br />
40<br />
CPU (%)<br />
60<br />
40<br />
30<br />
20<br />
Throughput (MiB/s)<br />
20<br />
10<br />
0<br />
0<br />
1 4 16 64 256 1024 4096 16384 65536<br />
Transfer size (k)<br />
Figure 4: Throughput and CPU usage for the user-mode IDE driver on Itanium-2, reading from a<br />
disk<br />
tion, so has to generate multiple transfers <strong>The</strong><br />
<strong>Linux</strong> kernel splits DMA requests above 64k,<br />
thus increasing the overhead.<br />
<strong>The</strong> time spent in this driver is divided as<br />
shown in Figure 5.<br />
IRQ<br />
Hardware<br />
Scheduler<br />
2.2 Latency 1<br />
<strong>Kernel</strong> Stub 0.4<br />
UserMode<br />
Handler<br />
Queue<br />
New<br />
Work<br />
Signal<br />
Client<br />
DMA...<br />
Figure 5: Timeline (in µseconds)<br />
Scheduler Latency<br />
1. Packet receive performance, where packets<br />
were dropped and counted at the layer<br />
immediately above the driver<br />
2. Packet transmit performance, where packets<br />
were generated and fed to the driver,<br />
and<br />
3. Ethernet-layer packet echoing, where the<br />
protocol layer swapped source and destination<br />
MAC-addresses, and fed received<br />
packets back into the driver.<br />
7.2 Gigabit Ethernet<br />
<strong>The</strong> Gigabit driver results are more interesting.<br />
We tested these using [ipbench, 2004]<br />
with four clients, all with pause control turned<br />
off. We ran three tests:<br />
We did not want to start comparing IP stacks,<br />
so none of these tests actually use higher level<br />
protocols.<br />
We measured three different configurations: a<br />
standalone application linked with the driver,<br />
the driver looped back into /dev/net/tap and<br />
the standard in-kernel driver, all with interrupt
158 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
holdoff set to 0, 1, or 2. (By default, the normal<br />
kernel driver sets the interrupt holdoff to 300<br />
µseconds, which led to too many packets being<br />
dropped because of FIFO overflow) Not all<br />
tests were run in all configurations—for example<br />
the linux in-kernel packet generator is sufficiently<br />
different from ours that no fair comparison<br />
could be made.<br />
For the tests that had the driver residing in or<br />
feeding into the kernel, we implemented a new<br />
protocol module to count and either echo or<br />
drop packets, depending on the benchmark.<br />
In all cases, we used the amount of work<br />
achieved by a low priority process to measure<br />
time available for other work while the test was<br />
going on.<br />
<strong>The</strong> throughput graphs in all cases are the<br />
same. <strong>The</strong> maximum possible speed on the<br />
wire is given for raw ethernet by 10 9 × p/(p +<br />
38) bits per second (the parameter 38 is the<br />
ethernet header size (14 octets), plus a 4 octet<br />
frame check sequence, plus a 7 octet preamble,<br />
plus a 1 octet start frame delimiter plus<br />
the minimum 12 octet interframe gap; p is the<br />
packet size in octets). For large packets the performance<br />
in all cases was the same as the theoretical<br />
maximum. For small packet sizes, the<br />
throughput is limited by the PCI bus; you’ll notice<br />
that the slope of the throughput curve when<br />
echoing packets is around half the slope when<br />
discarding packets, because the driver has to do<br />
twice as many DMA operations per packet.<br />
<strong>The</strong> user-mode driver (‘<strong>Linux</strong> user’ on the<br />
graph) outperforms the in-kernel driver<br />
(‘<strong>Linux</strong> orig’)—not in terms of throughput,<br />
where all the drivers perform identically, but<br />
in using much less processing time.<br />
This result was so surprising that we repeated<br />
the tests using an EEpro1000, purportedly a<br />
card with a much better driver, but saw the<br />
same effect—in fact the achieved echo performance<br />
is worse than for the in-kernel ns83820<br />
driver for some packet sizes.<br />
<strong>The</strong> reason appears to be that our driver has<br />
a fixed number of receive buffers, which are<br />
reused when the client is finished with them—<br />
they are allocated only once. This is to provide<br />
congestion control at the lowest possible<br />
level—the card drops packets when the upper<br />
layers cannot keep up.<br />
<strong>The</strong> <strong>Linux</strong> kernel drivers have an essentially<br />
unlimited supply of receive buffers. Overhead<br />
involved in allocating and setting up DMA for<br />
these buffers is excessive, and if the upper layers<br />
cannot keep up, congestion is detected and<br />
the packets dropped in the protocol layer—<br />
after significant work has been done in the<br />
driver.<br />
<strong>One</strong> sees the same problem with the user mode<br />
driver feeding the tuntap interface, as there is<br />
no feedback to throttle the driver. Of course,<br />
here there is an extra copy for each packet,<br />
which also reduces performance.<br />
7.3 Reliability and Failure Modes<br />
In general the user-mode drivers are very reliable.<br />
Bugs in the drivers that would cause<br />
the kernel to crash (for example, a null pointer<br />
reference inside an interrupt handler) cause the<br />
driver to crash, but the kernel continues. <strong>The</strong><br />
driver can then be fixed and restarted.<br />
8 Future Work<br />
<strong>The</strong> main foci of our work now lie in:<br />
1. Reducing the need for context switches<br />
and system calls by merging system calls,<br />
and by trying new driver structures.<br />
2. A zero-copy implementation of tun/tap.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 159<br />
100<br />
1e+09<br />
80<br />
<strong>The</strong>oretical Max<br />
<strong>Kernel</strong> EEPRO1000 driver<br />
User mode driver, 100usec holdoff<br />
<strong>Kernel</strong> NS83820 driver, 100usec holdoff<br />
9e+08<br />
8e+08<br />
CPU (%)<br />
60<br />
40<br />
7e+08<br />
6e+08<br />
5e+08<br />
Throughput (b/s)<br />
4e+08<br />
20<br />
3e+08<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600 2e+08<br />
Packet size (octets)<br />
Figure 6: Receive Throughput and CPU usage for Gigabit Ethernet drivers on Itanium-2<br />
100<br />
1e+09<br />
80<br />
8e+08<br />
CPU (%)<br />
60<br />
40<br />
<strong>The</strong>oretical Max<br />
User mode driver, 200 usec interrupt holdoff<br />
User mode driver, 100 usec interrupt holdoff<br />
User mode driver, 0 usec interrupt holdoff<br />
6e+08<br />
4e+08<br />
Throughput (b/s)<br />
20<br />
2e+08<br />
0<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600<br />
Packet size (octets)<br />
Figure 7: Transmit Throughput and CPU usage for Gigabit Ethernet drivers on Itanium-2
160 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
100<br />
1e+09<br />
80<br />
8e+08<br />
CPU (%)<br />
60<br />
40<br />
<strong>The</strong>oretical Max<br />
User mode driver<br />
In-kernel EEPRO1000 driver<br />
Normal kernel driver<br />
user-mode driver -> /dev/tun/tap0<br />
6e+08<br />
4e+08<br />
Throughput<br />
20<br />
2e+08<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600 0<br />
Packet size<br />
Figure 8: MAC-layer Echo Throughput and CPU usage for Gigabit Ethernet drivers on Itanium-2<br />
3. Improving robustness and reliability of<br />
the user-mode drivers, by experimenting<br />
with the IOMMU on the ZX1 chipset of<br />
our Itanium-2 machines.<br />
4. Measuring the reliability enhancements,<br />
by using artificial fault injection to see<br />
what problems that cause the kernel to<br />
crash are recoverable in user space.<br />
9 Where d’ya Get It?<br />
Patches against the 2.6 kernel are sent to the<br />
<strong>Linux</strong> kernel mailing list, and are on http://<br />
www.gelato.unsw.edu.au/patches<br />
Sample drivers will be made available from the<br />
same website.<br />
5. User-mode filesystems.<br />
In addition there are some housekeeping tasks<br />
to do before this infrastructure is ready for inclusion<br />
in a 2.7 kernel:<br />
1. Replace the ad-hoc memory cache with a<br />
proper slab allocator.<br />
2. Clean up the system call interface<br />
10 Acknowledgements<br />
Other people on the team here did much work<br />
on the actual implementation of the user level<br />
drivers and on the benchmarking infrastructure.<br />
Prominent among them were Ben Leslie<br />
(IDE driver, port of our dp83820 into the kernel),<br />
Daniel Potts (DP83820 driver, tuntap interface),<br />
and Luke McPherson and Ian Wienand<br />
(IPbench).
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 161<br />
References<br />
[Chou et al., 2001] Chou, A., Yang, J., Chelf,<br />
B., Hallem, S., and Engler, D. R. (2001).<br />
An empirical study of operating systems<br />
errors. In Symposium on Operating<br />
Systems Principles, pages 73–88.<br />
http://citeseer.nj.nec.com/<br />
article/chou01empirical.html.<br />
[ipbench, 2004] ipbench (2004). ipbench — a<br />
distributed framework for network<br />
benchmarking.<br />
http://ipbench.sf.net/.<br />
[Jungo, 2003] Jungo (2003). Windriver.<br />
http://www.jungo.com/<br />
windriver.html.<br />
[Keedy, 1979] Keedy, J. L. (1979). A<br />
comparison of two process structuring<br />
models. MONADS Report 4, Dept.<br />
Computer Science, Monash University.<br />
[Leslie and Heiser, 2003] Leslie, B. and<br />
Heiser, G. (2003). Towards untrusted<br />
device drivers. Technical Report<br />
UNSW-CSE-TR-0303, Operating Systems<br />
and Distributed Systems Group, School of<br />
Computer Science and Engineering, <strong>The</strong><br />
University of NSW. CSE techreports<br />
website,<br />
ftp://ftp.cse.unsw.edu.au/<br />
pub/doc/papers/UNSW/0303.pdf.<br />
[Nakatani, 2002] Nakatani, B. (2002).<br />
ELJOnline: User mode drivers.<br />
http://www.linuxdevices.com/<br />
articles/AT5731658926.html.<br />
[Swift et al., 2002] Swift, M., Martin, S.,<br />
Leyand, H. M., and Eggers, S. J. (2002).<br />
Nooks: an architecture for reliable device<br />
drivers. In Proceedings of the Tenth ACM<br />
SIGOPS European Workshop,<br />
Saint-Emilion, France.
162 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Big Servers—2.6 compared to 2.4<br />
Wim A. Coekaerts<br />
Oracle Corporation<br />
wim.coekaerts@oracle.com<br />
Abstract<br />
<strong>Linux</strong> 2.4 has been around in production environments<br />
at companies for a few years now,<br />
we have been able to gather some good data<br />
on how well (or not) things scale up. Number<br />
of CPU’s, amount of memory, number of processes,<br />
IO throughput, etc.<br />
Most of the deployments in production today,<br />
are on relatively small systems, 4- to 8-ways,<br />
8–16GB of memory, in a few cases 32GB.<br />
<strong>The</strong> architecture of choice has also been IA32.<br />
64-bit systems are picking up in popularity<br />
rapidly, however.<br />
Now with 2.6, a lot of the barriers are supposed<br />
to be gone. So, have they really? How much<br />
memory can be used now, how is cpu scaling<br />
these days, how good is IO throughput with<br />
multiple controllers in 2.6.<br />
A lot of people have the assumption that 2.6<br />
resolves all of this. We will go into detail on<br />
what we have found out, what we have tested<br />
and some of the conclusions on how good the<br />
move to 2.6 will really be.<br />
1 Introduction<br />
<strong>The</strong> comparison between the 2.4 and 2.6 kernel<br />
trees are not solely based on performance.<br />
A large part of the testsuites are performance<br />
benchmarks however, as you will see, they<br />
have been used to also measure stability. <strong>The</strong>re<br />
are a number of features added which improve<br />
stability of the kernel under heavy workloads.<br />
<strong>The</strong> goal of comparing the two kernel releases<br />
was more to show how well the 2.6 kernel will<br />
be able to hold up in a real world production<br />
environment. Many companies which have deployed<br />
<strong>Linux</strong> over the last two years are looking<br />
forward to rolling out 2.6 and it is important<br />
to show the benefits of doing such a move.<br />
It will take a few releases before the required<br />
stability is there however it’s clear so far that<br />
the 2.6 kernel has been remarkably solid, so<br />
early on.<br />
Most of the 2.4 based tests have been run on<br />
Red Hat Enterprise <strong>Linux</strong> 3, based on <strong>Linux</strong><br />
2.4.21. This is the enterprise release of Red<br />
Hat’s OS distribution; it contains a large number<br />
of patches on top of the <strong>Linux</strong> 2.4 kernel<br />
tree. Some of the tests have been run on the<br />
kernel.org mainstream 2.4 kernel, to show<br />
the benefit of having extra functionality. However<br />
it is difficult to even just boot up the mainstream<br />
kernel on the test hardware due to lack<br />
of support for drivers, or lack of stability to<br />
complete the testsuite. <strong>The</strong> interesting thing to<br />
keep in mind is that with the current <strong>Linux</strong> 2.6<br />
main stream kernel, most of the testsuites ran<br />
through completition. A number of test runs on<br />
<strong>Linux</strong> 2.6 have been on Novell/SuSE SLES9<br />
beta release.
164 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
2 Test Suites<br />
<strong>The</strong> test suites used to compare the various kernels<br />
are based on an IO simulator for Oracle,<br />
called OraSim and a TPC-C like workload generator<br />
called OAST.<br />
Oracle Simulator (OraSim) is a stand-alone<br />
tool designed to emulate the platform-critical<br />
activities of the Oracle database kernel. Oracle<br />
designed Oracle Simulator to test and characterize<br />
the input and output (I/O) software stack,<br />
the storage system, memory management, and<br />
cluster management of Oracle single instances<br />
and clusters. Oracle Simulator supports both<br />
pass-fail testing for validation, and analytical<br />
testing for debugging and tuning. It runs multiple<br />
processes, with each process representing<br />
the parameters of a particular type of system<br />
load similar to the Oracle database kernel.<br />
OraSim is a relatively straightforward IO<br />
stresstest utility, similar to IOzone or tiobench,<br />
however it is built to be very flexible and configurable.<br />
It has its own script language which allows one<br />
to build very complex IO patterns. <strong>The</strong> tool is<br />
not released under any open source license today<br />
because it has some code linked in which is<br />
part of the RDBMS itself. <strong>The</strong> jobfiles used for<br />
the testing are available online http://oss.<br />
oracle.com/external/ols/jobfiles/.<br />
<strong>The</strong> advantage of using OraSim over a real<br />
database benchmark is mainly the simplicity.<br />
It does not require large amounts of memory or<br />
large installed software components. <strong>The</strong>re is<br />
one executable which is started with the jobfile<br />
as a parameter.<strong>The</strong> jobfiles used can be easily<br />
modified to turn on certain filesystem features,<br />
such as asynchronous IO.<br />
OraSim jobfiles were created to simulate a relatively<br />
small database. 10 files are defined as<br />
actual database datafiles and two files are used<br />
to simulate database journals.<br />
OAST on the other hand is a complete database<br />
stress test kit, based on the TPC-C benchmark<br />
workloads. It requires a full installation of<br />
the database software and relies on an actual<br />
database environment to be created. TPC-C<br />
is an on-line transaction workload. <strong>The</strong> numbers<br />
represented during the testruns are not actual<br />
TPC-C benchmarks results and cannot or<br />
should not be used as a measure of TPC-C<br />
performance—they are TPC-C-like; however,<br />
not the same.<br />
<strong>The</strong> database engine which runs the OAST<br />
benchmark allocates a large shared memory<br />
segment which contains the database caches<br />
for SQL and for data blocks (shared pool and<br />
buffer cache). Every client connection can run<br />
on the same server or the connection can be<br />
over TCP. In case of a local connection, for<br />
each client, 2 processes are spawned on the<br />
system. <strong>One</strong> process is a dedicated database<br />
process and the other is the client code which<br />
communicates with the database server process<br />
through IPC calls. Test run parameters include<br />
run time length in seconds and number of<br />
client connections. As you can see in the result<br />
pages, both remote and local connections have<br />
been tested.<br />
3 Hardware<br />
A number of hardware configurations have<br />
been used. We tried to include various CPU<br />
architectures as well as local SCSI disk versus<br />
network storage (NAS) and fibre channel<br />
(SAN).<br />
Configuration 1 consists of an 8-way IA32<br />
Xeon 2 GHz with 32GB RAM attached to an<br />
EMC CX300 Clariion array with 30 147GB<br />
disks using a QLA2300 fibre channel HBA.<br />
<strong>The</strong> network cards are BCM5701 Broadcom<br />
Gigabit Ethernet.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 165<br />
Configuration 2 consists of an 8-way Itanium 2<br />
1.3 GHz with 8GB RAM attached to a JBOD<br />
fibre channel array with 8 36GB disks using<br />
a QLA2300 fibre channel HBA. <strong>The</strong> network<br />
cards are BCM5701 Broadcom Gigabit Ethernet.<br />
Configuration 3 consists of a 2-way AMD64 2<br />
GHz (Opteron 246) with 6GB RAM attached<br />
to local SCSI disk (LSI Logic 53c1030).<br />
4 Operating System<br />
<strong>The</strong> <strong>Linux</strong> 2.4 test cases were created using<br />
Red Hat Enterprise <strong>Linux</strong> 3 on all architectures.<br />
<strong>Linux</strong> 2.6 was done with SuSE SLES9<br />
on all architectures; however, in a number of<br />
tests the kernel was replaced by the 2.6 mainstream<br />
kernel for comparison.<br />
<strong>The</strong> test suites and benchmarks did not have<br />
to be recompiled to run on either RHEL3 or<br />
SLES9. Of course different executables were<br />
used on the three CPU architectures.<br />
5 Test Results<br />
At the time of writing a lot of changes were<br />
still happening on the 2.6 kernel. As such,<br />
the actual spreadsheets with benchmark data<br />
has been published on a website, the data is<br />
up-to-date with the current kernel tree and can<br />
be found here: http://oss.oracle.com/<br />
external/ols/results/<br />
5.1 IO<br />
If you want to build a huge database server,<br />
which can handle thousands of users, it is important<br />
to be able to attach a large number of<br />
disks. A very big shortcoming in <strong>Linux</strong> 2.4<br />
was the fact that it could only handle 128 or<br />
256.<br />
With some patches SuSE got to around 3700<br />
disks in SLES8, however that meant stealing<br />
major numbers from other components. Really<br />
large database setups which also require<br />
very high IO throughput, usually have disks attached<br />
ranging from a few hundred to a few<br />
thousand.<br />
With the 64-bit dev_t in 2.6, it’s now possible<br />
to attach plenty of disk. Without modifications<br />
it can easily handle tens of thousands of devices<br />
attached. This opens the world to really<br />
large scale datawarehouses, tens of terabytes of<br />
storage.<br />
Another important change is the block IO<br />
layer, the BIO code is much more efficient<br />
when it comes to large IOs being submitted<br />
down from the running application. In 2.4,<br />
every IO got broken down into small chunks,<br />
sometimes causing bottlenecks on allocating<br />
accounting structures. Some of the tests compared<br />
1MB read() and write() calls in<br />
2.4 and 2.6.<br />
5.2 Asynchronous IO and DirectIO<br />
If there is one feature that has always been on<br />
top of the Must Have list for large database<br />
vendors, it must be async IO. Asynchronous IO<br />
allows processes to submit batches of IO operations<br />
and continue on doing different tasks in<br />
the meantime. It improves CPU utilization and<br />
can keep devices more busy. <strong>The</strong> Enterprise<br />
distributions based on <strong>Linux</strong> 2.4 all ship with<br />
the async IO patch applied on top of the mainline<br />
kernel.<br />
<strong>Linux</strong> 2.6 has async IO out of the box. It is<br />
implemented a little different from <strong>Linux</strong> 2.4<br />
however combined with support for direct IO it<br />
is very performant. Direct IO is very useful as<br />
it eliminates copying the userspace buffers into<br />
kernel space. On systems that are constantly<br />
overloaded, there is a nice performance im-
166 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
provement to be gained doing direct IO. <strong>Linux</strong><br />
2.4 did not have direct IO and async IO combined.<br />
As you can see in the performance<br />
graph on AIO+DIO, it provides a significant<br />
reduction in CPU utilization.<br />
5.3 Virtual Memory<br />
<strong>The</strong>re has been another major VM overhaul in<br />
<strong>Linux</strong> 2.6, in fact, even after 2.6.0 was released<br />
a large portion has been re-written. This was<br />
due to large scale testing showing weaknesses<br />
as it relates to number of users that could be<br />
handled on a system. As you can see in the test<br />
results, we were able to go from around 3000<br />
users to over 7000 users. In particular on 32-<br />
bit systems, the VM has been pretty much a<br />
disaster when it comes to deploying a system<br />
with more than 16GB of RAM. With the latest<br />
VM changes it is now possible to push a 32GB<br />
even up to 48GB system pretty reliably.<br />
Support for large pages has also been a big<br />
winner. HUGETLBFS reduces TLB misses by<br />
a decent percentage. In some of the tests it<br />
provides up to a 3% performance gain. In our<br />
tests HUGETLBFS would be used to allocate<br />
the shared memory segment.<br />
5.4 NUMA<br />
<strong>Linux</strong> 2.6 is the first <strong>Linux</strong> kernel with real<br />
NUMA support. As we see high-end customers<br />
looking at deploying large SMP boxes<br />
running <strong>Linux</strong>, this became a real requirement.<br />
In fact even with the AMD64 design, NUMA<br />
support becomes important for performance<br />
even when looking at just a dual-CPU system.<br />
NUMA support has two components; however,<br />
one is the fact that the kernel VM allocates<br />
memory for processes in a more efficient way.<br />
On the other hand, it is possible for applications<br />
to use the NUMA API and tell the OS<br />
where memory should be allocated and how.<br />
Oracle has an extention for Itanium2 to support<br />
the libnuma API from Andi Kleen. Making use<br />
of this extention showed a significant improvement,<br />
up to about 20%. It allows the database<br />
engine to be smart about memory allocations<br />
resulting in a significant performance gain.<br />
6 Conclusion<br />
It is very clear that many of the features that<br />
were requested by the larger corporations providing<br />
enterprise applications actually help a<br />
huge amount. <strong>The</strong> advantage of having Asynchronous<br />
IO or NUMA support in the mainstream<br />
kernel is obvious. It takes a lot of effort<br />
for distribution vendors to maintain patches on<br />
top of the mainline kernel and when functionality<br />
makes sense it helps to have it be included<br />
in mainline. Micro-optimizations are still being<br />
done and in particular the VM subsystem<br />
can improve quite a bit. Most of the stability<br />
issues are around 32-bit, where the LowMem<br />
versus HighMem split wreaks havoc quite frequently.<br />
At least with some of the features now<br />
in the 2.6 kernel it is possible to run servers<br />
with more than 16GB of memory and scale up.<br />
<strong>The</strong> biggest surprise was the stability. It was<br />
very nice to see a new stable tree be so solid<br />
out of the box, this in contrast to earlier stable<br />
kernel trees where it took quite a few iterations<br />
to get to the same point.<br />
<strong>The</strong> major benefit of 2.6 is being able to run on<br />
really large SMP boxes: 32-way Itanium2 or<br />
Power4 systems with large amounts of memory.<br />
This was the last stronghold of the traditional<br />
Unices and now <strong>Linux</strong> can play alongside<br />
with them even there. Very exciting times.
Multi-processor and Frequency Scaling<br />
Making Your Server Behave Like a Laptop<br />
Paul Devriendt<br />
AMD Software Research and Development<br />
paul.devriendt@amd.com<br />
Copyright © 2004 Advanced Micro Devices, Inc.<br />
Abstract<br />
This paper will explore a multi-processor implementation<br />
of frequency management, using<br />
an AMD Opteron processor 4-way server as<br />
a test vehicle.<br />
Topics will include:<br />
• the benefits of doing this, and why server<br />
customers are asking for this,<br />
• the hardware, for case of the AMD<br />
Opteron processor,<br />
• the various software components that<br />
make this work,<br />
• the issues that arise, and<br />
• some areas of exploration for follow on<br />
work.<br />
1 Introduction<br />
Processor frequency management is common<br />
on laptops, primarily as a mechanism for improving<br />
battery life. Other benefits include a<br />
cooler processor and reduced fan noise. Fans<br />
also use a non-trivial amount of power.<br />
This technology is spreading to desktop machines,<br />
driven both by a desire to reduce power<br />
consumption and to reduce fan noise.<br />
Servers and other multiprocessor machines can<br />
equally benefit. <strong>The</strong> multiprocessor frequency<br />
management scenario offers more complexity<br />
(no surprise there). This paper discusses<br />
these complexities, based upon a test implementation<br />
on an AMD Opteron processor 4-<br />
way server. Details within this paper are AMD<br />
processor specific, but the concepts are applicable<br />
to other architectures.<br />
<strong>The</strong> author of this paper would like to make<br />
it clear that he is just the maintainer of the<br />
AMD frequency driver, supporting the AMD<br />
Athlon 64 and AMD Opteron processors.<br />
This frequency driver fits into, and is totally dependent,<br />
on the CPUFreq support. <strong>The</strong> author<br />
has gratefully received much assistance and<br />
support from the CPUFreq maintainer (Dominik<br />
Brodowski).<br />
2 Abbreviations<br />
BKDG: <strong>The</strong> BIOS and <strong>Kernel</strong> Developer’s<br />
Guide. Document published by AMD containing<br />
information needed by system software developers.<br />
See the references section, entry 4.<br />
MSR: Model Specific Register. Processor registers,<br />
accessable only from kernel space, used<br />
for various control functions. <strong>The</strong>se registers<br />
are expected to change across processor<br />
families. <strong>The</strong>se registers are described in the
168 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
BKDG[4].<br />
VRM: Voltage Regulator Module. Hardware<br />
external to the processor that controls the voltage<br />
supplied to the processor. <strong>The</strong> VRM has to<br />
be capable of supplying different voltages on<br />
command. Note that for multiprocessor systems,<br />
it is expected that each processor will<br />
have its own independent VRM, allowing each<br />
processor to change voltage independently. For<br />
systems where more than one processor shares<br />
a VRM, the processors have to be managed as<br />
a group. <strong>The</strong> current frequency driver does not<br />
have this support.<br />
fid: Frequency Identifier. <strong>The</strong> values written<br />
to the control MSR to select a core frequency.<br />
<strong>The</strong>se identifiers are processor family<br />
specific. Currently, these are six bit codes, allowing<br />
the selection of frequencies from 800<br />
MHz to 5 Ghz. See the BKDG[4] for the mappings<br />
from fid to frequency. Note that the frequency<br />
driver does need to “understand” the<br />
mapping of fid to frequency, as frequencies are<br />
exposed to other software components.<br />
vid: Voltage Identifier. <strong>The</strong> values written to<br />
the control MSR to select a voltage. <strong>The</strong>se values<br />
are then driven to the VRM by processor<br />
logic to achieve control of the voltage. <strong>The</strong>se<br />
identifiers are processor model specific. Currently<br />
these identifiers are five bit codes, of<br />
which there are two sets—a standard set and<br />
a low-voltage mobile set. <strong>The</strong> frequency driver<br />
does not need to be able to “understand” the<br />
mapping of vid to voltage, other than perhaps<br />
for debug prints.<br />
VST: Voltage Stabilization Time. <strong>The</strong> length<br />
of time before the voltage has increased and is<br />
stable at a newly increased voltage. <strong>The</strong> driver<br />
has to wait for this time period when stepping<br />
the voltage up. <strong>The</strong> voltage has to be stable<br />
at the new level before applying a further step<br />
up in voltage, or before transitioning to a new<br />
frequency that requires the higher voltage.<br />
MVS: Maximum Voltage Step. <strong>The</strong> maximum<br />
voltage step that can be taken when increasing<br />
the voltage. <strong>The</strong> driver has to step up voltage<br />
in multiple steps of this value when increasing<br />
the voltage. (When decreasing voltage it is not<br />
necessary to step, the driver can merely jump<br />
to the correct voltage.) A typical MVS value<br />
would be 25mV.<br />
RVO: Ramp Voltage Offset. When transitioning<br />
frequencies, it is necessary to temporarily<br />
increase the nominal voltage by this amount<br />
during the frequency transition. A typical RVO<br />
value would be 50mV.<br />
IRT: Isochronous Relief Time. During frequency<br />
transitions, busmasters briefly lose access<br />
to system memory. When making multiple<br />
frequency changes, the processor driver<br />
must delay the next transition for this time<br />
period to allow busmasters access to system<br />
memory. <strong>The</strong> typical value used is 80us.<br />
PLL: Phase Locked Loop. Electronic circuit<br />
that controls an oscillator to maintain a constant<br />
phase angle relative to a reference signal.<br />
Used to synthesize new frequencies which are<br />
a multiple of a reference frequency.<br />
PLL Lock Time: <strong>The</strong> length of time, in microseconds,<br />
for the PLL to lock.<br />
pstate: Performance State. A combination of<br />
frequency/voltage that is supported for the operation<br />
of the processor. A processor will typically<br />
have several pstates available, with higher<br />
frequencies needing higher voltages. <strong>The</strong> processor<br />
clock can not be set to any arbitrary frequency;<br />
it may only be set to one of a limited<br />
set of frequencies. For a given frequency, there<br />
is a minimum voltage needed to operate reliably<br />
at that frequency, and this is the correct<br />
voltage, thus forming the frequency/voltage<br />
pair.<br />
ACPI: Advanced Configuration and Power In-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 169<br />
terface Specification. An industry specification,<br />
initially developed by Intel, Microsoft,<br />
Phoenix and Toshiba. See the reference section,<br />
entry 5.<br />
_PSS: Performance Supported States. ACPI<br />
object that defines the performance states valid<br />
for a processor.<br />
_PPC: Performance Present Capabilities.<br />
ACPI object that defines which of the _PSS<br />
states are currently available, due to current<br />
platform limitations.<br />
PSB Performance State Block. BIOS provided<br />
data structure used to pass information, to the<br />
driver, concerning the pstates available on the<br />
processor. <strong>The</strong> PSB does not support multiprocessor<br />
systems (which use the ACPI _PSS<br />
object instead) and is being deprecated. <strong>The</strong><br />
format of the PSB is defined in the BKDG.<br />
3 Why Does Frequency Management<br />
Affect Power Consumption?<br />
Higher frequency requires higher voltage.<br />
As an example, data for part number<br />
ADA3200AEP4AX:<br />
2.2 GHz @ 1.50 volts, 58 amps max – 89 watts<br />
2.0 GHz @ 1.40 volts, 48 amps max – 69 watts<br />
1.8 GHz @ 1.30 volts, 37 amps max – 50 watts<br />
1.0 GHz @ 1.10 volts, 18 amps max – 22 watts<br />
<strong>The</strong>se figures are worst case current/power figures,<br />
at maximum case temperature, and include<br />
I/O power of 2.2W.<br />
Actual power usage is determined by:<br />
• code currently executing (idle blocks in<br />
the processor consume less power),<br />
• activity from other processors (cache coherency,<br />
memory accesses, pass-through<br />
traffic on the HyperTransport connections),<br />
• processor temperature (current increases<br />
with temperature, at constant workload<br />
and voltage),<br />
• processor voltage.<br />
Increasing the voltage allows operation at<br />
higher frequencies, at the cost of higher power<br />
consumption and higher heat generation. Note<br />
that relationship between frequency and power<br />
consumption is not a linear relationship—a<br />
10% frequency increase will cost more than<br />
10% in power consumption (30% or more).<br />
Total system power usage depends on other devices<br />
in the system, such as whether disk drives<br />
are spinning or stopped, and on the efficiency<br />
of power supplies.<br />
4 Why Should Your Server Behave<br />
Like A Laptop?<br />
• Save power. It is the right thing to do<br />
for the environment. Note that power<br />
consumed is largely converted into heat,<br />
which then becomes a load on the air conditioning<br />
in the server room.<br />
• Save money. Power costs money. <strong>The</strong><br />
power savings for a single server are typically<br />
regarded as trivial in terms of a corporate<br />
budget. However, many large organizations<br />
have racks of many thousands<br />
of servers. <strong>The</strong> power bill is then far from<br />
trivial.<br />
• Cooler components last longer, and this<br />
translates into improved server reliability.<br />
• Government Regulation.
170 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
5 Interesting Scenarios<br />
<strong>The</strong>se are real world scenarios, where the application<br />
of the technology is appropriate.<br />
5.1 Save power in an idle cluster<br />
A cluster would typically be kept running at<br />
all times, allowing remote access on demand.<br />
During the periods when the cluster is idle, reducing<br />
the CPU frequency is a good way to<br />
reduce power consumption (and therefore also<br />
air conditioning load), yet be able to quickly<br />
transition back up to full speed (
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 171<br />
demanding case is a blade server) aggravates<br />
the cooling problem as the neighboring boxes<br />
are also generating heat.<br />
6 System Power Budget<br />
<strong>The</strong> processors are only part of the system. We<br />
therefore need to understand the power consumption<br />
of the entire system to see how significant<br />
processor frequency management is on<br />
the power consumption of the whole system.<br />
A system power budget is obviously platform<br />
specific. This sample DC (direct current)<br />
power budget is for a 4-processor AMD<br />
Opteron processor based system. <strong>The</strong> system<br />
has three 500W power supplies, of which one<br />
is redundant. Analysis shows that for many<br />
operating scenarios, the system could run on<br />
a single power supply.<br />
This analysis is of DC power. For the system<br />
in question, the efficiency of the power supplies<br />
are approximately linear across varying<br />
loads, and thus the DC power figures expressed<br />
as percentages are meaningful as predictors of<br />
the AC (alternating current) power consumption.<br />
For systems with power supplies that are<br />
not linearly efficient across varying loads, the<br />
calculations obviously have to be factored to<br />
take account of power supply efficiency.<br />
System components:<br />
• 4 processors @ 89W = 356W in the maximum<br />
pstate, 4 @ 22W = 88W in the minimum<br />
pstate. <strong>The</strong>se are worst case figures,<br />
at maximium case temperature, with the<br />
worst case instruction mix. <strong>The</strong> figures in<br />
Table1 are reduced from these maximums<br />
by approximately 10% to account for a reduced<br />
case temperature and for a workload<br />
that does not keep all of the processors’<br />
internal units busy.<br />
• Two disk drives (Western Digital 250<br />
GByte SATA), 16W read/write, 10W idle<br />
(spinning), 1.3W sleep (not spinning).<br />
Note SCSI drives typically consume more<br />
power.<br />
• DVD Drive, 10W read, 1W idle/sleep.<br />
• PCI 2.2 Slots – absolute max of 25W per<br />
slot, system will have a total power budget<br />
that may not account for maximum power<br />
in all slots. Estimate 2 slots occupied at a<br />
total of 20W.<br />
• VGA video card in a PCI slot. 5W. (AGP<br />
would be more like 15W+).<br />
• DDR DRAM, 10W max per DIMM, 40W<br />
for 4 GBytes configured as 4 DIMMs.<br />
• Network (built in) 5W.<br />
• Motherboard and components 30W.<br />
• 10 fans @ 6W each. 60W.<br />
• Keyboard + Mouse 3W<br />
See Table 1 for the sample power budget under<br />
busy and light loads.<br />
<strong>The</strong> light load without any frequency reduction<br />
is baselined as 100%.<br />
<strong>The</strong> power consumption is shown for the same<br />
light load with frequency reduction enabled,<br />
and again where the idle loop incorporates the<br />
hlt instruction.<br />
Using frequency management, the power consumption<br />
drops to 43%, and adding the use of<br />
the hlt instruction (assuming 50% time halted),<br />
the power consumption drops further to 33%.<br />
<strong>The</strong>se are significant power savings, for systems<br />
that are under light load conditions at<br />
times. <strong>The</strong> percentage of time that the system<br />
is running under reduced load has to be known<br />
to predict actual power savings.
172 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
system load 4 2 kbd<br />
cpus disks dvd pci vga dram net planar fans mou total<br />
busy 320 32 10 20 5 40 5 30 60 3 525W<br />
90%<br />
light load 310 22 1 15 5 38 5 20 60 3 479W<br />
87% 100%<br />
light load, using 79 22 1 15 5 38 5 20 20 3 208W<br />
frequency reduction 90% 43%<br />
light load, using 32 22 1 15 5 38 5 20 15 3 156<br />
frequency reduction 40% 33%<br />
and using hlt 50%<br />
of the time<br />
Table 1: Sample System Power Budget (DC), in watts<br />
7 Hardware—AMD Opteron<br />
7.1 Software Interface To <strong>The</strong> Hardware<br />
<strong>The</strong>re are two MSRs, the FIDVID_STATUS<br />
MSR and the FIDVID_CONTROL MSR, that<br />
are used for frequency voltage transitions.<br />
<strong>The</strong>se MSRs are the same for the single processor<br />
AMD Athlon 64 processors and for the<br />
AMD Opteron MP capable processors. <strong>The</strong>se<br />
registers are not compatible with the previous<br />
generation of AMD Athlon processors, and<br />
will not be compatible with the next generation<br />
of processors.<br />
<strong>The</strong> CPU frequency driver for AMD processors<br />
therefore has to change across processor<br />
revisions, as do the ACPI _PSS objects that describe<br />
pstates.<br />
<strong>The</strong> status register reports the current fid and<br />
vid, as well as the maximum fid, the start fid,<br />
the maximum vid and the start vid of the particular<br />
processor.<br />
<strong>The</strong>se registers are documented in the<br />
BKDG[4].<br />
As MSRs can only be accessed by executing<br />
code (the read msr or write msr instructions) on<br />
the target processor, the frequency driver has to<br />
use the processor affinity support to force execution<br />
on the correct processor.<br />
7.2 Multiple Memory Controllers<br />
In PC architectures, the memory controller is<br />
a component of the northbridge, which is traditionally<br />
a separate component from the processor.<br />
With AMD Opteron processors, the<br />
northbridge is built into the processor. Thus,<br />
in a multi-processor system there are multiple<br />
memory controllers.<br />
See Figure 1 for a block diagram of a two processor<br />
system.<br />
If a processor is accessing DRAM that is physically<br />
attached to a different processor, the<br />
DRAM access (and any cache coherency traffic)<br />
crosses the coherent HyperTransport interprocessor<br />
links. <strong>The</strong>re is a small performance<br />
penalty in this case. This penalty is of the order<br />
of a DRAM page hit versus a DRAM page<br />
miss, about 1.7 times slower than a local access.<br />
This penalty is minimized by the processor<br />
caches, where data/code residing in remote<br />
DRAM is locally cached. It is also minimized
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 173<br />
by <strong>Linux</strong>’s NUMA support.<br />
Note that a single threaded application that<br />
is memory bandwidth constrained may benefit<br />
from multiple memory controllers, due to the<br />
increase in memory bandwidth.<br />
When the remote processor is transitioned to<br />
a lower frequency, this performance penalty is<br />
worse. An upper bound to the penalty may<br />
be calculated as proportional to the frequency<br />
slowdown. I.e., taking the remote processor<br />
from 2.2 GHz to 1.0 GHz would take the 1.7<br />
factor from above to a factor of 2.56. Note that<br />
this is an absolute worst case, an upper bound<br />
to the factor. Actual impact is workload dependent.<br />
A worst case scenario would be a memory<br />
bound task, doing memory reads at addresses<br />
that are pathologically the worst case for the<br />
caches, with all accesses being to remote memory.<br />
A more typical scenario would see this<br />
penalty alleviated by:<br />
• processor caches, where 64 bytes will<br />
be read and cached for a single access,<br />
so applications that walk linearly through<br />
memory will only see the penalty on 64<br />
byte boundaries,<br />
• memory writes do not take a penalty<br />
(as processor execution continues without<br />
waiting for a write to complete),<br />
• memory may be interleaved,<br />
• kernel NUMA optimizations for noninterleaved<br />
memory (which allocate<br />
memory local to the processor when<br />
possible to avoid this penalty).<br />
7.3 DRAM Interface Speed<br />
<strong>The</strong> DRAM interface speed is impacted by the<br />
core clock frequency. A full table is published<br />
in the processor data sheet; Table 2 shows a<br />
sample of actual DRAM frequencies for the<br />
common specified DRAM frequencies, across<br />
a range of core frequencies.<br />
This table shows that certain DRAM speed /<br />
core speed combinations are suboptimal.<br />
Effective memory performance is influenced<br />
by many factors:<br />
• cache hit rates,<br />
• effectiveness of NUMA memory allocation<br />
routines,<br />
• load on the memory controller,<br />
• size of penalty for remote memory accesses,<br />
• memory speed,<br />
• other hardware related items, such as<br />
types of DRAM accesses.<br />
It is therefore necessary to benchmark the actual<br />
workload to get meaningful data for that<br />
workload.<br />
7.4 UMA<br />
During frequency transitions, and when HyperTransport<br />
LDTSTOP is asserted, DRAM is<br />
placed into self refresh mode. UMA graphics<br />
devices therefore can not access DRAM.<br />
UMA systems therefore need to limit the time<br />
that DRAM is in self refresh mode. Time constraints<br />
are bandwidth dependent, with high<br />
resolution displays needing higher memory<br />
bandwidth. This is handled by the IRT delay<br />
time during frequency transitions. When transitioning<br />
multiple steps, the driver waits an appropriate<br />
length of time to allow external devices<br />
to access memory.
174 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
DDR<br />
AMD Opteron TM<br />
Processor<br />
cHT<br />
AMD Opteron TM<br />
Processor<br />
DDR<br />
ncHT<br />
ncHT<br />
8X AGP<br />
AMD 8151 TM<br />
Graphics Tunnel<br />
AMD 8131 TM<br />
PCI-X Tunnel<br />
PCI-X<br />
ncHT<br />
Legacy PCI<br />
USB<br />
LPC<br />
AMD 8111 TM<br />
I/O Hub<br />
AC ‘97<br />
EIDE<br />
Figure 1: Two Processor System<br />
Processor 100MHz 133MHz 166MHz 200MHz<br />
Core DRAM DRAM DRAM DRAM<br />
Frequency spec spec spec spec<br />
800MHz 100.00 133.33 160.00 160.00<br />
1000MHz 100.00 125.00 166.66 200.00<br />
2000MHz 100.00 133.33 166.66 200.00<br />
2200MHz 100.00 129.41 157.14 200.00<br />
Table 2: DRAM Frequencies For A Range Of Processor Core Frequencies
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 175<br />
7.5 TSC Varying<br />
<strong>The</strong> Time Stamp Counter (TSC) register is<br />
a register that increments with the processor<br />
clock. Multiple reads of the register will see<br />
increasing values. This register increments on<br />
each core clock cycle in the current generation<br />
of processors. Thus, the rate of increase of the<br />
TSC when compared with “wall clock time”<br />
varies as the frequency varies. This causes<br />
problems in code that calibrates the TSC increments<br />
against an external time source, and then<br />
attempts to use the TSC to measure time.<br />
<strong>The</strong> <strong>Linux</strong> kernel uses the TSC for such timings,<br />
for example when a driver calls udelay().<br />
In this case it is not a disaster if the udelay()<br />
call waits for too long as the call is defined to<br />
allow this behavior. <strong>The</strong> case of the udelay()<br />
call returning too quickly can be fatal, and this<br />
has been demonstrated during experimentation<br />
with this code.<br />
This particular problem is resolved by the<br />
cpufreq driver correcting the kernel TSC calibration<br />
whenever the frequency changes.<br />
This issue may impact other code that uses<br />
the TSC register directly. It is interesting to<br />
note that it is hard to define a correct behavior.<br />
Code that calibrates the TSC against an external<br />
clock will be thrown off if the rate of increment<br />
of the TSC should change. However,<br />
other code may expect a certain code sequence<br />
to consistently execute in approximately the<br />
same number of cycles, as measured by the<br />
TSC, and this code will be thrown off if the behavior<br />
of the TSC changes relative to the processor<br />
speed.<br />
7.6 Measurement Of Frequency Transition<br />
Times<br />
<strong>The</strong> time required to perform a transition is a<br />
combination of the software time to execute the<br />
required code, and the hardware time to perform<br />
the transition.<br />
Examples of hardware wait time are:<br />
• waiting for the VRM to be stable at a<br />
newer voltage,<br />
• waiting for the PLL to lock at the new frequency,<br />
• waiting for DRAM to be placed into and<br />
then taken out of self refresh mode around<br />
a frequency transition.<br />
<strong>The</strong> time taken to transition between two states<br />
is dependent on both the initial state and the<br />
target state. This is due to :<br />
• multiple steps being required in some<br />
cases,<br />
• certain operations are lengthier (for example,<br />
voltage is stepped up in multiple<br />
stages, but stepped down in a single step),<br />
• difference in code execution time dependent<br />
on processor speed (although this is<br />
minor).<br />
Measurements, taken by calibrating the frequency<br />
driver, show that frequency transitions<br />
for a processor are taking less than 0.015 seconds.<br />
Further experimentation with multiple processors<br />
showed a worst case transition time of less<br />
than 0.08 seconds to transition all 4 processors<br />
from minimum to maximum frequency, and<br />
slightly faster to transition from maximum to<br />
minimum frequency.<br />
Note, there is a driver optimization under<br />
consideration that would approximately halve<br />
these transition times.
176 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
7.7 Use of Hardware Enforced Throttling<br />
<strong>The</strong> southbridge (I/O Hub, example AMD-<br />
8111 HyperTransport I/O Hub) is capable<br />
of initiating throttling via the HyperTransport<br />
stopclock message, which will ramp down the<br />
CPU grid by the programmed amount. This<br />
may be initiated by the southbridge for thermal<br />
throttling or for other reasons.<br />
This throttling is transparent to software, other<br />
than the performance impact.<br />
This throttling is of greatest value in the lowest<br />
pstate, due to the reduced voltage.<br />
<strong>The</strong> hardware enforced throttling is generally<br />
not of relevance to the software management<br />
of processor frequencies. However, a system<br />
designer would need to take care to ensure<br />
that the optimal scenarios occur—i.e., transition<br />
to a lower frequency/voltage in preference<br />
to hardware throttling in high pstates. <strong>The</strong><br />
BIOS configurations are documented in the<br />
BKDG[4].<br />
For maximum power savings, the southbridge<br />
would be configured to initiate throttling when<br />
the processor executes the hlt instruction.<br />
8 Software<br />
<strong>The</strong> AMD frequency driver is a small part of<br />
the software involved. <strong>The</strong> frequency driver<br />
fits into the CPUFreq architecture, which is<br />
part of the 2.6 kernel. It is also available as a<br />
patch for the 2.4 kernel, and many distributions<br />
do include it.<br />
<strong>The</strong> CPUFreq architecture includes kernel support,<br />
the CPUFreq driver itself (drivers/<br />
cpufreq), an architecture specific driver to<br />
control the hardware (powernow-k8.ko is this<br />
case), and /sys file system code for userland<br />
access.<br />
<strong>The</strong> kernel support code (linux/kernel/<br />
cpufreq.c) handles timing changes such as<br />
updating the kernel constant loops_per_<br />
jiffies, as well as notifiers (system components<br />
that need to be notified of a frequency<br />
change).<br />
8.1 History Of <strong>The</strong> AMD Frequency Driver<br />
<strong>The</strong> CPU frequency driver for AMD Athlon<br />
(the previous generation of processors) was<br />
developed by Dave Jones. This driver supports<br />
single processor transitions only, as the<br />
pstate transition capability was only enabled in<br />
mobile processors. This driver used the PSB<br />
mechanism to determine valid pstates for the<br />
processor. This driver has subsequently been<br />
enhanced to add ACPI support.<br />
<strong>The</strong> initial AMD Athlon 64 and AMD Opteron<br />
driver (developed by me, based upon Dave’s<br />
earlier work, and with much input from Dominik<br />
and others), was also PSB based. This<br />
was followed by a version of the driver that<br />
added ACPI support.<br />
<strong>The</strong> next release is intended to add a built-in<br />
table of pstates that will allow the checking of<br />
BIOS supplied data, and also allow an override<br />
capability to provide pstate data when not supplied<br />
by BIOS.<br />
8.2 User Interface<br />
<strong>The</strong> deprecated /proc/cpufreq (and<br />
/proc/sys) file system offers control over<br />
all processors or individual processors. By<br />
echoing values into this file, the root user<br />
can change policies and change the limits on<br />
available frequencies.<br />
Examples:<br />
Constrain all processors to frequencies between<br />
1.0 GHz and 1.6 GHz, with the performance<br />
policy (effectively chooses 1.6 GHz):
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 177<br />
echo -n "1000000:16000000:<br />
performance" > /proc/cpufreq<br />
Constrain processor 2 to run at only 2.0 GHz:<br />
echo -n "2:2000000:2000000:<br />
performance" > proc/cpufreq<br />
<strong>The</strong> “performance” refers to a policy, with<br />
the other policy available being “powersave.”<br />
<strong>The</strong>se policies simply forced the frequency to<br />
be at the appropriate extreme of the available<br />
range. With the 2.6 kernel, the choice is normally<br />
for a “userspace” governor, which allows<br />
the (root) user or any user space code (running<br />
with root privilege) to dynamically control the<br />
frequency.<br />
With the 2.6 kernel, a new interface in the<br />
/sys filesystem is available to the root user,<br />
deprecating the /proc/cpufreq method.<br />
<strong>The</strong> control and status files exist under<br />
/sys/devices/system/cpu/cpuN/<br />
cpufreq, where N varies from 0 upwards,<br />
dependent on which processors are<br />
online. Among the other files in each processor’s<br />
directory, scaling_min_freq and<br />
scaling_max_freq control the minimum<br />
and maximum of the ranges in which the frequency<br />
may vary. <strong>The</strong> scaling_governor<br />
file is used to control the choice of governor.<br />
See linux/Documentation/<br />
cpu-freq/userguide.txt for more<br />
information.<br />
Examples:<br />
Constrain processor 2 to run only in the range<br />
1.6 GHz to 2.0 GHz:<br />
cd /sys/devices/system/cpu<br />
cd cpu2/cpufreq<br />
echo 1600000 > scaling_min_freq<br />
echo 2000000 > scaling_max_freq<br />
8.3 Control From User Space And User Daemons<br />
<strong>The</strong> interface to the /sys filesystem allows<br />
userland control and query functionality. Some<br />
form of automation of the policy would normally<br />
be part of the desired complete implementation.<br />
This automation is dependent on the reason for<br />
using frequency management. As an example,<br />
for the case of transitioning to a lower pstate<br />
when running on a UPS, a daemon will be notified<br />
of the failure of mains power, and that<br />
daemon will trigger the frequency change by<br />
writing to the control files in the /sys filesystem.<br />
<strong>The</strong> CPUFreq architecture has thus split the<br />
implementation into multiple parts:<br />
1. user space policy<br />
2. kernel space driver for common functionality<br />
3. kernel space driver for processor specific<br />
implementation.<br />
<strong>The</strong>re are multiple user space automation<br />
implementations, not all of which currently<br />
support multiprocessor systems. <strong>One</strong> that<br />
does, and that has been used in this<br />
project is cpufreqd version 1.1.2 (http://<br />
sourceforge.net/projects/cpufreqd).<br />
This daemon is controlled by a configuration<br />
file. Other than making changes to the configuration<br />
file, the author of this paper has not<br />
been involved in any of the development work<br />
on cpufreqd, and is a mere user of this tool.<br />
<strong>The</strong> configuration file specifies profiles and<br />
rules. A profile is a description of the system<br />
settings in that state, and my configuration file<br />
is setup to map the profiles to the processor
178 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
pstates. Rules are used to dynamically choose<br />
which profile to use, and my rules are setup<br />
to transition profiles based on total processor<br />
load.<br />
My simple configuration file to change processor<br />
frequency dependent on system load is:<br />
[General]<br />
pidfile=/var/run/cpufreqd.pid<br />
poll_interval=2<br />
pm_type=acpi<br />
# 2.2 GHz processor speed<br />
[Profile]<br />
name=hi_boost<br />
minfreq=95%<br />
maxfreq=100%<br />
policy=performance<br />
# 2.0 GHz processor speed<br />
[Profile]<br />
name=medium_boost<br />
minfreq=90%<br />
maxfreq=93%<br />
policy=performance<br />
# 1.0 GHz processor Speed<br />
[Profile]<br />
name=lo_boost<br />
minfreq=40%<br />
maxfreq=50%<br />
policy=powersave<br />
[Profile]<br />
name=lo_power<br />
minfreq=40%<br />
maxfreq=50%<br />
policy=powersave<br />
[Rule]<br />
#not busy 0%-40%<br />
name=conservative<br />
ac=on<br />
battery_interval=0-100<br />
cpu_interval=0-40<br />
profile=lo_boost<br />
#medium busy 30%-80%<br />
[Rule]<br />
name=lo_cpu_boost<br />
ac=on<br />
battery_interval=0-100<br />
cpu_interval=30-80<br />
profile=medium_boost<br />
#really busy 70%-100%<br />
[Rule]<br />
name=hi_cpu_boost<br />
ac=on<br />
battery_interval=50-100<br />
cpu_interval=70-100<br />
profile=hi_boost<br />
This approach actually works very well for<br />
multiple small tasks, for transitioning the frequencies<br />
of all the processors together based<br />
on a collective loading statistic.<br />
For a long running, single threaded task, this<br />
approach does not work well as the load is only<br />
high on a single processor, with the others being<br />
idle. <strong>The</strong> average load is thus low, and<br />
all processors are kept at a slow speed. Such<br />
a workload scenario would require an implementation<br />
that looked at the loading of individual<br />
processors, rather than the average. See the<br />
section below on future work.<br />
8.4 <strong>The</strong> Drivers Involved<br />
powernow-k8.ko arch/i386/<br />
kernel/cpu/cpufreq/powernow-k8.<br />
c (the same source code is built as a 32-bit<br />
driver in the i386 tree and as a 64-bit driver<br />
in the x86_64 tree)<br />
drivers/acpi<br />
drivers/cpufreq
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 179<br />
<strong>The</strong> Test Driver<br />
Note that the powernow-k8.ko driver does<br />
not export any read, write, or ioctl interfaces.<br />
For test purposes, a second driver exists with<br />
an ioctl interface for test application use. <strong>The</strong><br />
test driver was a big part of the test effort on<br />
powernow-k8.ko prior to release.<br />
8.5 Frequency Driver Entry Points<br />
powernowk8_init()<br />
Driver late_initcall. Initialization is<br />
late as the acpi driver needs to be initialized<br />
first. Verifies that all processors in the system<br />
are capable of frequency transitions, and that<br />
all processors are supported processors. Builds<br />
a data structure with the addresses of the four<br />
entry points for cpufreq use (listed below), and<br />
calls cpufreq_register_driver().<br />
powernowk8_exit()<br />
Called when the driver is to be unloaded. Calls<br />
cpufreq_unregister_driver().<br />
8.6 Frequency Driver Entry Points For Use By<br />
<strong>The</strong> CPUFreq driver<br />
powernowk8_cpu_init()<br />
This is a per-processor initialization routine.<br />
As we are not guaranteed to be executing on<br />
the processor in question, and as the driver<br />
needs access to MSRs, the driver needs to force<br />
itself to run on the correct processor by using<br />
set_cpus_allowed().<br />
This pre-processor initialization allows for processors<br />
to be taken offline or brought online dynamically.<br />
I.e., this is part of the software support<br />
that would be needed for processor hotplug,<br />
although this is not supported in the hardware.<br />
This routine finds the ACPI pstate data for this<br />
processor, and extracts the (proprietary) data<br />
from the ACPI _PSS objects. This data is verified<br />
as far as is reasonable. Per-processor data<br />
tables for use during frequency transitions are<br />
constructed from this information.<br />
powernowk8_cpu_exit()<br />
Per-processor cleanup routine.<br />
powernowk8_verify()<br />
When the root user (or an application running<br />
on behalf of the root user) requests a change to<br />
the minimum/maximum frequencies, or to the<br />
policy or governor, the frequency driver’s verification<br />
routine is called to verify (and correct<br />
if necessary) the input values. For example,<br />
if the maximum speed of the processor is 2.4<br />
GHz and the user requests that the maximum<br />
range be set to 3.0 GHz, the verify routine will<br />
correct the maximum value to a value that is actually<br />
possible. <strong>The</strong> user can, however, chose a<br />
value that is less than the hardware maximum,<br />
for example 2.0 GHz in this case.<br />
As this routine just needs to access the perprocessor<br />
data, and not any MSRs, it does not<br />
matter which processor executes this code.<br />
powernowk8_target()<br />
This is the driver entry point that actually performs<br />
a transition to a new frequency/voltage.<br />
This entry point is called for each processor<br />
that needs to transition to a new frequency.<br />
<strong>The</strong>re is therefore an optimization possible by<br />
enhancing the interface between the frequency<br />
driver and the CPUFreq driver for the case<br />
where all processors are to be transitioned to<br />
a new, common frequency. However, it is not<br />
clear that such an optimization is worth the<br />
complexity, as the functionality to transition a<br />
single processor would still be needed.<br />
This routine is invoked with the processor
180 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
number as a parameter, and there is no guarantee<br />
as to which processor we are currently executing<br />
on. As the mechanism for changing the<br />
frequency involves accessing MSRs, it is necessary<br />
to execute on the target processor, and<br />
the driver forces its execution onto the target<br />
processor by using set_cpus_allowed().<br />
<strong>The</strong> CPUFreq helpers are then used to determine<br />
the correct target frequency. Once a chosen<br />
target fid and vid are identified:<br />
• the cpufreq driver is called to warn that a<br />
transition is about to occur,<br />
• the actual transition code within<br />
powernow-k8 is called, and then<br />
• the cpufreq driver is called again to confirm<br />
that the transition was successful.<br />
<strong>The</strong> actual transition is protected with a<br />
semaphore that is used across all processors.<br />
This is to prevent transitions on one processor<br />
from interfering with transitions on other<br />
processors. This is due to the inter-processor<br />
communication that occurs at a hardware level<br />
when a frequency transition occurs.<br />
8.7 CPUFreq Interface<br />
<strong>The</strong> CPUFreq interface provides entry points,<br />
that are required to make the system function.<br />
It also provides helper functions, which need<br />
not be used, but are there to provide common<br />
functionality across the set of all architecture<br />
specific drivers. Elimination of duplicate good<br />
is a good thing! An architecture specific driver<br />
can build a table of available frequencies, and<br />
pass this table to the CPUFreq driver. <strong>The</strong><br />
helper functions then simplify the architecture<br />
driver code by manipulating this table.<br />
cpufreq_register_driver()<br />
Registers the frequency driver as being the<br />
driver capable of performing frequency transitions<br />
on this platform. Only one driver may be<br />
registered.<br />
cpufreq_unregister_driver()<br />
Unregisters the driver, when it is being unloaded.<br />
cpufreq_notify_transition()<br />
Used to notify the CPUFreq driver, and thus the<br />
kernel, that a frequency transition is occurring,<br />
and triggering recalibration of timing specific<br />
code.<br />
cpufreq_frequency_table_target()<br />
Helper function to find an appropriate table entry<br />
for a given target frequency. Used in the<br />
driver’s target function.<br />
cpufreq_frequency_table_verify()<br />
Helper function to verify that an input frequency<br />
is valid. This helper is effectively a<br />
complete implementation of the driver’s verify<br />
function.<br />
cpufreq_frequency_table_cpuinfo()<br />
Supplies the frequency table data that is used<br />
on subsequent helper function calls. Also aids<br />
with providing information as to the capabilities<br />
of the processors.<br />
8.8 Calls To <strong>The</strong> ACPI Driver<br />
acpi_processor_register_performance()<br />
acpi_processor_unregister_performance()<br />
Helper functions used at per-processor initialization<br />
time to gain access to the data from the<br />
_PSS object for that processor. This is a preferable<br />
solution to the frequency driver having to<br />
walk the ACPI namespace itself.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 181<br />
8.9 <strong>The</strong> Single Processor Solution<br />
Many of the kernel system calls collapse to<br />
constants when the kernel is built without<br />
multiprocessor support. For example, num_<br />
online_cpus() becomes a macro with the<br />
value 1. By the careful use of the definitions<br />
in smp.h, the same driver code handles<br />
both multiprocessor and single processor machines<br />
without the use of conditional compilation.<br />
<strong>The</strong> multiprocessor support obviously<br />
adds complexity to the code for a single processor<br />
code, but this code is negligible in the case<br />
of transitioning frequencies. <strong>The</strong> driver initialization<br />
and termination code is made more<br />
complex and lengthy, but this is not frequently<br />
executed code. <strong>The</strong>re is also a small penalty in<br />
terms of code space.<br />
<strong>The</strong> author does not feel that the penalty of the<br />
multiple processor support code is noticeable<br />
on a single processor system, but this is obviously<br />
debatable. <strong>The</strong> current choice is to have<br />
a single driver that supports both single processor<br />
and multiple processor systems.<br />
As the primary performance cost is in terms<br />
of additional code space, it is true that a single<br />
processor machine with highly constrained<br />
memory may benefit from a simplified driver<br />
without the additional multi-processor support<br />
code. However, such a machine would see<br />
greater benefit by eliminating other code that<br />
would not be necessary on a chosen platform.<br />
For example, the PSB support code could be<br />
removed from a memory constrained single<br />
processor machine that was using ACPI.<br />
This approach of removing code unnecessary<br />
for a particular platform is not a wonderful approach<br />
when it leads to multiple variants of<br />
the driver, all of which have to be supported<br />
and enhanced, and which makes Kconfig even<br />
more complex.<br />
8.10 Stages Of Development, Test And Debug<br />
Of <strong>The</strong> Driver<br />
<strong>The</strong> algorithm for transitioning to a new frequency<br />
is complex. See the BKDG[4] for a<br />
good description of the steps required, including<br />
flowcharts. In order to test and debug the<br />
frequency/voltage transition code thoroughly,<br />
the author first wrote a simple simulation of the<br />
processor. This simulation maintained a state<br />
machine, verified that fid/vid MSR control activity<br />
was legal, provided fid/vid status MSR<br />
results, and wrote a log file of all activity. <strong>The</strong><br />
core driver code was then written as an application<br />
and linked with this simulation code to<br />
allow testing of all combinations.<br />
<strong>The</strong> driver was then developed as a skeleton<br />
using printk to develop and test the<br />
BIOS/ACPI interfaces without having the frequency/voltage<br />
transition code present. This is<br />
because attempts to actually transition to an invalid<br />
pstate often result in total system lockups<br />
that offer no debug output—if the processor<br />
voltage is too low for the frequency, successful<br />
code execution ceases.<br />
When the skeleton was working correctly, the<br />
actual transition code was dropped into place,<br />
and tested on real hardware, both single processor<br />
and multiple processor. (<strong>The</strong> single processor<br />
driver was released many months before<br />
the multi-processor capable driver as the multiprocessor<br />
capable hardware was not available<br />
in the marketplace.) <strong>The</strong> functional driver was<br />
tested, using printk to trace activity, and using<br />
external hardware to track power usage, and<br />
using a test driver to independently verify register<br />
settings.<br />
<strong>The</strong> functional driver was then made available<br />
to various people in the community for their<br />
feedback. <strong>The</strong> author is grateful for the extensive<br />
feedback received, which included the<br />
changed code to implement suggestions. <strong>The</strong><br />
driver as it exists today is considerably im-
182 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
proved from the initial release, due to this feedback<br />
mechanism.<br />
9 How To Determine Valid PStates<br />
For A Given Processor<br />
AMD defines pstates for each processor. A<br />
performance state is a frequency/voltage pair<br />
that is valid for operation of that processor.<br />
<strong>The</strong>se are specified as fid/vid (frequency identifier/voltage<br />
identifier values) pairs, and are<br />
documented in the Processor <strong>The</strong>rmal and Data<br />
Sheets (see references). <strong>The</strong> worst case processor<br />
power consumption for each pstate is also<br />
characterized. <strong>The</strong> BKDG[4] contains tables<br />
for mapping fid to frequency and vid to voltage.<br />
Pstates are processor specific. I.e., 2.0 GHz at<br />
1.45V may be correct for one model/revision<br />
of processor, but is not necessarily correct for<br />
a different/revision model of processor.<br />
Code can determine whether a processor supports<br />
or does not support pstate transitions by<br />
executing the cpuid instruction. (For details,<br />
see the BKDG[4] or the source code for the<br />
<strong>Linux</strong> frequency driver). This needs to be done<br />
for each processor in an MP system.<br />
Each processor in an MP system could theoretically<br />
have different pstates.<br />
Ideally, the processor frequency driver would<br />
not contain hardcoded pstate tables, as the<br />
driver would then need to be revised for new<br />
processor revisions. <strong>The</strong> chosen solution is to<br />
have the BIOS provide the tables of pstates,<br />
and have the driver retrieve the pstate data from<br />
the BIOS. <strong>The</strong>re are two such tables defined for<br />
use by BIOSs for AMD systems:<br />
1. PSB, AMD’s original proprietary mechanism,<br />
which does not support MP. This<br />
mechanism is being deprecated.<br />
2. ACPI _PSS objects. Whereas the ACPI<br />
specification is a standard, the data within<br />
the _PSS objects is AMD specific (and, in<br />
fact, processor family specific), and thus<br />
there is still a proprietary nature of this solution.<br />
<strong>The</strong> current AMD frequency driver obtains<br />
data from the ACPI objects. ACPI does introduce<br />
some limitations, which are discussed<br />
later. Experimentation is ongoing with a builtin<br />
database approach to the problem in an attempt<br />
to bypass these issues, and also to allow<br />
checking of validity of the ACPI provided data.<br />
10 ACPI And Frequency Restrictions<br />
ACPI[5] provides the _PPC object, that is used<br />
to constrain the pstates available. This object<br />
is dynamic, and can therefore be used in platforms<br />
for purposes such as:<br />
• forcing frequency restrictions when operating<br />
on battery power,<br />
• forcing frequency restrictions due to thermal<br />
conditions.<br />
For battery / mains power transitions, an ACPIcompliant<br />
GPE (General Purpose Event) input<br />
to the chipset (I/O hub) is dedicated to assigning<br />
a SCI (System Control Interrupt) when the<br />
power source changes. <strong>The</strong> ACPI driver will<br />
then execute the ACPI control method (see the<br />
_PSR power source ACPI object), which issues<br />
a notify to the _CPUn object, which triggers<br />
the ACPI driver to re-evaluate the _PPC<br />
object. If the current pstate exceeds that allowed<br />
by this new evaluation of the _PPC object,<br />
the CPU frequency driver will be called to<br />
transition to a lower pstate.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 183<br />
11 ACPI Issues<br />
processors.<br />
ACPI as a standard is not perfect. <strong>The</strong>re is variation<br />
among different implementations, and<br />
<strong>Linux</strong> ACPI support does not work on all machines.<br />
ACPI does introduce some overhead, and some<br />
users are not willing to enable ACPI.<br />
ACPI requires that pstates be of equivalent<br />
power usage and frequency across all processors.<br />
In a system with processors that are capable<br />
of different maximum frequencies (for<br />
example, one processor capable of 2.0 GHz<br />
and a second processor capable of 2.2 GHz),<br />
compliance with the ACPI specification means<br />
that the faster processor(s) will be restricted to<br />
the maximum speed of the slowest processor.<br />
Also, if one processor has 5 available pstates,<br />
the presence of processor with only 4 available<br />
pstates will restrict all processors to 4 pstates.<br />
12 What Is <strong>The</strong>re Today?<br />
AMD is shipping pstate capable AMD Opteron<br />
processors (revision CG). Server processors<br />
prior to revision CG were not pstate capable.<br />
All AMD Athlon 64 processors for mobile and<br />
desktop are pstate capable.<br />
BKDG[4] enhancements to describe the capability<br />
are in progress.<br />
AMD internal BIOSs have the enhancements.<br />
<strong>The</strong>se enhancements are rolling out to the publicly<br />
available BIOSs along with the BKDG<br />
enhancements.<br />
<strong>The</strong> multi-processor capable <strong>Linux</strong> frequency<br />
driver has released under GPL.<br />
<strong>The</strong> cpufreqd user-mode daemon, available<br />
for download from http://sourceforge.<br />
net/projects/cpufreqd supports multiple<br />
13 Other Software-directed Power<br />
Saving Mechanisms<br />
13.1 Use Of <strong>The</strong> HLT Instruction<br />
<strong>The</strong> hlt instruction is normally used when the<br />
operating system has no code for the processor<br />
to execute. This is the ACPI C1 state. Execution<br />
of instructions ceases, until the processor<br />
is restarted with an interrupt. <strong>The</strong> power<br />
savings are maximized when the hlt state is entered<br />
in the minimum pstate, due to the lower<br />
voltage. <strong>The</strong> alternative to the use of the hlt<br />
instruction is a do nothing loop.<br />
13.2 Use of Power Managed Chipset Drivers<br />
Devices on the planar board, such as a PCI-X<br />
bridge or an AGP tunnel, may have the capability<br />
to operate in lower power modes. Entering<br />
and leaving the lower power modes is under the<br />
control of the driver for that device.<br />
Note that HyperTransport attached devices can<br />
transition themselves to lower power modes<br />
when certain messages are seen on the bus.<br />
However, this functionality is typically configurable,<br />
so a chipset driver (or the system BIOS<br />
during bootup) would need to enable this capability.<br />
14 Items For Future Exploration<br />
14.1 A Built-in Database<br />
<strong>The</strong> theory is that the driver could have a builtin<br />
database of processors and the pstates that<br />
they support. <strong>The</strong> driver could then use this<br />
database to obtain the pstate data without dependencies<br />
on ACPI, or use it for enhanced
184 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
checking of the ACPI provided data. <strong>The</strong> disadvantage<br />
of this is the need to update the<br />
database for new processor revisions. <strong>The</strong> advantages<br />
are the ability to overcome the ACPI<br />
imposed restrictions, and also to allow the use<br />
of the technology on systems where the ACPI<br />
support is not enabled.<br />
14.2 <strong>Kernel</strong> Scheduler—CPU Power<br />
An enhanced scheduler for the 2.6 kernel<br />
(2.6.6-bk1) is aware of groups of processors<br />
with different processing power. <strong>The</strong> power<br />
tating of each CPU group should be dynamically<br />
adjusted using a cpufreq transition notifier<br />
as the processor frequencies are changed.<br />
See http://lwn.net/Articles/<br />
80601/ for a detailed acount of the scheduler<br />
changes.<br />
14.3 <strong>The</strong>rmal Management, ACPI <strong>The</strong>rmal<br />
Zones<br />
Publicly available BIOSs for AMD machines<br />
do not implement thermal zones. Obviously<br />
this is one way to provide the input control for<br />
frequency management based on thermal conditions.<br />
14.4 <strong>The</strong>rmal Management, Service Processor<br />
Servers typically have a service processor,<br />
which may be compliant to the IPMI specification.<br />
This service processor is able to accurately<br />
monitor temperature at different locations<br />
within the chassis. <strong>The</strong> 2.6 kernel<br />
includes an IPMI driver. User space code<br />
may use these thermal readings to control fan<br />
speeds and generate administrator alerts. It<br />
may make sense to also use these accurate thermal<br />
readings to trigger frequency transitions.<br />
<strong>The</strong> interaction between thermal events from<br />
the service processor and ACPI thermal zones<br />
may be a problem.<br />
Hiding <strong>The</strong>rmal Conditions<br />
<strong>One</strong> concern with the use of CPU frequency<br />
manipulation to avoid overheating is that hardware<br />
problems may not be noticed. Over temperature<br />
conditions would normally cause administrator<br />
alerts, but if the processor is first<br />
taken to a lower frequency to hold temperature<br />
down, then the alert may not be generated. A<br />
failing fan (not spinning at full speed) could<br />
therefore be missed. Some hardware components<br />
fail gradually, and early warning of imminent<br />
failures is needed to perform planned<br />
maintenance. Losing this data would be badness.<br />
15 Legal Information<br />
Copyright © 2004 Advanced Micro Devices, Inc<br />
Permission to redistribute in accordance with <strong>Linux</strong><br />
Symposium submission guidelines is granted; all<br />
other rights reserved.<br />
AMD, the AMD Arrow logo, AMD Opteron,<br />
AMD Athlon and combinations thereof, AMD-<br />
8111, AMD-8131, and AMD-8151 are trademarks<br />
of Advanced Micro Devices, Inc.<br />
<strong>Linux</strong> is a registered trademark of Linus Torvalds.<br />
HyperTransport is a licensed trademark of the HyperTransport<br />
Technology Consortium.<br />
Other product names used in this publication are for<br />
identification purposes only and may be trademarks<br />
of their respective companies.<br />
16 References<br />
1. AMD Opteron Processor Data Sheet,<br />
publication 23932, available from www.<br />
amd.com<br />
2. AMD Opteron Processor Power And
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 185<br />
<strong>The</strong>rmal Data Sheet, publication 30417,<br />
available from www.amd.com<br />
3. AMD Athlon 64 Processor Power And<br />
<strong>The</strong>rmal Data Sheet, publication 30430,<br />
available from www.amd.com<br />
4. BIOS and <strong>Kernel</strong> Developer’s Guide (the<br />
BKDG) for AMD Athlon 64 and AMD<br />
Opteron Processors, publication 26094,<br />
available from www.amd.com. Chapter<br />
9 covers frequency management.<br />
5. ACPI 2.0b Specification, from www.<br />
acpi.info<br />
6. Text documentation files in the kernel<br />
linux/Documentation/cpu-freq/<br />
directory:<br />
• index.txt<br />
• user-guide.txt<br />
• core.txt<br />
• cpu-drivers.txt<br />
• governors.txt
186 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Dynamic <strong>Kernel</strong> Module Support:<br />
From <strong>The</strong>ory to Practice<br />
Matt Domsch & Gary Lerhaupt<br />
Dell <strong>Linux</strong> Engineering<br />
Matt_Domsch@dell.com, Gary_Lerhaupt@dell.com<br />
Abstract<br />
DKMS is a framework which allows individual<br />
kernel modules to be upgraded without changing<br />
your whole kernel. Its primary audience<br />
is fourfold: system administrators who want<br />
to update a single device driver rather than<br />
wait for a new kernel from elsewhere with it<br />
included; distribution maintainers, who want<br />
to release a single targeted bugfix in between<br />
larger scheduled updates; system manufacturers<br />
who need single modules changed to support<br />
new hardware or to fix bugs, but do not<br />
wish to test whole new kernels; and driver<br />
developers, who must provide updated device<br />
drivers for testing and general use on a wide<br />
variety of kernels, as well as submit drivers to<br />
kernel.org.<br />
Since OLS2003, DKMS has gone from a good<br />
idea to deployed and used. Based on end user<br />
feedback, additional features have been added:<br />
precompiled module tarball support to speed<br />
factory installation; driver disks for Red Hat<br />
distributions; 2.6 kernel support; SuSE kernel<br />
support. Planned features include crossarchitecture<br />
build support and additional distribution<br />
driver disk methods.<br />
In addition to overviewing DKMS and its features,<br />
we explain how to create a dkms.conf file<br />
to DKMS-ify your kernel module source.<br />
1 History<br />
Historically, <strong>Linux</strong> distributions bundle device<br />
drivers into essentially one large kernel package,<br />
for several primary reasons:<br />
• Completeness: <strong>The</strong> <strong>Linux</strong> kernel as distributed<br />
on kernel.org includes all the device<br />
drivers packaged neatly together in<br />
the same kernel tarball. Distro kernels follow<br />
kernel.org in this respect.<br />
• Maintainer simplicity: With over 4000<br />
files in the kernel drivers/ directory,<br />
each possibly separately versioned, it<br />
would be impractical for the kernel maintainer(s)<br />
to provide a separate package for<br />
each driver.<br />
• Quality Assurance / Support organization<br />
simplicity: It is easiest to ask a user “what<br />
kernel version are you running,” and to<br />
compare this against the list of approved<br />
kernel versions released by the QA team,<br />
rather than requiring the customer to provide<br />
a long and extensive list of package<br />
versions, possibly one per module.<br />
• End user install experience: End users<br />
don’t care about which of the 4000 possible<br />
drivers they need to install, they just<br />
want it to work.<br />
This works well as long as you are able to make<br />
the “top of the tree” contain the most current
188 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
and most stable device driver, and you are able<br />
to convince your end users to always run the<br />
“top of the tree.” <strong>The</strong> kernel.org development<br />
processes tend to follow this model with<br />
great success.<br />
But widely used distros cannot ask their users<br />
to always update to the top of the kernel.org<br />
tree. Instead, they start their products from the<br />
top of the kernel.org tree at some point in time,<br />
essentially freezing with that, to begin their test<br />
cycles. <strong>The</strong> duration of these test cycles can<br />
be as short as a few weeks, and as long as a<br />
few years, but 3-6 months is not unusual. During<br />
this time, the kernel.org kernels march forward,<br />
and some (but not all) of these changes<br />
are backported into the distro’s kernel. <strong>The</strong>y<br />
then apply the minimal patches necessary for<br />
them to declare the product finished, and move<br />
the project into the sustaining phase, where<br />
changes are very closely scrutinized before releasing<br />
them to the end users.<br />
1.1 Backporting<br />
It is this sustaining phase that DKMS targets.<br />
DKMS can be used to backport newer device<br />
driver versions from the “top of the tree” kernels<br />
where most development takes place to the<br />
now-historical kernels of released products.<br />
<strong>The</strong> PATCH_MATCH mechanism was specifically<br />
designed to allow the application of<br />
patches to a “top of the tree” device driver to<br />
make it work with older kernels. This allows<br />
driver developers to continue to focus their efforts<br />
on keeping kernel.org up to date, while allowing<br />
that same effort to be used on existing<br />
products with minimal changes. See Section 6<br />
for a further explanation of this feature.<br />
1.2 Driver developers’ packaging<br />
Driver developers have recognized for a long<br />
time that they needed to provide backported<br />
versions of their drivers to match their end<br />
users’ needs. Often these requirements are<br />
imposed on them by system vendors such<br />
as Dell in support of a given distro release.<br />
However, each driver developer was free to<br />
provide the backport mechanism in any way<br />
they chose. Some provided architecturespecific<br />
RPMs which contained only precompiled<br />
modules. Some provided source RPMs<br />
which could be rebuilt for the running kernel.<br />
Some provided driver disks with precompiled<br />
modules. Some provided just source code<br />
patches, and expected the end user to rebuild<br />
the kernel themselves to obtain the desired device<br />
driver version. All provided their own<br />
Makefiles rather than use the kernel-provided<br />
build system.<br />
As a result, different problems were encountered<br />
with each developers’ solution. Some<br />
developers had not included their drivers in<br />
the kernel.org tree for so long that that there<br />
were discrepancies, e.g. CONFIG_SMP vs<br />
__SMP__, CONFIG_2G vs. CONFIG_3G,<br />
and compiler option differences which went<br />
unnoticed and resulted in hard-to-debug issues.<br />
Needless to say, with so many different mechanisms,<br />
all done differently, and all with different<br />
problems, it was a nightmare for end users.<br />
A new mechanism was needed to cleanly handle<br />
applying updated device drivers onto an<br />
end user’s system. Hence DKMS was created<br />
as the one module update mechanism to replace<br />
all previous methods.<br />
2 Goals<br />
DKMS has several design goals.<br />
• Implement only mechanism, not policy.<br />
• Allow system administrators to easily<br />
know what modules, what versions, for
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 189<br />
what kernels, and in what state, they have<br />
on their system.<br />
• Keep module source as it would be found<br />
in the “top of the tree” on kernel.org. Apply<br />
patches to backport the modules to<br />
earlier kernels as necessary.<br />
• Use the kernel-provided build mechanism.<br />
This reduces the Makefile magic<br />
that driver developers need to know, thus<br />
the likelihood of getting it wrong.<br />
• Keep additional DKMS knowledge a<br />
driver developer must have to a minimum.<br />
Only a small per-driver dkms.conf file is<br />
needed.<br />
• Allow multiple versions of any one module<br />
to be present on the system, with only<br />
one active at any given time.<br />
• Allow DKMS-aware drivers to be<br />
packaged in the <strong>Linux</strong> Standard Baseconformant<br />
RPM format.<br />
• Ease of use by multiple audiences: driver<br />
developers, system administrators, <strong>Linux</strong><br />
distros, and system vendors.<br />
We discuss DKMS as it applies to each of these<br />
four audiences.<br />
3 Distributions<br />
All present <strong>Linux</strong> distributions distribute device<br />
drivers bundled into essentially one large<br />
kernel package, for reasons outlined in Section<br />
1. It makes the most sense, most of the<br />
time.<br />
However, there are cases where it does not<br />
make sense.<br />
• Severity 1 bugs are discovered in a single<br />
device driver between larger scheduled<br />
updates. Ideally you’d like your affected<br />
users to be able to get the single<br />
module update without having to release<br />
and Q/A a whole new kernel. Only customers<br />
who are affected by the particular<br />
bug need to update “off-cycle.”<br />
• Solutions vendors, for change control reasons,<br />
often certify their solution on a particular<br />
distribution, scheduled update release,<br />
and sometimes specific kernel version.<br />
<strong>The</strong> latter, combined with releasing<br />
device driver bug fixes as whole new kernels,<br />
puts the customer in the untenable<br />
position of either updating to the new kernel<br />
(and losing the certification of the solution<br />
vendor), or forgoing the bug fix and<br />
possibly putting their data at risk.<br />
• Some device drivers are not (yet) included<br />
in kernel.org nor a distro kernel, however<br />
one may be required for a functional software<br />
solution. <strong>The</strong> current support models<br />
require that the add-on driver “taint”<br />
the kernel or in some way flag to the support<br />
organization that the user is running<br />
an unsupported kernel module. Tainting,<br />
while valid, only has three dimensions<br />
to it at present: Proprietary—non-GPL<br />
licensed; Forced—loaded via insmod<br />
-f; and Unsafe SMP—for some CPUs<br />
which are not designed to be SMPcapable.<br />
A GPL-licensed device driver<br />
which is not yet in kernel.org or provided<br />
by the distribution may trigger none of<br />
these taints, yet the support organization<br />
needs to be aware of this module’s presence.<br />
To avoid this, we expect to see<br />
the distros begin to cryptographically sign<br />
kernel modules that they produce, and<br />
taint on load of an unsigned module. This<br />
would help reduce the support organization’s<br />
work for calls about “unsupported”
190 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
configurations. With DKMS in use, there<br />
is less a need for such methods, as it’s easy<br />
to see which modules have been changed.<br />
Note: this is not to suggest that driver authors<br />
should not submit their drivers to<br />
kernel.org—absolutely they should.<br />
• <strong>The</strong> distro QA team would like to test updates<br />
to specific drivers without waiting<br />
for the kernel maintenance team to rebuild<br />
the kernel package (which can take many<br />
hours in some cases). Likewise, individual<br />
end users may be willing (and often be<br />
required, e.g. if the distro QA team can’t<br />
reproduce the users’s hardware and software<br />
environment exactly) to show that a<br />
particular bug is fixed in a driver, prior<br />
to releasing the fix to all of that distro’s<br />
users.<br />
• New hardware support via driver disks:<br />
Hardware vendors release new hardware<br />
asynchronously to any software vendor<br />
schedule, no matter how hard companies<br />
may try to synchronize releases. OS distributions<br />
provide install methods which<br />
use driver diskettes to enable new hardware<br />
for previously-released versions of<br />
the OS. Generating driver disks has always<br />
been a difficult and error-prone procedure,<br />
different for each OS distribution,<br />
not something that the casual end-user<br />
would dare attempt.<br />
DKMS was designed to address all of these<br />
concerns.<br />
DKMS aims to provide a clear separation between<br />
mechanism (how one updates individual<br />
kernel modules and tracks such activity) and<br />
policy (when should one update individual kernel<br />
modules).<br />
3.1 Mechanism<br />
DKMS provides only the mechanism for updating<br />
individual kernel modules, not policy.<br />
As such, it can be used by distributions (per<br />
their policy) for updating individual device<br />
drivers for individual users affected by Severity<br />
1 bugs, without releasing a whole new kernel.<br />
<strong>The</strong> first mechanism critical to a system administrator<br />
or support organization is the status<br />
command, which reports the name, version,<br />
and state of each kernel module under DKMS<br />
control. By querying DKMS for this information,<br />
system administrators and distribution<br />
support organizations may quickly understand<br />
when an updated device driver is in use to<br />
speed resolution when issues are seen.<br />
DKMS’s ability to generate driver diskettes<br />
gives control to both novice and seasoned system<br />
administrators alike, as they can now perform<br />
work which otherwise they would have<br />
to wait for a support organization to do for<br />
them. <strong>The</strong>y can get their new hardware systems<br />
up-and-running quickly by themselves,<br />
leaving the support organizations with time to<br />
do other more interesting value-added work.<br />
3.2 Policy<br />
Suggested policy items include:<br />
• Updates must pass QA. This seems obvious,<br />
but it reduces broken updates (designed<br />
to fix other problems) from being<br />
released.<br />
• Updates must be submitted, and ideally be<br />
included already, upstream. For this we<br />
expect kernel.org and the OS distribution<br />
to include the update in their next larger<br />
scheduled update. This ensures that when<br />
the next kernel.org kernel or distro update
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 191<br />
comes out, the short-term fix provided via<br />
DKMS is incorporated already.<br />
• <strong>The</strong> AUTOINSTALL mechanism is set to<br />
NO for all modules which are shipped with<br />
the target distro’s kernel. This prevents<br />
the DKMS autoinstaller from installing<br />
a (possibly older) kernel module onto a<br />
newer kernel without being explicitly told<br />
to do so by the system administrator. This<br />
follows from the “all DKMS updates must<br />
be in the next larger release” rule above.<br />
• All issues for which DKMS is used are<br />
tracked in the appropriate bug tracking<br />
databases until they are included upstream,<br />
and are reviewed regularly.<br />
• All DKMS packages are provided as<br />
DKMS-enabled RPMs for easy installation<br />
and removal, per the <strong>Linux</strong> Standard<br />
Base specification.<br />
• All DKMS packages are posted to the distro’s<br />
support web site for download by<br />
system administrators affected by the partiular<br />
issue.<br />
4 System Vendors<br />
DKMS is useful to System Vendors such as<br />
Dell for many of the same reasons it’s useful<br />
to the <strong>Linux</strong> distributions. In addition, system<br />
vendors face additional issues:<br />
• Critical bug fixes for distro-provided<br />
drivers: While we hope to never need<br />
such, and we test extensively with distroprovided<br />
drivers, occasionally we have<br />
discovered a critical bug after the distribution<br />
has cut their gold CDs. We use<br />
DKMS to update just the affected device<br />
drivers.<br />
• Alternate drivers: Dell occasionally needs<br />
to provide an alternate driver for a piece of<br />
hardware rather than that provided by the<br />
distribution natively. For example, Dell<br />
provides the Intel iANS network channel<br />
bonding and failover driver for customers<br />
who have used iANS in the past, and wish<br />
to continue using it rather than upgrading<br />
to the native channel bonding driver resident<br />
in the distribution.<br />
• Factory installation: Dell installs various<br />
OS distribution releases onto new hardware<br />
in its factories. We try not to update<br />
from the gold release of a distribution<br />
version to any of the scheduled updates,<br />
as customers expect to receive gold. We<br />
use DKMS to enable newer device drivers<br />
to handle newer hardware than was supported<br />
natively in the gold release, while<br />
keeping the gold kernel the same.<br />
We briefly describe the policy Dell uses, in addition<br />
to the above rules suggested to OS distributions:<br />
• Prebuilt DKMS tarballs are required for<br />
factory installation use, for all kernels<br />
used in the factory install process. This<br />
prevents the need for the compiler to be<br />
run, saving time through the factories.<br />
Dell rarely changes the factory install images<br />
for a given OS release, so this is not<br />
a huge burden on the DKMS packager.<br />
• All DKMS packages are posted to support.dell.com<br />
for download by system administrators<br />
purchasing systems without<br />
<strong>Linux</strong> factory-installed.
192 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Figure 1: DKMS state diagram.<br />
5 System Administrators<br />
5.1 Understanding the DKMS Life Cycle<br />
Before diving into using DKMS to manage kernel<br />
modules, it is helpful to understand the life<br />
cycle by which DKMS maintains your kernel<br />
modules. In Figure 1, each rectangle represents<br />
a state your module can be in and each<br />
italicized word represents a DKMS action that<br />
can used to switch between the various DKMS<br />
states. In the following section we will look<br />
further into each of these DKMS actions and<br />
then continue on to discuss auxiliary DKMS<br />
functionality that extends and improves upon<br />
your ability to utilize these basic commands.<br />
5.2 RPM and DKMS<br />
5.3 Using DKMS<br />
5.3.1 Add<br />
DKMS manages kernel module versions at<br />
the source code level. <strong>The</strong> first requirement<br />
of using DKMS is that the module<br />
source be located on the build system and<br />
that it be located in the directory /usr/src/<br />
-/. It<br />
also requires that a dkms.conf file exists with<br />
the appropriately formatted directives within<br />
this configuration file to tell DKMS such things<br />
as where to install the module and how to build<br />
it. Once these two requirements have been<br />
met and DKMS has been installed on your system,<br />
you can begin using DKMS by adding a<br />
module/module-version to the DKMS tree. For<br />
example:<br />
# dkms add -m megaraid2 -v 2.10.3<br />
DKMS was designed to work well with Red<br />
Hat Package Manger (RPM). Many times using<br />
DKMS to install a kernel module is as easy<br />
as installing a DKMS-enabled module RPM.<br />
Internally in these RPMs, DKMS is used to<br />
add, build, and install a module. By<br />
wrapping DKMS commands inside of an RPM,<br />
you get the benefits of RPM (package versioning,<br />
security, dependency resolution, and package<br />
distribution methodologies) while DKMS<br />
handles the work RPM does not, versioning<br />
and building of individual kernel modules.<br />
For reference, a sample DKMS-enabled RPM<br />
specfile can be found in the DKMS package.<br />
This example add command would add<br />
megaraid2/2.10.3 to the already existent<br />
/var/dkms tree, leaving it in the Added<br />
state.<br />
5.3.2 Build<br />
Once in the Added state, the module is ready<br />
to be built. This occurs through the DKMS<br />
build command and requires that the proper<br />
kernel sources are located on the system from<br />
the /lib/module/<br />
/build symlink. <strong>The</strong> make command that is
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 193<br />
used to compile the module is specified in the<br />
dkms.conf configuration file. Continuing with<br />
the megaraid2/2.10.3 example:<br />
# dkms build -m megaraid2<br />
-v 2.10.3 -k 2.4.21-4.ELsmp<br />
<strong>The</strong> build command compiles the module<br />
but stops short of installing it. As can be seen<br />
in the above example, build expects a kernelversion<br />
parameter. If this kernel name is left<br />
out, it assumes the currently running kernel.<br />
However, it functions perfectly well to build<br />
modules for kernels that are not currently running.<br />
This functionality is assured through use<br />
of a kernel preparation subroutine that runs before<br />
any module build is performed in order<br />
to ensure that the module being built is linked<br />
against the proper kernel symbols.<br />
Successful completion of a build creates, for<br />
this example, the /var/dkms/megaraid2/<br />
2.10.3/2.4.21-4.ELsmp/ directory as<br />
well as the log and module subdirectories<br />
within this directory. <strong>The</strong> log directory holds<br />
a log file of the module make and the module<br />
directory holds copies of the resultant binaries.<br />
5.3.3 Install<br />
With the completion of a build, the module<br />
can now be installed on the kernel for<br />
which it was built. Installation copies the compiled<br />
module binary to the correct location in<br />
the /lib/modules/ tree as specified in the<br />
dkms.conf file. If a module by that name is<br />
already found in that location, DKMS saves it<br />
in its tree as an original module so at a later<br />
time it can be put back into place if the newer<br />
module is uninstalled. An example install<br />
command:<br />
# dkms install -m megaraid2<br />
-v 2.10.3 -k 2.4.21-4.ELsmp<br />
If a module by the same name is already<br />
installed, DKMS saves a copy in its<br />
tree and does so in the /var/dkms/<br />
/original_module/<br />
directory. In this case, it would be saved to<br />
/var/dkms/megaraid2/original_<br />
module/2.4.21-4.ELsmp/.<br />
5.3.4 Uninstall and Remove<br />
To complete the DKMS cycle, you can also<br />
uninstall or remove your module from the<br />
tree. <strong>The</strong> uninstall command deletes from<br />
/lib/modules the module you installed<br />
and, if applicable, replaces it with its original<br />
module. In scenarios where multiple versions<br />
of a module are located within the DKMS tree,<br />
when one version is uninstalled, DKMS does<br />
not try to understand or assume which of these<br />
other versions to put in its place. Instead, if<br />
a true “original_module” was saved from the<br />
very first DKMS installation, it will be put back<br />
into the kernel and all of the other module versions<br />
for that module will be left in the Built<br />
state. An example uninstall would be:<br />
# dkms uninstall -m megaraid2<br />
-v 2.10.3 -k 2.4.21-4.ELsmp<br />
Again, if the kernel version parameter is unset,<br />
the currently running kernel is assumed,<br />
although, the same behavior does not occur<br />
with the remove command. <strong>The</strong> remove and<br />
uninstall are very similar in that remove<br />
will do all of the same steps as uninstall.<br />
However, when remove is employed, if the<br />
module-version being removed is the last instance<br />
of that module-version for all kernels<br />
on your system, after the uninstall portion of<br />
the remove completes, it will delete all traces<br />
of that module from the DKMS tree. To put it<br />
another way, when an uninstall command<br />
completes, your modules are left in the Built
194 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
state. However, when a remove completes,<br />
you would be left in the Not in Tree state. Here<br />
are two sample remove commands:<br />
# dkms remove -m megaraid2<br />
-v 2.10.3 -k 2.4.21-4.ELsmp<br />
# dkms remove -m megaraid2<br />
-v 2.10.3 --all<br />
With the first example remove command,<br />
your module would be uninstalled and if this<br />
module/module-version were not installed on<br />
any other kernel, all traces of it would be removed<br />
from the DKMS tree all together. If,<br />
say, megaraid2/2.10.3 was also installed on the<br />
2.4.21-4.ELhugemem kernel, the first remove<br />
command would leave it alone and it would remain<br />
intact in the DKMS tree. In the second<br />
example, that would not be the case. It would<br />
uninstall all versions of the megaraid2/2.10.3<br />
module from all kernels and then completely<br />
expunge all references of megaraid2/2.10.3<br />
from the DKMS tree. Thus, remove is what<br />
cleans your DKMS tree.<br />
5.4 Miscellaneous DKMS Commands<br />
5.4.1 Status<br />
DKMS also comes with a fully functional status<br />
command that returns information about<br />
what is currently located in your tree. If no<br />
parameters are set, it will return all information<br />
found. Logically, the specificity of information<br />
returned depends on which parameters<br />
are passed to your status command. Each status<br />
entry returned will be of the state: “added,”<br />
“built,” or “installed,” and if an original module<br />
has been saved, this information will also<br />
be displayed. Some example status commands<br />
include:<br />
# dkms status<br />
# dkms status -m megaraid2<br />
# dkms status -m megaraid2 -v 2.10.3<br />
# dkms status -k 2.4.21-4.ELsmp<br />
# dkms status -m megaraid2<br />
-v 2.10.3 -k 2.4.21-4.ELsmp<br />
5.4.2 Match<br />
Another major feature of DKMS is the match<br />
command. <strong>The</strong> match command takes the configuration<br />
of DKMS installed modules for one<br />
kernel and applies this same configuration to<br />
some other kernel. When the match completes,<br />
the same module/module-versions that were<br />
installed for one kernel are also then installed<br />
on the other kernel. This is helpful when you<br />
are upgrading from one kernel to the next, but<br />
would like to keep the same DKMS modules in<br />
place for the new kernel. Here is an example:<br />
# dkms match<br />
--templatekernel 2.4.21-4.ELsmp<br />
-k 2.4.21-5.ELsmp<br />
As can be seen in the example, the<br />
−−templatekernel is the “match-er”<br />
kernel from which the configuration is based,<br />
while the -k kernel is the “match-ee” upon<br />
which the configuration is instated.<br />
5.4.3 dkms_autoinstaller<br />
Similar in nature to the match command is<br />
the dkms_autoinstaller service. This service<br />
gets installed as part of the DKMS RPM<br />
in the /etc/init.d directory. Depending on<br />
whether an AUTOINSTALL directive is set<br />
within a module’s dkms.conf configuration<br />
file, the dkms_autoinstaller will automatically<br />
build and install that module as you boot your<br />
system into new kernels which do not already<br />
have this module installed.<br />
5.4.4 mkdriverdisk<br />
<strong>The</strong> last miscellaneous DKMS command is<br />
mkdriverdisk. As can be inferred from its<br />
name, mkdriverdisk will take the proper
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 195<br />
sources in your DKMS tree and create a driver<br />
disk image for use in providing updated drivers<br />
to <strong>Linux</strong> distribution installations. A sample<br />
mkdriverdisk might look like:<br />
You have built the megaraid2 device driver,<br />
version 2.10.3, for two different kernel families<br />
(here 2.4.20-9 and 2.4.21-4.EL), on your<br />
master build system.<br />
# dkms mkdriverdisk -d redhat<br />
-m megaraid2 -v 2.10.3<br />
-k 2.4.21-4.ELBOOT<br />
Currently, the only supported distribution<br />
driver disk format is Red Hat. For more<br />
information on the extra necessary files and<br />
their formats for DKMS to create Red<br />
Hat driver disks, see http://people.<br />
redhat.com/dledford. <strong>The</strong>se files<br />
should be placed in your module source directory.<br />
5.5 Systems Management with DKMS Tarballs<br />
As we have seen, DKMS provides a simple<br />
mechanism to build, install, and track device<br />
driver updates. So far, all these actions have<br />
related to a single machine. But what if you’ve<br />
got many similar machines under your administrative<br />
control? What if you have a compiler<br />
and kernel source on only one system (your<br />
master build system), but you need to deploy<br />
your newly built driver to all your other systems?<br />
DKMS provides a solution to this as<br />
well—in the mktarball and ldtarball<br />
commands.<br />
<strong>The</strong> mktarball command rolls up copies of<br />
each device driver module file which you’ve<br />
built using DKMS into a compressed tarball.<br />
You may then copy this tarball to each<br />
of your target systems, and use the DKMS<br />
ldtarball command to load those into your<br />
DKMS tree, leaving each module in the Built<br />
state, ready to be installed. This avoids the<br />
need for both kernel source and compilers to<br />
be on every target system.<br />
For example:<br />
# dkms status<br />
megaraid2, 2.10.3, 2.4.20-9: built<br />
megaraid2, 2.10.3, 2.4.20-9bigmem: built<br />
megaraid2, 2.10.3, 2.4.20-9BOOT: built<br />
megaraid2, 2.10.3, 2.4.20-9smp: built<br />
megaraid2, 2.10.3, 2.4.21-4.EL: built<br />
megaraid2, 2.10.3, 2.4.21-4.ELBOOT: built<br />
megaraid2, 2.10.3, 2.4.21-4.ELhugemem: built<br />
megaraid2, 2.10.3, 2.4.21-4.ELsmp: built<br />
You wish to deploy this version of the<br />
driver to several systems, without rebuilding<br />
from source each time. You can use the<br />
mktarball command to generate one tarball<br />
for each kernel family:<br />
# dkms mktarball -m megaraid2<br />
-v 2.10.3<br />
-k 2.4.21-4.EL,2.4.21-4.ELsmp,<br />
2.4.21-4.ELBOOT,2.4.21-4.ELhugemem<br />
Marking /usr/src/megaraid2-2.10.3 for archiving...<br />
Marking kernel 2.4.21-4.EL for archiving...<br />
Marking kernel 2.4.21-4.ELBOOT for archiving...<br />
Marking kernel 2.4.21-4.ELhugemem for archiving...<br />
Marking kernel 2.4.21-4.ELsmp for archiving...<br />
Tarball location:<br />
/var/dkms/megaraid2/2.10.3/tarball/<br />
megaraid2-2.10.3-manykernels.tgz<br />
Done.<br />
You can make one big tarball containing modules<br />
for both families by omitting the -k argument<br />
and kernel list; DKMS will include a<br />
module for every kernel version found.<br />
You may then copy the tarball (renaming it if<br />
you wish) to each of your target systems using<br />
any mechanism you wish, and load the modules<br />
in. First, see that the target DKMS tree<br />
does not contain the modules you’re loading:<br />
# dkms status<br />
Nothing found within the DKMS tree for<br />
this status command. If your modules were<br />
not installed with DKMS, they will not show<br />
up here.<br />
<strong>The</strong>n, load the tarball on your target system:
196 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
# dkms ldtarball<br />
--archive=megaraid2-2.10.3-manykernels.tgz<br />
Loading tarball for module:<br />
megaraid2 / version: 2.10.3<br />
Loading /usr/src/megaraid2-2.10.3...<br />
Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.EL...<br />
Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.ELBOOT...<br />
Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.ELhugemem...<br />
Loading /var/dkms/megaraid2/2.10.3/2.4.21-4.ELsmp...<br />
Creating /var/dkms/megaraid2/2.10.3/source symlink...<br />
Finally, verify the modules are present, and in<br />
the Built state:<br />
# dkms status<br />
megaraid2, 2.10.3, 2.4.21-4.EL: built<br />
megaraid2, 2.10.3, 2.4.21-4.ELBOOT: built<br />
megaraid2, 2.10.3, 2.4.21-4.ELhugemem: built<br />
megaraid2, 2.10.3, 2.4.21-4.ELsmp: built<br />
DKMS ldtarball leaves the modules in the<br />
Built state, not the Installed state. For each kernel<br />
version you want your modules to be installed<br />
into, follow the install steps as above.<br />
6 Driver Developers<br />
As the maintainer of a kernel module, the only<br />
thing you need to do to get DKMS interoperability<br />
is place a small dkms.conf file in your<br />
driver source tarball. Once this has been done,<br />
any user of DKMS can simply do:<br />
dkms ldtarball --archive /path/to/foo-1.0.tgz<br />
That’s it. We could discuss at length (which<br />
we will not rehash in this paper) the best methods<br />
to utilizing DKMS within a dkms-enabled<br />
module RPM, but for simple DKMS usability,<br />
the buck stops here. With the dkms.conf file<br />
in place, you have now positioned your source<br />
tarball to be usable by all manner and skill level<br />
of <strong>Linux</strong> users utilizing your driver. Effectively,<br />
you have widely increased your testing<br />
base without having to wade into package management<br />
or pre-compiled binaries. DKMS will<br />
handle this all for you. Along the same line,<br />
by leveraging DKMS you can now easily allow<br />
more widespread testing of your driver. Since<br />
driver versions can now be cleanly tracked outside<br />
of the kernel tree, you no longer must wait<br />
for the next kernel release in order for the community<br />
to register the necessary debugging cycles<br />
against your code. Instead, DKMS can be<br />
counted on to manage various versions of your<br />
kernel module such that any catastrophic errors<br />
in your code can be easily mitigated by a singular<br />
dkms uninstall command.<br />
This leaves the composition of the dkms.conf<br />
as the only interesting piece left to discuss<br />
for the driver developer audience. With that<br />
in mind, we will now explicate over two<br />
dkms.conf examples ranging from that which<br />
is minimally required (Figure 2) to that which<br />
expresses maximal configuration (Figure 3).<br />
6.1 Minimal dkms.conf for 2.4 kernels<br />
Referring to Figure 2, the first thing that is distinguishable<br />
is the definition of the version of<br />
the package and the make command to be used<br />
to compile your module. This is only necessary<br />
for 2.4-based kernels, and lets the developer<br />
specify their desired make incantation.<br />
Reviewing the rest of the dkms.conf,<br />
PACKAGE_NAME and BUILT_MODULE_<br />
NAME[0] appear to be duplicate in nature,<br />
but this is only the case for a package which<br />
contains only one kernel module within it.<br />
Had this example been for something like<br />
ALSA, the name of the package would be<br />
“alsa,” but the BUILT_MODULE_NAME array<br />
would instead be populated with the names of<br />
the kernel modules within the ALSA package.<br />
<strong>The</strong> final required piece of this minimal example<br />
is the DEST_MODULE_LOCATION array.<br />
This simply tells DKMS where in the<br />
/lib/modules tree it should install your module.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 197<br />
PACKAGE_NAME="megaraid2"<br />
PACKAGE_VERSION="2.10.3"<br />
MAKE[0]="make -C ${kernel_source_dir}<br />
SUBDIRS=${dkms_tree}/${PACKAGE_NAME}/${PACKAGE_VERSION}/build modules"<br />
BUILT_MODULE_NAME[0]="megaraid2"<br />
DEST_MODULE_LOCATION[0]="/kernel/drivers/scsi/"<br />
Figure 2: A minimal dkms.conf<br />
6.2 Minimal dkms.conf for 2.6 kernels<br />
In the current version of DKMS, for 2.6 kernels<br />
the MAKE command listed in the dkms.conf<br />
is wholly ignored, and instead DKMS will always<br />
use:<br />
make -C /lib/modules/$kernel_version/build \<br />
M=$dkms_tree/$module/$module_version/build<br />
This jibes with the new external module build<br />
infrastructure supported by Sam Ravnborg’s<br />
kernel Makefile improvements, as DKMS will<br />
always build your module in a build subdirectory<br />
it creates for each version you have<br />
installed. Similarly, an impending future<br />
version of DKMS will also begin to ignore<br />
the PACKAGE_VERSION as specified in<br />
dkms.conf in favor of the new modinfo provided<br />
information as implemented by Rusty<br />
Russell.<br />
With regard to removing the requirement for<br />
DEST_MODULE_LOCATION for 2.6 kernels,<br />
given that similar information should be located<br />
in the install target of the Makefile provided<br />
with your package, it is theoretically possible<br />
that DKMS could one day glean such<br />
information from the Makefile instead. In<br />
fact, in a simple scenario as this example, it<br />
is further theoretically possible that the name<br />
of the package and of the built module could<br />
also be determined from the package Makefile.<br />
In effect, this would completely remove<br />
any need for a dkms.conf whatsoever, thus enabling<br />
all simple module tarballs to be automatically<br />
DKMS enabled.<br />
Though, as these features have not been explored<br />
and as package maintainers would<br />
likely want to use some of the other dkms.conf<br />
directive features which are about to be elaborated<br />
upon, it is likely that requiring a<br />
dkms.conf will continue for the foreseeable future.<br />
6.3 Optional dkms.conf directives<br />
In the real-world version of the Dell’s DKMSenabled<br />
megaraid2 package, we also specify<br />
the optional directives:<br />
MODULES_CONF_ALIAS_TYPE[0]=<br />
"scsi_hostadapter"<br />
MODULES_CONF_OBSOLETES[0]=<br />
"megaraid,megaraid_2002"<br />
REMAKE_INITRD="yes"<br />
<strong>The</strong>se directives tell DKMS to remake the kernel’s<br />
initial ramdisk after every DKMS install<br />
or uninstall of this module. <strong>The</strong>y further specify<br />
that before this happens, /etc/modules.conf<br />
(or /etc/sysconfig/kernel) should be edited intelligently<br />
so that the initrd is properly assembled.<br />
In this case, if /etc/modules.conf already<br />
contains a reference to either “megaraid” or<br />
“megaraid_2002,” these will be switched to<br />
“megaraid2.” If no such references are found,
198 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
then a new “scsi_hostadapter” entry will be<br />
added as the last such scsi_hostadapter number.<br />
On the other hand, if it had also included:<br />
MODULES_CONF_OBSOLETES_ONLY="yes"<br />
then had no obsolete references been found,<br />
a new “scsi_hostadapter” line would not have<br />
been added. This would be useful in scenarios<br />
where you instead want to rely on something<br />
like Red Hat’s kudzu program for adding references<br />
for your kernel modules.<br />
As well one could hypothetically also specify<br />
within the dkms.conf:<br />
DEST_MODULE_NAME[0]="megaraid"<br />
This would cause the resultant megaraid2 kernel<br />
module to be renamed to “megaraid” before<br />
being installed. Rather than having to<br />
propagate various one-off naming mechanisms<br />
which include the version as part of the module<br />
name in /lib/modules as has been previous<br />
common practice, DKMS could instead be relied<br />
upon to manage all module versioning to<br />
avoid such clutter. Was megaraid_2002 a version<br />
or just a special year in the hearts of the<br />
megaraid developers? While you and I might<br />
know the answer to this, it certainly confused<br />
Dell’s customers.<br />
Continuing with hypothetical additions to the<br />
dkms.conf in Figure 2, one could also include:<br />
BUILD_EXCLUSIVE_KERNEL="^2\.4.*"<br />
BUILD_EXCLUSIVE_ARCH="i.86"<br />
In the event that you know the code you produced<br />
is not portable, this is how you can tell<br />
DKMS to keep people from trying to build it<br />
elsewhere. <strong>The</strong> above restrictions would only<br />
allow the kernel module to be built on 2.4 kernels<br />
on x86 architectures.<br />
Continuting with optional dkms.conf directives,<br />
the ALSA example in Figure 3 is taken<br />
directly from a DKMS-enabled package that<br />
Dell released to address sound issues on the<br />
Precision 360 workstation. It is slightly<br />
abridged as the alsa-driver as delivered actually<br />
installs 13 separate kernel modules, but for the<br />
sake of this example, only 9 are shown.<br />
In this example, we have:<br />
AUTOINSTALL="yes"<br />
This tells the boot-time service<br />
dkms_autoinstaller that this package should be<br />
built and installed as you boot into a new kernel<br />
that DKMS has not already installed this<br />
package upon. By general policy, Dell only<br />
allows AUTOINSTALL to be set if the kernel<br />
modules are not already natively included<br />
with the kernel. This is to avoid the scenario<br />
where DKMS might automatically install<br />
over a newer version of the kernel module as<br />
provided by some newer version of the kernel.<br />
However, given the 2.6 modinfo changes,<br />
DKMS can now be modified to intelligently<br />
check the version of a native kernel module<br />
before clobbering it with some older version.<br />
This will likely result in a future policy change<br />
within Dell with regard to this feature.<br />
In this example, we also have:<br />
PATCH[0]="adriver.h.patch"<br />
PATCH_MATCH[0]="2.4.[2-9][2-9]"<br />
<strong>The</strong>se two directives indicate to DKMS that<br />
if the kernel that the kernel module is being<br />
built for is >=2.4.22 (but still of the 2.4 family),<br />
the included adriver.h.patch should first be
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 199<br />
PACKAGE_NAME="alsa-driver"<br />
PACKAGE_VERSION="0.9.0rc6"<br />
MAKE="sh configure --with-cards=intel8x0 --with-sequencer=yes \<br />
--with-kernel=/lib/modules/$kernelver/build \<br />
--with-moddir=/lib/modules/$kernelver/kernel/sound > /dev/null; make"<br />
AUTOINSTALL="yes"<br />
PATCH[0]="adriver.h.patch"<br />
PATCH_MATCH[0]="2.4.[2-9][2-9]"<br />
POST_INSTALL="alsa-driver-dkms-post.sh"<br />
MODULES_CONF[0]="alias char-major-116 snd"<br />
MODULES_CONF[1]="alias snd-card-0 snd-intel8x0"<br />
MODULES_CONF[2]="alias char-major-14 soundcore"<br />
MODULES_CONF[3]="alias sound-slot-0 snd-card-0"<br />
MODULES_CONF[4]="alias sound-service-0-0 snd-mixer-oss"<br />
MODULES_CONF[5]="alias sound-service-0-1 snd-seq-oss"<br />
MODULES_CONF[6]="alias sound-service-0-3 snd-pcm-oss"<br />
MODULES_CONF[7]="alias sound-service-0-8 snd-seq-oss"<br />
MODULES_CONF[8]="alias sound-service-0-12 snd-pcm-oss"<br />
MODULES_CONF[9]="post-install snd-card-0 /usr/sbin/alsactl restore >/dev/null 2>&1 || :"<br />
MODULES_CONF[10]="pre-remove snd-card-0 /usr/sbin/alsactl store >/dev/null 2>&1 || :"<br />
BUILT_MODULE_NAME[0]="snd-pcm"<br />
BUILT_MODULE_LOCATION[0]="acore"<br />
DEST_MODULE_LOCATION[0]="/kernel/sound/acore"<br />
BUILT_MODULE_NAME[1]="snd-rawmidi"<br />
BUILT_MODULE_LOCATION[1]="acore"<br />
DEST_MODULE_LOCATION[1]="/kernel/sound/acore"<br />
BUILT_MODULE_NAME[2]="snd-timer"<br />
BUILT_MODULE_LOCATION[2]="acore"<br />
DEST_MODULE_LOCATION[2]="/kernel/sound/acore"<br />
BUILT_MODULE_NAME[3]="snd"<br />
BUILT_MODULE_LOCATION[3]="acore"<br />
DEST_MODULE_LOCATION[3]="/kernel/sound/acore"<br />
BUILT_MODULE_NAME[4]="snd-mixer-oss"<br />
BUILT_MODULE_LOCATION[4]="acore/oss"<br />
DEST_MODULE_LOCATION[4]="/kernel/sound/acore/oss"<br />
BUILT_MODULE_NAME[5]="snd-pcm-oss"<br />
BUILT_MODULE_LOCATION[5]="acore/oss"<br />
DEST_MODULE_LOCATION[5]="/kernel/sound/acore/oss"<br />
BUILT_MODULE_NAME[6]="snd-seq-device"<br />
BUILT_MODULE_LOCATION[6]="acore/seq"<br />
DEST_MODULE_LOCATION[6]="/kernel/sound/acore/seq"<br />
BUILT_MODULE_NAME[7]="snd-seq-midi-event"<br />
BUILT_MODULE_LOCATION[7]="acore/seq"<br />
DEST_MODULE_LOCATION[7]="/kernel/sound/acore/seq"<br />
BUILT_MODULE_NAME[8]="snd-seq-midi"<br />
BUILT_MODULE_LOCATION[8]="acore/seq"<br />
DEST_MODULE_LOCATION[8]="/kernel/sound/acore/seq"<br />
BUILT_MODULE_NAME[9]="snd-seq"<br />
BUILT_MODULE_LOCATION[9]="acore/seq"<br />
DEST_MODULE_LOCATION[9]="/kernel/sound/acore/seq"<br />
Figure 3: An elaborate dkms.conf
200 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
applied to the module source before a module<br />
build occurs. In this way, by including various<br />
patches needed for various kernel versions,<br />
you can distribute one source tarball and ensure<br />
it will always properly build regardless of<br />
the end user target kernel. If no corresponding<br />
PATCH_MATCH[0] entry were specified for<br />
PATCH[0], then the adriver.h.patch would always<br />
get applied before a module build. As<br />
DKMS always starts off each module build<br />
with pristine module source, you can always<br />
ensure the right patches are being applied.<br />
Also seen in this example is:<br />
MODULES_CONF[0]=<br />
"alias char-major-116 snd"<br />
MODULES_CONF[1]=<br />
"alias snd-card-0 snd-intel8x0"<br />
Unlike the previous discussion of<br />
/etc/modules.conf changes, any entries<br />
placed into the MODULES_CONF array are<br />
automatically added into /etc/modules.conf<br />
during a module install. <strong>The</strong>se are later only<br />
removed during the final module uninstall.<br />
Lastly, we have:<br />
POST_INSTALL="alsa-driver-dkms-post.sh"<br />
In the event that you have other scripts that<br />
must be run during various DKMS events,<br />
DKMS includes POST_ADD, POST_BUILD,<br />
POST_INSTALL and POST_REMOVE functionality.<br />
7 Future<br />
As you can tell from the above, DKMS is very<br />
much ready for deployment now. However, as<br />
with all software projects, there’s room for improvement.<br />
7.1 Cross-Architecture Builds<br />
DKMS today has no concept of a platform architecture<br />
such as i386, x86_64, ia64, sparc,<br />
and the like. It expects that it is building kernel<br />
modules with a native compiler, not a cross<br />
compiler, and that the target architecture is the<br />
native architecture. While this works in practice,<br />
it would be convenient if DKMS were able<br />
to be used to build kernel modules for nonnative<br />
architectures.<br />
Today DKMS handles the cross-architecture<br />
build process by having separate /var/dkms directory<br />
trees for each architecture, and using<br />
the dkmstree option to specify a using a different<br />
tree, and the config option to specify<br />
to use a different kernel configuration file.<br />
Going forward, we plan to add an −−arch<br />
option to DKMS, or have it glean it from the<br />
kernel config file and act accordingly.<br />
7.2 Additional distribution driver disks<br />
DKMS today supports generating driver disks<br />
in the Red Hat format only. We recognize that<br />
other distributions accomplish the same goal<br />
using other driver disk formats. This should<br />
be relatively simple to add once we understand<br />
what the additional formats are.<br />
8 Conclusion<br />
DKMS provides a simple and unified mechanism<br />
for driver authors, <strong>Linux</strong> distributions,<br />
system vendors, and system administrators to<br />
update the device drivers on a target system<br />
without updating the whole kernel. It allows<br />
driver developers to keep their work aimed at<br />
the “top of the tree,” and to backport that work<br />
to older kernels painlessly. It allows <strong>Linux</strong> distributions<br />
to provide updates to single device<br />
drivers asynchronous to the release of a larger
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 201<br />
scheduled update, and to know what drivers<br />
have been updated. It lets system vendors<br />
ship newer hardware than was supported in a<br />
distribution’s “gold” release without invalidating<br />
any test or certification work done on the<br />
“gold” release. It lets system administrators<br />
update individual drivers to match their environment<br />
and their needs, regardless of whose<br />
kernel they are running. It lets end users track<br />
which module versions have been added to<br />
their system.<br />
We believe DKMS is a project whose time has<br />
come, and encourage everyone to use it.<br />
9 References<br />
DKMS is licensed under the GNU General<br />
Public License. It is available at<br />
http://linux.dell.com/dkms/,<br />
and has a mailing list dkms-devel@<br />
lists.us.dell.com to which you may<br />
subscribe at http://lists.us.dell.<br />
com/.
202 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
e100 Weight Reduction Program<br />
Writing for Maintainability<br />
Scott Feldman<br />
Intel Corporation<br />
scott.feldman@intel.com<br />
Abstract<br />
Corporate-authored device drivers are<br />
bloated/buggy with dead code, HW and<br />
OS abstraction layers, non-standard user<br />
controls, and support for complicated HW<br />
features that provide little or no value. e100<br />
in 2.6.4 has been rewritten to address these<br />
issues and in the process lost 75% of the lines<br />
of code, with no loss of functionality. This<br />
paper gives guidelines to other corporate driver<br />
authors.<br />
Introduction<br />
This paper gives some basic guidelines to corporate<br />
device driver maintainers based on experiences<br />
I had while re-writing the e100 network<br />
device driver for Intel’s PRO/100+ Ethernet<br />
controllers. By corporate maintainer, I<br />
mean someone employed by a corporation to<br />
provide <strong>Linux</strong> driver support for that corporation’s<br />
device. Of course, these guidelines may<br />
apply to non-corporate individuals as well, but<br />
the intended audience is the corporate driver<br />
author.<br />
<strong>The</strong> assumption behind these guidelines is that<br />
the device driver is intended for inclusion in<br />
the <strong>Linux</strong> kernel. For a driver to be accepted<br />
into the <strong>Linux</strong> kernel, it must meet both technical<br />
and non-technical requirements. This paper<br />
focuses on the non-technical requirements,<br />
specifically maintainability.<br />
Guideline #1: Maintainability over<br />
Everything Else<br />
Corporate marketing requirements documents<br />
specify priority order to features and performance<br />
and schedule (time-to-market), but<br />
rarely specify maintainability. However, maintainability<br />
is the most important requirement<br />
for <strong>Linux</strong> kernel drivers.<br />
Why?<br />
• You will not be the long-term driver maintainer.<br />
• Your company will not be the long-term<br />
driver maintainer.<br />
• Your driver will out-live your interest in it.<br />
Driver code should be written so a like-skilled<br />
kernel maintainer can fix a problem in a reasonable<br />
amount of time without you or your resources.<br />
Here are a few items to keep in mind<br />
to improve maintainability.<br />
• Use kernel coding style over corporate<br />
coding style<br />
• Document how the driver/device works, at<br />
a high level, in a “<strong>The</strong>ory of Operation”<br />
comment section
204 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
old driver v2<br />
VLANs tagging/<br />
stripping<br />
Tx/Rx checksum of<br />
loading<br />
interrupt moderation<br />
new driver v3<br />
use SW VLAN support<br />
in kernel<br />
use SW checksum<br />
support in kernel<br />
use NAPI support in<br />
kernel<br />
Table 1: Feature migration in e100<br />
• Document hardware workarounds<br />
Guideline #2: Don’t Add Features<br />
for Feature’s Sake<br />
Consider the code complexity to support the<br />
feature versus the user’s benefit. Is the device<br />
still usable without the feature? Is the device<br />
performing reasonably for the 80% usecase<br />
without the feature? Is the hardware offload<br />
feature working against ever increasing<br />
CPU/memory/IO speeds? Is there a software<br />
equivalent to the feature already provided in<br />
the OS?<br />
If the answer is yes to any of these questions, it<br />
is better to not implement the feature, keeping<br />
the complexity in the driver low and maintainability<br />
high.<br />
Table 1 shows features removed from the driver<br />
during the re-write of e100 because the OS already<br />
provides software equivalents.<br />
Guideline #3: Limit User-Controls—<br />
Use What’s Built into the OS<br />
Most users will use the default settings, so before<br />
adding a user-control, consider:<br />
1. If the driver model for your device class<br />
already provides a mechanism for the<br />
user-control, enable that support in the<br />
old driver v2<br />
new driver v3<br />
BundleMax<br />
not needed – NAPI<br />
BundleSmallFr not needed – NAPI<br />
IntDelay<br />
not needed – NAPI<br />
ucode<br />
not needed – NAPI<br />
RxDescriptors ethtool -G<br />
TxDescriptors ethtool -G<br />
XsumRX<br />
not needed – checksum<br />
in OS<br />
IFS<br />
always enabled<br />
e100_speed_duplex ethtool -s<br />
Table 2: User-control migration in e100<br />
driver rather than adding a custom usercontrol.<br />
2. If the driver model doesn’t provide a usercontrol,<br />
but the user-control is potentially<br />
useful to other drivers, extend the driver<br />
model to include user-control.<br />
3. If the user-control is to enable/disable a<br />
workaround, enable the workaround without<br />
the use of a user-control. (Solve<br />
the problem without requiring a decision<br />
from the user).<br />
4. If the user-control is to tune performance,<br />
tune the driver for the 80% use-case and<br />
remove the user-control.<br />
Table 2 shows user-controls (implemented as<br />
module parameters) removed from the driver<br />
during the re-write of e100 because the OS<br />
already provides built-in user-controls, or the<br />
user-control was no longer needed.<br />
Guideline #4: Don’t Write Code<br />
that’s Already in the <strong>Kernel</strong><br />
Look for library code that’s already used by<br />
other drivers and adapt that to your driver.<br />
Common hardware is often used between vendors’<br />
devices, so shared code will work for all<br />
(and be debugged by all).
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 205<br />
For example, e100 has a highly MDIcompliant<br />
PHY interface, so use mii.c for<br />
standard PHY access and remove custom code<br />
from the driver.<br />
For another example, e100 v2 used /proc/<br />
net/IntelPROAdapter to report driver<br />
information. This functionality was replaced<br />
with ethtool, sysfs, lspci, etc.<br />
Look for opportunities to move code out of the<br />
driver into generic code.<br />
Guideline #5: Don’t Use OSabstraction<br />
Layers<br />
A common corporate design goal is to reuse<br />
driver code as much as possible between OSes.<br />
This allows a driver to be brought up on one OS<br />
and “ported” to another OS with little work.<br />
After all, the hardware interface to the device<br />
didn’t change from one OS to the next, so<br />
all that is required is an OS-abstraction layer<br />
that wraps the OS’s native driver model with a<br />
generic driver model. <strong>The</strong> driver is then written<br />
to the generic driver model and it’s just a matter<br />
of porting the OS-abstraction layer to each<br />
target OS.<br />
<strong>The</strong>re are problems when doing this with<br />
<strong>Linux</strong>:<br />
1. <strong>The</strong> OS-abstraction wrapper code means<br />
nothing to an outside <strong>Linux</strong> maintainer<br />
and just obfuscates the real meaning behind<br />
the code. This makes your code<br />
harder to follow and therefore harder to<br />
maintain.<br />
2. <strong>The</strong> generic driver model may not map 1:1<br />
with the native driver model leaving gaps<br />
in compatibility that you’ll need to fix up<br />
with OS-specific code.<br />
3. Limits your ability to back-port contributions<br />
given under GPL to non-GPL OSes.<br />
Guideline #6: Use kcompat Techniques<br />
to Move Legacy <strong>Kernel</strong> Support<br />
out of the Driver (and <strong>Kernel</strong>)<br />
Users may not be able to move to the latest<br />
kernel.org kernel, so there is a need<br />
to provide updated device drivers that can be<br />
installed against legacy kernels. <strong>The</strong> need is<br />
driven by 1) bug fixes, 2) new hardware support<br />
that wasn’t included in the driver when the<br />
driver was included in the legacy kernel.<br />
<strong>The</strong> best strategy is to:<br />
1. Maintain your driver code to work against<br />
the latest kernel.org development<br />
kernel API. This will make it easier to<br />
keep the driver in the kernel.org kernel<br />
synchronized with your code base as<br />
changes (patches) are almost always in<br />
reference to the latest kernel.org kernel.<br />
2. Provide a kernel-compat-layer (kcompat)<br />
to translate the latest API to the supported<br />
legacy kernel API. <strong>The</strong> driver code is void<br />
of any ifdef code for legacy kernel support.<br />
All of the ifdef logic moves to the<br />
kcompat layer. <strong>The</strong> kcompat layer is not<br />
included in the latest kernel.org kernel<br />
(by definition).<br />
Here is an example with e100.<br />
In driver code, use the latest API:<br />
s = pci_name(pdev);<br />
...<br />
free_netdev(netdev);
206 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
In kcompat code, translate to legacy kernel<br />
API:<br />
#if ( LINUX_VERSION_CODE < \<br />
KERNEL_VERSION(2,4,22) )<br />
#define pci_name(x) ((x)->slot_name)<br />
#endif<br />
#ifndef HAVE_FREE_NETDEV<br />
#define free_netdev(x) kfree(x)<br />
#endif<br />
Guideline #7: Plan to Re-write the<br />
Driver at Least Once<br />
You will not get it right the first time. Plan on<br />
rewriting the driver from scratch at least once.<br />
This will cleanse the code, removing dead code<br />
and organizing/consolidating functionality.<br />
For example, the last e100 re-write reduced the<br />
driver size by 75% without loss of functionality.<br />
Conclusion<br />
Following these guidelines will result in more<br />
maintainable device drivers with better acceptance<br />
into the <strong>Linux</strong> kernel tree. <strong>The</strong> basic<br />
idea is to remove as much as possible from the<br />
driver without loss of functionality.<br />
References<br />
• <strong>The</strong> latest e100 driver code is available at<br />
linux/driver/net/e100.c (2.6.4<br />
kernel or higher).<br />
• An example of kcompat is here:<br />
http://sf.net/projects/<br />
gkernel
NFSv4 and rpcsec_gss for linux<br />
J. Bruce Fields<br />
University of Michigan<br />
bfields@umich.edu<br />
Abstract<br />
<strong>The</strong> 2.6 <strong>Linux</strong> kernels now include support for<br />
version 4 of NFS. In addition to built-in locking<br />
and ACL support, and features designed to<br />
improve performance over the Internet, NFSv4<br />
also mandates the implementation of strong<br />
cryptographic security. This security is provided<br />
by rpcsec_gss, a standard, widely implemented<br />
protocol that operates at the rpc level,<br />
and hence can also provide security for NFS<br />
versions 2 and 3.<br />
1 <strong>The</strong> rpcsec_gss protocol<br />
<strong>The</strong> rpc protocol, which all version of NFS<br />
and related protocols are built upon, includes<br />
generic support for authentication mechanisms:<br />
each rpc call has two fields, the credential<br />
and the verifier, each consisting of a<br />
32-bit integer, designating a “security flavor,”<br />
followed by 400 bytes of opaque data whose<br />
structure depends on the specified flavor. Similarly,<br />
each reply includes a single “verifier.”<br />
Until recently, the only widely implemented<br />
security flavor has been the auth_unix flavor,<br />
which uses the credential to pass uid’s and<br />
gid’s and simply asks the server to trust them.<br />
This may be satisfactory given physical security<br />
over the clients and the network, but for<br />
many situations (including use over the Internet),<br />
it is inadequate.<br />
Thus rfc 2203 defines the rpcsec_gss protocol,<br />
which uses rpc’s opaque security fields to carry<br />
cryptographically secure tokens. <strong>The</strong> cryptographic<br />
services are provided by the GSS-API<br />
(“Generic Security Service Application Program<br />
Interface,” defined by rfc 2743), allowing<br />
the use of a wide variety of security mechanisms,<br />
including, for example, Kerberos.<br />
Three levels of security are provided by rpcsec_gss:<br />
1. Authentication only: <strong>The</strong> rpc header of<br />
each request and response is signed.<br />
2. Integrity: <strong>The</strong> header and body of each request<br />
and response is signed.<br />
3. Privacy: <strong>The</strong> header of each request is<br />
signed, and the body is encrypted.<br />
<strong>The</strong> combination of a security level with a<br />
GSS-API mechanism can be designated by a<br />
32-bit “pseudoflavor.” <strong>The</strong> mount protocol<br />
used with NFS versions 2 and 3 uses a list<br />
of pseudoflavors to communicate the security<br />
capabilities of a server. NFSv4 does not use<br />
pseudoflavors on the wire, but they are still useful<br />
in internal interfaces.<br />
Security protocols generally require some initial<br />
negotiation, to determine the capabilities<br />
of the systems involved and to choose session<br />
keys. <strong>The</strong> rpcsec_gss protocol uses calls with<br />
procedure number 0 for this purpose. Normally<br />
such a call is a simple “ping” with no<br />
side-effects, useful for measuring round-trip
208 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
latency or testing whether a certain service is<br />
running. However a call with procedure number<br />
0, if made with authentication flavor rpcsec_gss,<br />
may use certain fields in the credential<br />
to indicate that it is part of a context-initiation<br />
exchange.<br />
2 <strong>Linux</strong> implementation of rpcsec_gss<br />
<strong>The</strong> <strong>Linux</strong> implementation of rpcsec_gss consists<br />
of several pieces:<br />
1. Mechanism-specific code, currently for<br />
two mechanisms: krb5 and spkm3.<br />
2. A stripped-down in-kernel version of the<br />
GSS-API interface, with an interface that<br />
allows mechanism-specific code to register<br />
support for various pseudoflavors.<br />
3. Client and server code which uses the<br />
GSS-API interface to encode and decode<br />
rpc calls and replies.<br />
4. A userland daemon, gssd, which performs<br />
context initiation.<br />
2.1 Mechanism-specific code<br />
<strong>The</strong> NFSv4 RFC mandates the implementation<br />
(though not the use) of three GSS-API mechanisms:<br />
krb5, spkm3, and lipkey.<br />
Our krb5 implementation supports three<br />
pseudoflavors: krb5, krb5i, and krb5p, providing<br />
authentication only, integrity, and<br />
privacy, respectively. <strong>The</strong> code is derived from<br />
MIT’s Kerberos implementation, somewhat<br />
simplified, and not currently supporting the<br />
variety of encryption algorithms that MIT’s<br />
does. <strong>The</strong> krb5 mechanism is also supported<br />
by NFS implementations from Sun, Network<br />
Appliance, and others, which it interoperates<br />
with.<br />
<strong>The</strong> Low Infrastructure Public Key Mechanism<br />
(“lipkey,” specified by rfc 2847), is a public key<br />
mechanism built on top of the Simple Public<br />
Key Mechanism (spkm), which provides functionality<br />
similar to that of TLS, allowing a secure<br />
channel to be established using a serverside<br />
certificate and a client-side password.<br />
We have a preliminary implementation of<br />
spkm3 (without privacy), but none yet of lipkey.<br />
Other NFS implementors have not yet<br />
implemented either of these mechanisms, but<br />
there appears to be sufficient interest from the<br />
grid community for us to continue implementation<br />
even if it is <strong>Linux</strong>-only for now.<br />
2.2 GSS-API<br />
<strong>The</strong> GSS-API interface as specified is very<br />
complex. Fortunately, rpcsec_gss only requires<br />
a subset of the GSS-API, and even less is required<br />
for per-packet processing.<br />
Our implementation is derived by the implementation<br />
in MIT Kerberos, and initially<br />
stayed fairly close the the GSS-API specification;<br />
but over time we have pared it down to<br />
something quite a bit simpler.<br />
<strong>The</strong> kernel gss interface also provides APIs<br />
by which code implementing particular mechanisms<br />
can register itself to the gss-api code<br />
and hence can be safely provided by modules<br />
loaded at runtime.<br />
2.3 RPC code<br />
<strong>The</strong> RPC code has been enhanced by the addition<br />
of a new rpcsec_gss mechanism which authenticates<br />
calls and replies and which wraps<br />
and unwraps rpc bodies in the case of integrity<br />
and privacy.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 209<br />
This is relatively straightforward, though<br />
somewhat complicated by the need to handle<br />
discontiguous buffers containing page data.<br />
Caches for session state are also required on<br />
both client and server; on the client a preexisting<br />
rpc credentials cache is used, and on the<br />
server we use the same caching infrastructure<br />
used for caching of client and export information.<br />
2.4 Userland daemon<br />
We had no desire to put a complete implementation<br />
of Kerberos version 5 or the other mechanisms<br />
into the kernel. Fortunately, the work<br />
performed by the various GSS-API mechanisms<br />
can be divided neatly into context initiation<br />
and per-packet processing. <strong>The</strong> former<br />
is complex and is performed only once per session,<br />
while the latter is simple by comparison<br />
and needs to be performed on every packet.<br />
<strong>The</strong>refore it makes sense to put the packet processing<br />
in the kernel, and have the context initiation<br />
performed in userspace.<br />
Since it is the kernel that knows when context<br />
initiation is necessary, we require a mechanism<br />
allowing the kernel to pass the necessary parameters<br />
to a userspace daemon whenever context<br />
initiation is needed, and allowing the daemon<br />
to respond with the completed security<br />
context.<br />
This problem was solved in different ways<br />
on the client and server, but both use special<br />
files (the former in a dedicated filesystem,<br />
rpc_pipefs, and the latter in the proc filesystem),<br />
which our userspace daemon, gssd, can<br />
poll for requests and then write responses back<br />
to.<br />
In the case of Kerberos, the sequence of events<br />
will be something like this:<br />
1. <strong>The</strong> user gets Kerberos credentials using<br />
kinit, which are cached on a local filesystem.<br />
2. <strong>The</strong> user attempts to perform an operation<br />
on an NFS filesystem mounted with krb5<br />
security.<br />
3. <strong>The</strong> kernel rpc client looks for the a security<br />
context for the user in its cache; not<br />
finding any, it does an upcall to gssd to request<br />
one.<br />
4. Gssd, on receiving the upcall, reads the<br />
user’s Kerberos credentials from the local<br />
filesystem and uses them to construct<br />
a null rpc request which it sends to the<br />
server.<br />
5. <strong>The</strong> server kernel makes an upcall which<br />
passes the null request to its gssd.<br />
6. At this point, the server gssd has all it<br />
needs to construct a security context for<br />
this session, consisting mainly of a session<br />
key. It passes this context down to<br />
the kernel rpc server, which stores it in its<br />
context cache.<br />
7. <strong>The</strong> server’s gssd then constructs the null<br />
rpc reply, which it gives to the kernel to<br />
return to the client gssd.<br />
8. <strong>The</strong> client gssd uses this reply to construct<br />
its own security context, and passes this<br />
context to the kernel rpc client.<br />
9. <strong>The</strong> kernel rpc client then uses this context<br />
to send the first real rpc request to the<br />
server.<br />
10. <strong>The</strong> server uses the new context in its<br />
cache to verify the rpc request, and to<br />
compose its reply.
210 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
3 <strong>The</strong> NFSv4 protocol<br />
While rpcsec_gss works equally well on all existing<br />
versions of NFS, much of the work on<br />
rpcsec_gss has been motivated by NFS version<br />
4, which is the first version of NFS to make<br />
rpcsec_gss mandatory to implement.<br />
This new version of NFS is specified by rfc<br />
3530, which says:<br />
“Unlike earlier versions, the NFS version 4<br />
protocol supports traditional file access while<br />
integrating support for file locking and the<br />
mount protocol. In addition, support for strong<br />
security (and its negotiation), compound operations,<br />
client caching, and internationalization<br />
have been added. Of course, attention has been<br />
applied to making NFS version 4 operate well<br />
in an Internet environment.”<br />
Descriptions of some of these features follow,<br />
with some notes about their implementation in<br />
<strong>Linux</strong>.<br />
3.1 Compound operations<br />
Each rpc request includes a procedure number,<br />
which describes the operation to be performed.<br />
<strong>The</strong> format of the body of the rpc request (the<br />
arguments) and of the reply depend on the program<br />
number. Procedure 0 is reserved as a noop<br />
(except when it is used for rpcsec_gss context<br />
initiation, as described above).<br />
<strong>The</strong> NFSv4 protocol only supports one nonzero<br />
procedure, procedure 1, the compound<br />
procedure.<br />
<strong>The</strong> body of a compound is a list of operations,<br />
each with its own arguments. For example,<br />
a compound request performing a lookup<br />
might consist of 3 operations: a PUTFH, with<br />
a filehandle, which sets the “current filehandle”<br />
to the provided filehandle; a LOOKUP, with a<br />
name, which looks up the name in the directory<br />
given by the current filehandle and then modifies<br />
the current filehandle to be the filehandle of<br />
the result; a GETFH, with no arguments, which<br />
returns the new value of the current filehandle;<br />
and a GETATTR, with a bitmask specifying a<br />
set of attributes to return for the looked-up file.<br />
<strong>The</strong> server processes these operations in order,<br />
but with no guarantee of atomicity. On encountering<br />
any error, it stops and returns the results<br />
of the operations up to and including the operation<br />
that failed.<br />
In theory complex operations could therefore<br />
be done by long compounds which perform<br />
complex series of operations.<br />
In practice, the compounds sent by the <strong>Linux</strong><br />
client correspond very closely to NFSv2/v3<br />
procedures—the VFS and the POSIX filesystem<br />
API make it difficult to do otherwise—and<br />
our server, like most NFSv4 servers we know<br />
of, rejects overly long or complex compounds.<br />
3.2 Well-known port for NFS<br />
RPC allows services to be run on different<br />
ports, using the “portmap” service to map program<br />
numbers to ports. While flexible, this<br />
system complicates firewall management; so<br />
NFSv4 recommends the use of port 2049.<br />
In addition, the use of sideband protocols for<br />
mounting, locking, etc. also complicates firewall<br />
management, as multiple connections to<br />
multiple ports are required for a single NFS<br />
mount. NFSv4 eliminates these extra protocols,<br />
allowing all traffic to pass over a single<br />
connection using one protocol.<br />
3.3 No more mount protocol<br />
Earlier versions of NFS use a separate protocol<br />
for mount. <strong>The</strong> mount protocol exists primarily<br />
to map path names, presented to the server as
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 211<br />
strings, to filehandles, which may then be used<br />
in the NFS protocol.<br />
NFSv4 instead uses a single operation, PUT-<br />
ROOTFH, that returns a filehandle; clients can<br />
then use ordinary lookups to traverse to the<br />
filesystem they wish to mount. This changes<br />
the behavior of NFS in a few subtle ways: for<br />
example, the special status of mounts in the old<br />
protocol meant that mounting /usr and then<br />
looking up local might get you a different<br />
object than would mounting /usr/local;<br />
under NFSv4 this can no longer happen.<br />
A server that exports multiple filesystems must<br />
knit them together using a single “pseudofilesystem”<br />
which links them to a common<br />
root.<br />
On <strong>Linux</strong>’s nfsd the pseudofilesystem is a<br />
real filesystem, marked by the export option<br />
“fsid=0”. An adminstrator that is content to<br />
export a single filesystem can export it with<br />
“fsid=0”, and clients will find it just by mounting<br />
the path “/”.<br />
<strong>The</strong> expected use for “fsid=0”, however, is to<br />
designate a filesystem that is used just a collection<br />
of empty directories used as mountpoints<br />
for exported filesystems, which are mounted<br />
using mount ---bind; thus an administrator<br />
could export /bin and /local/src by:<br />
mkdir -p /exports/home<br />
mkdir -p /exports/bin/<br />
mount --bind /home /exports/home<br />
mount --bind /bin/ /exports/bin<br />
and then using an exports file something like:<br />
/exports *.foo.com(fsid=0,crossmnt)<br />
/exports/home *.foo.com<br />
/exports/bin *.foo.com<br />
Clients in foo.com can then mount<br />
server.foo.com:/bin or server.<br />
foo.com:/home. However the relationship<br />
between the original mountpoint on the server<br />
and the mountpoint under /exports (which<br />
determines the path seen by the client) is<br />
arbitrary, so the administrator could just as<br />
well export /home as /some/other/path<br />
if desired.<br />
This gives maximum flexibility at the expense<br />
of some confusion for adminstrators used to<br />
earlier NFS versions.<br />
3.4 No more lock protocol<br />
Locking has also been absorbed into the<br />
NFSv4 protocol. In addition to advantages<br />
enumerated above, this allows servers to support<br />
mandatory locking if desired. Previously<br />
this was impossible because it was impossible<br />
to tell whether a given read or write<br />
should be ordered before or after a lock request.<br />
NFSv4 enforces such sequencing by<br />
providing a stateid field on each read or write<br />
which identifies the locking state that the operation<br />
was performed under; thus for example a<br />
write that occurred while a lock was held, but<br />
that appeared on the server to have occurred after<br />
an unlock, can be identified as belonging to<br />
a previous locking context, and can therefore<br />
be correctly rejected.<br />
<strong>The</strong> additional state required to manage locking<br />
is the source of much of the additional complexity<br />
in NFSv4.<br />
3.5 String representations of user and group<br />
names<br />
Previous versions of NFS use integers to represent<br />
users and groups; while simple to handle,<br />
they can make NFS installations difficult to<br />
manage, particularly across adminstrative domains.<br />
Version 4, therefore, uses string names<br />
of the form user@domain.<br />
This poses some challenges for the kernel im-
212 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
plementation. In particular, while the protocol<br />
may use string names, the kernel still needs to<br />
deal with uid’s, so it must map between NFSv4<br />
string names and integers.<br />
As with rpcsec_gss context initation, we solve<br />
this problem by making upcalls to a userspace<br />
daemon; with the mapping in userspace, it is<br />
easy to use mechanisms such as NIS or LDAP<br />
to do the actual mapping without introducing<br />
large amounts of code into the kernel. So as not<br />
to degrade performance by requiring a context<br />
switch every time we process a packet carrying<br />
a name, we cache the results of this mapping in<br />
the kernel.<br />
3.6 Delegations<br />
NFSv4, like previous versions of NFS, does<br />
not attempt to provide full cache consistency.<br />
Instead, all that is guaranteed is that if an open<br />
follows a close of the same file, then data read<br />
after the open will reflect any modifications<br />
performed before the close. This makes both<br />
open and close potentially high latency operations,<br />
since they must wait for at least one<br />
round trip before returning–in the close case,<br />
to flush out any pending writes, and in the<br />
open case, to check the attributes of the file in<br />
question to determine whether the local cache<br />
should be invalidated.<br />
Locks provide similar semantics—writes are<br />
flushed on unlock, and cache consistency is<br />
verified on lock—and hence lock operations<br />
are also prone to high latencies.<br />
To mitigate these concerns, and to encourage<br />
the use of NFS’s locking features, delegations<br />
have been added to NFSv4. Delegations are<br />
granted or denied by the server in response to<br />
open calls, and give the client the right to perform<br />
later locks and opens locally, without the<br />
need to contact the server. A set of callbacks<br />
is provided so that the server can notify the<br />
client when another client requests an open that<br />
would confict with the open originally obtained<br />
by the client.<br />
Thus locks and opens may be performed<br />
quickly by the client in the common case when<br />
files are not being shared, but callbacks ensure<br />
that correct close-to-open (and unlock-to-lock)<br />
semantics may be enforced when there is contention.<br />
To allow other clients to proceed when a client<br />
holding a delegation reboots, clients are required<br />
to periodically send a “renew” operation<br />
to the server, indicating that it is still alive;<br />
a client that fails to send a renew operation<br />
within a given lease time (established when the<br />
client first contacts the server) may have all of<br />
its delegations and other locking state revoked.<br />
Most implementations of NFSv4 delegations,<br />
including <strong>Linux</strong>’s, are still young, and we<br />
haven’t yet gathered good data on the performance<br />
impact.<br />
Nevertheless, further extensions, including<br />
delegations over directories, are under consideration<br />
for future versions of the protocol.<br />
3.7 ACLs<br />
ACL support is integrated into the protocol,<br />
with ACLs that are more similar to those found<br />
in NT than to the POSIX ACLs supported by<br />
<strong>Linux</strong>.<br />
Thus while it is possible to translate an arbitrary<br />
<strong>Linux</strong> ACL to an NFS4 ACL with nearly<br />
identical meaning, most NFS ACLs have no<br />
reasonable representation as <strong>Linux</strong> ACLs.<br />
Marius Eriksen has written a draft describing<br />
the POSIX to NFS4 ACL translation. Currently<br />
the <strong>Linux</strong> implementation uses this mapping,<br />
and rejects any NFS4 ACL that isn’t exactly<br />
in the image of this mapping. This en-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 213<br />
sures userland support from all tools that currently<br />
support POSIX ACLs, and simplifies<br />
ACL management when an exported filesystem<br />
is also used by local users, since both nfsd<br />
and the local users can use the backend filesystem’s<br />
POSIX ACL implementation. However<br />
it makes it difficult to interoperate with NFSv4<br />
implementations that support the full ACL protocol.<br />
For that reason we will eventually also<br />
want to add support for NFSv4 ACLs.<br />
4 Acknowledgements and Further<br />
Information<br />
This work has been sponsored by Sun Microsystems,<br />
Network Appliance, and the<br />
Accelerated Strategic Computing Initiative<br />
(ASCI). For further information, see www.<br />
citi.umich.edu/projects/nfsv4/.
214 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Comparing and Evaluating epoll, select, and poll<br />
Event Mechanisms<br />
Louay Gammo, Tim Brecht, Amol Shukla, and David Pariag<br />
University of Waterloo<br />
{lgammo,brecht,ashukla,db2pariag}@cs.uwaterloo.ca<br />
Abstract<br />
This paper uses a high-performance, eventdriven,<br />
HTTP server (the µserver) to compare<br />
the performance of the select, poll, and epoll<br />
event mechanisms. We subject the µserver to<br />
a variety of workloads that allow us to expose<br />
the relative strengths and weaknesses of each<br />
event mechanism.<br />
Interestingly, initial results show that the select<br />
and poll event mechanisms perform comparably<br />
to the epoll event mechanism in the<br />
absence of idle connections. Profiling data<br />
shows a significant amount of time spent in executing<br />
a large number of epoll_ctl system<br />
calls. As a result, we examine a variety<br />
of techniques for reducing epoll_ctl overhead<br />
including edge-triggered notification, and<br />
introducing a new system call (epoll_ctlv)<br />
that aggregates several epoll_ctl calls into<br />
a single call. Our experiments indicate that although<br />
these techniques are successful at reducing<br />
epoll_ctl overhead, they only improve<br />
performance slightly.<br />
1 Introduction<br />
<strong>The</strong> Internet is expanding in size, number of<br />
users, and in volume of content, thus it is imperative<br />
to be able to support these changes<br />
with faster and more efficient HTTP servers.<br />
A common problem in HTTP server scalability<br />
is how to ensure that the server handles<br />
a large number of connections simultaneously<br />
without degrading the performance. An<br />
event-driven approach is often implemented in<br />
high-performance network servers [14] to multiplex<br />
a large number of concurrent connections<br />
over a few server processes. In eventdriven<br />
servers it is important that the server<br />
focuses on connections that can be serviced<br />
without blocking its main process. An event<br />
dispatch mechanism such as select is used<br />
to determine the connections on which forward<br />
progress can be made without invoking<br />
a blocking system call. Many different<br />
event dispatch mechanisms have been used<br />
and studied in the context of network applications.<br />
<strong>The</strong>se mechanisms range from select,<br />
poll, /dev/poll, RT signals, and epoll<br />
[2, 3, 15, 6, 18, 10, 12, 4].<br />
<strong>The</strong> epoll event mechanism [18, 10, 12] is designed<br />
to scale to larger numbers of connections<br />
than select and poll. <strong>One</strong> of the<br />
problems with select and poll is that in<br />
a single call they must both inform the kernel<br />
of all of the events of interest and obtain new<br />
events. This can result in large overheads, particularly<br />
in environments with large numbers<br />
of connections and relatively few new events<br />
occurring. In a fashion similar to that described<br />
by Banga et al. [3] epoll separates mechanisms<br />
for obtaining events (epoll_wait)<br />
from those used to declare and control interest
216 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
in events (epoll_ctl).<br />
Further reductions in the number of generated<br />
events can be obtained by using edge-triggered<br />
epoll semantics. In this mode events are only<br />
provided when there is a change in the state of<br />
the socket descriptor of interest. For compatibility<br />
with the semantics offered by select<br />
and poll, epoll also provides level-triggered<br />
event mechanisms.<br />
To compare the performance of epoll with<br />
select and poll, we use the µserver [4, 7]<br />
web server. <strong>The</strong> µserver facilitates comparative<br />
analysis of different event dispatch mechanisms<br />
within the same code base through<br />
command-line parameters. Recently, a highly<br />
tuned version of the single process event driven<br />
µserver using select has shown promising<br />
results that rival the performance of the inkernel<br />
TUX web server [4].<br />
Interestingly, in this paper, we found that for<br />
some of the workloads considered select<br />
and poll perform as well as or slightly better<br />
than epoll. <strong>One</strong> such result is shown in<br />
Figure 1. This motivated further investigation<br />
with the goal of obtaining a better understanding<br />
of epoll’s behaviour. In this paper, we describe<br />
our experience in trying to determine<br />
how to best use epoll, and examine techniques<br />
designed to improve its performance.<br />
<strong>The</strong> rest of the paper is organized as follows:<br />
In Section 2 we summarize some existing work<br />
that led to the development of epoll as a scalable<br />
replacement for select. In Section 3 we<br />
describe the techniques we have tried to improve<br />
epoll’s performance. In Section 4 we describe<br />
our experimental methodology, including<br />
the workloads used in the evaluation. In<br />
Section 5 we describe and analyze the results<br />
of our experiments. In Section 6 we summarize<br />
our findings and outline some ideas for future<br />
work.<br />
2 Background and Related Work<br />
Event-notification mechanisms have a long<br />
history in operating systems research and development,<br />
and have been a central issue in<br />
many performance studies. <strong>The</strong>se studies have<br />
sought to improve mechanisms and interfaces<br />
for obtaining information about the state of<br />
socket and file descriptors from the operating<br />
system [2, 1, 3, 13, 15, 6, 18, 10, 12]. Some<br />
of these studies have developed improvements<br />
to select, poll and sigwaitinfo by reducing<br />
the amount of data copied between the<br />
application and kernel. Other studies have reduced<br />
the number of events delivered by the<br />
kernel, for example, the signal-per-fd scheme<br />
proposed by Chandra et al. [6]. Much of the<br />
aforementioned work is tracked and discussed<br />
on the web site, “<strong>The</strong> C10K Problem” [8].<br />
Early work by Banga and Mogul [2] found<br />
that despite performing well under laboratory<br />
conditions, popular event-driven servers performed<br />
poorly under real-world conditions.<br />
<strong>The</strong>y demonstrated that the discrepancy is due<br />
the inability of the select system call to<br />
scale to the large number of simultaneous connections<br />
that are found in WAN environments.<br />
Subsequent work by Banga et al. [3] sought to<br />
improve on select’s performance by (among<br />
other things) separating the declaration of interest<br />
in events from the retrieval of events on<br />
that interest set. Event mechanisms like select<br />
and poll have traditionally combined these<br />
tasks into a single system call. However, this<br />
amalgamation requires the server to re-declare<br />
its interest set every time it wishes to retrieve<br />
events, since the kernel does not remember the<br />
interest sets from previous calls. This results in<br />
unnecessary data copying between the application<br />
and the kernel.<br />
<strong>The</strong> /dev/poll mechanism was adapted<br />
from Sun Solaris to <strong>Linux</strong> by Provos et al. [15],
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 217<br />
and improved on poll’s performance by introducing<br />
a new interface that separated the declaration<br />
of interest in events from retrieval. <strong>The</strong>ir<br />
/dev/poll mechanism further reduced data<br />
copying by using a shared memory region to<br />
return events to the application.<br />
<strong>The</strong> kqueue event mechanism [9] addressed<br />
many of the deficiencies of select and poll<br />
for FreeBSD systems. In addition to separating<br />
the declaration of interest from retrieval,<br />
kqueue allows an application to retrieve<br />
events from a variety of sources including<br />
file/socket descriptors, signals, AIO completions,<br />
file system changes, and changes in<br />
process state.<br />
<strong>The</strong> epoll event mechanism [18, 10, 12] investigated<br />
in this paper also separates the declaration<br />
of interest in events from their retrieval.<br />
<strong>The</strong> epoll_create system call instructs<br />
the kernel to create an event data structure<br />
that can be used to track events on a number<br />
of descriptors. <strong>The</strong>reafter, the epoll_ctl<br />
call is used to modify interest sets, while the<br />
epoll_wait call is used to retrieve events.<br />
Another drawback of select and poll is<br />
that they perform work that depends on the<br />
size of the interest set, rather than the number<br />
of events returned. This leads to poor performance<br />
when the interest set is much larger than<br />
the active set. <strong>The</strong> epoll mechanisms avoid this<br />
pitfall and provide performance that is largely<br />
independent of the size of the interest set.<br />
3 Improving epoll Performance<br />
Figure 1 in Section 5 shows the throughput<br />
obtained when using the µserver with the select,<br />
poll, and level-triggered epoll (epoll-LT)<br />
mechanisms. In this graph the x-axis shows<br />
increasing request rates and the y-axis shows<br />
the reply rate as measured by the clients that<br />
are inducing the load. This graph shows results<br />
for the one-byte workload. <strong>The</strong>se results<br />
demonstrate that the µserver with leveltriggered<br />
epoll does not perform as well as<br />
select under conditions that stress the event<br />
mechanisms. This led us to more closely examine<br />
these results. Using gprof, we observed<br />
that epoll_ctl was responsible for a<br />
large percentage of the run-time. As can been<br />
seen in Table 1 in Section 5 over 16% of the<br />
time is spent in epoll_ctl. <strong>The</strong> gprof output<br />
also indicates (not shown in the table) that<br />
epoll_ctl was being called a large number<br />
of times because it is called for every state<br />
change for each socket descriptor. We examine<br />
several approaches designed to reduce the<br />
number of epoll_ctl calls. <strong>The</strong>se are outlined<br />
in the following paragraphs.<br />
<strong>The</strong> first method uses epoll in an edgetriggered<br />
fashion, which requires the µserver<br />
to keep track of the current state of the socket<br />
descriptor. This is required because with the<br />
edge-triggered semantics, events are only received<br />
for transitions on the socket descriptor<br />
state. For example, once the server reads data<br />
from a socket, it needs to keep track of whether<br />
or not that socket is still readable, or if it will<br />
get another event from epoll_wait indicating<br />
that the socket is readable. Similar state<br />
information is maintained by the server regarding<br />
whether or not the socket can be written.<br />
This method is referred to in our graphs and<br />
the rest of the paper epoll-ET.<br />
<strong>The</strong> second method, which we refer to as<br />
epoll2, simply calls epoll_ctl twice per<br />
socket descriptor. <strong>The</strong> first to register with the<br />
kernel that the server is interested in read and<br />
write events on the socket. <strong>The</strong> second call occurs<br />
when the socket is closed. It is used to<br />
tell epoll that we are no longer interested in<br />
events on that socket. All events are handled<br />
in a level-triggered fashion. Although this approach<br />
will reduce the number of epoll_ctl<br />
calls, it does have potential disadvantages.
218 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
<strong>One</strong> disadvantage of the epoll2 method is that<br />
because many of the sockets will continue to be<br />
readable or writable epoll_wait will return<br />
sooner, possibly with events that are currently<br />
not of interest to the server. For example, if the<br />
server is waiting for a read event on a socket it<br />
will not be interested in the fact that the socket<br />
is writable until later. Another disadvantage is<br />
that these calls return sooner, with fewer events<br />
being returned per call, resulting in a larger<br />
number of calls. Lastly, because many of the<br />
events will not be of interest to the server, the<br />
server is required to spend a bit of time to determine<br />
if it is or is not interested in each event<br />
and in discarding events that are not of interest.<br />
<strong>The</strong> third method uses a new system call named<br />
epoll_ctlv. This system call is designed to<br />
reduce the overhead of multiple epoll_ctl<br />
system calls by aggregating several calls to<br />
epoll_ctl into one call to epoll_ctlv.<br />
This is achieved by passing an array of epoll<br />
events structures to epoll_ctlv, which then<br />
calls epoll_ctl for each element of the array.<br />
Events are generated in level-triggered<br />
fashion. This method is referred to in the figures<br />
and the remainder of the paper as epollctlv.<br />
We use epoll_ctlv to add socket descriptors<br />
to the interest set, and for modifying<br />
the interest sets for existing socket descriptors.<br />
However, removal of socket descriptors<br />
from the interest set is done by explicitly calling<br />
epoll_ctl just before the descriptor is<br />
closed. We do not aggregate deletion operations<br />
because by the time epoll_ctlv is<br />
invoked, the µserver has closed the descriptor<br />
and the epoll_ctl invoked on that descriptor<br />
will fail.<br />
<strong>The</strong> µserver does not attempt to batch the closing<br />
of descriptors because it can run out of<br />
available file descriptors. Hence, the epollctlv<br />
method uses both the epoll_ctlv and<br />
the epoll_ctl system calls. Alternatively,<br />
we could rely on the close system call to<br />
remove the socket descriptor from the interest<br />
set (and we did try this). However, this<br />
increases the time spent by the µserver in<br />
close, and does not alter performance. We<br />
verified this empirically and decided to explicitly<br />
call epoll_ctl to perform the deletion<br />
of descriptors from the epoll interest set.<br />
4 Experimental Environment<br />
<strong>The</strong> experimental environment consists of a<br />
single server and eight clients. <strong>The</strong> server contains<br />
dual 2.4 GHz Xeon processors, 1 GB of<br />
RAM, a 10,000 rpm SCSI disk, and two one<br />
Gigabit Ethernet cards. <strong>The</strong> clients are identical<br />
to the server with the exception of their<br />
disks which are EIDE. <strong>The</strong> server and clients<br />
are connected with a 24-port Gigabit switch.<br />
To avoid network bottlenecks, the first four<br />
clients communicate with the server’s first Ethernet<br />
card, while the remaining four use a different<br />
IP address linked to the second Ethernet<br />
card. <strong>The</strong> server machine runs a slightly modified<br />
version of the 2.6.5 <strong>Linux</strong> kernel in uniprocessor<br />
mode.<br />
4.1 Workloads<br />
This section describes the workloads that we<br />
used to evaluate performance of the µserver<br />
with the different event notification mechanisms.<br />
In all experiments, we generate HTTP<br />
loads using httperf [11], an open-loop workload<br />
generator that uses connection timeouts to<br />
generate loads that can exceed the capacity of<br />
the server.<br />
Our first workload is based on the widely used<br />
SPECweb99 benchmarking suite [17]. We use<br />
httperf in conjunction with a SPECweb99 file<br />
set and synthetic HTTP traces. Our traces<br />
have been carefully generated to recreate the
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 219<br />
file classes, access patterns, and number of requests<br />
issued per (HTTP 1.1) connection that<br />
are used in the static portion of SPECweb99.<br />
<strong>The</strong> file set and server caches are sized so that<br />
the entire file set fits in the server’s cache. This<br />
ensures that differences in cache hit rates do<br />
not affect performance.<br />
Our second workload is called the one-byte<br />
workload. In this workload, the clients repeatedly<br />
request the same one byte file from the<br />
server’s cache. We believe that this workload<br />
stresses the event dispatch mechanism by minimizing<br />
the amount of work that needs to be<br />
done by the server in completing a particular<br />
request. By reducing the effect of system calls<br />
such as read and write, this workload isolates<br />
the differences due to the event dispatch<br />
mechanisms.<br />
To study the scalability of the event dispatch<br />
mechanisms as the number of socket descriptors<br />
(connections) is increased, we use idleconn,<br />
a program that comes as part of the<br />
httperf suite. This program maintains a steady<br />
number of idle connections to the server (in addition<br />
to the active connections maintained by<br />
httperf). If any of these connections are closed<br />
idleconn immediately re-establishes them. We<br />
first examine the behaviour of the event dispatch<br />
mechanisms without any idle connections<br />
to study scenarios where all of the connections<br />
present in a server are active. We then<br />
pre-load the server with a number of idle connections<br />
and then run experiments. <strong>The</strong> idle<br />
connections are used to increase the number<br />
of simultaneous connections in order to simulate<br />
a WAN environment. In this paper we<br />
present experiments using 10,000 idle connections,<br />
our findings with other numbers of idle<br />
connections were similar and they are not presented<br />
here.<br />
4.2 Server Configuration<br />
For all of our experiments, the µserver is run<br />
with the same set of configuration parameters<br />
except for the event dispatch mechanism. <strong>The</strong><br />
µserver is configured to use sendfile to take<br />
advantage of zero-copy socket I/O while writing<br />
replies. We use TCP_CORK in conjunction<br />
with sendfile. <strong>The</strong> same server options<br />
are used for all experiments even though<br />
the use of TCP_CORK and sendfile may<br />
not provide benefits for the one-byte workload<br />
when compared with simply using writev.<br />
4.3 Experimental Methodology<br />
We measure the throughput of the µserver using<br />
different event dispatch mechanisms. In<br />
our graphs, each data point is the result of a<br />
two minute experiment. Trial and error revealed<br />
that two minutes is sufficient for the<br />
server to achieve a stable state of operation. A<br />
two minute delay is used between consecutive<br />
experiments, which allows the TIME_WAIT<br />
state on all sockets to be cleared before the subsequent<br />
run. All non-essential services are terminated<br />
prior to running any experiment.<br />
5 Experimental Results<br />
In this section we first compare the throughput<br />
achieved when using level-triggered epoll with<br />
that observed when using select and poll<br />
under both the one-byte and SPECweb99-<br />
like workloads with no idle connections. We<br />
then examine the effectiveness of the different<br />
methods described for reducing the number<br />
of epoll_ctl calls under these same<br />
workloads. This is followed by a comparison<br />
of the performance of the event dispatch<br />
mechanisms when the server is pre-loaded with<br />
10,000 idle connections. Finally, we describe<br />
the results of experiments in which we tune the
220 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
accept strategy used in conjunction with epoll-<br />
LT and epoll-ctlv to further improve their performance.<br />
We initially ran the one byte and the<br />
SPECweb99-like workloads to compare the<br />
performance of the select, poll and leveltriggered<br />
epoll mechanisms.<br />
As shown in Figure 1 and Figure 2, for both<br />
of these workloads select and poll perform as<br />
well as epoll-LT. It is important to note that because<br />
there are no idle connections for these<br />
experiments the number of socket descriptors<br />
tracked by each mechanism is not very high.<br />
As expected, the gap between epoll-LT and select<br />
is more pronounced for the one byte workload<br />
because it places more stress on the event<br />
dispatch mechanism.<br />
We tried to improve the performance of the<br />
server by exploring different techniques for using<br />
epoll as described in Section 3. <strong>The</strong> effect<br />
of these techniques on the one-byte workload<br />
is shown in Figure 3. <strong>The</strong> graphs in this figure<br />
show that for this workload the techniques used<br />
to reduce the number of epoll_ctl calls do<br />
not provide significant benefits when compared<br />
with their level-triggered counterpart (epoll-<br />
LT). Additionally, the performance of select<br />
and poll is equal to or slightly better than each<br />
of the epoll techniques. Note that we omit the<br />
line for poll from Figures 3 and 4 because it is<br />
nearly identical to the select line.<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
select<br />
epoll-LT<br />
epoll-ET<br />
epoll-ctlv<br />
epoll2<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
select<br />
poll<br />
epoll-LT<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
Figure 3: µserver performance on one byte<br />
workload with no idle connections<br />
5000<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
Figure 1: µserver performance on one byte<br />
workload using select, poll, and epoll-LT<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
select<br />
poll<br />
epoll-LT<br />
Figure 2: µserver performance on<br />
SPECweb99-like workload using select,<br />
poll, and epoll-LT<br />
We further analyze the results from Figure 3<br />
by profiling the µserver using gprof at the request<br />
rate of 22,000 requests per second. Table<br />
1 shows the percentage of time spent in system<br />
calls (rows) under the various event dispatch<br />
methods (columns). <strong>The</strong> output for system<br />
calls and µserver functions which do not<br />
contribute significantly to the total run-time is<br />
left out of the table for clarity.<br />
If we compare the select and poll columns<br />
we see that they have a similar breakdown including<br />
spending about 13% of their time indicating<br />
to the kernel events of interest and<br />
obtaining events. In contrast the epoll-LT,<br />
epoll-ctlv, and epoll2 approaches spend about<br />
21 – 23% of their time on their equivalent<br />
functions (epoll_ctl, epoll_ctlv and<br />
epoll_wait). Despite these extra overheads<br />
the throughputs obtained using the epoll techniques<br />
compare favourably with those obtained
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 221<br />
select epoll-LT epoll-ctlv epoll2 epoll-ET poll<br />
read 21.51 20.95 21.41 20.08 22.19 20.97<br />
close 14.90 14.05 14.90 13.02 14.14 14.79<br />
select 13.33 - - - - -<br />
poll - - - - - 13.32<br />
epoll_ctl - 16.34 5.98 10.27 11.06 -<br />
epoll_wait - 7.15 6.01 12.56 6.52 -<br />
epoll_ctlv - - 9.28 - - -<br />
setsockopt 11.17 9.13 9.13 7.57 9.08 10.68<br />
accept 10.08 9.51 9.76 9.05 9.30 10.20<br />
write 5.98 5.06 5.10 4.13 5.31 5.70<br />
fcntl 3.66 3.34 3.37 3.14 3.34 3.61<br />
sendfile 3.43 2.70 2.71 3.00 3.91 3.43<br />
Table 1: gprof profile data for the µserver under the one-byte workload at 22,000 requests/sec<br />
using select and poll. We note that when<br />
using select and poll the application requires<br />
extra manipulation, copying, and event<br />
scanning code that is not required in the epoll<br />
case (and does not appear in the gprof data).<br />
<strong>The</strong> results in Table 1 also show that the<br />
overhead due to epoll_ctl calls is reduced<br />
in epoll-ctlv, epoll2 and epoll-ET, when<br />
compared with epoll-LT. However, in each<br />
case these improvements are offset by increased<br />
costs in other portions of the code.<br />
<strong>The</strong> epoll2 technique spends twice as much<br />
time in epoll_wait when compared with<br />
epoll-LT. With epoll2 the number of calls<br />
to epoll_wait is significantly higher, the<br />
average number of descriptors returned is<br />
lower, and only a very small proportion of<br />
the calls (less than 1%) return events that<br />
need to be acted upon by the server. On the<br />
other hand, when compared with epoll-LT the<br />
epoll2 technique spends about 6% less time<br />
on epoll_ctl calls so the total amount of<br />
time spent dealing with events is comparable<br />
with that of epoll-LT. Despite the significant<br />
epoll_wait overheads epoll2 performance<br />
compares favourably with the other methods<br />
on this workload.<br />
Using the epoll-ctlv technique, gprof indicates<br />
that epoll_ctlv and epoll_ctl combine<br />
for a total of 1,949,404 calls compared<br />
with 3,947,769 epoll_ctl calls when using<br />
epoll-LT. While epoll-ctlv helps to reduce<br />
the number of user-kernel boundary crossings,<br />
the net result is no better than epoll-<br />
LT. <strong>The</strong> amount of time taken by epoll-ctlv<br />
in epoll_ctlv and epoll_ctl system<br />
calls is about the same (around 16%) as<br />
that spent by level-triggered epoll in invoking<br />
epoll_ctl.<br />
When comparing the percentage of time epoll-<br />
LT and epoll-ET spend in epoll_ctl we see<br />
that it has been reduced using epoll-ET from<br />
16% to 11%. Although the epoll_ctl time<br />
has been reduced it does not result in an appreciable<br />
improvement in throughput. We also<br />
note that about 2% of the run-time (which is<br />
not shown in the table) is also spent in the<br />
epoll-ET case checking, and tracking the state<br />
of the request (i.e., whether the server should<br />
be reading or writing) and the state of the<br />
socket (i.e., whether it is readable or writable).<br />
We expect that this can be reduced but that it<br />
wouldn’t noticeably impact performance.<br />
Results for the SPECweb99-like workload are
222 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
shown in Figure 4. Here the graph shows that<br />
all techniques produce very similar results with<br />
a very slight performance advantage going to<br />
epoll-ET after the saturation point is reached.<br />
<strong>The</strong> results for the SPECweb99-like workload<br />
with 10,000 idle connections are shown in Figure<br />
6. In this case each of the event mechanisms<br />
is impacted in a manner similar to that<br />
in which they are impacted by idle connections<br />
in the one-byte workload case.<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
select<br />
epoll-LT<br />
epoll-ET<br />
epoll2<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
select<br />
poll<br />
epoll-ET<br />
epoll-LT<br />
epoll2<br />
5000<br />
5000<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
Figure 4: µserver performance on<br />
SPECweb99-like workload with no idle<br />
connections<br />
5.1 Results With Idle Connections<br />
We now compare the performance of the event<br />
mechanisms with 10,000 idle connections. <strong>The</strong><br />
idle connections are intended to simulate the<br />
presence of larger numbers of simultaneous<br />
connections (as might occur in a WAN environment).<br />
Thus, the event dispatch mechanism<br />
has to keep track of a large number of descriptors<br />
even though only a very small portion of<br />
them are active.<br />
By comparing results in Figures 3 and 5 one<br />
can see that the performance of select and poll<br />
degrade by up to 79% when the 10,000 idle<br />
connections are added. <strong>The</strong> performance of<br />
epoll2 with idle connections suffers similarly<br />
to select and poll. In this case, epoll2 suffers<br />
from the overheads incurred by making a large<br />
number of epoll_wait calls the vast majority<br />
of which return events that are not of current<br />
interest to the server. Throughput with<br />
level-triggered epoll is slightly reduced with<br />
the addition of the idle connections while edgetriggered<br />
epoll is not impacted.<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
Figure 5: µserver performance on one byte<br />
workload and 10,000 idle connections<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
select<br />
poll<br />
epoll-ET<br />
epoll-LT<br />
epoll2<br />
Figure 6: µserver performance on<br />
SPECweb99-like workload and 10,000<br />
idle connections<br />
5.2 Tuning Accept Strategy for epoll<br />
<strong>The</strong> µserver’s accept strategy has been tuned<br />
for use with select. <strong>The</strong> µserver includes a<br />
parameter that controls the number of connections<br />
that are accepted consecutively. We call<br />
this parameter the accept-limit. Parameter values<br />
range from one to infinity (Inf). A value of<br />
one limits the server to accepting at most one<br />
connection when notified of a pending connection<br />
request, while Inf causes the server to consecutively<br />
accept all currently pending connections.<br />
To this point we have used the accept strategy<br />
that was shown to be effective for select by
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 223<br />
Brecht et al. [4] (i.e., accept-limit is Inf). In<br />
order to verify whether the same strategy performs<br />
well with the epoll-based methods we<br />
explored their performance under different accept<br />
strategies.<br />
Figure 7 examines the performance of leveltriggered<br />
epoll after the accept-limit has been<br />
tuned for the one-byte workload (other values<br />
were explored but only the best values<br />
are shown). Level-triggered epoll with an accept<br />
limit of 10 shows a marked improvement<br />
over the previous accept-limit of Inf,<br />
and now matches the performance of select<br />
on this workload. <strong>The</strong> accept-limit of 10 also<br />
improves peak throughput for the epoll-ctlv<br />
model by 7%. This gap widens to 32% at<br />
21,000 requests/sec. In fact the best accept<br />
strategy for epoll-ctlv fares slightly better than<br />
the best accept strategy for select.<br />
Replies/s<br />
25000<br />
20000<br />
15000<br />
10000<br />
select accept=Inf<br />
epoll-LT accept=Inf<br />
5000<br />
epoll-LT accept=10<br />
epoll-ctlv accept=Inf<br />
epoll-ctlv accept=10<br />
0<br />
0 5000 10000 15000 20000 25000 30000<br />
Requests/s<br />
Figure 7: µserver performance on one byte<br />
workload with different accept strategies and<br />
no idle connections<br />
Varying the accept-limit did not improve the<br />
performance of the edge-triggered epoll technique<br />
under this workload and it is not shown<br />
in the graph. However, we believe that the effects<br />
of the accept strategy on the various epoll<br />
techniques warrants further study as the efficacy<br />
of the strategy may be workload dependent.<br />
6 Discussion<br />
In this paper we use a high-performance eventdriven<br />
HTTP server, the µserver, to compare<br />
and evaluate the performance of select, poll,<br />
and epoll event mechanisms. Interestingly,<br />
we observe that under some of the workloads<br />
examined the throughput obtained using<br />
select and poll is as good or slightly better<br />
than that obtained with epoll. While these<br />
workloads may not utilize representative numbers<br />
of simultaneous connections they do stress<br />
the event mechanisms being tested.<br />
Our results also show that a main source of<br />
overhead when using level-triggered epoll is<br />
the large number of epoll_ctl calls. We<br />
explore techniques which significantly reduce<br />
the number of epoll_ctl calls, including<br />
the use of edge-triggered events and a system<br />
call, epoll_ctlv, which allows the µserver<br />
to aggregate large numbers of epoll_ctl<br />
calls into a single system call. While these<br />
techniques are successful in reducing the number<br />
of epoll_ctl calls they do not appear<br />
to provide appreciable improvements in performance.<br />
As expected, the introduction of idle connections<br />
results in dramatic performance degradation<br />
when using select and poll, while not<br />
noticeably impacting the performance when<br />
using epoll. Although it is not clear that<br />
the use of idle connections to simulate larger<br />
numbers of connections is representative of<br />
real workloads, we find that the addition of<br />
idle connections does not significantly alter<br />
the performance of the edge-triggered and<br />
level-triggered epoll mechanisms. <strong>The</strong> edgetriggered<br />
epoll mechanism performs best with<br />
the level-triggered epoll mechanism offering<br />
performance that is very close to edgetriggered.<br />
In the future we plan to re-evaluate some of
224 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
the mechanisms explored in this paper under<br />
more representative workloads that include<br />
more representative wide area network conditions.<br />
<strong>The</strong> problem with the technique of using<br />
idle connections is that the idle connections<br />
simply inflate the number of connections without<br />
doing any useful work. We plan to explore<br />
tools similar to Dummynet [16] and NIST Net<br />
[5] in order to more accurately simulate traffic<br />
delays, packet loss, and other wide area network<br />
traffic characteristics, and to re-examine<br />
the performance of Internet servers using different<br />
event dispatch mechanisms and a wider<br />
variety of workloads.<br />
7 Acknowledgments<br />
We gratefully acknowledge Hewlett Packard,<br />
the Ontario Research and Development Challenge<br />
Fund, and the National Sciences and Engineering<br />
Research Council of Canada for financial<br />
support for this project.<br />
References<br />
[1] G. Banga, P. Druschel, and J.C. Mogul.<br />
Resource containers: A new facility for<br />
resource management in server systems.<br />
In Operating Systems Design and<br />
Implementation, pages 45–58, 1999.<br />
[2] G. Banga and J.C. Mogul. Scalable<br />
kernel performance for Internet servers<br />
under realistic loads. In Proceedings of<br />
the 1998 USENIX Annual Technical<br />
Conference, New Orleans, LA, 1998.<br />
[3] G. Banga, J.C. Mogul, and P. Druschel.<br />
A scalable and explicit event delivery<br />
mechanism for UNIX. In Proceedings of<br />
the 1999 USENIX Annual Technical<br />
Conference, Monterey, CA, June 1999.<br />
[4] Tim Brecht, David Pariag, and Louay<br />
Gammo. accept()able strategies for<br />
improving web server performance. In<br />
Proceedings of the 2004 USENIX Annual<br />
Technical Conference (to appear), June<br />
2004.<br />
[5] M. Carson and D. Santay. NIST Net – a<br />
<strong>Linux</strong>-based network emulation tool.<br />
Computer Communication Review, to<br />
appear.<br />
[6] A. Chandra and D. Mosberger.<br />
Scalability of <strong>Linux</strong> event-dispatch<br />
mechanisms. In Proceedings of the 2001<br />
USENIX Annual Technical Conference,<br />
Boston, 2001.<br />
[7] HP Labs. <strong>The</strong> userver home page, 2004.<br />
Available at http://hpl.hp.com/<br />
research/linux/userver.<br />
[8] Dan Kegel. <strong>The</strong> C10K problem, 2004.<br />
Available at http:<br />
//www.kegel.com/c10k.html.<br />
[9] Jonathon Lemon. Kqueue—a generic<br />
and scalable event notification facility. In<br />
Proceedings of the USENIX Annual<br />
Technical Conference, FREENIX Track,<br />
2001.<br />
[10] Davide Libenzi. Improving (network)<br />
I/O performance. Available at<br />
http://www.xmailserver.org/<br />
linux-patches/nio-improve.<br />
html.<br />
[11] D. Mosberger and T. Jin. httperf: A tool<br />
for measuring web server performance.<br />
In <strong>The</strong> First Workshop on Internet Server<br />
Performance, pages 59–67, Madison,<br />
WI, June 1998.<br />
[12] Shailabh Nagar, Paul Larson, Hanna<br />
Linder, and David Stevens. epoll<br />
scalability web page. Available at<br />
http://lse.sourceforge.net/<br />
epoll/index.html.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 225<br />
[13] M. Ostrowski. A mechanism for scalable<br />
event notification and delivery in <strong>Linux</strong>.<br />
Master’s thesis, Department of Computer<br />
Science, University of Waterloo,<br />
November 2000.<br />
[14] Vivek S. Pai, Peter Druschel, and Willy<br />
Zwaenepoel. Flash: An efficient and<br />
portable Web server. In Proceedings of<br />
the USENIX 1999 Annual Technical<br />
Conference, Monterey, CA, June 1999.<br />
http://citeseer.nj.nec.com/<br />
article/pai99flash.html.<br />
[15] N. Provos and C. Lever. Scalable<br />
network I/O in <strong>Linux</strong>. In Proceedings of<br />
the USENIX Annual Technical<br />
Conference, FREENIX Track, June 2000.<br />
[16] Luigi Rizzo. Dummynet: a simple<br />
approach to the evaluation of network<br />
protocols. ACM Computer<br />
Communication Review, 27(1):31–41,<br />
1997.<br />
http://citeseer.ist.psu.<br />
edu/rizzo97dummynet.html.<br />
[17] Standard Performance Evaluation<br />
Corporation. SPECWeb99 Benchmark,<br />
1999. Available at http://www.<br />
specbench.org/osg/web99.<br />
[18] David Weekly. /dev/epoll – a highspeed<br />
<strong>Linux</strong> kernel patch. Available at<br />
http://epoll.hackerdojo.com.
226 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
<strong>The</strong> (Re)Architecture of the X Window System<br />
James Gettys<br />
jim.gettys@hp.com<br />
Keith Packard<br />
keithp@keithp.com<br />
HP Cambridge Research Laboratory<br />
Abstract<br />
<strong>The</strong> X Window System, Version 11, is the standard<br />
window system on <strong>Linux</strong> and UNIX systems.<br />
X11, designed in 1987, was “state of<br />
the art” at that time. From its inception, X has<br />
been a network transparent window system in<br />
which X client applications can run on any machine<br />
in a network using an X server running<br />
on any display. While there have been some<br />
significant extensions to X over its history (e.g.<br />
OpenGL support), X’s design lay fallow over<br />
much of the 1990’s. With the increasing interest<br />
in open source systems, it was no longer<br />
sufficient for modern applications and a significant<br />
overhaul is now well underway. This<br />
paper describes revisions to the architecture of<br />
the window system used in a growing fraction<br />
of desktops and embedded systems<br />
1 Introduction<br />
While part of this work on the X window system<br />
[SG92] is “good citizenship” required by<br />
open source, some of the architectural problems<br />
solved ease the ability of open source applications<br />
to print their results, and some of<br />
the techniques developed are believed to be in<br />
advance of the commercial computer industry.<br />
<strong>The</strong> challenges being faced include:<br />
• X’s fundamentally flawed font architecture<br />
made it difficult to implement good<br />
WYSIWYG systems<br />
• Inadequate 2D graphics, which had always<br />
been intended to be augmented<br />
and/or replaced<br />
• Developers are loathe to adopt any new<br />
technology that limits the distribution of<br />
their applications<br />
• Legal requirements for accessibility for<br />
screen magnifiers are difficult to implement<br />
• Users desire modern user interface eye<br />
candy, which sport translucent graphics<br />
and windows, drop shadows, etc.<br />
• Full integration of applications into 3 D<br />
environments<br />
• Collaborative shared use of X (e.g. multiple<br />
simultaneous use of projector walls or<br />
other shared applications)<br />
While some of this work has been published<br />
elsewhere, there has never been any overview<br />
paper describing this work as an integrated<br />
whole, and the compositing manager work described<br />
below is novel as of fall 2003. This<br />
work represents a long term effort that started<br />
in 1999, and will continue for several years<br />
more.
228 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
2 Text and Graphics<br />
X’s obsolete 2D bit-blit based text and graphics<br />
system problems were most urgent. <strong>The</strong> development<br />
of the Gnome and KDE GUI environments<br />
in the period 1997-2000 had shown<br />
X11’s fundamental soundness, but confirmed<br />
the authors’ belief that the rendering system in<br />
X was woefully inadequate. <strong>One</strong> of us participated<br />
in the original X11 design meetings<br />
where the intent was to augment the rendering<br />
design at a later date; but the “GUI Wars” of the<br />
late 1980’s doomed effort in this area. Good<br />
printing support has been particularly difficult<br />
to implement in X applications, as fonts have<br />
were opaque X server side objects not directly<br />
accessible by applications.<br />
Most applications now composite images in<br />
sophisticated ways, whether it be in Flash media<br />
players, or subtly as part of anti-aliased<br />
characters. Bit-Blit is not sufficient for these<br />
applications, and these modern applications<br />
were (if only by their use of modern toolkits)<br />
all resorting to pixel based image manipulation.<br />
<strong>The</strong> screen pixels are retrieved from<br />
the window system, composited in clients, and<br />
then restored to the screen, rather than directly<br />
composited in hardware, resulting in poor performance.<br />
Inspired by the model first implemented<br />
in the Plan 9 window system, a graphics<br />
model based on Porter/Duff [PD84] image<br />
compositing was chosen. This work resulted in<br />
the X Render extension [Pac01a].<br />
X11’s core graphics exposed fonts as a server<br />
side abstraction. This font model was, at best,<br />
marginally adequate by 1987 standards. Even<br />
WYSIWYG systems of that era found them insufficient.<br />
Much additional information embedded<br />
in fonts (e.g. kerning tables) were not<br />
available from X whatsoever. Current competitive<br />
systems implement anti-aliased outline<br />
fonts. Discovering the Unicode coverage of a<br />
font, required by current toolkits for internationalization,<br />
was causing major performance<br />
problems. Deploying new server side font<br />
technology is slow, as X is a distributed system,<br />
and many X servers are seldom (or never)<br />
updated.<br />
<strong>The</strong>refore, a more fundamental change in X’s<br />
architecture was undertaken: to no longer use<br />
server side fonts at all, but to allow applications<br />
direct access to font files and have the window<br />
system cache and composite glyphs onto the<br />
screen.<br />
<strong>The</strong> first implementation of the new font system<br />
[Pac01b] taught a vital lesson. Xft1<br />
provided anti-aliased text and proper font<br />
naming/substitution support, but reverted to<br />
the core X11 bitmap fonts if the Render<br />
extension was not present. Xft1 included<br />
the first implementation what is called “subpixel<br />
decimation,” which provides higher quality<br />
subpixel based rendering than Microsoft’s<br />
ClearType [Pla00] technology in a completely<br />
general algorithm.<br />
Despite these advances, Xft1 received at best<br />
a lukewarm reception. If an application developer<br />
wanted anti-aliased text universally, Xft1<br />
did not help them, since it relied on the Render<br />
extension which had not yet been widely deployed;<br />
instead, the developer would be faced<br />
with two implementations, and higher maintenance<br />
costs. This (in retrospect obvious) rational<br />
behavior of application developers shows<br />
the high importance of backwards compatibility;<br />
X extensions intended for application developers’<br />
use must be designed in a downward<br />
compatible form whenever possible, and<br />
should enable a complete conversion to a new<br />
facility, so that multiple code paths in applications<br />
do not need testing and maintenance.<br />
<strong>The</strong>se principles have guided later development.<br />
<strong>The</strong> font installation, naming, substitution,<br />
and internationalization problems were sepa-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 229<br />
rated from Xft into a library named Fontconfig<br />
[Pac02], (since some printer only applications<br />
need this functionality independent of<br />
the window system.) Fontconfig provides internationalization<br />
features in advance of those<br />
in commercial systems such as Windows or<br />
OS X, and enables trivial font installation with<br />
good performance even when using thousands<br />
of fonts. Xft2 was also modified to operate<br />
against legacy X servers lacking the Render extension.<br />
Xft2 and Fontconfig’s solving of several major<br />
problems and lack of deployment barriers<br />
enabled rapid acceptance and deployment in<br />
the open source community, seeing almost universal<br />
use and uptake in less than one calendar<br />
year. <strong>The</strong>y have been widely deployed on<br />
<strong>Linux</strong> systems since the end of 2002. <strong>The</strong>y also<br />
“future proof” open source systems against<br />
coming improvements in font systems (e.g.<br />
OpenType), as the window system is no longer<br />
a gating item for font technology.<br />
Sun Microsystems implemented a server side<br />
font extension over the last several years; for<br />
the reasons outlined in this section, it has not<br />
been adopted by open source developers.<br />
While Xft2 and Fontconfig finally freed application<br />
developers from the tyranny of<br />
X11’s core font system, improved performance<br />
[PG03], and at a stroke simplified their<br />
printing problems, it has still left a substantial<br />
burden on applications. <strong>The</strong> X11 core graphics,<br />
even augmented by the Render extension,<br />
lack convenient facilities for many applications<br />
for even simple primitives like splines, tasteful<br />
wide lines, stroking paths, etc, much less provide<br />
simple ways for applications to print the<br />
results on paper.<br />
3 Cairo<br />
<strong>The</strong> Cairo library [WP03], developed by one of<br />
the authors in conjunction with by Carl Worth<br />
of ISI, is designed to solve this problem. Cairo<br />
provides a state full user-level API with support<br />
for the PDF 1.4 imaging model. Cairo provides<br />
operations including stroking and filling<br />
Bézier cubic splines, transforming and compositing<br />
translucent images, and anti-aliased<br />
text rendering. <strong>The</strong> PostScript drawing model<br />
has been adapted for use within applications.<br />
Extensions needed to support much of the PDF<br />
1.4 imaging operations have been included.<br />
This integration of the familiar PostScript operational<br />
model within the native application<br />
language environments provides a simple and<br />
powerful new tool for graphics application development.<br />
Cairo’s rendering algorithms use work done<br />
in the 1980’s by Guibas, Ramshaw, and<br />
Stolfi [GRS83] along with work by John<br />
Hobby [Hob85], which has never been exploited<br />
in Postscript or in Windows. <strong>The</strong> implementation<br />
is fast, precise, and numerically<br />
stable, supports hardware acceleration, and is<br />
in advance of commercial systems.<br />
Of particular note is the current development of<br />
Glitz [NR04], an OpenGL backend for Cairo,<br />
being developed by a pair of master’s students<br />
in Sweden. Not only is it showing that a high<br />
speed implementation of Cairo is possible, it<br />
implements an interface very similar to the X<br />
Render extension’s interface. More about this<br />
in the OpenGL section below.<br />
Cairo is in the late stages of development and<br />
is being widely adopted in the open source<br />
community. It includes the ability to render<br />
to Postscript and a PDF back end is planned,<br />
which should greatly improve applications’<br />
printing support. Work to incorporate Cairo in<br />
the Gnome and KDE desktop environments is
230 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
well underway, as are ports to Windows and<br />
Apple’s MacIntosh, and it is being used by the<br />
Mono project. As with Xft2, Cairo works with<br />
all X servers, even those without the Render<br />
extension.<br />
4 Accessibility and Eye-Candy<br />
Several years ago, one of us implemented a<br />
prototype X system that used image compositing<br />
as the fundamental primitive for constructing<br />
the screen representation of the window hierarchy<br />
contents. Child window contents were<br />
composited to their parent windows which<br />
were incrementally composed to their parents<br />
until the final screen image was formed, enabling<br />
translucent windows. <strong>The</strong> problem with<br />
this simplistic model was twofold—first, a<br />
naïve implementation consumed enormous resources<br />
as each window required two complete<br />
off screen buffers (one for the window<br />
contents themselves, and one for the window<br />
contents composited with the children) and<br />
took huge amounts of time to build the final<br />
screen image as it recursively composited windows<br />
together. Secondly, the policy governing<br />
the compositing was hardwired into the X<br />
server. An architecture for exposing the same<br />
semantics with less overhead seemed almost<br />
possible, and pieces of it were implemented<br />
(miext/layer). However, no complete system<br />
was fielded, and every copy of the code tracked<br />
down and destroyed to prevent its escape into<br />
the wild.<br />
Both Mac OS X and DirectFB [Hun04] perform<br />
window-level compositing by creating<br />
off-screen buffers for each top-level window<br />
(in OS X, the window system is not nested,<br />
so there are only top-level windows). <strong>The</strong><br />
screen image is then formed by taking the resulting<br />
images and blending them together on<br />
the screen. Without handling the nested window<br />
case, both of these systems provide the<br />
desired functionality with a simple implementation.<br />
This simple approach is inadequate<br />
for X as some desktop environments nest the<br />
whole system inside a single top-level window<br />
to allow panning, and X’s long history<br />
has shown the value of separating mechanism<br />
from policy (Gnome and KDE were developed<br />
over 10 years after X11’s design). <strong>The</strong> fix is<br />
pretty easy—allow applications to select which<br />
pieces of the window hierarchy are to be stored<br />
off-screen and which are to be drawn to their<br />
parent storage.<br />
With window hierarchy contents stored in offscreen<br />
buffers, an external application can now<br />
control how the screen contents are constructed<br />
from the constituent sub-windows and whatever<br />
other graphical elements are desired. This<br />
eliminated the complexities surrounding precisely<br />
what semantics would be offered in<br />
window-level compositing within the X server<br />
and the design of the underlying X extensions.<br />
<strong>The</strong>y were replaced by some concerns over the<br />
performance implications of using an external<br />
agent (the “Compositing Manager”) to execute<br />
the requests needed to present the screen image.<br />
Note that every visible pixel is under the<br />
control of the compositing manager, so screen<br />
updates are limited to how fast that application<br />
can get the bits painted to the screen.<br />
<strong>The</strong> architecture is split across three new extensions:<br />
• Composite, which controls which subhierarchies<br />
within the window tree are<br />
rendered to separate buffers.<br />
• Damage, which tracks modified areas<br />
with windows, informing the Composting<br />
Manager which areas of the off-screen hierarchy<br />
components have changed.<br />
• Xfixes, which includes new Region objects<br />
permitting all of the above computation<br />
to be performed indirectly within the
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 231<br />
X server, avoiding round trips.<br />
Multiple applications can take advantage of the<br />
off screen window contents, allowing thumbnail<br />
or screen magnifier applications to be included<br />
in the desktop environment.<br />
To allow applications other than the compositing<br />
manager to present alpha-blended content<br />
to the screen, a new X Visual was added to the<br />
server. At 32 bits deep, it provides 8 bits of<br />
red, green and blue along with 8 bits of alpha<br />
value. Applications can create windows using<br />
this visual and the compositing manager can<br />
composite them onto the screen.<br />
Nothing in this fundamental design indicates<br />
that it is used for constructing translucent windows;<br />
redirection of window contents and notification<br />
of window content change seems<br />
pretty far removed from one of the final goals.<br />
But note the compositing manger can use whatever<br />
X requests it likes to paint the combined<br />
image, including requests from the Render<br />
extension, which does know how to blend<br />
translucent images together. <strong>The</strong> final image<br />
is constructed programmatically so the possible<br />
presentation on the screen is limited only<br />
by the fertile imagination of the numerous eyecandy<br />
developers, and not restricted to any policy<br />
imposed by the base window system. And<br />
vital to rapid deployment, most applications<br />
can be completely oblivious to this background<br />
legerdemain.<br />
In this design, such sophisticated effects need<br />
only be applied at frame update rates on only<br />
modified sections of the screen rather than at<br />
the rate applications perform graphics; this<br />
constant behavior is highly desirable in systems.<br />
<strong>The</strong>re is very strong “pull” from both commercial<br />
and non-commercial users of X for this<br />
work and the current early version will likely<br />
be shipped as part of the next X.org Foundation<br />
X Window System release, sometime<br />
this summer. Since there has not been sufficient<br />
exposure through widespread use, further<br />
changes will certainly be required further experience<br />
with the facilities are gained in a much<br />
larger audience; as these can be made without<br />
affecting existing applications, immediate deployment<br />
is both possible and extremely desirable.<br />
<strong>The</strong> mechanisms described above realize a fundamentally<br />
more interesting architecture than<br />
either Windows or Mac OSX, where the compositing<br />
policy is hardwired into the window<br />
system. We expect a fertile explosion of experimentation,<br />
experience (both good and bad),<br />
and a winnowing of ideas as these facilities<br />
gain wider exposure.<br />
5 Input Transformation<br />
In the “naïve,” eye-candy use of the new compositing<br />
functions, no transformation of input<br />
events are required, as input to windows remains<br />
at the same geometric position on the<br />
screen, even though the windows are first rendered<br />
off screen. More sophisticated use, for<br />
example, screen readers or immersive environments<br />
such as Croquet [SRRK02], or Sun’s<br />
Looking Glass [KJ04] requires transformation<br />
of input events from where they first occur<br />
on the visible screen to the actual position in<br />
the windows (being rendered from off screen),<br />
since the window’s contents may have been arbitrarily<br />
transformed or even texture mapped<br />
onto shapes on the screen.<br />
As part of Sun Microsystem’s award winning<br />
work on accessibility in open source for screen<br />
readers, Sun has developed the XEvIE extension<br />
[Kre], which allows external clients to<br />
transform input events. This looks like a good<br />
starting point for the somewhat more general<br />
problem that 3D systems pose, and with some
232 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
modification can serve both the accessibility<br />
needs and those of more sophisticated applications.<br />
6 Synchronization<br />
Synchronization is probably the largest remaining<br />
challenge posed by compositing.<br />
While composite has eliminated much flashing<br />
of the screen since window exposure is eliminated,<br />
this does not solve the challenge of the<br />
compositing manager happening to copy an application’s<br />
window to the frame buffer in the<br />
middle of an application painting a sequence<br />
of updates. No “tearing” of single graphics operations<br />
take place since the X server is single<br />
threaded, and all graphics operations are run to<br />
completion.<br />
<strong>The</strong> X Synchronization extension<br />
(XSync) [GCGW92], widely available<br />
but to date seldom used, provides a general set<br />
of mechanisms for applications to synchronize<br />
with each other, with real time, and potentially<br />
with other system provided counters. XSync’s<br />
original design intent intended system provided<br />
counters for vertical retrace interrupts,<br />
audio sample clocks, and similar system<br />
facilities, enabling very tight synchronization<br />
of graphics operations with these time bases.<br />
Work has begun on <strong>Linux</strong> to provide these<br />
counters at long last, when available, to flesh<br />
out the design originally put in place and tested<br />
in the early 1990’s.<br />
A possible design for solving the application<br />
synchronization problem at low overhead may<br />
be to mark sections of requests with increments<br />
of XSync counters: if the count is odd<br />
(or even) the window would be unstable/stable.<br />
<strong>The</strong> compositing manager might then copy the<br />
window only if the window is in a stable state.<br />
Some details and possibly extensions to XSync<br />
will need to be worked out, if this approach is<br />
pursued.<br />
7 Next Steps<br />
We believe we are slightly more than half way<br />
through the process of rearchitecting and reimplementing<br />
the X Window System. <strong>The</strong> existing<br />
prototype needs to become a production<br />
system requiring significant infrastructure<br />
work as described in this section.<br />
7.1 OpenGL based X<br />
Current X-based systems which support<br />
OpenGL do so by encapsulating the OpenGL<br />
environment within X windows. As such,<br />
an OpenGL application cannot manipulate X<br />
objects with OpenGL drawing commands.<br />
Using OpenGL as the basis for the X server itself<br />
will place X objects such as pixmaps and<br />
off-screen window contents inside OpenGL<br />
objects allowing applications to use the full<br />
OpenGL command set to manipulate them.<br />
A “proof of concept” of implementation of the<br />
X Render extension is being done as part of<br />
the Glitz back-end for Cairo, which is showing<br />
very good performance for render based applications.<br />
Whether the “core” X graphics will require<br />
any OpenGL extensions is still somewhat<br />
an open question.<br />
In concert with the new compositing extensions,<br />
conventional X applications can then be<br />
integrated into 3D environments such as Croquet,<br />
or Sun’s Looking Glass. X application<br />
contents can be used as textures and mapped<br />
onto any surface desired in those environments.<br />
This work is underway, but not demonstrable<br />
at this date.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 233<br />
7.2 <strong>Kernel</strong> support for graphics cards<br />
In current open source systems, graphics cards<br />
are supported in a manner totally unlike that<br />
of any other operating system, and unlike previous<br />
device drivers for the X Window System<br />
on commercial UNIX systems. <strong>The</strong>re is no single<br />
central kernel driver responsible for managing<br />
access to the hardware. Instead, a large set<br />
of cooperating user and kernel mode systems<br />
are involved in mutual support of the hardware,<br />
including the X server (for 2D graphic), the<br />
direct-rendering infrastructure (DRI) (for accelerated<br />
3D graphics), the kernel frame buffer<br />
driver (for text console emulation), the General<br />
ATI TV and Overlay Software (GATOS)<br />
(for video input and output) and alternate 2D<br />
graphics systems like DirectFB.<br />
Two of these systems, the kernel frame buffer<br />
driver and the X server both include code to<br />
configure the graphics card “video mode”—<br />
the settings needed to send the correct video<br />
signals to monitors connected to the card.<br />
Three of these systems, DRI, the X server<br />
and GATOS, all include code for managing<br />
the memory space within the graphics card.<br />
All of these systems directly manipulate hardware<br />
registers without any coordination among<br />
them.<br />
<strong>The</strong> X server has no kernel component for<br />
2D graphics. Long-latency operations cannot<br />
use interrupts, instead the X server spins while<br />
polling status registers. DMA is difficult or impossible<br />
to configure in this environment. Perhaps<br />
the most egregious problem is that the<br />
X server reconfigures the PCI bus to correct<br />
BIOS mapping errors without informing the<br />
operating system kernel. <strong>Kernel</strong> access to devices<br />
while this remapping is going on may<br />
find the related devices mismapped.<br />
To rationalize this situation, various groups and<br />
vendors are coordinating efforts to create a single<br />
kernel-level entity responsible for basic device<br />
management, but this effort has just begun.<br />
7.3 Housecleaning and Latency Elimination<br />
and Latency Hiding<br />
Serious attempts were made in the early 1990’s<br />
to multi-thread the X server itself, with the discovery<br />
that the threading overhead in the X<br />
server is a net performance loss [Smi92].<br />
Applications, however, often need to be multithreaded.<br />
<strong>The</strong> primary C binding to the X protocol<br />
is called Xlib, and its current implementation<br />
by one of us dates from 1987. While it<br />
was partially developed on a Firefly multiprocessor<br />
workstation of that era, something almost<br />
unheard of at that date, and some consideration<br />
of multi-threaded applications were<br />
taken in its implementation, its internal transport<br />
facilities were never expected/intended to<br />
be preserved when serious multi-threaded operating<br />
systems became available. Unfortunately,<br />
rather than a full rewrite as one of us expected,<br />
multi-threaded support was debugged<br />
into existence using the original code base and<br />
the resulting code is very bug-prone and hard to<br />
maintain. Additionally, over the years, Xlib became<br />
a “kitchen sink” library, including functionality<br />
well beyond its primary use as a binding<br />
to the X protocol. We have both seriously<br />
regretted the precedents both of us set<br />
introducing extraneous functionality into Xlib,<br />
causing it to be one of the largest libraries on<br />
UNIX/<strong>Linux</strong> systems. Due to better facilities<br />
in modern toolkits and system libraries, more<br />
than half of Xlib’s current footprint is obsolete<br />
code or data.<br />
While serious work was done in X11’s design<br />
to mitigate latency, X’s performance, particularly<br />
over low speed networks, is often limited<br />
by round trip latency, and with retrospect<br />
much more can be done [PG03]. As this
234 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
work shows, client side fonts have made a significant<br />
improvement in startup latency, and<br />
work has already been completed in toolkits<br />
to mitigate some of the other hot spots. Much<br />
of the latency can be retrieved by some simple<br />
techniques already underway, but some require<br />
more sophisticated techniques that the<br />
current Xlib implementation is not capable of.<br />
Potentially 90the latency as of 2003 can be<br />
recovered by various techniques. <strong>The</strong> XCB<br />
library [MS01] by Bart Massey and Jamey<br />
Sharp is both carefully engineered to be multithreaded<br />
and to expose interfaces that will allow<br />
for latency hiding.<br />
Since libraries linked against different basic<br />
X transport systems would cause havoc in the<br />
same address space, a Xlib compatibility layer<br />
(XCL) has been developed that provides the<br />
“traditional” X library API, using the original<br />
Xlib stubs, but replacing the internal transport<br />
and locking system, which will allow for much<br />
more useful latency hiding interfaces. <strong>The</strong><br />
XCB/XCL version of Xlib is now able to run<br />
essentially all applications, and after a shakedown<br />
period, should be able to replace the existing<br />
Xlib transport soon. Other bindings than<br />
the traditional Xlib bindings then become possible<br />
in the same address space, and we may<br />
see toolkits adopt those bindings at substantial<br />
savings in space.<br />
7.4 Mobility, Collaboration, and Other Topics<br />
X’s original intended environment included<br />
highly mobile students, and a hope, never generally<br />
realized for X, was the migration of applications<br />
between X servers.<br />
<strong>The</strong> user should be able to travel between systems<br />
running X and retrieve your running applications<br />
(with suitable authentication and authorization).<br />
<strong>The</strong> user should be able to log out<br />
and “park” applications somewhere for later<br />
retrieval, either on the same display, or elsewhere.<br />
Users should be able to replicate an<br />
application’s display on a wall projector for<br />
presentation. Applications should be able to<br />
easily survive the loss of the X server (most<br />
commonly caused by the loss of the underlying<br />
TCP connection, when running remotely).<br />
Toolkit implementers typically did not understand<br />
and share this poorly enunciated vision<br />
and were primarily driven by pressing immediate<br />
needs, and X’s design and implementation<br />
made migration or replication difficult<br />
to implement as an afterthought. As a result,<br />
migration (and replication) was seldom<br />
implemented, and early toolkits such as Xt<br />
made it even more difficult. Emacs is the only<br />
widespread application capable of both migration<br />
and replication, and it avoided using any<br />
toolkit. A more detailed description of this vision<br />
is available in [Get02].<br />
Recent work in some of the modern toolkits<br />
(e.g. GTK+) and evolution of X itself make<br />
much of this vision demonstrable in current applications.<br />
Some work in the X infrastructure<br />
(Xlib) is underway to enable the prototype in<br />
GTK+ to be finished.<br />
Similarly, input devices need to become fullfledged<br />
network data sources, to enable much<br />
looser coupling of keyboards, mice, game consoles<br />
and projectors and displays; the challenge<br />
here will be the authentication, authorization<br />
and security issues this will raise. <strong>The</strong> HAL<br />
and DBUS projects hosted at freedesktop.org<br />
are working on at least part of the solutions for<br />
the user interface challenges posed by hotplug<br />
of input devices.<br />
7.5 Color Management<br />
<strong>The</strong> existing color management facilities in<br />
X are over 10 years old, have never seen<br />
widespread use, and do not meet current needs.<br />
This area is ripe for revisiting. Marti Maria Sa-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 235<br />
guer’s LittleCMS [Mar] may be of use here.<br />
For the first time, we have the opportunity to<br />
“get it right” from one end to the other if we<br />
choose to make the investment.<br />
7.6 Security and Authentication<br />
Transport security has become an burning issue;<br />
X is network transparent (applications can<br />
run on any system in a network, using remote<br />
displays), yet we dare no longer use X over the<br />
network directly due to password grabbing kits<br />
in the hands of script kiddies. SSH [BS01] provides<br />
such facilities via port forwarding and<br />
is being used as a temporary stopgap. Urgent<br />
work on something better is vital to enable<br />
scaling and avoid the performance and latency<br />
issues introduced by transit of extra processes,<br />
particularly on (<strong>Linux</strong> Terminal Server<br />
Project (LTSP [McQ02]) servers, which are beginning<br />
break out of their initial use in schools<br />
and other non security sensitive environments<br />
into very sensitive commercial environments.<br />
Another aspect of security arises between applications<br />
sharing a display. In the early and<br />
mid 1990’s efforts were made as a result of the<br />
compartmented mode workstation projects to<br />
make it much more difficult for applications to<br />
share or steal data from each other on a X display.<br />
<strong>The</strong>se facilities are very inflexible, and<br />
have gone almost unused.<br />
As projectors and other shared displays become<br />
common over the next five years, applications<br />
from multiple users sharing a display<br />
will become commonplace. In such environments,<br />
different people may be using the same<br />
display at the same time and would like some<br />
level of assurance that their application’s data<br />
is not being grabbed by the other user’s application.<br />
Eamon Walsh has, as part of the SE<strong>Linux</strong><br />
project [Wal04], been working to replace the<br />
existing X Security extension with an extension<br />
that, as in SE<strong>Linux</strong>, will allow multiple<br />
different security policies to be developed external<br />
to the X server. This should allow multiple<br />
different policies to be available to suit the<br />
varied uses: normal workstations, secure workstations,<br />
shared displays in conference rooms,<br />
etc.<br />
7.7 Compression and Image Transport<br />
Many/most modern applications and desktops,<br />
including the most commonly used application<br />
(a web browser) are now intensive users of synthetic<br />
and natural images. <strong>The</strong> previous attempt<br />
(XIE [SSF + 96]) to provide compressed<br />
image transport failed due to excessive complexity<br />
and over ambition of the designers, has<br />
never been significantly used, and is now in<br />
fact not even shipped as part of current X distributions.<br />
Today, many images are being read from disk<br />
or the network in compressed form, uncompressed<br />
into memory in the X client, moved<br />
to the X server (where they often occupy another<br />
copy of the uncompressed data). If we<br />
add general data compression to X (or run X<br />
over ssh with compression enabled) the data<br />
would be both compressed and uncompressed<br />
on its way to the X server. A simple replacement<br />
for XIE (if the complexity slippery slope<br />
can be avoided in a second attempt) would be<br />
worthwhile, along with other general compression<br />
of the X protocol.<br />
Results in our 2003 Usenix X Network Performance<br />
paper show that, in real application<br />
workloads (the startup of a Gnome desktop),<br />
using even simple GZIP [Gai93] style<br />
compression can make a tremendous difference<br />
in a network environment, with a factor<br />
of 300(!) savings in bandwidth. Apparently<br />
the synthetic images used in many current<br />
UI’s are extremely good candidates for
236 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
compression. A simple X extension that could<br />
encapsulate one or more X requests into the<br />
extension request would avoid multiple compression/uncompression<br />
of the same data in<br />
the system where an image transport extension<br />
was also present. <strong>The</strong> basic X protocol framework<br />
is actually very byte efficient relative to<br />
most conventional RPC systems, with a basic<br />
X request only occupying 4 bytes (contrast this<br />
with HTTP or CORBA, in which a simple request<br />
is more than 100 bytes).<br />
With the great recent interest in LTSP in commercial<br />
environments, work here would be extremely<br />
well spent, saving both memory and<br />
CPU, and network bandwidth.<br />
We are more than happy to hear from anyone<br />
interested in helping in this effort to bring X<br />
into the new millennium.<br />
References<br />
[BS01]<br />
[Gai93]<br />
Daniel J. Barrett and Richard<br />
Silverman. SSH, <strong>The</strong> Secure<br />
Shell: <strong>The</strong> Definitive Guide.<br />
O’Reilly & Associates, Inc.,<br />
2001.<br />
Jean-Loup Gailly. Gzip: <strong>The</strong><br />
Data Compression Program.<br />
iUniverse.com, 1.2.4 edition,<br />
1993.<br />
[GCGW92] Tim Glauert, Dave Carver, James<br />
Gettys, and David Wiggins. X<br />
Synchronization Extension<br />
Protocol, Version 3.0. X<br />
consortium standard, 1992.<br />
[Get02]<br />
James Gettys. <strong>The</strong> Future is<br />
Coming, Where the X Window<br />
System Should Go. In FREENIX<br />
Track, 2002 Usenix Annual<br />
Technical Conference, Monterey,<br />
CA, June 2002. USENIX.<br />
[GRS83]<br />
[Hob85]<br />
[Hun04]<br />
[KJ04]<br />
[Kre]<br />
[Mar]<br />
[McQ02]<br />
[MS01]<br />
Leo Guibas, Lyle Ramshaw, and<br />
Jorge Stolfi. A kinetic framework<br />
for computational geometry. In<br />
Proceedings of the IEEE 1983<br />
24th Annual Symposium on the<br />
Foundations of Computer<br />
Science, pages 100–111. IEEE<br />
Computer Society Press, 1983.<br />
John D. Hobby. Digitized Brush<br />
Trajectories. PhD thesis,<br />
Stanford University, 1985. Also<br />
Stanford Report<br />
STAN-CS-85-1070.<br />
A. Hundt. DirectFB Overview<br />
(v0.2 for DirectFB 0.9.21),<br />
February 2004.<br />
http://www.directfb.<br />
org/documentation.<br />
H. Kawahara and D. Johnson.<br />
Project Looking Glass: 3D<br />
Desktop Exploration. In X<br />
Developers Conference,<br />
Cambridge, MA, April 2004.<br />
S. Kreitman. XEvIE - X Event<br />
Interception Extension. http:<br />
//freedesktop.org/<br />
~stukreit/xevie.html.<br />
M. Maria. Little CMS Engine<br />
1.12 API Definition. Technical<br />
report.<br />
http://www.littlecms.<br />
com/lcmsapi.txt.<br />
Jim McQuillan. LTSP - <strong>Linux</strong><br />
Terminal Server Project, Version<br />
3.0. Technical report, March<br />
2002. http://www.ltsp.<br />
org/documentation/<br />
ltsp-3.0-4-en.html.<br />
Bart Massey and Jamey Sharp.<br />
XCB: An X protocol c binding.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 237<br />
[NR04]<br />
[Pac01a]<br />
[Pac01b]<br />
[Pac02]<br />
[PD84]<br />
[PG03]<br />
In XFree86 Technical<br />
Conference, Oakland, CA,<br />
November 2001. USENIX.<br />
Peter Nilsson and David<br />
Reveman. Glitz: Hardware<br />
Accelerated Image Compositing<br />
using OpenGL. In FREENIX<br />
Track, 2004 Usenix Annual<br />
Technical Conference, Boston,<br />
MA, July 2004. USENIX.<br />
Keith Packard. Design and<br />
Implementation of the X<br />
Rendering Extension. In<br />
FREENIX Track, 2001 Usenix<br />
Annual Technical Conference,<br />
Boston, MA, June 2001.<br />
USENIX.<br />
Keith Packard. <strong>The</strong> Xft Font<br />
Library: Architecture and Users<br />
Guide. In XFree86 Technical<br />
Conference, Oakland, CA,<br />
November 2001. USENIX.<br />
Keith Packard. Font<br />
Configuration and Customization<br />
for Open Source Systems. In<br />
2002 Gnome User’s and<br />
Developers European<br />
Conference, Seville, Spain, April<br />
2002. Gnome.<br />
Thomas Porter and Tom Duff.<br />
Compositing Digital Images.<br />
Computer Graphics,<br />
18(3):253–259, July 1984.<br />
Keith Packard and James Gettys.<br />
X Window System Network<br />
Performance. In FREENIX<br />
Track, 2003 Usenix Annual<br />
Technical Conference, San<br />
Antonio, TX, June 2003.<br />
USENIX.<br />
[Pla00]<br />
[SG92]<br />
[Smi92]<br />
[SRRK02]<br />
J. Platt. Optimal filtering for<br />
patterned displays. IEEE Signal<br />
Processing Letters,<br />
7(7):179–180, 2000.<br />
Robert W. Scheifler and James<br />
Gettys. X Window System.<br />
Digital Press, third edition, 1992.<br />
John Smith. <strong>The</strong> Multi-Threaded<br />
X Server. <strong>The</strong> X Resource,<br />
1:73–89, Winter 1992.<br />
D. Smith, A. Raab, D. Reed, and<br />
A. Kay. Croquet: <strong>The</strong> Users<br />
Manual, October 2002.<br />
http://glab.cs.<br />
uni-magdeburg.de/<br />
~croquet/downloads/<br />
Croquet0.1.pdf.<br />
[SSF + 96] Robert N.C. Shelley, Robert W.<br />
Scheifler, Ben Fahy, Jim Fulton,<br />
Keith Packard, Joe Mauro,<br />
Richard Hennessy, and Tom<br />
Vaughn. X Image Extension<br />
Protocol Version 5.02. X<br />
consortium standard, 1996.<br />
[Wal04]<br />
[WP03]<br />
Eamon Walsh. Integrating<br />
XFree86 With<br />
Security-Enhanced <strong>Linux</strong>. In X<br />
Developers Conference,<br />
Cambridge, MA, April 2004.<br />
http://freedesktop.<br />
org/Software/XDevConf/<br />
x-security-walsh.pdf.<br />
Carl Worth and Keith Packard.<br />
Xr: Cross-device Rendering for<br />
Vector Graphics. In Proceedings<br />
of the Ottawa <strong>Linux</strong> Symposium,<br />
Ottawa, ON, July 2003. OLS.
238 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
IA64-<strong>Linux</strong> perf tools for IO dorks<br />
Examples of IA-64 PMU usage<br />
Grant Grundler<br />
Hewlett-Packard<br />
iod00d@hp.com<br />
grundler@parisc-linux.org<br />
Abstract<br />
Itanium processors have very sophisticated<br />
performance monitoring tools integrated into<br />
the CPU. McKinley and Madison Itanium<br />
CPUs have over three hundred different types<br />
of events they can filter, trigger on, and count.<br />
<strong>The</strong> restrictions on which combinations of triggers<br />
are allowed is daunting and varies across<br />
CPU implementations. Fortunately, the tools<br />
hide this complicated mess. While the tools<br />
prevent us from shooting ourselves in the foot,<br />
it’s not obvious how to use those tools for measuring<br />
kernel device driver behaviors.<br />
IO driver writers can use pfmon to measure two<br />
key areas generally not obvious from the code:<br />
MMIO read and write frequency and precise<br />
addresses of instructions regularly causing L3<br />
data cache misses. Measuring MMIO reads has<br />
some nuances related to instruction execution<br />
which are relevant to understanding ia64 and<br />
likely ia32 platforms. Similarly, the ability to<br />
pinpoint exactly which data is being accessed<br />
by drivers enables driver writers to either modify<br />
the algorithms or add prefetching directives<br />
where feasible. I include some examples on<br />
how I used pfmon to measure NIC drivers and<br />
give some guidelines on use.<br />
q-syscollect is a “gprof without the pain” kind<br />
of tool. While q-syscollect uses the same kernel<br />
perfmon subsystem as pfmon, the former<br />
works at a higher level. With some knowledge<br />
about how the kernel operates, q-syscollect can<br />
collect call-graphs, function call counts, and<br />
percentage of time spent in particular routines.<br />
In other words, pfmon can tell us how much<br />
time the CPU spends stalled on d-cache misses<br />
and q-syscollect can give us the call-graph for<br />
the worst offenders.<br />
Updated versions of this paper will be available<br />
from http://iou.parisc-linux.<br />
org/ols2004/<br />
1 Introduction<br />
Improving the performance of IO drivers is really<br />
not that easy. It usually goes something<br />
like:<br />
1. Determine which workload is relevant<br />
2. Set up the test environment<br />
3. Collect metrics<br />
4. Analyze the metrics<br />
5. Change the code based on theories about<br />
the metrics<br />
6. Iterate on Collect metrics<br />
This paper attempts to make the collectanalyze-change<br />
loop more efficient for three
240 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
obvious things: MMIO reads, MMIO writes,<br />
and cache line misses.<br />
MMIO reads and writes are easier to locate in<br />
<strong>Linux</strong> code than for other OSs which support<br />
memory-mapped IO—just search for readl()<br />
and writel() calls. But pfmon [1] can provide<br />
statistics of actual behavior and not just where<br />
in the code MMIO space is touched.<br />
Cache line misses are hard to detect. None<br />
of the regular performance tools I’ve used<br />
can precisely tell where CPU stalls are taking<br />
place. We can guess some of them based on<br />
data usage—like spin locks ping-ponging between<br />
CPUs. This requires a level of understanding<br />
that most of us mere mortals don’t<br />
possess. Again, pfmon can help out here.<br />
Lastly, getting an overview of system performance<br />
and getting run-time call graph usually<br />
requires compiler support that gcc doesn’t provide.<br />
q-tools[4] can provide that information.<br />
Driver writers can then manually adjust the<br />
code knowing where the “hot spots” are.<br />
1.1 pfmon<br />
<strong>The</strong> author of pfmon, Stephane Eranian [2],<br />
describes pfmon as “the performance tool<br />
for IA64-<strong>Linux</strong> which exploits all the features<br />
of the IA-64 Performance Monitoring Unit<br />
(PMU).” pfmon uses a command line interface<br />
and does not require any special privilege<br />
to run. pfmon can monitor a single process, a<br />
multi-threaded process, multi-processes workloads<br />
and the entire system.<br />
pfmon is the user command line interface to<br />
the kernel perfmon subsystem. perfmon does<br />
the ugly work of programming the PMU. Perfmon<br />
is versioned separately from pfmon command.<br />
When in doubt, use the perfmon in the<br />
latest 2.6 kernel.<br />
<strong>The</strong>re are two major types of measurements:<br />
counting and sampling. For counting, pfmon<br />
simply reports the number of occurrences of<br />
the desired events during the monitoring period.<br />
pfmon can also be configured to sample<br />
at certain intervals information about the execution<br />
of a command or for the entire system.<br />
It is possible to sample any events provided by<br />
the underlying PMU.<br />
<strong>The</strong> information recorded by the PMU depends<br />
on what the user wants. pfmon contains a few<br />
preset measurements but for the most part the<br />
user is free to set up custom measurements.<br />
On Itanium2, pfmon provides access to all the<br />
PMU advanced features such as opcode matching,<br />
range restrictions, the Event Address Registers<br />
(EAR) and the Branch Trace Buffer.<br />
1.2 pfmon command line options<br />
Here is a summary of command line options<br />
used in the examples later in this paper:<br />
–us-c use the US-style comma separator for<br />
large numbers.<br />
–cpu-list=0 bind pfmon to CPU 0 and only<br />
count on CPU 0<br />
–pin-command bind the command at the end<br />
of the command line to the same CPU as<br />
pfmon.<br />
–resolve-addr look up addresses and print the<br />
symbols<br />
–long-smpl-periods=2000 take a sample of<br />
every 2000th event.<br />
–smpl-periods-random=0xfff:10 randomize<br />
the sampling period. This is necessary<br />
to avoid bias when sampling repetitive<br />
behaviors. <strong>The</strong> first value is the mask<br />
of bits to randomize (e.g., 0xfff) and the<br />
second value is initial seed (e.g., 10).<br />
-k kernel only.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 241<br />
–system-wide measure the entire system (all<br />
processes and kernel)<br />
Parameters only available on a to-be-released<br />
pfmon v3.1:<br />
–smpl-module=dear-hist-itanium2 This particular<br />
module is to be used ONLY in<br />
conjunction with the Data EAR (Event<br />
Address Registers) and presents recorded<br />
samples as histograms about the cache<br />
misses. By default, the information is presented<br />
in the instruction view but it is possible<br />
to get the data view of the misses<br />
also.<br />
-e data_ear_cache_lat64 pseudo event for<br />
memory loads with latency ≥ 64 cycles.<br />
<strong>The</strong> real event is DATA_EAR_EVENT<br />
(counts the number of times Data EAR<br />
has recorded something) and the pseudo<br />
event expresses the latency filter for the<br />
event. Use “pfmon -ldata_ear_<br />
cache*” to list all valid values. Valid<br />
values with McKinley CPU are powers of<br />
two (4 – 4096).<br />
1.3 q-tools<br />
<strong>The</strong> author of q-tools, David Mosberger [5],<br />
has described q-tools as “gprof without the<br />
pain.”<br />
q-tools package contains q-syscollect,<br />
q-view, qprof, and q-dot.<br />
q-syscollect collects profile information<br />
using kernel perfmon subsystem to<br />
sample the PMU. q-view will present the<br />
data collected in both flat-profile and call<br />
graph form. q-dot displays the call-graph<br />
in graphical form. Please see the qprof [6]<br />
website for details on qprof.<br />
q-syscollect depends on the kernel perfmon<br />
subsystem which is included in all 2.6<br />
<strong>Linux</strong> kernels. Because q-syscollect uses<br />
the PMU, it has the following advantages over<br />
other tools:<br />
• no special kernel support needed (besides<br />
perfmon subsystem).<br />
• provides call-graph of kernel functions<br />
• can collect call-graphs of the kernel while<br />
interrupts are blocked.<br />
• measures multi-threaded applications<br />
• data is collected per-CPU and can be<br />
merged<br />
• instruction level granularity (not bundles)<br />
2 Measuring MMIO Reads<br />
Nearly every driver uses MMIO reads to either<br />
flush MMIO writes, flush in-flight DMA,<br />
or (most obviously) collect status data from the<br />
IO device directly. While use of MMIO read is<br />
necessary in most cases, it should be avoided<br />
where possible.<br />
2.1 Why worry about MMIO Reads?<br />
MMIO reads are expensive—how expensive<br />
depends on speed of the IO bus, the number<br />
bridges the read (and its corresponding read return)<br />
has to cross, how “busy” each bus is, and<br />
finally how quickly the device responds to the<br />
read request. On most architectures, one can<br />
precisely measure the cost by measuring a loop<br />
of MMIO reads and calling get_cycles()<br />
before/after the loop.<br />
I’ve measured anywhere from 1µs to 2µs per<br />
read. In practical terms:<br />
• ∼ 500–600 cycles on an otherwise-idle<br />
400 MHz PA-RISC machine.
242 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
• ∼ 1000 cycles on a 450 MHz Pentium machine<br />
which included crossing a PCI-PCI<br />
bridge.<br />
• ∼ 900–1000 cycles on a 800 MHz IA64<br />
HP ZX1 machine.<br />
And for those who still don’t believe me, try<br />
watching a DVD movie after turning DMA off<br />
for an IDE DVD player:<br />
hdparm -d 0 /dev/cdrom<br />
By switching the IDE controller to use PIO<br />
(Programmed I/O) mode, all data will be transferred<br />
to/from host memory under CPU control,<br />
byte (or word) at a time. pfmon can measure<br />
this. And pfmon looks broken when it<br />
displays three and four digit “Average Cycles<br />
Per Instruction” (CPI) output.<br />
2.2 Eh? Memory Reads don’t stall?<br />
<strong>The</strong>y do. But the CPU and PMU don’t “realize”<br />
the stall until the next memory reference.<br />
<strong>The</strong> CPU continues execution until memory order<br />
is enforced by the acquire semantics in the<br />
MMIO read. This means the Data Event Address<br />
Registers record the next stalled memory<br />
reference due to memory ordering constraints,<br />
not the MMIO read. <strong>One</strong> has to look<br />
at the instruction stream carefully to determine<br />
which instruction actually caused the stall.<br />
This also means the following sequence<br />
doesn’t work exactly like we expect:<br />
writel(CMD,addr);<br />
readl(addr);<br />
udelay(1);<br />
y = buf->member;<br />
<strong>The</strong> problem is the value returned by<br />
read(x) is never consumed. Memory<br />
ordering imposes no constraint on nonload/store<br />
instructions. Hence udelay(1)<br />
begins before the CPU stalls. <strong>The</strong> CPU will<br />
stall on buf->member because of memory<br />
ordering restrictions if the udelay(1) completes<br />
before readl(x) is retired. Drop the<br />
udelay(1) call and pfmon will always see<br />
the stall caused by MMIO reads on the next<br />
memory reference.<br />
Unfortunately, the IA32 Software Developer’s<br />
Manual[3] Volume 3, Chapter 7.2 “MEMORY<br />
ORDERING” is silent on the issue of how<br />
MMIO (uncached accesses) will (or will not)<br />
stall the instruction stream. This document<br />
is very clear on how “IO Operations” (e.g.,<br />
IN/OUT) will stall the instruction pipeline until<br />
the read return arrives at the CPU. A direct response<br />
from Intel(R) indicated readl() does<br />
not stall like IN or OUT do and IA32 has the<br />
same problem. <strong>The</strong> Intel® architect who responded<br />
did hedge the above statement claiming<br />
a “udelay(10) will be as close as expected”<br />
for an example similar to mine. Anyone who<br />
has access to a frontside bus analyzer can verify<br />
the above statement by measuring timing<br />
loops between uncached accesses. I’m not that<br />
privileged and have to trust Intel® in this case.<br />
For IA64, we considered putting an extra burden<br />
on udelay to stall the instruction stream<br />
until previous memory references were retired.<br />
We could use dummy loads/stores before and<br />
after the actual delay loop so memory ordering<br />
could be used to stall the instruction pipeline.<br />
That seemed excessive for something that we<br />
didn’t have a bug report for.<br />
Consensus was adding mf.a (memory fence)<br />
instruction to readl() should be sufficient.<br />
<strong>The</strong> architecture only requires mf.a serve as<br />
an ordering token and need not cause any delays<br />
of its own. In other words, the implementation<br />
is platform specific. mf.a has not<br />
been added to readl() yet because every-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 243<br />
thing was working without so far.<br />
2.3 pfmon -e uc_loads_retired<br />
IO accesses are generally the only uncached<br />
references made on IA64-linux and normally<br />
will represent MMIO reads. <strong>The</strong> basic measurement<br />
will tell us roughly how many cycles<br />
the CPU stalls for MMIO reads. Get the number<br />
of MMIO reads per sample period and then<br />
multiply by the actual cycle counts a MMIO<br />
read takes for the given device. <strong>One</strong> needs to<br />
measure MMIO read cost by using a CPU internal<br />
cycle counter and hacking the kernel to<br />
read a harmless address from the target device<br />
a few thousand times.<br />
In order to make statements about per transaction<br />
or per interrupt, we need to know<br />
the cumulative number of transactions or<br />
interrupts processed for the sample period.<br />
pktgen is straightforward in this regard since<br />
pktgen will print transaction statistics when<br />
a run is terminated. And one can record<br />
/proc/interrupts contents before and<br />
after each pfmon run to collect interrupt<br />
events as well.<br />
Drawbacks to the above are one assumes a homogeneous<br />
driver environment; i.e., only one<br />
type of driver is under load during the test. I<br />
think that’s a fair assumption for development<br />
in most cases. Bridges (e.g., routing traffic<br />
across different interconnects) are probably the<br />
one case it’s not true. <strong>One</strong> has to work a bit<br />
harder to figure out what the counts mean in<br />
that case.<br />
For other benchmarks, like SpecWeb, we want<br />
to grab /proc/interrupt and networking<br />
stats before/after pfmon runs.<br />
2.4 tg3 Memory Reads<br />
In summary, Figure 1 shows tg3 is doing<br />
2749675/(1834959 − 918505) ≈ 3<br />
MMIO reads per interrupt and averaging about<br />
5000000/(1834959 − 918505) ≈ 5 packets<br />
per interrupt. This is with the BCM5701 chip<br />
running in PCI mode at 66MHz:64-bit.<br />
Based on code inspection, here is a break down<br />
of where the MMIO reads occur in temporal<br />
order:<br />
1. tg3_interrupt() flushes MMIO<br />
write to MAILBOX_INTERRUPT_0<br />
2. tg3_poll() → tg3_enable_<br />
ints() → tw32(TG3PCI_MISC_<br />
HOST_CTRL)<br />
3. tg3_enable_ints() flushes MMIO<br />
write to MAILBOX_INTERRUPT_0<br />
It’s obvious when inspecting tw32(), the<br />
BCM5701 chip has a serious bug. Every call<br />
to tw32() on BCM5701 requires a MMIO<br />
read to follow the MMIO write. Only writes to<br />
mailbox registers don’t require this and a different<br />
routine is used for mailbox writes.<br />
Given the NIC was designed for zero MMIO<br />
reads, this is pretty poor performance. Using<br />
a BCM5703 or BCM5704 would avoid the<br />
MMIO read in tw32().<br />
I’ve exchanged email with David Miller and<br />
Jeff Garzik (tg3 driver maintainers). <strong>The</strong>y have<br />
valid concerns with portability. We agree tg3<br />
could be reduced to one MMIO read after the<br />
last MMIO write (to guarantee interrupts get<br />
re-enabled).<br />
<strong>One</strong> would need to use the “tag” field in the<br />
status block when writing the mail box register<br />
to indicate which “tag” the CPU most recently
244 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
gsyprf3:~# pfmon -e uc_loads_retired -k --system-wide \<br />
-- /usr/src/pktgen-testing/pktgen-single-tg3<br />
Adding devices to run.<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
57: 918505 0 IO-SAPIC-level eth1<br />
Result: OK: 7613693(c7613006+d687) usec, 5000000 (64byte) 656771pps 320Mb/sec<br />
(336266752bps) errors: 0<br />
57: 1834959 0 IO-SAPIC-level eth1<br />
CPU0<br />
2749675 UC_LOADS_RETIRED<br />
CPU1<br />
1175 UC_LOADS_RETIRED<br />
}<br />
Figure 1: tg3 v3.6 MMIO reads with pktgen/IRQ on same CPU<br />
saw. Using Message Signaled Interrupts (MSI)<br />
instead of Line based IRQs would guarantee<br />
the most recent status block update (transferred<br />
via DMA writes) would be visible to the CPU<br />
before tg3_interrupt() gets called.<br />
<strong>The</strong> protocol would allow correct operation<br />
without using MSI, too.<br />
2.5 Benchmarking, pfmon, and CPU bindings<br />
<strong>The</strong> purpose of binding pktgen to CPU1 is<br />
to verify the transmit code path is NOT doing<br />
any MMIO reads. We split the transmit code<br />
path and interrupt handler across CPUs to narrow<br />
down which code path is performing the<br />
MMIO reads. This change is not obvious from<br />
Figure 2 output since tg3 only performs MMIO<br />
reads from CPU 0 (tg3_interrupt()).<br />
But in Figure 2, performance goes up 30%!<br />
Offhand, I don’t know if this is due to CPU<br />
utilization (pktgen and tg3_interrupt()<br />
contending for CPU cycles) or if DMA is more<br />
efficient because of cache-line flows. When I<br />
don’t have any deadlines looming, I’d like to<br />
determine the difference.<br />
2.6 e1000 Memory Reads<br />
e1000 version 5.2.52-k4 has a more efficient<br />
implementation than tg3 driver. In a nut shell,<br />
MMIO reads are pretty much irrelevant to the<br />
pktgen workload with e1000 driver using default<br />
values.<br />
Figure 3 shows e1000 performs<br />
173315/(703829 − 622143) ≈ 2 MMIO<br />
reads per interrupt and 5000000/(703829 −<br />
622143) ≈ 61 packets per interrupt.<br />
Being the curious soul I am, I tracked down<br />
the two MMIO reads anyway. <strong>One</strong> is in the interrupt<br />
handler and the second when interrupts<br />
are re-enabled. It looks like e1000 will always<br />
need at least 2 MMIO reads per interrupt.<br />
3 Measuring MMIO Writes<br />
3.1 Why worry about MMIO Writes?<br />
MMIO writes are clearly not as significant as<br />
MMIO reads. Nonetheless, every time a driver<br />
writes to MMIO space, some subtle things happen.<br />
<strong>The</strong>re are four minor issues to think about:<br />
memory ordering, PCI bus utilization, filling<br />
outbound write queues, and stalling MMIO<br />
reads longer than necessary.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 245<br />
gsyprf3:~# pfmon -e uc_loads_retired -k --system-wide \<br />
-- /usr/src/pktgen-testing/pktgen-single-tg3<br />
Adding devices to run.<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
57: 5809687 0 IO-SAPIC-level eth1<br />
Result: OK: 5914889(c5843865+d71024) usec, 5000000 (64byte) 845451pps 412Mb/se<br />
c (432870912bps) errors: 0<br />
57: 6427969 0 IO-SAPIC-level eth1<br />
CPU0<br />
1855253 UC_LOADS_RETIRED<br />
CPU1<br />
950 UC_LOADS_RETIRED<br />
Figure 2: tg3 v3.6 MMIO reads with pktgen/IRQ on diff CPU<br />
gsyprf3:~# pfmon -e uc_loads_retired -k --system-wide \<br />
-- /usr/src/pktgen-testing/pktgen-single-e1000<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
59: 622143 0 IO-SAPIC-level eth3<br />
Result: OK: 10228738(c9990105+d238633) usec, 5000000 (64byte) 488854pps 238Mb/<br />
sec (250293248bps) errors: 81669<br />
59: 703829 0 IO-SAPIC-level eth3<br />
CPU0<br />
173315 UC_LOADS_RETIRED<br />
CPU1<br />
1422 UC_LOADS_RETIRED<br />
Figure 3: MMIO reads for e1000 v5.2.52-k4
246 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
First, memory ordering is enforced since PCI<br />
requires strong ordering of MMIO writes. This<br />
means the MMIO write will push all previous<br />
regular memory writes ahead. This is not a serious<br />
issue but it can make a MMIO write take<br />
longer.<br />
MMIO writes are short transactions (i.e., much<br />
less than a cache-line). <strong>The</strong> PCI bus setup time<br />
to select the device, send the target address and<br />
data, and disconnect measurably reduces PCI<br />
bus utilization. It typically results in six or<br />
more PCI bus cycles to send four (or eight)<br />
bytes of data. On systems which strongly order<br />
DMA Read Returns and MMIO Writes, the<br />
latter will also interfere with DMA flows by interrupting<br />
in-flight, outbound DMA.<br />
If the IO bridge (e.g., PCI Bus controller) nearest<br />
the CPU has a full write queue, the CPU<br />
will stall. <strong>The</strong> bridge would normally queue<br />
the MMIO write and then tell the CPU it’s<br />
done. <strong>The</strong> chip designers normally make the<br />
write queue deep enough so the CPU never<br />
needs to stall. But drivers that perform many<br />
MMIO writes (e.g., use door bells) and burst<br />
many of MMIO writes at a time, could run into<br />
a worst case.<br />
<strong>The</strong> last concern, stalling MMIO reads longer<br />
than normal, exists because of PCI ordering<br />
rules. MMIO reads and MMIO writes are<br />
strongly ordered. E.g., if four MMIO writes<br />
are queued before a MMIO read, the read will<br />
wait until all four MMIO write transactions<br />
have completed. So instead of say 1000 CPU<br />
cycles, the MMIO read might take more than<br />
2000 CPU cycles on current platforms.<br />
3.2 pfmon -e uc_stores_retired<br />
pfmon counts MMIO Writes with no surprises.<br />
3.3 tg3 Memory Writes<br />
Figure 4 shows tg3 does about 10M MMIO<br />
writes to send 5M packets. However, we<br />
can break the MMIO writes down into base<br />
level (feed packets onto transmit queue) and<br />
tg3_interrupt which handles TX (and<br />
RX) completions. Knowing which code path<br />
the MMIO writes are in helps track down usage<br />
in the source code.<br />
Output in Figure 5 is after hacking the<br />
pktgen-single-tg3 script to bind<br />
pktgen kernel thread to CPU 1 when<br />
eth1 is directing interrupts to CPU 0.<br />
<strong>The</strong> distribution between TX queue setup<br />
and interrupt handling is obvious now.<br />
CPU 0 is handling interrupts and performs<br />
3013580/(5803789 − 5201193) ≈ 5 MMIO<br />
writes per interrupt. CPU 1 is handling TX<br />
setup and performs 5000376/5000000 ≈ 1<br />
MMIO write per packet.<br />
Again, as noted in section 2.5, binding pktgen<br />
thread to one CPU and interrupts to another,<br />
changes the performance dramatically.<br />
3.4 e1000 Memory Writes<br />
Figure 6 shows 248891/(991082 −<br />
908366) ≈ 3 MMIO writes per interrupt<br />
and 5001303/5000000 ≈ 1 MMIO write<br />
per packet. In other words, slightly better than<br />
tg3 driver. Nonetheless, the hardware can’t<br />
push as many packets. <strong>One</strong> difference is the<br />
e1000 driver is pushing data to a NIC behind a<br />
PCI-PCI Bridge.<br />
Figure 7 shows a ≈40% improvement in<br />
throughput 1 for pktgen without a PCI-PCI<br />
Bridge in the way. Note the ratios of MMIO<br />
writes per interrupt and MMIO writes per<br />
1 This demonstrates how the distance between the IO<br />
device and CPU (and memory) directly translates into<br />
latency and performance.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 247<br />
gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/pktgen-test<br />
ing/pktgen-single-tg3<br />
Adding devices to run.<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
57: 4284466 0 IO-SAPIC-level eth1<br />
Result: OK: 7611689(c7610900+d789) usec, 5000000 (64byte) 656943pps 320Mb/sec<br />
(336354816bps) errors: 0<br />
57: 5198436 0 IO-SAPIC-level eth1<br />
CPU0<br />
9570269 UC_STORES_RETIRED<br />
CPU1<br />
445 UC_STORES_RETIRED<br />
Figure 4: tg3 v3.6 MMIO writes<br />
gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/<br />
pktgen-testing/pktgen-single-tg3<br />
Adding devices to run.<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
57: 5201193 0 IO-SAPIC-level eth1<br />
Result: OK: 5880249(c5811180+d69069) usec, 5000000 (64byte) 850340pps 415Mb<br />
/sec (435374080bps) errors: 0<br />
57: 5803789 0 IO-SAPIC-level eth1<br />
CPU0<br />
3013580 UC_STORES_RETIRED<br />
CPU1<br />
5000376 UC_STORES_RETIRED<br />
Figure 5: tg3 v3.6 MMIO writes with pktgen/IRQ split across CPUs<br />
gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/<br />
pktgen-testing/pktgen-single-e1000<br />
Running... ctrl^C to stop<br />
59: 908366 0 IO-SAPIC-level eth3<br />
Result: OK: 10340222(c10104719+d235503) usec, 5000000 (64byte) 483558pps 236Mb<br />
/sec (247581696bps) errors: 82675<br />
59: 991082 0 IO-SAPIC-level eth3<br />
CPU0<br />
248891 UC_STORES_RETIRED<br />
CPU1<br />
5001303 UC_STORES_RETIRED<br />
Figure 6: MMIO writes for e1000 v5.2.52-k4<br />
gsyprf3:~# pfmon -e uc_stores_retired -k --system-wide -- /usr/src/pktgen-test<br />
ing/pktgen-single-e1000<br />
Running... ctrl^C to stop<br />
71: 3 0 IO-SAPIC-level eth7<br />
Result: OK: 7491358(c7342756+d148602) usec, 5000000 (64byte) 667467pps 325Mb/s<br />
ec (341743104bps) errors: 59870<br />
71: 59907 0 IO-SAPIC-level eth7<br />
CPU0<br />
180406 UC_STORES_RETIRED<br />
CPU1<br />
5000939 UC_STORES_RETIRED<br />
Figure 7: e1000 v5.2.52-k4 MMIO writes without PCI-PCI Bridge
248 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
packet are the same. I doubt the MMIO<br />
reads and MMIO writes are the limiting factors.<br />
More likely DMA access to memory<br />
(and thus TX/RX descriptor rings) limits NIC<br />
packet processing.<br />
4 Measuring Cache-line Misses<br />
<strong>The</strong> Event Address Registers 2 (EAR) can only<br />
record one event at a time. What is so interesting<br />
about them is that they record precise information<br />
about data cache misses. For instance<br />
for a data cache miss, you get the:<br />
• address of the instruction, likely a load<br />
• address of the target data<br />
• latency in cycles to resolve the miss<br />
<strong>The</strong> information pinpoints the source of the<br />
miss, not the consequence (i.e., the stall).<br />
<strong>The</strong> Data EAR (DEAR) can also tell us about<br />
MMIO reads via sampling. <strong>The</strong> DEAR can<br />
only record loads that miss, not stores. Of<br />
course, MMIO reads always miss because they<br />
are uncached. This is interesting if we want to<br />
track down which MMIO addresses are “hot.”<br />
It’s usually easier to track down usage in source<br />
code knowing which MMIO address is referenced.<br />
Collecting with DEAR sampling requires two<br />
parameters be tweaked to statistically improve<br />
the samples. <strong>One</strong> is the frequency at which<br />
Data Addresses are recorded and the other is<br />
the threshold (how many CPU cycles latency).<br />
Because we know the latency to L3 is about<br />
21 cycles, setting the EAR threshold to a value<br />
higher (e.g., 64 cycles) ensures only the load<br />
2 pfmon v3.1 is the first version to support EAR<br />
and is expected to be available in August, 2004.<br />
misses accessing main memory will be captured.<br />
This is how to select which level of<br />
cacheline misses one samples.<br />
While high threshholds (e.g., 64 cycles) will<br />
show us where the longest delays occur, it will<br />
not show us the worst offenders. Doing a second<br />
run with a lower threshold (e.g., 4 cycles)<br />
shows all L1, L2, and L3 cache misses and provides<br />
a much broader picture of cache utilization.<br />
When sampling events with low threshholds,<br />
we will get saturated with events and need to<br />
reduce the number of events actually sampled<br />
to every 5000th. <strong>The</strong> appropriate value will<br />
depend on the workload and how patient one<br />
is. <strong>The</strong> workload needs to be run long enough<br />
to be statistically significant and the sampling<br />
period needs to be high enough to not significantly<br />
perturb the workload.<br />
4.1 tg3 Data Cache misses > 64 cycles<br />
For the output in Figure 8, I’ve iteratively decreased<br />
the smpl-periods until I noticed the total<br />
pktgen throughput starting to drop. Figure<br />
8 output only shows the tg3 interrupt code<br />
path since pfmon is bound to CPU 0. Normally,<br />
it would be useful to run this again with<br />
cpu-list=1. We could then see what the<br />
TX code path and pktgen are doing.<br />
Also, the pin-command option in<br />
this example doesn’t do anything since<br />
pktgen-single-tg3 directs a pktgen<br />
kernel thread bound CPU 1 to do the real<br />
work. I’ve included the option only to make<br />
people aware of it.<br />
4.2 tg3 Data Cache misses > 4 cycles<br />
Figure 9 puts the lat64 output in Figure 8<br />
into better perspective. It shows tg3 is spending<br />
more time for L1 and L2 misses than L3 misses
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 249<br />
gsyprf3:~# pfmon31 --us-c --cpu-list=0 --pin-command --resolve-addr \<br />
--smpl-module=dear-hist-itanium2 \<br />
-e data_ear_cache_lat64 --long-smpl-periods=500 \<br />
--smpl-periods-random=0xfff:10 --system-wide \<br />
-k -- /usr/src/pktgen-testing/pktgen-single-tg3<br />
added event set 0<br />
only kernel symbols are resolved in system-wide mode<br />
Adding devices to run.<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
57: 7209769 0 IO-SAPIC-level eth1<br />
Result: OK: 5915877(c5845032+d70845) usec, 5000000 (64byte) 845308pps 412Mb/sec<br />
(432797696bps) errors: 0<br />
57: 7827812 0 IO-SAPIC-level eth1<br />
# total_samples 672<br />
# instruction addr view<br />
# sorted by count<br />
# showing per per distinct value<br />
# %L2 : percentage of L1 misses that hit L2<br />
# %L3 : percentage of L1 misses that hit L3<br />
# %RAM : percentage of L1 misses that hit memory<br />
# L2 : 5 cycles load latency<br />
# L3 : 12 cycles load latency<br />
# sampling period: 500<br />
#count %self %cum %L2 %L3 %RAM instruction addr<br />
38 5.65% 5.65% 0.00% 0.00% 100.00% 0xa000000100009141 ia64_spinlock_contention<br />
+0x21<br />
36 5.36% 11.01% 0.00% 0.00% 100.00% 0xa00000020003e580 tg3_interrupt[tg3]+0xe0<br />
32 4.76% 15.77% 0.00% 0.00% 100.00% 0xa000000200034770 tg3_write_indirect_reg32[tg3]<br />
+0x90<br />
32 4.76% 20.54% 0.00% 0.00% 100.00% 0xa00000020003e640 tg3_interrupt[tg3]+0x1a0<br />
30 4.46% 25.00% 0.00% 0.00% 100.00% 0xa000000200034e91 tg3_enable_ints[tg3]+0x91<br />
29 4.32% 29.32% 0.00% 0.00% 100.00% 0xa00000020003e510 tg3_interrupt[tg3]+0x70<br />
28 4.17% 33.48% 0.00% 0.00% 100.00% 0xa00000020003d1a0 tg3_tx[tg3]+0x2e0<br />
27 4.02% 37.50% 0.00% 0.00% 100.00% 0xa00000020003cfa0 tg3_tx[tg3]+0xe0<br />
24 3.57% 41.07% 0.00% 0.00% 100.00% 0xa00000020003cfd1 tg3_tx[tg3]+0x111<br />
21 3.12% 44.20% 0.00% 0.00% 100.00% 0xa000000200034e60 tg3_enable_ints[tg3]+0x60<br />
.<br />
.<br />
.<br />
# level 0 : counts=0 avg_cycles=0.0ms 0.00%<br />
# level 1 : counts=0 avg_cycles=0.0ms 0.00%<br />
# level 2 : counts=672 avg_cycles=0.0ms 100.00%<br />
approx cost: 0.0s<br />
Figure 8: tg3 v3.6 lat64 output
250 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
gsyprf3:~# pfmon31 --us-c --cpu-list=0 --resolve-addr --smpl-module=dear-hist-itanium2 \<br />
-e data_ear_cache_lat4 --long-smpl-periods=5000 --smpl-periods-random=0xfff:10 \<br />
--system-wide -k -- /usr/src/pktgen-testing/pktgen-single-tg3<br />
added event set 0<br />
only kernel symbols are resolved in system-wide mode<br />
Adding devices to run.<br />
Configuring devices<br />
Running... ctrl^C to stop<br />
57: 8484552 0 IO-SAPIC-level eth1<br />
Result: OK: 5938001(c5866437+d71564) usec, 5000000 (64byte) 842034pps 411Mb/sec<br />
(431121408bps) errors: 0<br />
57: 9093642 0 IO-SAPIC-level eth1<br />
# total_samples 795<br />
# instruction addr view<br />
# sorted by count<br />
# showing per per distinct value<br />
# %L2 : percentage of L1 misses that hit L2<br />
# %L3 : percentage of L1 misses that hit L3<br />
# %RAM : percentage of L1 misses that hit memory<br />
# L2 : 5 cycles load latency<br />
# L3 : 12 cycles load latency<br />
# sampling period: 5000<br />
# #count %self %cum %L2 %L3 %RAM instruction addr<br />
95 11.95% 11.95% 0.00% 98.95% 1.05% 0xa00000020003d150 tg3_tx[tg3]+0x290<br />
83 10.44% 22.39% 93.98% 4.82% 1.20% 0xa00000020003d030 tg3_tx[tg3]+0x170<br />
21 2.64% 25.03% 0.00% 95.24% 4.76% 0xa0000001000180f0 ia64_handle_irq+0x170<br />
20 2.52% 27.55% 5.00% 80.00% 15.00% 0xa00000020003d040 tg3_tx[tg3]+0x180<br />
18 2.26% 29.81% 50.00% 11.11% 38.89% 0xa00000020003cfa0 tg3_tx[tg3]+0xe0<br />
17 2.14% 31.95% 0.00% 0.00% 100.00% 0xa00000020003e671 tg3_interrupt[tg3]<br />
+0x1d1<br />
17 2.14% 34.09% 0.00% 100.00% 0.00% 0xa00000020003e700 tg3_interrupt[tg3]<br />
+0x260<br />
16 2.01% 36.10% 56.25% 43.75% 0.00% 0xa000000100012160 ia64_leave_kernel<br />
+0x180<br />
16 2.01% 38.11% 62.50% 0.00% 37.50% 0xa00000020003cf60 tg3_tx[tg3]+0xa0<br />
15 1.89% 40.00% 86.67% 6.67% 6.67% 0xa00000020003cfd0 tg3_tx[tg3]+0x110<br />
15 1.89% 41.89% 0.00% 0.00% 100.00% 0xa000000100016041 do_IRQ+0x1a1<br />
15 1.89% 43.77% 0.00% 53.33% 46.67% 0xa00000020003e370 tg3_poll[tg3]+0x350<br />
.<br />
.<br />
.<br />
# level 0 : counts=226 avg_cycles=0.0ms 28.43%<br />
# level 1 : counts=264 avg_cycles=0.0ms 33.21%<br />
# level 2 : counts=305 avg_cycles=0.0ms 38.36%<br />
approx cost: 0.0s<br />
Figure 9: tg3 v3.6 lat4 output
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 251<br />
and in only two locations. Adding one prefetch<br />
to pull data from L3 into L2 would help for the<br />
top offender. <strong>One</strong> needs to figure out which bit<br />
of data each recorded access refers to and determine<br />
how early one can prefetch that data.<br />
We can also rule out MMIO accesses as the top<br />
culprit. tg3_interrupt+0x1d1 could be<br />
an MMIO read but it doesn’t show up in Figure<br />
8 like tg3_write_indirect_reg32<br />
does.<br />
Note smpl-periods is 10x higher in Figure<br />
9 than in Figure 8. Collecting 10x more<br />
samples with lat4 definitely disturbs the<br />
workload.<br />
5 q-tools<br />
q-syscollect and q-view are trivial to<br />
use. An example and brief explanation for kernel<br />
usage follow.<br />
Please remember most applications spend most<br />
of the time in user space and not in the kernel.<br />
q-tools is especially good in user space.<br />
5.1 q-syscollect<br />
q-syscollect -c 5000 -C 5000 -t<br />
20 -k<br />
This will collect system wide kernel data during<br />
the 20 second period. Twenty to thrity seconds<br />
is usually long enough to get sufficient accuracy<br />
3 . However, if the workload generates<br />
a very wide call graph with even distribution,<br />
one will likely need to sample for longer periods<br />
to get accuracy in the ±1% range. When<br />
in doubt, try sampling for longer periods to see<br />
if the call-counts change significantly.<br />
3 See Page 7 of the David Mosberger’s Gelato talk<br />
[4] for a nice graph on accuracy which only applies to<br />
his example.<br />
<strong>The</strong> -c and -C set the call sample rate and<br />
code sample rate respectively. <strong>The</strong> call sample<br />
rate is used to collect function call counts.<br />
This is one of the key differences compared to<br />
traditional profiling tools: q-syscollect obtains<br />
call-counts in a statistical fashion, just as has<br />
been done traditionally for the execution-time<br />
profile. <strong>The</strong> code sample rate is used to collect<br />
a flat profile (CPU_CYCLES by default).<br />
<strong>The</strong> -e option allows one to change the event<br />
used to sample for the flat profile. <strong>The</strong> default<br />
is to sample CPU_CYCLES event. This provides<br />
traditional execution time in the flat profile.<br />
<strong>The</strong> data is stored in the current directory under<br />
.q/ directory. <strong>The</strong> next section demonstrates<br />
how q-view displays the data.<br />
5.2 q-view<br />
I was running the netperf [7] TCP_RR test in<br />
the background to another server when I collected<br />
the following data. As Figure 10 shows,<br />
this particular TCP_RR test isn’t costing many<br />
cycles in tg3 driver. Or, at least not ones I can<br />
measure.<br />
tg3_interrupt() shows up in the flat profile<br />
with 0.314 seconds time associated with<br />
it. <strong>The</strong> time measurement is only possible<br />
because handle_IRQ_event() re-enables<br />
interrupts if the IRQ handler is not registered<br />
with SA_INTERRUPT (to indicate latency<br />
sensitive IRQ handler). do_IRQ() and<br />
other functions in that same call graph do NOT<br />
have any time measurements because interrupts<br />
are disabled. As noted before, the callgraph<br />
is sampled using a different part of the<br />
PMU than the part which samples the flat profile.<br />
Lastly, I’ve omitted the trailing output of<br />
q-view which explains the fields and<br />
columns more completely. Read that first be-
252 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
gsyprf3:~# q-view .q/kernel-cpu0.info | more<br />
Flat profile of CPU_CYCLES in kernel-cpu0.hist#0:<br />
Each histogram sample counts as 200.510u seconds<br />
% time self cumul calls self/call tot/call name<br />
68.88 13.41 13.41 215k 62.5u 62.5u default_idle<br />
2.90 0.56 13.97 431k 1.31u 1.31u finish_task_switch<br />
2.50 0.49 14.46 233k 2.09u 4.89u tg3_poll<br />
1.77 0.35 14.80 1.38M 251n 268n ipt_do_table<br />
1.61 0.31 15.12 240k 1.31u 1.31u tg3_interrupt<br />
1.51 0.29 15.41 240k 1.22u 5.95u net_rx_action<br />
.<br />
.<br />
.<br />
Call-graph table:<br />
index %time self children called name<br />
<br />
[176] 69.4 30.5m 13.4 - cpu_idle<br />
29.5m 0.285 231k/457k schedule [164]<br />
10.0m 0.00 244k/244k check_pgt_cache [178]<br />
13.4 0.00 215k/215k default_idle [177]<br />
----------------------------------------------------<br />
.<br />
.<br />
.<br />
----------------------------------------------------<br />
0.293 1.14 240k __do_softirq [40]<br />
[56] 7.4 0.293 1.14 240k net_rx_action<br />
0.487 0.649 233k/233k tg3_poll [57]<br />
----------------------------------------------------<br />
0.487 0.649 233k net_rx_action [56]<br />
[57] 5.9 0.487 0.649 233k tg3_poll<br />
- 0.00 229k/229k tg3_enable_ints [133]<br />
97.7m 0.552 225k/225k tg3_rx [61]<br />
- 0.00 227k/227k tg3_tx [58]<br />
----------------------------------------------------<br />
.<br />
.<br />
.<br />
----------------------------------------------------<br />
- 1.88 348k ia64_leave_kernel [10]<br />
[11] 9.7 - 1.88 348k ia64_handle_irq<br />
- 1.52 239k/240k do_softirq [39]<br />
- 0.367 356k/356k do_IRQ [12]<br />
----------------------------------------------------<br />
.<br />
.<br />
.<br />
Figure 10: q-view output for TCP_RR over tg3 v3.6
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 253<br />
fore going through the rest of the output.<br />
6 Conclusion<br />
6.1 More pfmon examples<br />
CPU L2 cache misses in one kernel function<br />
pfmon --verb -k \<br />
--irange=sba_alloc_range \<br />
-el2_misses --system-wide \<br />
--session-timeout=10<br />
Show all L2 cache misses in<br />
sba_alloc_range. This is interesting<br />
since sba_alloc_range() walks<br />
a bitmap to look for “free” resources.<br />
<strong>One</strong> can instead specify -el3_misses<br />
since L3 cache misses are much more<br />
expensive.<br />
CPU 1 memory loads<br />
pfmon --us-c \<br />
--cpu-list=1 \<br />
-e loads_retired \<br />
-k --system-wide \<br />
-- /tmp/pktgen-single Only<br />
count memory loads on CPU 1. This is<br />
useful for when we can bind the interrupt<br />
to CPU 1 and the workload to a different<br />
CPU. This lets us separate interrupt path<br />
from base level code, i.e., when is the<br />
load happening (before or after DMA<br />
occurred) and which code path should<br />
one be looking more closely at.<br />
List EAR events supported pfmon -lear<br />
List all EAR types supported by pfmon 4 .<br />
More info on Event pfmon -i DATA_EAR_<br />
TLB_ALL pfmon can provide more info<br />
on particular events it supports.<br />
4 EAR isn’t supported until pfmon v3.1<br />
6.2 And thanks to. . .<br />
Special thanks to Stephane Eranian [2] for dedicating<br />
so much time to the perfmon kernel<br />
driver and associated tools. People might think<br />
the PMU does it all—but only with a lot of SW<br />
driving it. His review of this paper caught some<br />
good bloopers. This talk only happened because<br />
I sit across the aisle from him and could<br />
pester him regularly.<br />
Thanks to David Mosberger[5] for putting together<br />
q-tools and making it so trivial to use.<br />
In addition, in no particular order:<br />
Christophe de Dinechin, Bjorn Helgaas,<br />
Matthew Wilcox, Andrew Patterson, Al Stone,<br />
Asit Mallick, and James Bottomley for reviewing<br />
this document or providing technical guidance.<br />
Thanks also to the OLS staff for making this<br />
event happen every year.<br />
My apologies if I omitted other contributors.<br />
References<br />
[1] perfmon homepage,<br />
http://www.hpl.hp.com/<br />
research/linux/<br />
perfmon/<br />
[2] Stephane Eranian,<br />
http://www.gelato.org/<br />
community/gelato_<br />
meeting.php?id=CU2004#<br />
talk22<br />
[3] <strong>The</strong> IA-32 Intel(R) Architecture<br />
Software Developer’s Manuals,<br />
http://www.intel.com/<br />
design/pentium4/<br />
manuals/253668.htm<br />
[4] q-tools homepage,<br />
http://www.hpl.hp.com/
254 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
research/linux/<br />
q-tools/<br />
[5] David Mosberger,<br />
http://www.gelato.org/<br />
community/gelato_<br />
meeting.php?id=CU2004#<br />
talk19<br />
[6] qprof homepage,<br />
http://www.hpl.hp.com/<br />
research/linux/qprof/<br />
[7] netperf homepage, http:<br />
//www.netperf.org/
Carrier Grade Server Features in the <strong>Linux</strong> <strong>Kernel</strong><br />
Towards <strong>Linux</strong>-based Telecom Plarforms<br />
Ibrahim Haddad<br />
Ericsson Research<br />
ibrahim.haddad@ericsson.com<br />
Abstract<br />
Traditionally, communications and data service<br />
networks were built on proprietary platforms<br />
that had to meet very specific availability,<br />
reliability, performance, and service response<br />
time requirements. Today, communication<br />
service providers are challenged to meet<br />
their needs cost-effectively for new architectures,<br />
new services, and increased bandwidth,<br />
with highly available, scalable, secure, and<br />
reliable systems that have predictable performance<br />
and that are easy to maintain and upgrade.<br />
This paper presents the technological<br />
trend of migrating from proprietary to open<br />
platforms based on software and hardware<br />
building blocks. It also focuses on the ongoing<br />
work by the Carrier Grade <strong>Linux</strong> working<br />
group at the Open Source Development Labs,<br />
examines the CGL architecture, the requirements<br />
from the latest specification release, and<br />
presents some of the needed kernel features<br />
that are not currently supported by <strong>Linux</strong> such<br />
as a <strong>Linux</strong> cluster communication mechanism,<br />
a low-level kernel mechanism for improved reliability<br />
and soft-realtime performance, support<br />
for multi-FIB, and support for additional<br />
security mechanisms.<br />
1 Open platforms<br />
<strong>The</strong> demand for rich media and enhanced<br />
communication services is rapidly leading to<br />
significant changes in the communication industry,<br />
such as the convergence of data and<br />
voice technologies. <strong>The</strong> transition to packetbased,<br />
converged, multi-service IP networks<br />
require a carrier grade infrastructure based on<br />
interoperable hardware and software building<br />
blocks, management middleware, and applications,<br />
implemented with standard interfaces.<br />
<strong>The</strong> communication industry is witnessing a<br />
technology trend moving away from proprietary<br />
systems toward open and standardized<br />
systems that are built using modular and flexible<br />
hardware and software (operating system<br />
and middleware) common off the shelf components.<br />
<strong>The</strong> trend is to proceed forward delivering<br />
next generation and multimedia communication<br />
services, using open standard carrier<br />
grade platforms. This trend is motivated<br />
by the expectations that open platforms are going<br />
to reduce the cost and risk of developing<br />
and delivering rich communications services.<br />
Also, they will enable faster time to market and<br />
ensure portability and interoperability between<br />
various components from different providers.<br />
<strong>One</strong> frequently asked question is: ’How can we<br />
meet tomorrow’s requirements using existing<br />
infrastructures and technologies?’. Proprietary<br />
platforms are closed systems, expensive to develop,<br />
and often lack support of the current<br />
and upcoming standards. Using such closed<br />
platforms to meet tomorrow’s requirements for<br />
new architectures and services is almost impossible.<br />
A uniform open software environment<br />
with the characteristics demanded by telecom
256 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
applications, combined with commercial offthe-shelf<br />
software and hardware components<br />
is a necessary part of these new architectures.<br />
<strong>The</strong> following key industry consortia are defining<br />
hardware and software high availability<br />
specifications that are directly related to telecom<br />
platforms:<br />
1. <strong>The</strong> PCI Industrial Computer Manufacturers<br />
Group [1] (PICMG) defines standards<br />
for high availability (HA) hardware.<br />
2. <strong>The</strong> Open Source Development Labs [2]<br />
(OSDL) Carrier Grade <strong>Linux</strong> [3] (CGL)<br />
working group was established in January<br />
2002 with the goal of enhancing the<br />
<strong>Linux</strong> operating system, to achieve an<br />
Open Source platform that is highly available,<br />
secure, scalable and easily maintained,<br />
suitable for carrier grade systems.<br />
3. <strong>The</strong> Service Availability Forum [4] (SA<br />
Forum) defines the interfaces of HA middleware<br />
and focusing on APIs for hardware<br />
platform management and for application<br />
failover in the application API. SA<br />
compliant middleware will provide services<br />
to an application that needs to be HA<br />
in a portable way.<br />
2 <strong>The</strong> term Carrier Grade<br />
In this paper, we refer to the term Carrier Grade<br />
on many occasions. Carrier grade is a term<br />
for public network telecommunications products<br />
that require a reliability percentage up to 5<br />
or 6 nines of uptime.<br />
• 5 nines refers to 99.999% of uptime per<br />
year (i.e., 5 minutes of downtime per<br />
year). This level of availability is usually<br />
associated with Carrier Grade servers.<br />
• 6 nines refers to 99.9999% of uptime per<br />
year (i.e., 30 seconds of downtime per<br />
year). This level of availability is usually<br />
associated with Carrier Grade switches.<br />
3 <strong>Linux</strong> versus proprietary operating<br />
systems<br />
This section describes briefly the motivating<br />
reasons in favor of using <strong>Linux</strong> on Carrier<br />
Grade systems, versus continuing with proprietary<br />
operating systems. <strong>The</strong>se motivations include:<br />
• Cost: <strong>Linux</strong> is available free of charge in<br />
the form of a downloadable package from<br />
the Internet.<br />
• Source code availability: With <strong>Linux</strong>, you<br />
gain full access to the source code allowing<br />
you to tailor the kernel to your needs.<br />
Figure 1: From Proprietary to Open Solutions<br />
<strong>The</strong> operating system is a core component in<br />
such architectures. In the remaining of this paper,<br />
we will be focusing on CGL, its architecture<br />
and specifications.<br />
• Open development process (Figure 2):<br />
<strong>The</strong> development process of the kernel is<br />
open to anyone to participate and contribute.<br />
<strong>The</strong> process is based on the concept<br />
of "release early, release often."<br />
• Peer review and testing resources: With<br />
access to the source code, people using a
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 257<br />
wide variety of platform, operating systems,<br />
and compiler combinations; can<br />
compile, link, and run the code on their<br />
systems to test for portability, compatibility<br />
and bugs.<br />
• Vendor independent: With <strong>Linux</strong>, you no<br />
longer have to be locked into a specific<br />
vendor. <strong>Linux</strong> is supported on multiple<br />
platforms.<br />
• High innovation rate: New features are<br />
usually implemented on <strong>Linux</strong> before they<br />
are available on commercial or proprietary<br />
systems.<br />
Figure 2: Open development process of the<br />
<strong>Linux</strong> kernel<br />
Other contributing factors include <strong>Linux</strong>’ support<br />
for a broad range of processors and<br />
peripherals, commercial support availability,<br />
high performance networking, and the proven<br />
record of being a stable, and reliable server<br />
platform.<br />
4 Carrier Grade <strong>Linux</strong><br />
<strong>The</strong> <strong>Linux</strong> kernel is missing several features<br />
that are needed in a telecom environment. It<br />
is not adapted to meet telecom requirements<br />
in various areas such as reliability, security,<br />
and scalability. To help the advancement of<br />
<strong>Linux</strong> in the telecom space, OSDL established<br />
the CGL working group. <strong>The</strong> group specifies<br />
and helps implement an Open Source platform<br />
targeted for the communication industry that<br />
is highly available, secure, scalable and easily<br />
maintained. <strong>The</strong> CGL working group is composed<br />
of several members from network equipment<br />
providers, system integrators, platform<br />
providers, and <strong>Linux</strong> distributors. <strong>The</strong>y all<br />
contribute to the requirement definition of Carrier<br />
Grade <strong>Linux</strong>, help Open Source projects<br />
to meet these requirements, and in some cases<br />
start new Open Source projects. Many of<br />
the CGL members companies have contributed<br />
pieces of technologies to Open Source in order<br />
to make the <strong>Linux</strong> <strong>Kernel</strong> a more viable option<br />
for telecom platforms. For instance, the Open<br />
Systems Lab [5] from Ericsson Research has<br />
contributed three key technologies: the Transparent<br />
IPC [6], the Asynchronous Event Mechanism<br />
[7], and the Distributed Security Infrastructure<br />
[8]. <strong>The</strong>re are already <strong>Linux</strong> distributions,<br />
MontaVista [9] for instance, that are<br />
providing CGL distribution based on the CGL<br />
requirement definition. Many companies are<br />
also either deploying CGL, or at least evaluating<br />
and experimenting with it.<br />
Consequently, CGL activities are giving much<br />
momentum for <strong>Linux</strong> in the telecom space<br />
allowing it to be a viable option to proprietary<br />
operating system. Member companies of<br />
CGL are releasing code to Open Source and<br />
are making some of their proprietary technologies<br />
open, which leads to going forward from<br />
closed platforms to open platforms that use<br />
CGL <strong>Linux</strong>.<br />
5 Target CGL applications<br />
<strong>The</strong> CGL Working Group has identified three<br />
main categories of application areas into which<br />
they expect the majority of applications implemented<br />
on CGL platforms to fall. <strong>The</strong>se appli-
258 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
cation areas are gateways, signaling, and management<br />
servers.<br />
• Gateways are bridges between two different<br />
technologies or administration domains.<br />
For example, a media gateway performs<br />
the critical function of converting<br />
voice messages from a native telecommunications<br />
time-division-multiplexed network,<br />
to an Internet protocol packetswitched<br />
network. A gateway processes a<br />
large number of small messages received<br />
and transmitted over a large number of<br />
physical interfaces. Gateways perform<br />
in a timely manner very close to hard<br />
real-time. <strong>The</strong>y are implemented on dedicated<br />
platforms with replicated (rather<br />
than clustered) systems used for redundancy.<br />
• Signaling servers handle call control, session<br />
control, and radio recourse control.<br />
A signaling server handles the routing and<br />
maintains the status of calls over the network.<br />
It takes the request of user agents<br />
who want to connect to other user agents<br />
and routes it to the appropriate signaling.<br />
Signaling servers require soft real time response<br />
capabilities less than 80 milliseconds,<br />
and may manage tens of thousands<br />
of simultaneous connections. A signaling<br />
server application is context switch and<br />
memory intensive due to requirements for<br />
quick switching and a capacity to manage<br />
large numbers of connections.<br />
• Management servers handle traditional<br />
network management operations, as well<br />
as service and customer management.<br />
<strong>The</strong>se servers provide services such as: a<br />
Home Location Register and Visitor Location<br />
Register (for wireless networks)<br />
or customer information (such as personal<br />
preferences including features the<br />
customer is authorized to use). Typically,<br />
management applications are data<br />
and communication intensive. <strong>The</strong>ir response<br />
time requirements are less stringent<br />
by several orders of magnitude, compared<br />
to those of signaling and gateway<br />
applications.<br />
6 Overview of the CGL working<br />
group<br />
<strong>The</strong> CGL working group has the vision that<br />
next-generation and multimedia communication<br />
services can be delivered using <strong>Linux</strong><br />
based open standards platforms for carrier<br />
grade infrastructure equipment. To achieve this<br />
vision, the working group has setup a strategy<br />
to define the requirements and architecture<br />
for the Carrier Grade <strong>Linux</strong> platform, develop<br />
a roadmap for the platform, and promote the<br />
development of a stable platform upon which<br />
commercial components and services can be<br />
deployed.<br />
In the course of achieving this strategy, the<br />
OSDL CGL working group, is creating the requirement<br />
definitions, and identifying existing<br />
Open Source projects that support the roadmap<br />
to implement the required components and interfaces<br />
of the platform. When an Open Source<br />
project does not exist to support a certain requirement,<br />
OSDL CGL is launching (or support<br />
the launch of) new Open Source projects<br />
to implement missing components and interfaces<br />
of the platform.<br />
<strong>The</strong> CGL working group consists of three distinct<br />
sub-groups that work together. <strong>The</strong>se subgroups<br />
are: specification, proof-of-concept,<br />
and validation. Responsibilities of each subgroup<br />
are as follows:<br />
1. Specifications: <strong>The</strong> specifications subgroup<br />
is responsible for defining a set of
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 259<br />
requirements that lead to enhancements in<br />
the <strong>Linux</strong> kernel, that are useful for carrier<br />
grade implementations and applications.<br />
<strong>The</strong> group collects, categorizes, and<br />
prioritizes the requirements from participants<br />
to allow reasonable work to proceed<br />
on implementations. <strong>The</strong> group also interacts<br />
with other standard defining bodies,<br />
open source communities, developers<br />
and distributions to ensure that the requirements<br />
identify useful enhancements<br />
in such a way, that they can be adopted<br />
into the base <strong>Linux</strong> kernel.<br />
2. Proof-of-Concept: This sub-group generates<br />
documents covering the design, features,<br />
and technology relevant to CGL. It<br />
drives the implementation and integration<br />
of core Carrier Grade enhancements to<br />
<strong>Linux</strong> as identified and prioritized by the<br />
requirement document. <strong>The</strong> group is also<br />
responsible for ensuring the integrated enhancements<br />
pass, the CGL validation test<br />
suite and for establishing and leading an<br />
open source umbrella project to coordinate<br />
implementation and integration activities<br />
for CGL enhancements.<br />
3. Validation: This sub-group defines standard<br />
test environments for developing validation<br />
suites. It is responsible for coordinating<br />
the development of validation<br />
suites, to ensure that all of the CGL requirements<br />
are covered. This group is<br />
also responsible for the development of<br />
an Open Source project CGL validation<br />
suite.<br />
7 CGL architecture<br />
Figure 3 presents the scope of the CGL Working<br />
Group, which covers two areas:<br />
• Carrier Grade <strong>Linux</strong>: Various requirements<br />
such as availability and scalability<br />
Figure 3: CGL architecture and scope<br />
are related to the CGL enhancements to<br />
the operating system. Enhancements may<br />
also be made to hardware interfaces, interfaces<br />
to the user level or application code<br />
and interfaces to development and debugging<br />
tools. In some cases, to access the<br />
kernel services, user level library changes<br />
will be needed.<br />
• Software Development Tools: <strong>The</strong>se tools<br />
will include debuggers and analyzers.<br />
On October 9, 2003, OSDL announced<br />
the availability of the OSDL Carrier<br />
Grade <strong>Linux</strong> Requirements Definition,<br />
Version 2.0 (CGL 2.0). This latest requirement<br />
definition for next-generation<br />
carrier grade <strong>Linux</strong> offers major advances<br />
in security, high availability, and clustering.<br />
8 CGL requirements<br />
<strong>The</strong> requirement definition document of CGL<br />
version 2.0 introduced new and enhanced features<br />
to support <strong>Linux</strong> as a carrier grade platform.<br />
<strong>The</strong> CGL requirement definition divides<br />
the requirements in main categories described<br />
briefly below:
260 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
8.1 Clustering<br />
<strong>The</strong>se requirements support the use of multiple<br />
carrier server systems to provide higher levels<br />
of service availability through redundant resources<br />
and recovery capabilities, and to provide<br />
a horizontally scaled environment supporting<br />
increased throughput.<br />
8.2 Security<br />
<strong>The</strong> security requirements are aimed at maintaining<br />
a certain level of security while not endangering<br />
the goals of high availability, performance,<br />
and scalability. <strong>The</strong> requirements support<br />
the use of additional security mechanisms<br />
to protect the systems against attacks from both<br />
the Internet and intranets, and provide special<br />
mechanisms at kernel level to be used by telecom<br />
applications.<br />
8.3 Standards<br />
CGL specifies standards that are required for<br />
compliance for carrier grade server systems.<br />
Examples of these standards include:<br />
• <strong>Linux</strong> Standard Base<br />
• POSIX Timer Interface<br />
• POSIX Signal Interface<br />
• POSIX Message Queue Interface<br />
• POSIX Semaphore Interface<br />
• IPv6 RFCs compliance<br />
• IPsecv6 RFCs compliance<br />
• MIPv6 RFCs compliance<br />
• SNMP support<br />
• POSIX threads<br />
8.4 Platform<br />
OSDL CGL specifies requirements that support<br />
interactions with the hardware platforms<br />
making up carrier server systems. Platform capabilities<br />
are not tied to a particular vendor’s<br />
implementation. Examples of the platform requirements<br />
include:<br />
• Hot insert: supports hot-swap insertion of<br />
hardware components<br />
• Hot remove: supports hot-swap removal<br />
of hardware components<br />
• Remote boot support:<br />
booting functionality<br />
supports remote<br />
• Boot cycle detection: supports detecting<br />
reboot cycles due to recurring failures.<br />
If the system experiences a problem that<br />
causes it to reboot repeatedly, the system<br />
will go offline. This is to prevent additional<br />
difficulties from occurring as a result<br />
of the repeated reboots<br />
• Diskless systems: Provide support for<br />
diskless systems loading their kernel/application<br />
over the network<br />
• Support remote booting across common<br />
LAN and WAN communication media<br />
8.5 Availability<br />
<strong>The</strong> availability requirements support heightened<br />
availability of carrier server systems, such<br />
as improving the robustness of software components<br />
or by supporting recovery from failure<br />
of hardware or software. Examples of these requirements<br />
include:<br />
• RAID 1: support for RAID 1 offers mirroring<br />
to provide duplicate sets of all data<br />
on separate hard disks
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 261<br />
• Watchdog timer interface: support for<br />
watchdog timers to perform certain specified<br />
operations when timeouts occur<br />
• Support for Disk and volume management:<br />
to allow grouping of disks into volumes<br />
• Ethernet link aggregation and link<br />
failover: support bonding of multiple NIC<br />
for bandwidth aggregation and provide<br />
automatic failover of IP addresses from<br />
one interface to another<br />
• Support for application heartbeat monitor:<br />
monitor applications availability and<br />
functionality.<br />
8.6 Serviceability<br />
<strong>The</strong> serviceability requirements support servicing<br />
and managing hardware and software on<br />
carrier server systems. <strong>The</strong>se are wide-ranging<br />
set requirements, put together, help support the<br />
availability of applications and the operating<br />
system. Examples of these requirements include:<br />
• Support for producing and storing kernel<br />
dumps<br />
• Support for dynamic debug to allow dynamically<br />
the insertion of software instrumentation<br />
into a running system in the<br />
kernel or applications<br />
• Support for platform signal handler enabling<br />
infrastructures to allow interrupts<br />
generated by hardware errors to be logged<br />
using the event logging mechanism<br />
• Support for remote access to event log information<br />
8.7 Performance<br />
OSDL CGL specifies the requirements that<br />
support performance levels necessary for the<br />
environments expected to be encountered by<br />
carrier server systems. Examples of these requirements<br />
include:<br />
• Support for application (pre) loading.<br />
• Support for soft real time performance<br />
through configuring the scheduler to provide<br />
soft real time support with latency of<br />
10 ms.<br />
• Support <strong>Kernel</strong> preemption.<br />
• Raid 0 support: RAID Level 0 provides<br />
"disk striping" support to enhance<br />
performance for request-rate-intensive or<br />
transfer-rate-intensive environments<br />
8.8 Scalability<br />
<strong>The</strong>se requirements support vertical and horizontal<br />
scaling of carrier server systems such as<br />
the addition of hardware resources to result in<br />
acceptable increases in capacity.<br />
8.9 Tools<br />
<strong>The</strong> tools requirements provide capabilities to<br />
facilitate diagnosis. Examples of these requirements<br />
include:<br />
• Support the usage of a kernel debugger.<br />
• Support for <strong>Kernel</strong> dump analysis.<br />
• Support for debugging multi-threaded<br />
programs
262 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
9 CGL 3.0<br />
<strong>The</strong> work on the next version of the OSDL<br />
CGL requirements, version 3.0, started in January<br />
2004 with focus on advanced requirement<br />
areas such as manageability, serviceability,<br />
tools, security, standards, performance,<br />
hardware, clustering and availability. With the<br />
success of CGL’s first two requirement documents,<br />
OSDL CGL working group anticipates<br />
that their third version will be quite beneficial<br />
to the Carrier Grade ecosystem. Official release<br />
of the CGL requirement document Version<br />
3.0 is expected in October 2004.<br />
10 CGL implementations<br />
<strong>The</strong>re are several enhancements to the <strong>Linux</strong><br />
<strong>Kernel</strong> that are required by the communication<br />
industry, to help adopt <strong>Linux</strong> on their carrier<br />
grade platforms, and support telecom applications.<br />
<strong>The</strong>se enhancements (Figure 4) fall into<br />
the following categories availability, security,<br />
serviceability, performance, scalability, reliability,<br />
standards, and clustering.<br />
some cases, bringing some projects into maturity<br />
levels takes a considerable amount of time<br />
before being able to request its integration into<br />
the <strong>Linux</strong> kernel. Nevertheless, some of the enhancements<br />
are targeted for inclusion in kernel<br />
version 2.7. Other enhancements will follow in<br />
later kernel releases. Meanwhile, all enhancements,<br />
in the form of packages, kernel modules<br />
and patches, are available from their respective<br />
project web sites. <strong>The</strong> CGL 2.0 requirements<br />
are in-line with the <strong>Linux</strong> development community.<br />
<strong>The</strong> purpose of this project is to form a<br />
catalyst to capture common requirements from<br />
end-users for a CGL distribution. With a common<br />
set of requirements from the major Network<br />
Equipment Providers, developers can be<br />
much more productive and efficient within development<br />
projects. Many individuals within<br />
the CGL initiative are also active participants<br />
and contributors in the Open Source development<br />
community.<br />
11 Examples of needed features in<br />
the <strong>Linux</strong> <strong>Kernel</strong><br />
In this section, we provide some examples<br />
of missing features and mechanisms from the<br />
<strong>Linux</strong> kernel that are necessary in a telecom<br />
environment.<br />
11.1 Transparent Inter-Process and Inter-<br />
Processor Communication Protocol for<br />
<strong>Linux</strong> Clusters<br />
Figure 4: CGL enhancements areas<br />
<strong>The</strong> implementations providing theses enhancements<br />
are Open Source projects and<br />
planned for integration with the <strong>Linux</strong> kernel<br />
when the implementations are mature, and<br />
ready for merging with the kernel code. In<br />
Today’s telecommunication environments are<br />
increasingly adopting clustered servers to gain<br />
benefits in performance, availability, and scalability.<br />
<strong>The</strong> resulting benefits of a cluster<br />
are greater or more cost-efficient than what a<br />
single server can provide. Furthermore, the<br />
telecommunications industry interest in clustering<br />
originates from the fact that clusters<br />
address carrier grade characteristics such as<br />
guaranteed service availability, reliability and
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 263<br />
scaled performance, using cost-effective hardware<br />
and software. Without being absolute<br />
about these requirements, they can be divided<br />
in these three categories: short failure detection<br />
and failure recovery, guaranteed availability of<br />
service, and short response times. <strong>The</strong> most<br />
widely adopted clustering technique is use of<br />
multiple interconnected loosely coupled nodes<br />
to create a single highly available system.<br />
<strong>One</strong> missing feature from the <strong>Linux</strong> kernel in<br />
this area is a reliable, efficient, and transparent<br />
inter-process and inter-processor communication<br />
protocol. Transparent Inter Process<br />
Communication (TIPC) [6] is a suitable Open<br />
Source implementation that fills this gap and<br />
provides an efficient cluster communication<br />
protocol. This leverages the particular conditions<br />
present within loosely coupled clusters.<br />
It runs on <strong>Linux</strong> and is provided as a portable<br />
source code package implementing a loadable<br />
kernel module.<br />
TIPC is unique because there seems to be no<br />
other protocol providing a comparable combination<br />
of versatility and performance. It<br />
includes some original innovations such as<br />
the functional addressing, the topology subscription<br />
services, and the reactive connection<br />
concept. Other important TIPC features<br />
include full location transparency, support<br />
for lightweight connections, reliable multicast,<br />
signaling link protocol, topology subscription<br />
services and more.<br />
TIPC should be regarded as a useful toolbox<br />
for anyone wanting to develop or use Carrier<br />
Grade or Highly Available <strong>Linux</strong> clusters. It<br />
provides the necessary infrastructure for cluster,<br />
network and software management functionality,<br />
as well as a good support for designing<br />
site-independent, scalable, distributed,<br />
high-availability and high-performance applications.<br />
It is also worthwhile to mention that the<br />
ForCES (Forwarding and Control Element<br />
WG) [11] working group within IETF has<br />
agreed that their router internal protocol (the<br />
ForCES protocol) must be possible to carry<br />
over different types of transport protocols.<br />
<strong>The</strong>re is consensus on that TCP is the protocol<br />
to be used when ForCES messages are<br />
transported over the Internet, while TIPC is<br />
the protocol to be used in closed environments<br />
(LANs), where special characteristics such as<br />
high performance and multicast support is desirable.<br />
Other protocols may also be added as<br />
options.<br />
TIPC is a contribution from Ericsson [5] to<br />
the Open Source community. TIPC was announced<br />
on LKML on June 28, 2004; it is licensed<br />
under a dual GPL and BSD license.<br />
11.2 IPv4, IPv6, MIPv6 forwarding tables fast<br />
access and compact memory with multiple<br />
FIB support<br />
Routers are core elements of modern telecom<br />
networks. <strong>The</strong>y propagate and direct billion<br />
of data packets from their source to their destination<br />
using air transport devices or through<br />
high-speed links. <strong>The</strong>y must operate as fast as<br />
the medium in order to deliver the best quality<br />
of service and have a negligible effect on<br />
communications. To give some figures, it is<br />
common for routers to manage between 10.000<br />
to 500.000 routes. In these situations, good<br />
performance is achievable by handling around<br />
2000 routes/sec. <strong>The</strong> actual implementation of<br />
the IP stack in <strong>Linux</strong> works fine for home or<br />
small business routers. However, with the high<br />
expectation of telecom operators and the new<br />
capabilities of telecom hardware, it appears as<br />
barely possible to use <strong>Linux</strong> as an efficient<br />
forwarding and routing element of a high-end<br />
router for large network (core/border/access<br />
router) or a high-end server with routing capabilities.
264 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
<strong>One</strong> problem with the networking stack in<br />
<strong>Linux</strong> is the lack of support for multiple<br />
forward-ing information bases (multi-FIB) wit<br />
h overlapping interface’s IP address, and the<br />
lack of appropriate interfaces for addressing<br />
FIB. Another problem with the curren t implementation<br />
is the limited scalability of the routing<br />
table.<br />
<strong>The</strong> solution to these problems is to provide<br />
support for multi-FIB with overlapping IP address.<br />
As such, we can have on differe nt<br />
VLAN or different physical interfaces, independent<br />
network in the same <strong>Linux</strong> box. For<br />
example, we can have two HTTP servers serving<br />
two different networks with potentially the<br />
same IP address. <strong>One</strong> HTTP server will serve<br />
the network/FIB 10, and the othe r HTTP<br />
server will serves the network/FIB 20. <strong>The</strong> advantage<br />
gained is to have one <strong>Linux</strong> box serving<br />
two different customers usi ng the same IP<br />
address. ISPs adopt this approach by providing<br />
services for multiple customers sharing the<br />
same server (server pa rtitioning), instead of<br />
using a server per customer.<br />
<strong>The</strong> way to achieve this is to have an ID (an<br />
identifier that identifies the customer or user of<br />
the service) to completely separ ate the routing<br />
table in memory. Two approaches exist:<br />
the first is to have a separate routing tables,<br />
each routing table is looked up by their ID and<br />
within tha t table the lookup is done one the<br />
prefix. <strong>The</strong> second approach is to have one table,<br />
and the lookup is done on the combined<br />
key = prefix + ID.<br />
A different kind of problem arises when we are<br />
not able to predict access time, with the chaining<br />
in the hash table of the routi ng cache (and<br />
FIB). This problem is of particular inter-est in<br />
an environment that requires predictable performance.<br />
Another aspect of the problem is that the route<br />
cache and the routing table are not kept synchronized<br />
most of the time (path MTU, just<br />
to name one). <strong>The</strong> route cache flush is executed<br />
regularly; therefore, any updates on the<br />
cache are lost. For example, if you have a routing<br />
cache flush, you have to rebuild every route<br />
that you are currently talking to, by going for<br />
every route in the hash/try table and rebuilding<br />
the information. First, you have to lookup in<br />
the routing cache, and if you have a miss, then<br />
you need to go in the hash/try table. This process<br />
is very slow and not predictable since the<br />
hash/try table is implemented wi th linked list<br />
and there is high potential for collisions when a<br />
large number of routes are present. This design<br />
is suitable fo r a home PC with a few routes, but<br />
it is not scalable for a large server.<br />
To support the various routing requirements<br />
of server nodes operating in high performance<br />
and mission critical envrionments,<br />
<strong>Linux</strong> should support the following:<br />
• Implementation of multi-FIB using tree<br />
(radix, patricia, etc.): It is very important<br />
to have predictable performance in insert/delete/lookup<br />
from 10.000 to 500.000<br />
routes. In addition, it is favourable to have<br />
the same data structure for both IPv4 and<br />
IPv6.<br />
• Socket and ioctl interfaces for addressing<br />
multi-FIB.<br />
• Multi-FIB support for neighbors (arp).<br />
Providing these implementations in <strong>Linux</strong> will<br />
affect a large part of net/core, net/ipv4 and<br />
net/ipv6; these subsystems (mostly network<br />
layer) will need to be re-written. Other areas<br />
will have minimal impact at the source code<br />
level, mostly at the transport layer (socket,<br />
TCP, UDP, RAW, NAT, IPIP, IGMP, etc.).<br />
As for the availability of an Open Source<br />
project that can provide these functionalities,
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 265<br />
there exists a project called "<strong>Linux</strong> Virtual<br />
Routing and Forwarding" [12]. This project<br />
aims to implement a flexible and scalable<br />
mechanism for providing multiple routing instances<br />
within the <strong>Linux</strong> kernel. <strong>The</strong> project<br />
has some potential in providing the needed<br />
functionalities, however no progress has been<br />
made since 2002 and the project seems to be<br />
inactive.<br />
11.3 Run-time Authenticity Verification for Binaries<br />
<strong>Linux</strong> has generally been considered immune<br />
to the spread of viruses, backdoors and Trojan<br />
programs on the Internet. However, with<br />
the increasing popularity of <strong>Linux</strong> as a desktop<br />
platform, the risk of seeing viruses or Trojans<br />
developed for this platform are rapidly<br />
growing. To alleviate this problem, the system<br />
should prevent on run time the execution<br />
of un-trusted software. <strong>One</strong> solution is<br />
to digitally sign the trusted binaries and have<br />
the system check the digital signature of binaries<br />
before running them. <strong>The</strong>refore, untrusted<br />
(not signed) binaries are denied the execution.<br />
This can improve the security of the system<br />
by avoiding a wide range of malicious binaries<br />
like viruses, worms, Trojan programs and<br />
backdoors from running on the system.<br />
DigSig [13] is a <strong>Linux</strong> kernel module that<br />
checks the signature of a binary before running<br />
it. It inserts digital signatures inside the ELF<br />
binary and verifies this signature before loading<br />
the binary. It is based on the <strong>Linux</strong> Security<br />
Module hooks (LSM has been integrated with<br />
the <strong>Linux</strong> kernel since 2.5.X and higher).<br />
Typically, in this approach, vendors do not sign<br />
binaries; the control of the system remains with<br />
the local administrator. <strong>The</strong> responsible administrator<br />
is to sign all binaries they trust with<br />
their private key. <strong>The</strong>refore, DigSig guarantees<br />
two things: (1) if you signed a binary, nobody<br />
else other than yourself can modify that binary<br />
without being detected. (2) Nobody can run a<br />
binary which is not signed or badly signed.<br />
<strong>The</strong>re has already been several initiatives in<br />
this domain, such as Tripwire [14], BSign [15],<br />
Cryptomark [16], but we believe the DigSig<br />
project is the first to be both easily accessible to<br />
all (available on SourceForge, under the GPL<br />
license) and to operate at kernel level on run<br />
time. <strong>The</strong> run time is very important for Carrier<br />
Grade <strong>Linux</strong> as this takes into account the<br />
high availability aspects of the system.<br />
<strong>The</strong> DigSig approach has been using existing<br />
solutions like GnuPG [17] and BSign (a<br />
Debian package) rather than reinventing the<br />
wheel. However, in order to reduce the overhead<br />
in the kernel, the DigSig project only took<br />
the minimum code necessary from GnuPG.<br />
This helped much to reduce the amount of code<br />
imported to the kernel in source code of the<br />
original (only 1/10 of the original GnuPG 1.2.2<br />
source code has been imported to the kernel<br />
module).<br />
DigSig is a contribution from Ericsson [5] to<br />
the Open Source community. It was released<br />
under the GPL license and it is available from<br />
[8].<br />
DigSig has been announced on LKML [18] but<br />
it not yet integrated in the <strong>Linux</strong> <strong>Kernel</strong>.<br />
11.4 Efficient Low-Level Asynchronous Event<br />
Mechanism<br />
Carrier grade systems must provide a 5-nines<br />
availability, a maximum of five minutes per<br />
year of downtime, which includes hardware,<br />
operating system, software upgrade and maintenance.<br />
Operating systems for such systems<br />
must ensure that they can deliver a high response<br />
rate with minimum downtime. In addition,<br />
carrier-grade systems must take into<br />
account such characteristics such as scalabil-
266 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
ity, high availability and performance. In carrier<br />
grade systems, thousands of requests must<br />
be handled concurrently without affecting the<br />
overall system’s performance, even under extremely<br />
high loads. Subscribers can expect<br />
some latency time when issuing a request, but<br />
they are not willing to accept an unbounded<br />
response time. Such transactions are not handled<br />
instantaneously for many reasons, and it<br />
can take some milliseconds or seconds to reply.<br />
Waiting for an answer reduces applications<br />
abilities to handle other transactions.<br />
Many different solutions have been envisaged<br />
to improve <strong>Linux</strong>’s capabilities in this area using<br />
different types of software organization,<br />
such as multithreaded architectures, implementing<br />
efficient POSIX interfaces, or improving<br />
the scalability of existing kernel routines.<br />
<strong>One</strong> possible solution that is adequate for carrier<br />
grade servers is the Asynchronous Event<br />
Mechanism (AEM), which provides asynchronous<br />
execution of processes in the <strong>Linux</strong><br />
kernel. AEM implements a native support<br />
for asynchronous events in the <strong>Linux</strong> kernel<br />
and aims to bring carrier-grade characteristics<br />
to <strong>Linux</strong> in areas of scalability and soft realtime<br />
responsiveness. In addition, AEM offers<br />
event-based development framework, scalability,<br />
flexibility, and extensibility.<br />
Ericsson [5] released AEM to Open Source in<br />
February 2003 under the GPL license. AEM<br />
was announced on the <strong>Linux</strong> <strong>Kernel</strong> Mailing<br />
List (LKML) [20], and received feedback that<br />
resulted in some changes to the design and implementation.<br />
AEM is not yet integrated with<br />
the <strong>Linux</strong> kernel.<br />
12 Conclusion<br />
<strong>The</strong>re are many challenges accompanying the<br />
migration from proprietary to open platforms.<br />
<strong>The</strong> main challenge remains to be the availability<br />
of the various kernel features and mechanisms<br />
needed for telecom platforms and integrating<br />
these features in the <strong>Linux</strong> kernel.<br />
References<br />
[1] PCI Industrial Computer Manufacturers<br />
Group,<br />
http://www.picmg.org<br />
[2] Open Source Development Labs,<br />
http://www.osdl.org<br />
[3] Carrier Grade <strong>Linux</strong>,<br />
http://osdl.org/lab_activities<br />
[4] Service Availability Forum,<br />
http://www.saforum.org<br />
[5] Open System Lab,<br />
http://www.linux.ericsson.ca<br />
[6] Transparent IPC,<br />
http://tipc.sf.net<br />
[7] Asynchronous Event Mechanism,<br />
http://aem.sf.net<br />
[8] Distributed Security Infrastructure,<br />
http://disec.sf.net<br />
[9] MontaVista Carrier Grade Edition,<br />
http://www.mvista.com/cge<br />
[10] Make Clustering Easy with TIPC,<br />
<strong>Linux</strong>World Magazine, April 2004<br />
[11] IETF ForCES working group,<br />
http://www.sstanamera.com/~forces<br />
[12] <strong>Linux</strong> Virtual Routing and Forwarding<br />
project,<br />
http://linux-vrf.sf.net<br />
[13] Stop Malicious Code Execution at<br />
<strong>Kernel</strong> Level, <strong>Linux</strong>World Magazine,<br />
January 2004
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 267<br />
[14] Tripwire,<br />
http://www.tripwire.com<br />
[15] Bsign,<br />
http://packages.debian.org/bsign<br />
[16] Cryptomark,<br />
http://immunix.org/cryptomark.html<br />
[17] GnuPG,<br />
http://www.gnupg.org<br />
[18] DigSig announcement on LKML,<br />
http://lwn.net/Articles/51007<br />
[19] An Event Mechanism for <strong>Linux</strong>, <strong>Linux</strong><br />
Journal, July 2003<br />
[20] AEM announcement on LKML,<br />
http://lwn.net/Articles/45633<br />
Acknowledgments<br />
Thank you to Ludovic Beliveau, Mathieu<br />
Giguere, Magnus Karlson, Jon Maloy, Mats<br />
Naslund, Makan Pourzandi, and Frederic<br />
Rossi, for their valuable contributions and reviews.
268 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong>
Demands, Solutions, and Improvements for <strong>Linux</strong><br />
Filesystem Security<br />
Michael Austin Halcrow<br />
International Business Machines, Inc.<br />
mike@halcrow.us<br />
Abstract<br />
Securing file resources under <strong>Linux</strong> is a team<br />
effort. No one library, application, or kernel<br />
feature can stand alone in providing robust security.<br />
Current <strong>Linux</strong> access control mechanisms<br />
work in concert to provide a certain level<br />
of security, but they depend upon the integrity<br />
of the machine itself to protect that data. Once<br />
the data leaves that machine, or if the machine<br />
itself is physically compromised, those access<br />
control mechanisms can no longer protect the<br />
data in the filesystem. At that point, data privacy<br />
must be enforced via encryption.<br />
As <strong>Linux</strong> makes inroads in the desktop market,<br />
the need for transparent and effective data encryption<br />
increases. To be practically deployable,<br />
the encryption/decryption process must<br />
be secure, unobtrusive, consistent, flexible, reliable,<br />
and efficient. Most encryption mechanisms<br />
that run under <strong>Linux</strong> today fail in one<br />
or more of these categories. In this paper, we<br />
discuss solutions to many of these issues via<br />
the integration of encryption into the <strong>Linux</strong><br />
filesystem. This will provide access control enforcement<br />
on data that is not necessarily under<br />
the control of the operating environment.<br />
We also explore how stackable filesystems, Extended<br />
Attributes, PAM, GnuPG web-of-trust,<br />
supporting libraries, and applications (such as<br />
GNOME/KDE) can all be orchestrated to provide<br />
robust encryption-based access control<br />
over filesystem content.<br />
1 Development Efforts<br />
This paper is motivated by an effort on the part<br />
of the IBM <strong>Linux</strong> Technology Center to enhance<br />
<strong>Linux</strong> filesystem security through better<br />
integration of encryption technology. <strong>The</strong><br />
author of this paper is working together with<br />
the external community and several members<br />
of the LTC in the design and development of<br />
a transparent cryptographic filesystem layer in<br />
the <strong>Linux</strong> kernel. <strong>The</strong> “we” in this paper refers<br />
to immediate members of the author’s development<br />
team who are working together on this<br />
project, although many others outside that development<br />
team have thus far had a significant<br />
part in this development effort.<br />
2 <strong>The</strong> Filesystem Security<br />
2.1 Threat Model<br />
Computer users tend to be overly concerned<br />
about protecting their credit card numbers from<br />
being sniffed as they are transmitted over the<br />
Internet. At the same time, many do not think<br />
twice when sending equally sensitive information<br />
in the clear via an email message. A<br />
thief who steals a removable device, laptop, or<br />
server can also read the confidential files on<br />
those devices if they are left unprotected. Nevertheless,<br />
far too many users neglect to take the<br />
necessary steps to protect their files from such<br />
an event. Your liability limit for unauthorized
270 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
charges to your credit card is $50 (and most<br />
credit card companies waive that liability for<br />
victims of fraud); on the other hand, confidentiality<br />
cannot be restored once lost.<br />
Today, we see countless examples of neglect<br />
to use encryption to protect the integrity and<br />
the confidentiality of sensitive data. Those<br />
who are trusted with sensitive information routinely<br />
send that information as unencrypted<br />
email attachments. <strong>The</strong>y also store that information<br />
in clear text on disks, USB keychain<br />
drives, backup tapes, and other removable media.<br />
GnuPG[7] and OpenSSL[8] provide all the<br />
encryption tools necessary to protect this information,<br />
but these tools are not used nearly as<br />
often as they ought to be.<br />
If required to go through tedious encryption or<br />
decryption steps every time they need to work<br />
with a file or share it, people will select insecure<br />
passwords, transmit passwords in an insecure<br />
manner, fail to consider or use public key<br />
encryption options, or simply stop encrypting<br />
their files altogether. If security is overly obstructive,<br />
people will remove it, work around<br />
it, or misuse it (thus rendering it less effective).<br />
As <strong>Linux</strong> gains adoption in the desktop market,<br />
we need integrated file integrity and confidentiality<br />
that is seamless, transparent, easy to use,<br />
and effective.<br />
2.2 Integration of File Encryption into the<br />
Filesystem<br />
Several solutions exist that solve separate<br />
pieces of the problem. In one example highlighting<br />
transparency, employees within an organization<br />
that uses IBM Lotus Notes [9]<br />
for its email will not even notice the complex<br />
PKI or the encryption process that is integrated<br />
into the product. Encryption and decryption<br />
of sensitive email messages is seamless to the<br />
end user; it involves checking an “Encrypt”<br />
box, specifying a recipient, and sending the<br />
message. This effectively addresses a significant<br />
file in-transit confidentiality problem. If<br />
the local replicated mailbox database is also<br />
encrypted, then it also addresses confidentiality<br />
on the local storage device, but the protection<br />
is lost once the data leaves the domain of<br />
Notes (for example, if an attached file is saved<br />
to disk). <strong>The</strong> process must be seamlessly integrated<br />
into all relevant aspects of the user’s<br />
operating environment.<br />
In Section 4, we discuss filesystem security<br />
in general under <strong>Linux</strong>, with an emphasis<br />
on confidentiality and integrity enforcement<br />
via cryptographic technologies. In Section<br />
6, we propose a mechanism to integrate encryption<br />
of files at the filesystem level, including<br />
integration of GnuPG[7] web-of-trust,<br />
PAM[10], a stackable filesystem model[2], Extended<br />
Attributes[6], and libraries and applications,<br />
in order to make the entire process as<br />
transparent as possible to the end user.<br />
3 A Team Effort<br />
Filesystem security encompasses more than<br />
just the filesystem itself. It is a team effort,<br />
involving the kernel, the shells, the login processes,<br />
the filesystems, the applications, the administrators,<br />
and the users. When we speak of<br />
“filesystem security,” we refer to the security<br />
of the files in a filesystem, no matter what ends<br />
up providing that security.<br />
For any filesystem security problem that exists,<br />
there are usually several different ways of<br />
solving it. Solutions that involve modifications<br />
in the kernel tend to introduce less overhead.<br />
This is due to the fact that context switches and<br />
copying of data between kernel and user memory<br />
is reduced. However, changes in the kernel<br />
may reduce the efficiency of the kernel’s<br />
VFS while making it both harder to maintain<br />
and more bug-prone. As notable exceptions,
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 271<br />
Erez Zadok’s stackable filesystem framework,<br />
FiST[3], and Loop-aes, require no change to<br />
the current <strong>Linux</strong> kernel VFS. Solutions that<br />
exist entirely in userspace do not complicate<br />
the kernel, but they tend to have more overhead<br />
and may be limited in the functionality they are<br />
able to provide, as they are limited by the interface<br />
to the kernel from userspace. Since they<br />
are in userspace, they are also more prone to<br />
attack.<br />
4 Aspects of Filesystem Security<br />
Computer security can be decomposed into<br />
several areas:<br />
• Identifying who you are and having the<br />
machine recognize that identification (authentication).<br />
• Determining whether or not you should be<br />
granted access to a resource such as a sensitive<br />
file (authorization). This is often<br />
based on the permissions associated with<br />
the resource by its owner or an administrator<br />
(access control).<br />
• Transforming your data into an encrypted<br />
format in order to make it prohibitively<br />
costly for unauthorized users to decrypt<br />
and view (confidentiality).<br />
• Performing checksums, keyed hashes,<br />
and/or signing of your data to make unauthorized<br />
modifications of your data detectable<br />
(integrity).<br />
4.1 Filesystem Integrity<br />
When people consider filesystem security, they<br />
traditionally think about access control (file<br />
permissions) and confidentiality (encryption).<br />
File integrity, however, can be just as important<br />
as confidentiality, if not more so. If a script<br />
that performs an administrative task is altered<br />
in an unauthorized fashion, the script may perform<br />
actions that violate the system’s security<br />
policies. For example, many rootkits modify<br />
system startup and shutdown scripts to facilitate<br />
the attacker’s attempts to record the user’s<br />
keystrokes, sniff network traffic, or otherwise<br />
infiltrate the system.<br />
More often than not, the value of the data<br />
stored in files is greater than that of the machine<br />
that hosts the files. For example, if an<br />
attacker manages to insert false data into a financial<br />
report, the alteration to the report may<br />
go unnoticed until substantial damage has been<br />
done; jobs could be at stake and in more extreme<br />
cases even criminal charges against the<br />
user could result . If trojan code sneaks into the<br />
source repository for a major project, the public<br />
release of that project may contain a backdoor.<br />
1<br />
Many security professionals foresee a nightmare<br />
scenario wherein a widely propagated Internet<br />
worm quietly alters the contents of word<br />
processing and spreadsheet documents. Without<br />
any sort of integrity mechanism in place<br />
in the vast majority of the desktop machines<br />
in the world, nobody would know if any data<br />
that traversed vulnerable machines could be<br />
trusted. This threat could be very effectively<br />
addressed with a combination of a kernel-level<br />
mandatory access control (MAC)[11] protection<br />
profile and a filesystem that provides integrity<br />
and auditing capabilities. Such a combination<br />
would be resistant to damage done by<br />
a root compromise, especially if aided by a<br />
Trusted Platform Module (TPM)[13] using attestation.<br />
1 A high-profile example of an attempt to do this occurred<br />
with the <strong>Linux</strong> kernel last year. Fortunately, the<br />
source code management process used by the kernel developers<br />
allowed them to catch the attempted insertion<br />
of the trojan code before it made it into the actual kernel.
272 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
<strong>One</strong> can approach filesystem integrity from<br />
two angles. <strong>The</strong> first is to have strong authentication<br />
and authorization mechanisms in<br />
place that employ sufficiently flexible policy<br />
languages. <strong>The</strong> second is to have an auditing<br />
mechanism, to detect unauthorized attempts at<br />
modifying the contents of a filesystem.<br />
4.1.1 Authentication and Authorization<br />
<strong>The</strong> filesystem must contain support for the<br />
kernel’s security structure, which requires<br />
stateful security attributes on each file. Most<br />
GNU/<strong>Linux</strong> applications today use PAM[10]<br />
(see Section 4.1.2 below) for authentication<br />
and process credentials to represent their authorization;<br />
policy language is limited to<br />
what can be expressed using the file owner<br />
and group, along with the owner/group/world<br />
read/write/execute attributes of the file. <strong>The</strong><br />
administrator and the current owner have the<br />
authority to set the owner of the file or the<br />
read/write/execute policies for that file. In<br />
many filesystems, files may also contain additional<br />
security flags, such as an immutable or<br />
append-only flag.<br />
Posix Access Control Lists (ACL’s)[6] provide<br />
for more stringent delegations of access authority<br />
on a per-file basis. In an ACL, individual<br />
read/write/execute permissions can be assigned<br />
to the owner, the owning group, individual<br />
users, or groups. Masks can also be applied<br />
that indicate the maximum effective permissions<br />
for a class.<br />
For those who require even more flexible access<br />
control, SE <strong>Linux</strong>[15] uses a powerful<br />
policy language that can express a wide variety<br />
of access control policies for files and<br />
filesystem operations. In fact, <strong>Linux</strong> Security<br />
Module (LSM)[14] hooks (see Section 4.1.3<br />
below) exist for most of the security-relevant<br />
filesystem operations, which makes it easier to<br />
implement custom filesystem-agnostic security<br />
models. Authentication and authorization are<br />
pretty well covered with a combination of existing<br />
filesystem, kernel, and user-space solutions<br />
that are part of most GNU/<strong>Linux</strong> distributions.<br />
Many distributions could, however, do a<br />
better job of aiding both the administrator and<br />
the user in understanding and using all the tools<br />
that they have available to them.<br />
Policies that safeguard sensitive data should include<br />
timeouts, whereby the user must periodically<br />
re-authenticate in order to continue to<br />
access the data. In the event that the authorized<br />
users neglect to lock down the machine<br />
before leaving work for the day, timeouts help<br />
to keep the custodial staff from accessing the<br />
data when they come in at night to clean the<br />
office. As usual, this must be implemented in<br />
such a way as to be unobtrusive to the user. If a<br />
user finds a security mechanism overly imposing<br />
or inconvenient, he will usually disable or<br />
circumvent it.<br />
4.1.2 PAM<br />
Pluggable Authentication Modules (PAM)[10]<br />
implement authentication-related security policies.<br />
PAM offers discretionary access control<br />
(DAC)[12]; applications must defer to PAM in<br />
order to authenticate a user. If the authenticating<br />
PAM function that is called returns an affirmative<br />
answer, then the application can use<br />
that response to authorize the action, and vice<br />
versa. <strong>The</strong> exact mechanism that the PAM<br />
function uses to evaluate the authentication is<br />
dependent on the module called. 2<br />
In the case of filesystem security and encryption,<br />
PAM can be employed to obtain and forward<br />
keys to a filesystem encryption layer in<br />
kernel space. This would allow seamless inte-<br />
2 This is parameterizable in the configuration files<br />
found under /etc/pam.d/
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 273<br />
gration with any key retrieval mechanism that<br />
can be coded as a Pluggable Authentication<br />
Module.<br />
4.1.3 LSM<br />
<strong>Linux</strong> Security Modules (LSM) can provide<br />
customized security models. <strong>One</strong> possible use<br />
of LSM is to allow decryption of certain files<br />
only when a physical device is connected to the<br />
machine. This could be, for example, a USB<br />
keychain device, a Smartcard, or an RFID device.<br />
Some devices of these classes can also be<br />
used to house the encryption keys (retrievable<br />
via PAM, as previously discussed).<br />
4.1.4 Auditing<br />
<strong>The</strong> second angle to filesystem integrity is auditing.<br />
Auditing should only fill in where authentication<br />
and authorization mechanisms fall<br />
short. In a utopian world, where security systems<br />
are perfect and trusted people always act<br />
trustworthily, auditing does not have much of<br />
a use. In reality, code that implements security<br />
has defects and vulnerabilities. Passwords can<br />
be compromised, and authorized people can<br />
act in an untrustworthy manner. Auditing can<br />
involve keeping a log of all changes made to<br />
the attributes of the file or to the file data itself.<br />
It can also involve taking snapshots of the attributes<br />
and/or contents of the file and comparing<br />
the current state of the file with what was<br />
recorded in a prior snapshot.<br />
Intrusion detection systems (IDS), such as<br />
Tripwire[16], AIDE[17], or Samhain[18], perform<br />
auditing functions. As an example, Tripwire<br />
periodically scans the contents of the<br />
filesystem, checking file attributes, such as the<br />
size, the modification time, and the cryptographic<br />
hash of each file. If any attributes for<br />
the files being checked are found to be altered,<br />
Tripwire will report it. This approach can work<br />
fairly well in cases where the files are not expected<br />
to change very often, as is the case with<br />
most system scripts, shared libraries, executables,<br />
or configuration files. However, care must<br />
be taken to assure that the attacker cannot also<br />
modify Tripwire’s database when he modifies<br />
a system file; the integrity of the IDS system<br />
itself must also be assured.<br />
In cases where a file changes often, such as<br />
a database file or a spreadsheet file in an active<br />
project, we see a need for a more dynamic<br />
auditing solution - one which is perhaps<br />
more closely integrated with the filesystem<br />
itself. In many cases, the simple fact that<br />
the file has changed does not imply a security<br />
violation. We must also know who made<br />
the change. More robust security requirements<br />
also demand that we know what parts<br />
of the file were changed and when the changes<br />
were made. <strong>One</strong> could even imagine scenarios<br />
where the context of the change must also be<br />
taken into consideration (i.e., who was logged<br />
in, which processes were running, or what network<br />
activity was taking place at the time the<br />
change was made).<br />
File integrity, particularly in the area of auditing,<br />
is perhaps the security aspect of <strong>Linux</strong><br />
filesystems that could use the most improvement.<br />
Most efforts in secure filesystem development<br />
have focused on confidentiality more<br />
so than integrity, and integrity has been regulated<br />
to the domain of userland utilities that<br />
must periodically scan the entire filesystem.<br />
Sometimes, just knowing that a file has been<br />
changed is insufficient. Administrators would<br />
like to know exactly how the attacker made<br />
the changes and under what circumstances they<br />
were made.<br />
Cryptographic hashes are often used. <strong>The</strong>se<br />
can detect unauthorized circumvention of the<br />
filesystem itself, as long as the attacker forgets
274 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
(or is unable) to update the hashes when making<br />
unauthorized changes to the files. Some<br />
auditing solutions, such as the <strong>Linux</strong> Auditing<br />
System (LAuS) 3 that is part of SuSE <strong>Linux</strong><br />
Enterprise Server, can track system calls that<br />
affect the filesystem. Another recent addition<br />
to the 2.6 <strong>Linux</strong> kernel is the Light-weight<br />
Auditing Framework written by Rik Faith[28].<br />
<strong>The</strong>se are implemented independently of the<br />
filesystem itself, and the level of detail in the<br />
records is largely limited to the system call parameters<br />
and return codes. It is advisable that<br />
you keep your log files on a separate machine<br />
than the one being audited, since the attacker<br />
could modify the audit logs themselves once<br />
he has compromised the machine’s security.<br />
4.1.5 Improvements on Integrity<br />
Extended Attributes provide for a convenient<br />
way to attach metadata relating to a file to the<br />
file itself. On the premise that possession of<br />
a secret equates to authentication, every time<br />
an authenticated subject makes an authorized<br />
write to a file, a hash over the concatenation of<br />
that secret to the file contents (keyed hashing;<br />
HMAC is one popular standard) can be written<br />
as an Extended Attribute on that file. Since<br />
this action would be performed on the filesystem<br />
level, the user would not have to conscientiously<br />
re-run userspace tools to perform such<br />
an operation every time he wants to generate<br />
an integrity verifier on the file.<br />
This is an expensive operation to perform over<br />
large files, and so it would be a good idea to<br />
define extent sizes over which keyed hashes are<br />
formed, with the Extended Attributes including<br />
extent descriptors along with the keyed hashes.<br />
That way, a small change in the middle of a<br />
3 Note that LAuS is being covered in more detail in<br />
the 2004 Ottawa <strong>Linux</strong> Symposium by Doc Shankar,<br />
Emily Ratliff, and Olaf Kirch as part of their presentation<br />
regarding CAPP/EAL3+ Certification.<br />
large file would only require the keyed hash<br />
to be re-generated over the extent in which the<br />
change occurs. A keyed hash over the sequential<br />
set of the extent hashes would also keep an<br />
attacker from swapping around extents undetected.<br />
4.2 File Confidentiality<br />
Confidentiality means that only authorized<br />
users can read the contents of a file. Sometimes<br />
the names of the files themselves or a directory<br />
structure can be sensitive. In other cases, the<br />
sizes of the files or the modification times can<br />
betray more information than one might want<br />
to be known. Even the security policies protecting<br />
the files can reveal sensitive information.<br />
For example, “Only employees of Novell<br />
and SuSE can read this file” would imply that<br />
Novell and SuSE are collaborating on something,<br />
and neither of them may want this fact<br />
to be public knowledge as of yet. Many interesting<br />
protocols have been developed that can<br />
address these sorts of issues; some of them are<br />
easier to implement than others.<br />
When approaching the question of confidentiality,<br />
we assume that the block device that<br />
contains the file is vulnerable to physical compromise.<br />
For example, a laptop that contains<br />
sensitive material might be lost, or a database<br />
server might be stolen in a burglary. In either<br />
event, the data on the hard drive must not be<br />
readable by an unauthorized individual. If any<br />
individual must be authenticated before he is<br />
able to access to the data, then the data is protected<br />
against unauthorized access.<br />
Surprisingly, many users surrender their own<br />
data’s confidentiality (and more often than not<br />
they do so unwittingly). It has been my personal<br />
observation that most people do not fully<br />
understand the lack of confidentiality afforded<br />
their data when they send it over the Internet.<br />
To compound this problem, comprehend-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 275<br />
ing and even using most encryption tools takes<br />
considerable time and effort on the part of most<br />
users. If sensitive files could be encrypted by<br />
default, only to be decrypted by those authorized<br />
at the time of access, then the user would<br />
not have to expend so much effort toward protecting<br />
the data’s confidentiality.<br />
By putting the encryption at the filesystem<br />
layer, this model becomes possible without any<br />
modifications to the applications or libraries.<br />
A policy at that layer can dictate that certain<br />
processes, such as the mail client, are to receive<br />
the encrypted version any files that are<br />
read from disk.<br />
4.2.1 Encryption<br />
File confidentiality is most commonly accomplished<br />
through encryption. For performance<br />
reasons, secure filesystems use symmetric key<br />
cryptography, like AES or Triple-DES, although<br />
an asymmetric public/private keypair<br />
may be used to encrypt the symmetric key in<br />
some key management schemes. This hybrid<br />
approach is in common use through SSL and<br />
PGP encryption protocols.<br />
<strong>One</strong> of our proposals to extend Cryptfs is to<br />
mirror the techniques used in GnuPG encryption.<br />
If the symmetric key that protects the contents<br />
of a file is encrypted with the public key<br />
of the intended recipient of the file and stored<br />
as an Extended Attribute of the file, then that<br />
file can be transmitted in multiple ways (e.g.,<br />
physical device such as removable storage); as<br />
long as the Extended Attributes of the file are<br />
preserved across filesystem transfers, then the<br />
recipient with the corresponding private key<br />
has all the information that his Cryptfs layer<br />
needs to transparently decrypt the contents of<br />
the file.<br />
4.2.2 Key Management<br />
Key management will make or break a cryptographic<br />
filesystem.[5] If the key can be easily<br />
compromised, then even the strongest cipher<br />
will provide weak protection. If your<br />
key is accessible in an unencrypted file or in<br />
an unprotected region of memory, or if it is<br />
ever transmitted over the network in the clear,<br />
a rogue user can capture that key and use<br />
it later. Most passwords have poor entropy,<br />
which means that an attacker can have pretty<br />
good success with a brute force attack against<br />
the password. Thus the weakest link in the<br />
chain for password-based encryption is usually<br />
the password itself. <strong>The</strong> Cryptographic<br />
Filesystem (CFS)[22] mandates that the user<br />
choose a password with a length of at least 16<br />
characters. 4<br />
Ideally, the key would be kept in passwordencrypted<br />
form on a removable device (like a<br />
USB keychain drive) that is stored separately<br />
from the files that the key is used to encrypt.<br />
That way, an attacker would have to both compromise<br />
the password and gain physical access<br />
to the removable device before he could decrypt<br />
your files.<br />
Filesystem encryption is one of the most exciting<br />
applications for the Trusted Computing<br />
Platform. Given that the attacker has physical<br />
access to a machine with a Trusted Platform<br />
Module, it is significantly more difficult<br />
to compromise the key. By using secret sharing<br />
(otherwise known as key splitting)[4], the actual<br />
key used to decrypt a file on the filesystem<br />
can be contained as both the user’s key and the<br />
machine’s key (as contained in the TPM). In<br />
order to decrypt the files, an attacker must not<br />
4 <strong>The</strong> subject of secure password selection, although<br />
an important one, is beyond the scope of this<br />
article. Recommended reading on this subject is at<br />
http://www.alw.nih.gov/Security/Docs/<br />
passwd.html.
276 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
only compromise the user key, but he must also<br />
have access to the machine on which the TPM<br />
chip is installed. This “binds” the encrypted<br />
files to the machine. This is especially useful<br />
for protecting files on removable backup media.<br />
4.2.3 Cryptanalysis<br />
All block ciphers and most stream ciphers are,<br />
to various degrees, vulnerable to successful<br />
cryptanalysis. If a cipher is used improperly,<br />
then it may become even easier to discover the<br />
plaintext and/or the key. For example, with<br />
certain ciphers operating in certain modes, an<br />
attacker could discover information that aids<br />
in cryptanalysis by getting the filesystem to<br />
re-encrypt an already encrypted block of data.<br />
Other times, a cryptanalyst can deduce information<br />
about the type of data in the encrypted<br />
file when that data has predictable segments of<br />
data, like a common header or footer (thus allowing<br />
for a known-plaintext attack).<br />
4.2.4 Cipher Modes<br />
A block encryption mode that is resistant to<br />
cryptanalysis can involve dependencies among<br />
chains of bytes or blocks of data. Cipherblock-chaining<br />
(CBC) mode, for example, provides<br />
adequate encryption in many circumstances.<br />
In CBC mode, a change to one block<br />
of data will require that all subsequent blocks<br />
of data be re-encrypted. <strong>One</strong> can see how this<br />
would impact performance for large files, as a<br />
modification to data near the beginning of the<br />
file would require that all subsequent blocks be<br />
read, decrypted, re-encrypted, and written out<br />
again.<br />
This particular inefficiency can be effectively<br />
addressed by defining chaining extents. By<br />
limiting regions of the file that encompass<br />
chained blocks, it is feasible to decrypt and reencrypt<br />
the smaller segments. For example, if<br />
the block size for a cipher is 64 bits (8 bytes)<br />
and the block size, which is (we assume) the<br />
minimum unit of data that the block device<br />
driver can transfer at a time (512 bytes) then<br />
one could limit the number of blocks in any extent<br />
to 64 blocks. Depending on the plaintext<br />
(and other factors), this may be too few to effectively<br />
counter cryptanalysis, and so the extent<br />
size could be set to a small multiple of the<br />
page size without severely impacting overall<br />
performance. <strong>The</strong> optimal extent size largely<br />
depends on the access patterns and data patterns<br />
for the file in question; we plan on benchmarking<br />
against varying extent lengths under<br />
varying access patterns.<br />
4.2.5 Key Escrow<br />
<strong>The</strong> proverbial question, “What if the sysadmin<br />
gets hit by a bus?” is one that no organization<br />
should ever stop asking. In fact, sometimes<br />
no one person should alone have independent<br />
access to the sensitive data; multiple<br />
passwords may be required before the data is<br />
decrypted. Shareholders should demand that<br />
no single person in the company have full access<br />
to certain valuable data, in order to mitigate<br />
the damage to the company that could be<br />
done by a single corrupt administrator or executive.<br />
Methods for secret sharing can be employed<br />
to assure that multiple keys be required<br />
for file access, and (m,n)-threshold schemes [4]<br />
can ensure that the data is retrievable, even if a<br />
certain number of the keys are lost. Secret sharing<br />
would be easily implementable as part of<br />
any of the existing cryptographic filesystems.<br />
4.3 File Resilience<br />
<strong>The</strong> loss of a file can be just as devastating<br />
as the compromise of a file. <strong>The</strong>re are many
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 277<br />
well-established solutions to performing backups<br />
of your filesystem, but some cryptographic<br />
filesystems preclude the ability to efficiently<br />
and/or securely use them. Backup tapes tend<br />
to be easier to steal than secure computer systems<br />
are, and if unencrypted versions of secure<br />
files exist on the tapes, that constitutes an<br />
often-overlooked vulnerability.<br />
<strong>The</strong> <strong>Linux</strong> 2.6 kernel cryptoloop device 5<br />
filesystem is an all-or-nothing approach. Most<br />
backup utilities must be given free reign on<br />
the unencrypted directory listings in order to<br />
perform incremental backups. Most other<br />
encrypted filesystems keep sets of encrypted<br />
files in directories in the underlying filesystem,<br />
which makes incremental backups possible<br />
without giving the backup tools access to<br />
the unencrypted content of the files.<br />
<strong>The</strong> backup utilities must, however, maintain<br />
backups of the metadata in the directories containing<br />
the encrypted files in addition to the<br />
files themselves. On the other hand, when the<br />
filesystem takes the approach of storing the<br />
cryptographic metadata as Extended Attributes<br />
for each file, then backup utilities need only<br />
worry about copying just the file in question to<br />
the backup medium (preserving the Extended<br />
Attributes, of course).<br />
4.4 Advantages of FS-Level, EA-Guided Encryption<br />
Most encrypted filesystem solutions either operate<br />
on the entire block device or operate on<br />
entire directories. <strong>The</strong>re are several advantages<br />
to implementing filesystem encryption at the<br />
filesystem level and storing encryption metadata<br />
in the Extended Attributes of each file:<br />
• Granularity: Keys can be mapped to individual<br />
files, rather than entire block de-<br />
5 Note that this is deprecated and is in the process of<br />
being replaced with the Device Mapper crypto target.<br />
vices or entire directories.<br />
• Backup Utilities: Incremental backup<br />
tools can correctly operate without having<br />
to have access to the decrypted content of<br />
the files it is backing up.<br />
• Performance: In most cases, only certain<br />
files need to be encrypted. System<br />
libraries and executables, in general, do<br />
not need to be encrypted. By limiting the<br />
actual encryption and decryption to only<br />
those files that really need it, system resources<br />
will not be taxed as much.<br />
• Transparent Operation: Individual encrypted<br />
files can be easily transfered off of<br />
the block device without any extra transformation,<br />
and others with authorization<br />
will be able to decrypt those files. <strong>The</strong><br />
userspace applications and libraries do not<br />
need to be modified and recompiled to<br />
support this transparency.<br />
Since all the information necessary to decrypt<br />
a file is contained in the Extended Attributes<br />
of the file, it is possible for a user on a machine<br />
that is not running Cryptfs to use userland<br />
utilities to access the contents of the file.<br />
This also applies to other security-related operations,<br />
like verifying keyed hashes. This addresses<br />
compatibility issues with machines that<br />
are not running the encrypted filesystem layer.<br />
5 Survey of <strong>Linux</strong> Encrypted<br />
Filesystems<br />
5.1 Encrypted Loopback Filesystems<br />
5.1.1 Loop-aes<br />
<strong>The</strong> most well-known method of encrypting<br />
a filesystem is to use a loopback en-
278 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
crypted filesystem. 6 Loop-aes[20] is part<br />
of the 2.6 <strong>Linux</strong> kernel (CONFIG_BLK_DEV_<br />
CRYPTOLOOP). It performs encryption at the<br />
block device level. With Loop-aes, the administrator<br />
can choose whatever cipher he wishes<br />
to use with the filesystem. <strong>The</strong> mount package<br />
on most popular GNU/<strong>Linux</strong> distributions<br />
contains the losetup utility, which can be used<br />
to set up the encrypted loopback mount (you<br />
can choose whatever cipher that the kernel supports;<br />
we use blowfish in this example):<br />
root# modprobe cryptoloop<br />
root# modprobe blowfish<br />
root# dd if=/dev/urandom of=encrypted.img \<br />
bs=4k count=1000<br />
root# losetup -e blowfish /dev/loop0 \<br />
encrypted.img<br />
root# mkfs.ext3 /dev/loop0<br />
root# mkdir /mnt/unencrypted-view<br />
root# mount /dev/loop0 /mnt/unencrypted-view<br />
<strong>The</strong> loopback encrypted filesystem falls short<br />
in the fact that it is an all-or-nothing solution.<br />
It is impossible for most standard backup utilities<br />
to perform incremental backups on sets<br />
of encrypted files without being given access<br />
to the unencrypted files. In addition, remote<br />
users will need to use IPSec or some other network<br />
encryption layer when accessing the files,<br />
which must be exported from the unencrypted<br />
mount point on the server. Loop-aes is, however,<br />
the best performing encrypted filesystem<br />
that is freely available and integrated with most<br />
GNU/<strong>Linux</strong> distributions. It is an adequate solution<br />
for many who require little more than<br />
basic encryption of their entire filesystems.<br />
5.1.2 BestCrypt<br />
BestCrypt[23] is a non-free product that uses a<br />
loopback approach, similar to Loop-aes.<br />
6 Note that Loop-aes is being deprecated, in favor of<br />
Device Mapping (DM) Crypt, which also does encryption<br />
at the block device layer.<br />
5.1.3 PPDD<br />
PPDD[21] is a block device driver that encrypts<br />
and decrypts data as it goes to and comes<br />
from another block device. It works very much<br />
like Loop-aes; in fact, in the 2.4 kernel, it uses<br />
the loopback device, as Loop-aes does. PPDD<br />
has not been ported to the 2.6 kernel. Loop-aes<br />
takes the same approach, and Loop-aes ships<br />
with the 2.6 kernel itself.<br />
5.2 CFS<br />
<strong>The</strong> Cryptographic Filesystem (CFS)[22] by<br />
Matt Blaze is a well established transparent encrypted<br />
filesystem, originally written for BSD<br />
platforms. CFS is implemented entirely in<br />
userspace and operates similarly to NFS. A<br />
userspace daemon, cfsd, acts as a pseudo-NFS<br />
server, and the kernel makes RPC calls to the<br />
daemon. <strong>The</strong> CFS daemon performs transparent<br />
encryption and decryption when writing<br />
and reading data. Just as NFS can export a<br />
directory from any exportable filesystem, CFS<br />
can do the same, while managing the encryption<br />
on top of that filesystem.<br />
In the background, CFS stores the metadata<br />
necessary to encrypt and decrypt files with<br />
the files being encrypted or decrypted on the<br />
filesystem. If you were to look at those directories<br />
directly, you would see a set of files<br />
with encrypted values for filenames, and there<br />
would be a handful of metadata files mixed in.<br />
When accessed through CFS, those metadata<br />
files are hidden, and the files are transparently<br />
encrypted and decrypted for the user applications<br />
(with the proper credentials) to freely<br />
work with the data.<br />
While CFS is capable of acting as a remote<br />
NFS server, this is not recommended for many<br />
reasons, some of which include performance<br />
and security issues with plaintext passwords<br />
and unencrypted data being transmitted over
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 279<br />
the network. You would be better off, from a<br />
security perspective (and perhaps also performance,<br />
depending on the number of clients),<br />
to use a regular NFS server to handle remote<br />
mounts of the encrypted directories, with local<br />
CFS mounts off of the NFS mounts.<br />
Perhaps the most attractive attribute of CFS<br />
is the fact that it does not require any modifications<br />
to the standard <strong>Linux</strong> kernel. <strong>The</strong><br />
source code for CFS is freely obtainable. It is<br />
packaged in the Debian repositories and is also<br />
available in RPM form. Using apt, CFS is perhaps<br />
the easiest encrypted filesystem for a user<br />
to set up and start using:<br />
root# apt-get install cfs<br />
user# cmkdir encrypted-data<br />
user# cattach encrypted-data unencrypted-view<br />
<strong>The</strong> user will be prompted for his password<br />
at the requisite stages. At this point,<br />
anything the user writes to or reads from<br />
/crypt/unencrypted-view will be transparently<br />
encrypted to and decrypted from files in<br />
encrypted-data. Note that any user on the system<br />
can make a new encrypted directory and<br />
attach it. It is not necessary to initialize and<br />
mount an entire block device, as is the case<br />
with Loop-aes.<br />
5.3 TCFS<br />
TCFS[24] is a variation on CFS that includes<br />
secure integrated remote access and file integrity<br />
features. TCFS assumes the client’s<br />
workstation is trusted, and the server cannot<br />
necessarily be trusted. Everything sent to and<br />
from the server is encrypted. Encryption and<br />
decryption take place on the client side.<br />
Note that this behavior can be mimicked with<br />
a CFS mount on top of an NFS mount. However,<br />
because TCFS works within the kernel<br />
(thus requiring a patch) and does not necessitate<br />
two levels of mounting, it is faster than an<br />
NFS+CFS combination.<br />
TCFS is no longer an actively maintained<br />
project. <strong>The</strong> last release was made three years<br />
ago for the 2.0 kernel.<br />
5.4 Cryptfs<br />
As a proof-of-concept for the FiST stackable<br />
filesystem framework, Erez Zadok, et. al. developed<br />
Cryptfs[1]. Under Cryptfs, symmetric<br />
keys are associated with groups of files<br />
within a single directory. <strong>The</strong> key is generated<br />
with a password that is entered at the time that<br />
the filesystem is mounted. <strong>The</strong> Cryptfs mount<br />
point provides an unencrypted view of the directory<br />
that contains the encrypted files.<br />
<strong>The</strong> authors of this paper are currently working<br />
on extending Cryptfs to provide seamless<br />
integration into the user’s desktop environment<br />
(see Section 6).<br />
5.5 Userspace Encrypted Filesystems<br />
EncFS[25] utilizes the Filesystem in Userspace<br />
(FUSE) library and kernel module to implement<br />
an encrypted filesystem in userspace.<br />
Like CFS, EncFS encrypts on a per-file basis.<br />
CryptoFS[26] is similar to EncFS, except it<br />
uses the <strong>Linux</strong> Userland Filesystem (LUFS) library<br />
instead of FUSE.<br />
SSHFS[27], like CryptoFS, uses the LUFS kernel<br />
module and userspace daemon. It limits itself<br />
to encrypting the files via SFTP as they are<br />
transfered over a network; the files stored on<br />
disk are unencrypted. From the user perspective,<br />
all file accesses take place as though they<br />
were being performed on any regular filesystem<br />
(opens, read, writes, etc.). SSHFS transfers<br />
the files back and forth via SFTP with the<br />
file server as these operations occur.
280 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
5.6 Reiser4<br />
ReiserFS version 4 (Reiser4)[29], while still in<br />
the development stage, features pluggable security<br />
modules. <strong>The</strong>re are currently proposed<br />
modules for Reiser4 that will perform encryption<br />
and auditing.<br />
5.7 Network Filesystem Security<br />
Much research has taken place in the domain of<br />
networking filesystem security. CIFS, NFSv4,<br />
and other networking filesystems face special<br />
challenges in relation to user identification, access<br />
control, and data secrecy. <strong>The</strong> NFSv4 protocol<br />
definition in RFC 3010 contains descriptions<br />
of security mechanisms in section 3[30].<br />
6 Proposed Extensions to Cryptfs<br />
Our proposal is to place file encryption metadata<br />
into the Extended Attributes (EA’s) of the<br />
file itself. Extended Attributes are a generic<br />
interface for attaching metadata to files. <strong>The</strong><br />
Cryptfs layer will be extended to extract that<br />
information and to use the information to direct<br />
the encrypting and decrypting of the contents<br />
of the file. In the event that the filesystem<br />
does not support Extended Attributes, another<br />
filesystem layer can provide that functionality.<br />
<strong>The</strong> stackable framework effectively<br />
allows Cryptfs to operate on top of any filesystem.<br />
<strong>The</strong> encryption process is very similar to that of<br />
GnuPG and other public key cryptography programs<br />
that use a hybrid approach to encrypting<br />
data. By integrating the process into the<br />
filesystem, we can achieve a greater degree of<br />
transparency, without requiring any changes to<br />
userspace applications or libraries.<br />
Under our proposed design, when a new file is<br />
created as an encrypted file, the Cryptfs layer<br />
generates a new symmetric key K s for the encryption<br />
of the data that will be written. File<br />
creation policy enacted by Cryptfs can be dictated<br />
by directory attributes or globally defined<br />
behavior. <strong>The</strong> owner of the file is automatically<br />
authorized to access the file, and so the<br />
symmetric key is encrypted with the public key<br />
of the owner of the file K u , which was passed<br />
into the Cryptfs layer at the time that the user<br />
logged in by a Pluggable Authentication Module<br />
linked against libcryptfs. <strong>The</strong> encrypted<br />
symmetric key is then added to the Extended<br />
Attribute set of the file:<br />
{K s }K u<br />
Suppose that the user at this point wants to<br />
grant Alice access to the file. Alice’s public<br />
key, K a , is in the user’s GnuPG keyring. He<br />
can run a utility that selects Alice’s key, extracts<br />
it from the GnuPG keyring, and passes<br />
it to the Cryptfs layer, with instructions to<br />
add Alice as an authorized user for the file.<br />
<strong>The</strong> new key list in the Extended Attribute set<br />
for the file then contains two copies of the<br />
symmetric key, encrypted with different public<br />
keys:<br />
{K s }K u<br />
{K s }K a<br />
Note that this is not an access control directive;<br />
it is rather a confidentiality enforcement mechanism<br />
that extends beyond the local machine’s<br />
access control. Without either the user’s or Alice’s<br />
private key, no entity will be able to access<br />
the decrypted contents of the file. <strong>The</strong> machine<br />
that harbors such keys will enact its own access<br />
control over the decrypted file, based on<br />
standard UNIX file permissions and/or ACL’s.<br />
When that file is copied to a removable media<br />
or attached to an email, as long as the Extended<br />
Attributes are preserved, Alice will have all<br />
the information that she needs in order to retrieve<br />
the symmetric key for the file and de-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 281<br />
<strong>Kernel</strong><br />
User<br />
<strong>Kernel</strong><br />
crypto API<br />
cryptfs<br />
library<br />
Change file<br />
encryption attributes<br />
VFS syscall<br />
Login/GNOME/KDE/...<br />
Cryptfs layer<br />
Keystore<br />
Key retrieval<br />
PAM Module<br />
Prompt for<br />
user authentication<br />
action (conversation)<br />
File Structure<br />
Security Attributes<br />
Additional layers<br />
(optional)<br />
USB Keychain device,<br />
Smartcard, TPM,<br />
GnuPG Keyring, etc...<br />
Authentication<br />
PAM Module<br />
Filesystem<br />
PAM<br />
Figure 1: Overview of proposed extended Cryptfs architecture<br />
crypt it. If Alice is also running Cryptfs, when<br />
she launches an application that accesses the<br />
file, the decryption process is entirely transparent<br />
to her, since her Cryptfs layer received<br />
her private key from PAM at the time that she<br />
logged in.<br />
If the user requires the ability to encrypt a file<br />
for access by a group of users, then the user<br />
can associate sets of public keys with groups<br />
and refer to the groups when granting access.<br />
<strong>The</strong> userspace application that links against<br />
libcryptfs can then pass in the public keys to<br />
Cryptfs for each member of the group and instruct<br />
Cryptfs to add the associated key record<br />
to the Extended Attributes. Thus no special<br />
support for groups is needed within the Cryptfs<br />
layer itself.<br />
6.1 <strong>Kernel</strong>-level Changes<br />
No modifications to the 2.6 kernel itself are<br />
necessary to support the stackable Cryptfs<br />
layer. <strong>The</strong> Cryptfs module’s logical divisions<br />
include a sysfs interface, a keystore, and<br />
the VFS operation routines that perform the<br />
encryption and the decryption on reads and<br />
writes.<br />
By working with a userspace daemon, it would<br />
be possible for Cryptfs to export public key<br />
cryptographic operations to userspace. In order<br />
to avoid the need for such a daemon while<br />
using public key cryptography, the kernel cryptographic<br />
API must be extended to support it.<br />
6.2 PAM<br />
At login, the user’s public and private keys<br />
need to find their way into the kernel<br />
Cryptfs layer. This can be accomplished by<br />
writing a Pluggable Authentication Module,<br />
pam cryptfs.so. This module will link against<br />
libcryptfs and will extract keys from the user’s<br />
GnuPG keystore. <strong>The</strong> libcryptfs library will<br />
use the sysfs interface to pass the user’s keys<br />
into the Cryptfs layer.<br />
6.3 libcryptfs<br />
<strong>The</strong> libcryptfs library works with the Cryptfs’s<br />
sysfs interface. Userspace utilities, such as<br />
pam cryptfs.so, GNOME/KDE, or stand-alone<br />
utilities, will link against this library and use it
282 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
VFS Syscalls<br />
<strong>Kernel</strong> Crypto API<br />
Cryptfs Layer<br />
File Security<br />
Attributes<br />
Crypto calls<br />
parameterized<br />
by file security<br />
attributes<br />
Calls to kernel<br />
Crypto API<br />
Keys retrieved<br />
from the keystore<br />
Keystore<br />
Symmetric keys used<br />
for encryption of file<br />
data may need<br />
decrypting with the<br />
authorized user’s<br />
private key.<br />
File<br />
EA’s are parsed<br />
into cryptfs layer’s<br />
file attribute<br />
structure<br />
sysfs<br />
Set<br />
private/public<br />
keys<br />
Extended<br />
Attributes<br />
Data<br />
Userspace<br />
libcryptfs<br />
Figure 2: Structure of Cryptfs layer in kernel<br />
to communicate with the kernel Cryptfs layer.<br />
6.4 User Interface<br />
Desktop environments such as GNOME or<br />
KDE can link against libcryptfs to provide<br />
users with a convenient interface through<br />
which to work with the files. For example,<br />
by right-clicking on an icon representing the<br />
file and selecting “Security”, the user will be<br />
presented with a window that can be used to<br />
control the encryption status of the file. Such<br />
options will include whether or not the file is<br />
encrypted, which users should be able to encrypt<br />
and decrypt the file (identified by their<br />
public keys from the user’s GnuPG keyring),<br />
what cipher is used, what keylength is used,<br />
an optional password that encrypts the symmetric<br />
key, whether or not to use keyed hashing<br />
over extents of the file for integrity, the<br />
hash algorithm to use, whether accesses to the<br />
file when no key is available should result in<br />
an error or in the encrypted blocks being returned<br />
(perhaps associated with UID’s - good<br />
for backup utilities), and other properties that<br />
are controlled by the Cryptfs layer.<br />
6.5 Example Walkthrough<br />
When a file’s encryption attribute is set, the<br />
first thing that the Cryptfs layer will do will be<br />
to generate a new symmetric key, which will be<br />
used for all encryption and decryption of the<br />
file in question. Any data in that file is then<br />
immediately encrypted with that key. When<br />
using public key-enforced access control, that
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 283<br />
key will be encrypted with the process owner’s<br />
private key and stored as an EA of the file.<br />
When the process owner wishes to allow others<br />
to access the file, he encrypts the symmetric<br />
key with the their public keys. From the<br />
user’s perspective, this can be done by rightclicking<br />
on an icon representing the file, selecting<br />
“Security→Add Authorized User Key”,<br />
and having the user specify the authorized user<br />
while using PAM to retrieve the public key for<br />
that user.<br />
When using password-enforced access control,<br />
the symmetric key is instead encrypted using a<br />
key generated from a password. <strong>The</strong> user can<br />
then share that password with everyone who<br />
he authorized to access the file. In either case<br />
(public key-enforced or password-enforced access<br />
control), revocation of access to future<br />
versions of the file will necessitate regeneration<br />
and re-encryption of the symmetric key.<br />
Suppose the encrypted file is then copied to a<br />
removable device and delivered to an authorized<br />
user. When that user logged into his machine,<br />
his private key was retrieved by the key<br />
retrieval Pluggable Authentication Module and<br />
sent to the Cryptfs keystore. When that user<br />
launches any arbitrary application and attempts<br />
to access the encrypted file from the removable<br />
media, Cryptfs retrieves the encrypted symmetric<br />
key correlating with that user’s public<br />
key, uses the authenticated user’s private key<br />
to decrypt the symmetric key, associates that<br />
symmetric key with the file, and then proceeds<br />
to use that symmetric key for reading and writing<br />
the file. This is done in an entirely transparent<br />
manner from the perspective of the user,<br />
and the file maintains its encrypted status on<br />
the removable media throughout the entire process.<br />
No modification to the application or applications<br />
accessing the file are necessary to<br />
implement such functionality.<br />
In the case where a file’s symmetric key is encrypted<br />
with a password, it will be necessary<br />
for the user to launch a daemon that listens for<br />
password queries from the kernel cryptfs layer.<br />
Without such a daemon, the user’s initial attempt<br />
to access the file will be denied, and the<br />
user will have to use a password set utility to<br />
send the password to the cryptfs layer in the<br />
kernel.<br />
6.6 Other Considerations<br />
Sparse files present a challenge to encrypted<br />
filesystems. Under traditional UNIX semantics,<br />
when a user seeks more than a block beyond<br />
the end of a file to write, then that space<br />
is not stored on the block device at all. <strong>The</strong>se<br />
missing blocks are known as “holes.”<br />
When holes are later read, the kernel simply<br />
fills in zeros into the memory without actually<br />
reading the zeros from disk (recall that they<br />
do not exist on the disk at all; the filesystem<br />
“fakes it”). From the point of view of whatever<br />
is asking for the data from the filesystem,<br />
the section of the file being read appears to be<br />
all zeros. This presents a problem when the<br />
file is supposed to be encrypted. Without taking<br />
sparse files into consideration, the encryption<br />
layer will naïvely assume that the zeros being<br />
passed to it from the underlying filesystem<br />
are actually encrypted data, and it will attempt<br />
to decrypt the zeros. Obviously, this will result<br />
in something other that zeros being presented<br />
above the encryption layer, thus violating<br />
UNIX sparse file semantics.<br />
<strong>One</strong> solution to this problem is to abandon the<br />
concept of “holes” altogether at the Cryptfs<br />
layer. Whenever we seek past the end of the<br />
file and write, we can actually encrypt blocks<br />
of zeros and write them out to the underlying<br />
filesystem. While this allows Cryptfs to adhere<br />
to UNIX semantics, it is much less efficient.<br />
<strong>One</strong> possible solution might be to store a<br />
“hole bitmap” as an Extended Attribute of the
284 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
file. Each bit would correspond with a block of<br />
the file; a “1” might indicate that the block is a<br />
“hole” and should be zero’d out rather than decrypted,<br />
and a “0” might indicate that the block<br />
should be normally decrypted.<br />
Our proposed extensions to Cryptfs in the near<br />
future do not currently address the issues of directory<br />
structure and file size secrecy. We recognize<br />
that this type of confidentiality is important<br />
to many, and we plan to explore ways<br />
to integrate such features into Cryptfs, possibly<br />
by employing extra filesystem layers to aid in<br />
the process.<br />
Extended Attribute content can also be sensitive.<br />
Technically, only enough information to<br />
retrieve the symmetric decryption key need be<br />
accessible by authorized individuals; all other<br />
attributes can be encrypted with that key, just<br />
as the contents of the file are encrypted.<br />
Processes that are not authorized to access the<br />
decrypted content will either be denied access<br />
to the file or will receive the encrypted content,<br />
depending on how the Cryptfs layer is parameterized.<br />
This behavior permits incremental<br />
backup utilities to function properly, without<br />
requiring access to the unencrypted content<br />
of the files they are backing up.<br />
At some point, we would like to include file integrity<br />
information in the Extended Attributes.<br />
As previously mentioned, this can be accomplished<br />
via sets of keyed hashes over extents<br />
within the file:<br />
H 0 = H{O 0 , D 0 , K s }<br />
H 1 = H{O 1 , D 1 , K s }<br />
. . .<br />
H n = H{O n , D n , K s }<br />
H f = H{H 0 , H 1 , . . . , H n , n, s, K s }<br />
where n is the number of extents in the file,<br />
s is the extent size (also contained as another<br />
EA), O i is the offset number i within the file,<br />
D i is the data from offset O i to O i + s, K s is<br />
the key that one must possess in order to make<br />
authorized changes to the file, and H f is the<br />
hash of the hashes, the number of extents, the<br />
extent size, and the secret key, to help detect<br />
when an attacker swaps around extents or alters<br />
the extent size.<br />
Keyed hashes prove that whoever modified the<br />
data had access to the shared secret, which is,<br />
in this case, the symmetric key. Digital signatures<br />
can also be incorporated into Cryptfs.<br />
Executables downloaded over the Internet can<br />
often be of questionable origin or integrity. If<br />
you trust the person who signed the executable,<br />
then you can have a higher degree of certainty<br />
that the executable is safe to run if the digital<br />
signature is verifiable. <strong>The</strong> verification of the<br />
digital signature can be dynamically performed<br />
at the time of execution.<br />
As previously mentioned, in addition to the extensions<br />
to the Cryptfs stackable layer, this effort<br />
is requiring the development of a cryptfs<br />
library, a set of PAM modules, hooks into<br />
GNOME and KDE, and some utilities for managing<br />
file encryption. Applications that copy<br />
files with Extended Attributes must take steps<br />
to make sure that they preserve the Extended<br />
Attributes. 7<br />
7 Conclusion<br />
<strong>Linux</strong> currently has a comprehensive framework<br />
for managing filesystem security. Standard<br />
file security attributes, process credentials,<br />
ACL, PAM, LSM, Device Mapping (DM)<br />
Crypt, and other features together provide good<br />
security in a contained environment. To extend<br />
access control enforcement over individual<br />
files beyond the local environment, you<br />
must use encryption in a way that can be easily<br />
7 See http://www.suse.de/~agruen/<br />
ea-acl-copy/
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 285<br />
applied to individual files. <strong>The</strong> currently employed<br />
processes of encrypting and decrypting<br />
files, however, is inconvenient and often obstructive.<br />
By integrating the encryption and the decryption<br />
of the individual files into the filesystem<br />
itself, associating encryption metadata with the<br />
individual files, we can extend <strong>Linux</strong> security<br />
to provide seamless encryption-enforced access<br />
control and integrity auditing.<br />
8 Recognitions<br />
We would like to express our appreciation for<br />
the contributions and input on the part of all<br />
those who have laid the groundwork for an effort<br />
toward transparent filesystem encryption.<br />
This includes contributors to FiST and Cryptfs,<br />
GnuPG, PAM, and many others from which<br />
we are basing our development efforts, as well<br />
as several members of the kernel development<br />
community.<br />
9 Legal Statement<br />
This work represents the view of the author and<br />
does not necessarily represent the view of IBM.<br />
IBM and Lotus Notes are registered trademarks<br />
of International Business Machines Corporation<br />
in the United States, other countries,<br />
or both.<br />
Other company, product, and service names<br />
may be trademarks or service marks of others.<br />
References<br />
[1] E. Zadok, L. Badulescu, and A. Shender.<br />
Cryptfs: A stackable vnode level<br />
encryption file system. Technical Report<br />
CUCS-021-98, Computer Science<br />
Department, Columbia University, 1998.<br />
[2] J.S. Heidemann and G.J. Popek. File<br />
system development with stackable layers.<br />
ACM Transactions on Computer Systems,<br />
12(1):58–89, February 1994.<br />
[3] E. Zadok and J. Nieh. FiST: A Language<br />
for Stackable File Systems. Proceedings<br />
of the Annual USENIX Technical<br />
Conference, pp. 55–70, San Diego, June<br />
2000.<br />
[4] S.C. Kothari, Generalized Linear<br />
Threshold Scheme, Advances in<br />
Cryptology: Proceedings of CRYPTO 84,<br />
Springer-Verlag, 1985, pp. 231–241.<br />
[5] Matt Blaze. “Key Management in an<br />
Encrypting File System,” Proc. Summer<br />
’94 USENIX Tech. Conference, Boston,<br />
MA, June 1994.<br />
[6] For more information on Extended<br />
Attributes (EA’s) and Access Control Lists<br />
(ACL’s), see<br />
http://acl.bestbits.at/ or<br />
http://www.suse.de/~agruen/<br />
acl/chapter/fs_acl-en.pdf<br />
[7] For more information on GnuPG, see<br />
http://www.gnupg.org/<br />
[8] For more information on OpenSSL, see<br />
http://www.openssl.org/<br />
[9] For more information on IBM Lotus<br />
Notes, see http://www-306.ibm.<br />
com/software/lotus/. Information<br />
on Notes security can be obtained from<br />
http://www-10.lotus.com/ldd/<br />
today.nsf/f01245ebfc115aaf<br />
8525661a006b86b9/<br />
232e604b847d2cad8<br />
8256ab90074e298?OpenDocument<br />
[10] For more information on Pluggable<br />
Authentication Modules (PAM), see<br />
http://www.kernel.org/pub/<br />
linux/libs/pam/
286 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
[11] For more information on Mandatory<br />
Access Control (MAC), see http://<br />
csrc.nist.gov/publications/<br />
nistpubs/800-7/node35.html<br />
[12] For more information on Discretionary<br />
Access Control (DAC), see http://<br />
csrc.nist.gov/publications/<br />
nistpubs/800-7/node25.html<br />
[13] For more information on the Trusted<br />
Computing Platform Alliance (TCPA), see<br />
http://www.trustedcomputing.<br />
org/home<br />
[14] For more information on <strong>Linux</strong> Security<br />
Modules (LSM’s), see<br />
http://lsm.immunix.org/<br />
[15] For more information on<br />
Security-Enhanced <strong>Linux</strong> (SE <strong>Linux</strong>), see<br />
http://www.nsa.gov/selinux/<br />
index.cfm<br />
[16] For more information on Tripwire, see<br />
http://www.tripwire.org/<br />
[17] For more information on AIDE, see<br />
http://www.cs.tut.fi/<br />
~rammer/aide.html<br />
[18] For more information on Samhain, see<br />
http://la-samhna.de/samhain/<br />
[23] For more information on BestCrypt, see<br />
http://www.jetico.com/index.<br />
htm#/products.htm<br />
[24] For more information on TCFS, see<br />
http://www.tcfs.it/<br />
[25] For more information on EncFS, see<br />
http://arg0.net/users/<br />
vgough/encfs.html<br />
[26] For more information on CryptoFS, see<br />
http://reboot.animeirc.de/<br />
cryptofs/<br />
[27] For more information on SSHFS, see<br />
http://lufs.sourceforge.net/<br />
lufs/fs.html<br />
[28] For more information on the<br />
Light-weight Auditing Framework, see<br />
http:<br />
//lwn.net/Articles/79326/<br />
[29] For more information on Reiser4, see<br />
http:<br />
//www.namesys.com/v4/v4.html<br />
[30] NFSv4 RFC 3010 can be obtained from<br />
http://www.ietf.org/rfc/<br />
rfc3010.txt<br />
[19] For more information on Logcrypt, see<br />
http:<br />
//www.lunkwill.org/logcrypt/<br />
[20] For more information on Loop-aes, see<br />
http://sourceforge.net/<br />
projects/loop-aes/<br />
[21] For more information on PPDD, see<br />
http://linux01.gwdg.de/<br />
~alatham/ppdd.html<br />
[22] For more information on CFS, see<br />
http://sourceforge.net/<br />
projects/cfsnfs/
Hotplug Memory and the <strong>Linux</strong> VM<br />
Dave Hansen, Mike Kravetz, with Brad Christiansen<br />
IBM <strong>Linux</strong> Technology Center<br />
haveblue@us.ibm.com, kravetz@us.ibm.com, bradc1@us.ibm.com<br />
Matt Tolentino<br />
Intel<br />
matthew.e.tolentino@intel.com<br />
Abstract<br />
This paper will describe the changes needed to<br />
the <strong>Linux</strong> memory management system to cope<br />
with adding or removing RAM from a running<br />
system. In addition to support for physically<br />
adding or removing DIMMs, there is an everincreasing<br />
number of virtualized environments<br />
such as UML or the IBM pSeries Hypervisor<br />
which can transition RAM between virtual<br />
system images, based on need. This paper will<br />
describe techniques common to all supported<br />
platforms, as well as challenges for specific architectures.<br />
1 Introduction<br />
As Free Software Operating Systems continue<br />
to expand their scope of use, so do the demands<br />
placed upon them. <strong>One</strong> area of continuing<br />
growth for <strong>Linux</strong> is the adaptation to<br />
incessantly changing hardware configurations<br />
at runtime. While initially confined to commonly<br />
removed devices such as keyboards,<br />
digital cameras or hard disks, <strong>Linux</strong> has recently<br />
begun to grow to include the capability<br />
to hot-plug integral system components. This<br />
paper describes the changes necessary to enable<br />
<strong>Linux</strong> to adapt to dynamic changes in one<br />
of the most critical system resource—system<br />
RAM.<br />
2 Motivation<br />
<strong>The</strong> underlying reason for wanting to change<br />
the amount of RAM is very simple: availability.<br />
<strong>The</strong> systems that support memory hot-plug<br />
operations are designed to fulfill mission critical<br />
roles; significant enough that the cost of<br />
a reboot cycle for the sole purpose of adding<br />
or replacing system RAM is simply too expensive.<br />
For example, some large ppc64 machines<br />
have been reported to take well over thirty minutes<br />
for a simple reboot. <strong>The</strong>refore, the downtime<br />
necessary for an upgrade may compromise<br />
the five nine uptime requirement critical<br />
to high-end system customers [1].<br />
However, memory hotplug is not just important<br />
for big-iron. <strong>The</strong> availability of high<br />
speed, commodity hardware has prompted a<br />
resurgence of research into virtual machine<br />
monitors—layers of software such as Xen<br />
[2], VMWare [3], and conceptually even User<br />
Mode <strong>Linux</strong> that allow for multiple operating<br />
system instances to be run in isolated, virtual<br />
domains. As computing hardware density has<br />
increased, so has the possibility of splitting up<br />
that computing power into more manageable<br />
pieces. <strong>The</strong> capability for an operating system<br />
to expand or contract the range of physical
288 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
memory resources available presents the possibility<br />
for virtual machine implementations to<br />
balance memory requirements and improve the<br />
management of memory availability between<br />
domains 1 . This author currently leases a small<br />
User Mode <strong>Linux</strong> partition for small Internet<br />
tasks such as DNS and low-traffic web serving.<br />
Similar configurations with an approximately<br />
100 MHz processor and 64 MB of RAM are<br />
not uncommon. Imagine, in the case of an accidental<br />
Slashdotting, how useful radically growing<br />
such a machine could be.<br />
3 <strong>Linux</strong>’s Hotplug Shortcomings<br />
Before being able to handle the full wrath of<br />
Slashdot. we have to consider <strong>Linux</strong>’s current<br />
design. <strong>Linux</strong> only has two data structures<br />
that absolutely limit the amount of RAM that<br />
<strong>Linux</strong> can handle: the page allocator bitmaps,<br />
and mem_map[] (on contiguous memory systems).<br />
<strong>The</strong> page allocator bitmaps are very<br />
simple in concept, have a bit set one way when<br />
a page is available, and the opposite when it<br />
has been allocated. Since there needs to be one<br />
bit available for each page, it obviously has to<br />
scale with the size of the system’s total RAM.<br />
<strong>The</strong> bitmap memory consumption is approximately<br />
1 bit of memory for each page of system<br />
RAM.<br />
4 Resizing mem_map[]<br />
<strong>The</strong> mem_map[] structure is a bit more complicated.<br />
Conceptually, it is an array, with one<br />
struct page for each physical page which<br />
the system contains. <strong>The</strong>se structures contain<br />
bookkeeping information such as flags indicating<br />
page usage and locking structures. <strong>The</strong><br />
complexity with the struct pages is associated<br />
when their size. <strong>The</strong>y have a size of<br />
further<br />
1 err, I could write a lot about this, so I won’t go any<br />
40 bytes each on i386 (in the 2.6.5 kernel).<br />
On a system with 4096 byte hardware pages,<br />
this implies that about 1% of the total system<br />
memory will be consumed by struct<br />
pages alone. This use of 1% of the system<br />
memory is not a problem in and of itself. But,<br />
it does other problems.<br />
<strong>The</strong> <strong>Linux</strong> page allocator has a limitation on<br />
the maximum amounts of memory that it can<br />
allocate to a single request. On i386, this<br />
is 4MB, while on ppc64, it is 16MB. It is<br />
easy to calculate that anything larger than a<br />
4GB i386 system will be unable to allocate<br />
its mem_map[] with the normal page allocator.<br />
Normally, this problem with mem_map is<br />
avoided by using a boot-time allocator which<br />
does not have the same restrictions as the allocator<br />
used at runtime. However, memory hotplug<br />
requires the ability to grow the amount of<br />
mem_map[] used at runtime. It is not feasible<br />
to use the same approach as the page allocator<br />
bitmaps because, in contrast, they are kept to<br />
small-enough sizes to not impinge on the maximum<br />
size allocation limits.<br />
4.1 mem_map[] preallocation<br />
A very simple way around the runtime allocator<br />
limitations might be to allocate sufficient<br />
memory form mem_map[] at boot-time to account<br />
for any amount of RAM that could possibly<br />
be added to the system. But, this approach<br />
quickly breaks down in at least one important<br />
case. <strong>The</strong> mem_map[] must be allocated<br />
in low memory, an area on i386 which<br />
is approximately 896MB in total size. This<br />
is very important memory which is commonly<br />
exhausted [4],[5],[6]. Consider an 8GB system<br />
which could be expanded to 64GB in the future.<br />
Its normal mem_map[] use would be<br />
around 84MB, an acceptable 10% use of low<br />
memory. However, had mem_map[] been<br />
preallocated to handle a total capacity of 64GB<br />
of system memory, it would use an astound-
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 289<br />
ing 71% of low memory, giving any 8GB system<br />
all of the low memory problems associated<br />
with much larger systems.<br />
Preallocation also has the disadvantage of imposing<br />
limitations possibly making the user<br />
decide how large they expect the system to<br />
be, either when the kernel is compiled, or<br />
when it is booted. Perhaps the administrator<br />
of the above 8GB machine knows that it<br />
will never get any larger than 16GB. Does that<br />
make the low memory usage more acceptable?<br />
It would likely solve the immediate problem,<br />
however, such limitations and user intervention<br />
are becoming increasingly unacceptable<br />
to <strong>Linux</strong> vendors, as they drastically increase<br />
possible user configurations, and support costs<br />
along with it.<br />
4.2 Breaking mem_map[] up<br />
Instead of preallocation, another solution is<br />
to break up mem_map[]. Instead of needing<br />
massive amounts of memory, smaller ones<br />
could be used to piece together mem_map[]<br />
from more manageable allocations Interestingly,<br />
there is already precedent in the <strong>Linux</strong><br />
kernel for such an approach. <strong>The</strong> discontiguous<br />
memory support code tries to solve a different<br />
problem (large holes in the physical address<br />
space), but a similar solution was needed.<br />
In fact, there has been code released to use the<br />
current discontigmem support in <strong>Linux</strong> to implement<br />
memory hotplug. But, this has several<br />
disadvantages. Most importantly, it requires<br />
hijacking the NUMA code for use with<br />
memory hotplug. This would exclude the use<br />
of NUMA and memory hotplug on the same<br />
system, which is likely an unacceptable compromise<br />
due to the vast performance benefits<br />
demonstrated from using the <strong>Linux</strong> NUMA<br />
code for its intended use [6].<br />
Using the NUMA code for memory hotplug is<br />
a very tempting proposition because in addition<br />
to splitting up mem_map[] the NUMA<br />
support also handles discontiguous memory.<br />
Discontiguous memory simply means that the<br />
system does not lay out all of its physical memory<br />
in a single block, rather there are holes.<br />
Handling these holes with memory hotplug is<br />
very important, otherwise the only memory<br />
that could be added or removed would be on<br />
the end.<br />
Although an approch similar to this “node hotplug”<br />
approach will be needed when adding or<br />
removing entire NUMA nodes, using it on a<br />
regular SMP hotplug system could be disastrous.<br />
Each discontiguous area is represented<br />
by several data structures but each has at least<br />
one structzone. This structure is the basic<br />
unit which <strong>Linux</strong> uses to pool memory. When<br />
the amounts of memory reach certain low levels,<br />
<strong>Linux</strong> will respond by trying to free or<br />
swap memory. Artificially creating too many<br />
zones causes these events to be triggered much<br />
too early, degrading system performance and<br />
under-utilizing available RAM.<br />
5 CONFIG_NONLINEAR<br />
<strong>The</strong> solution to both the mem_map[] and discontiguous<br />
memory problems comes in a single<br />
package: nonlinear memory. First implemented<br />
by Daniel Phillips in April of 2002 as<br />
an alternative to discontiguous memory, nonlinear<br />
solves a similar set of problems.<br />
Laying out mem_map[] as an array has several<br />
advantages. <strong>One</strong> of the most important<br />
is the ability to quickly determine the physical<br />
address of any arbitrary struct page.<br />
Since mem_map[N] represents the Nth page<br />
of physical memory, the physical address of the<br />
memory represented by that struct page<br />
can be determined by simple pointer arithmetic:<br />
Once mem_map[] is broken up these simple
290 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
physical_address = (&mem_map[N] - &mem_map[0]) * sizeof(struct page)<br />
struct page N = mem_map[(physical_address / sizeof(struct page)]<br />
Figure 1: Physical Address Calculations<br />
calculations are no longer possible, thus another<br />
approach is required. <strong>The</strong> nonlinear approach<br />
is to use a set of two lookup tables, each<br />
one complementing the above operations: one<br />
for converting struct page to physical addresses,<br />
the other for doing the opposite. While<br />
it would be possible to have a table with an entry<br />
for every single page, that approach wastes<br />
far too much memory. As a result, nonlinear<br />
handles pages in uniformly sized sections, each<br />
of which has its own mem_map[] and an associated<br />
physical address range. <strong>Linux</strong> has some<br />
interesting conventions about how addresses<br />
are represented, and this has serious implications<br />
for how the nonlinear code functions.<br />
5.1 Physical Address Representations<br />
<strong>The</strong>re are, in fact, at least three different ways<br />
to represent a physical address in <strong>Linux</strong>: a<br />
physical address, a struct page, and a<br />
page frame number (pfn). A pfn is traditionally<br />
just the physical address divided by the size<br />
of a physical page (the N in the above in Figure<br />
1). Many parts of the kernel prefer to use<br />
a pfn as opposed to a struct page pointer<br />
to keep track of pages because pfn’s are easier<br />
to work with, being conceptually just array<br />
indexes. <strong>The</strong> page allocator bitmaps discussed<br />
above are just such a part of the kernel. To allocate<br />
or free a page, the page allocator toggles<br />
a bit at an index in one of the bitmaps. That<br />
index is based on a pfn, not a struct page<br />
or a physical address.<br />
Being so easily transposed, that decision does<br />
not seem horribly important. But it does cause<br />
a serious problem for memory hotplug. Consider<br />
a system with 100 1GB DIMM slots<br />
that support hotplug. When the system is first<br />
booted, only one of these DIMM slots is populated.<br />
Later on, the owner decides to hotplug<br />
another DIMM, but puts it in slot 100 instead<br />
of slot 2. Now, nonlinear has a bit of a problem:<br />
the new DIMM happens to appear at a physical<br />
address 100 times higher address than the first<br />
DIMM. <strong>The</strong> mem_map[] for the new DIMM<br />
is split up properly, but the allocator bitmap’s<br />
length is directly tied to the pfn, and thus the<br />
physical address of the memory.<br />
Having already stated that the allocator bitmap<br />
stays at manageable sizes, this still does not<br />
seem like much of an issue. However, the<br />
physical address of that new memory could<br />
have an even greater range than 100 GB; it has<br />
the capability to have many, many terabytes of<br />
range, based on the hardware. Keeping allocator<br />
bitmaps for terabytes of memory could<br />
conceivably consume all system memory on a<br />
small machine, which is quite unacceptable.<br />
Nonlinear offers a solution to this by introducing<br />
a new way to represent a physical address:<br />
a fourth addressing scheme. With three<br />
addressing schemes already existing, a fourth<br />
seems almost comical, until its small scope is<br />
considered. <strong>The</strong> new scheme is isolated to use<br />
inside of a small set of core allocator functions<br />
a single place in the memory hotplug code itself.<br />
A simple lookup table converts these new<br />
“linear” pfns into the more familiar physical<br />
pfns.
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 291<br />
5.2 Issues with CONFIG_NONLINEAR<br />
Although it greatly simplifies several issues,<br />
nonlinear is not without its problems. Firstly,<br />
it does require the consultation of a small number<br />
of lookup tables during critical sections of<br />
code. Random access of these tables is likely to<br />
cause cache overhead. <strong>The</strong> more finely grained<br />
the units of hotplug, the larger these tables will<br />
grow, and the worse the cache effects.<br />
Another concern arises with the size of the<br />
nonlinear tables themselves. While they allow<br />
pfns and mem_map[] to have nonlinear relationships,<br />
the nonlinear structures themselves<br />
remain normal, everyday, linear arrays. If<br />
hardware is encountered with sufficiently small<br />
hotplug units, and sufficiently large ranges of<br />
physical addresses, an alternate scheme to the<br />
arrays may be required. However, it is the authors’<br />
desire to keep the implementation simple,<br />
until such a need is actually demonstrated.<br />
6 Memory Removal<br />
While memory addition is a relatively blackand-white<br />
problem, memory removal has many<br />
more shades of gray. <strong>The</strong>re are many different<br />
ways to use memory, and each of them has<br />
specific challenges for unusing it. We will first<br />
discuss the kinds of memory that <strong>Linux</strong> has<br />
which are relevant to memory removal, along<br />
with strategies to go about unusing them.<br />
6.1 “Easy” User Memory<br />
Unusing memory is a matter of either moving<br />
data or simply throwing it away. <strong>The</strong> easiest,<br />
most straightforward kind of memory to<br />
remove is that whose contents can just be discarded.<br />
<strong>The</strong> two most common manifestations<br />
of this are clean page cache pages and swapped<br />
pages. Page cache pages are either dirty (containing<br />
information which has not been written<br />
to disk) or clean pages, which are simply a<br />
copy of something that is present on the disk.<br />
Memory removal logic that encounters a clean<br />
page cache page is free to have it discarded,<br />
just as the low memory reclaim code does today.<br />
<strong>The</strong> same is true of swapped pages; a page<br />
of RAM which has been written to disk is safe<br />
to discard. (Note: there is usually a brief period<br />
between when a page is written to disk,<br />
and when it is actually removed from memory.)<br />
Any page that can be swapped is also an easy<br />
candidate for memory removal, because it can<br />
easily be turned into a swapped page with existing<br />
code.<br />
6.2 Swappable User Memory<br />
Another type of memory which is very similar<br />
to the two types above is something which<br />
is only used by user programs, but is for<br />
some reason not a candidate for swapping.<br />
This at least includes pages which have been<br />
mlock()’d (which is a system call to prevent<br />
swapping). Instead of discarding these pages<br />
out of RAM, they must instead be moved. <strong>The</strong><br />
algorithm to accomplish this should be very<br />
similar to the algorithm for a complete page<br />
swapping: freeze writes to the page, move the<br />
page’s contents to another place in memory,<br />
change all references to the page, and re-enable<br />
writing. Notice that this is the same process as<br />
a complete swap cycle except that the writes to<br />
the disk are removed.<br />
6.3 <strong>Kernel</strong> Memory<br />
Now comes the hard part. Up until now, we<br />
have discussed memory which is being used<br />
by user programs. <strong>The</strong>re is also memory that<br />
<strong>Linux</strong> sets aside for its own use and this comes<br />
in many more varieties than that used by user<br />
programs. <strong>The</strong> techniques for dealing with this<br />
memory are largely still theoretical, and do not<br />
have existing implementations.
292 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
Remember how the <strong>Linux</strong> page allocator can<br />
only keep track of pages in powers of two? <strong>The</strong><br />
<strong>Linux</strong> slab cache was designed to make up for<br />
that [6], [7]. It has the ability to take those powers<br />
of two pages, and chop them up into smaller<br />
pieces. <strong>The</strong>re are some fixed-size groups for<br />
common allocations like 1024, 1532, or 8192<br />
bytes, but there are also caches for certain<br />
kinds of data structures. Some of these caches<br />
have the ability to attempt to shrink themselves<br />
when the system needs some memory back, but<br />
even that is relatively worthless for memory<br />
hotplug.<br />
6.4 Removing Slab Cache Pages<br />
<strong>The</strong> problem is that the slab cache’s shrinking<br />
mechanism does not concentrate on shrinking<br />
any particular memory, it just concentrates on<br />
shrinking, period. Plus, there’s currently no<br />
mechanism to tell which slab a particular page<br />
belongs to. It could just as easily be a simply<br />
discarded dcache entry as it could be a completely<br />
immovable entry like a pte_chain.<br />
<strong>Linux</strong> will need mechanisms to allow the slab<br />
cache shrinking to be much more surgical.<br />
However, there will always be slab cache memory<br />
which is not covered by any of the shrinking<br />
code, like for generic kmalloc() allocations.<br />
<strong>The</strong> slab cache could also make efforts<br />
to keep these “mystery” allocations away from<br />
those for which it knows how to handle.<br />
While the record-keeping for some slab-cache<br />
pages is sparse, there is memory with even<br />
more mysterious origins. Some is allocated<br />
early in the boot process, while other uses pull<br />
pages directly out of the allocator never to be<br />
seen again. If hot-removal of these areas is required,<br />
then a different approach must be employed:<br />
direct replacement. Instead of simply<br />
reducing the usage of an area of memory until<br />
it is unused, a one-to-one replacement of this<br />
memory is required. With the judicious use of<br />
page tables, the best that can be done is to preserve<br />
the virtual address of these areas. While<br />
this is acceptable for most use, it is not without<br />
its pitfalls.<br />
6.5 Removing DMA Memory<br />
<strong>One</strong> unacceptable place to change the physical<br />
address of some data is for a device’s<br />
DMA buffer. Modern disk controllers and network<br />
devices can transfer their data directly<br />
into the system’s memory without the CPU’s<br />
direct involvement. However, since the CPU<br />
is not involved, the devices lack access to the<br />
CPU’s virtual memory architecture. For this<br />
reason, all DMA-capable devices’ transfers are<br />
based on the physical address of the memory<br />
to which they are transferring. Every user of<br />
DMA in <strong>Linux</strong> will either need to be guaranteed<br />
to not be affected by memory replacement,<br />
or to be notified of such a replacement<br />
so that it can take corrective action. It should<br />
be noted, however, that the virtualization layer<br />
on ppc64 can properly handle this remapping<br />
in its IOMMU. Other architectures with IOM-<br />
MUs should be able to employ similar techniques.<br />
6.6 Removal and the Page Allocator<br />
<strong>The</strong> <strong>Linux</strong> page allocator works by keeping<br />
lists of groups of pages in sizes that are powers<br />
of two times the size of a page. It keeps a<br />
list of groups that are available for each power<br />
of two. However, when a request for a page<br />
is made, the only real information provided is<br />
for the size required, there is no component for<br />
specifically specifying which particular memory<br />
is required.<br />
<strong>The</strong> first thing to consider before removing<br />
memory is to make sure that no other part<br />
of the system is using that piece of memory.<br />
Thankfully, that’s exactly what a normal allocation<br />
does: make sure that it is alone in
<strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong> • 293<br />
its use of the page. So, making the page allocator<br />
support memory removal will simply<br />
involve walking the same lists that store the<br />
page groups. But, instead of simply taking the<br />
first available pages, it will be more finicky,<br />
only “allocating” pages that are among those<br />
about to be removed. In addition, the allocator<br />
should have checks in the free_pages()<br />
path to look for pages which were selected for<br />
removal.<br />
1. Inform allocator to catch any pages in the<br />
area being removed.<br />
2. Go into allocator, and remove any pages<br />
in that area.<br />
3. Trigger page reclaim mechanisms to trigger<br />
free()s, and hopefully unuse all target<br />
pages.<br />
4. If not complete, goto 3.<br />
6.7 Page Groupings<br />
As described above, the page allocator is the<br />
basis for all memory allocations. However,<br />
when it comes time to remove memory a fixed<br />
size block of memory is what is removed.<br />
<strong>The</strong>se blocks correspond to the sections defined<br />
in the the implementation of nonlinear<br />
memory. When removing a section of memory,<br />
the code performing the remove operation<br />
will first try to essentially allocate all the<br />
pages in the section. To remove the section,<br />
all pages within the section must be made free<br />
of use by some mechanism as described above.<br />
However, it should be noted that some pages<br />
will not be able to be made available for removal.<br />
For example, pages in use for kernel<br />
allocations, DMA or via the slab-cache. Since<br />
the page allocator makes no attempt to group<br />
pages based on usage, it is possible in a worst<br />
case situation that every section contains one<br />
in-use page that can not be removed. Ideally,<br />
we would like to group pages based on their usage<br />
to allow the maximum number of sections<br />
to be removed.<br />
Currently, the definition of zones provides<br />
some level of grouping on specific architectures.<br />
For example, on i386, three zones are<br />
defined: DMA, NORMAL and HIGHMEM.<br />
With such definitions, one would expect most<br />
non-removable pages to be allocated out of the<br />
DMA and NORMAL zones. In addition, one<br />
would expect most HIGHMEM allocations to<br />
be associated with userspace pages and thus<br />
removable. Of course, when the page allocator<br />
is under memory pressure it is possible<br />
that zone preferences will be ignored and allocations<br />
may come from an alternate zone. It<br />
should also be noted that on some architectures,<br />
such as ppc64, only one zone (DMA) is<br />
defined. Hence, zones can not provide grouping<br />
of pages on every architecture. It appears<br />
that zones do provide some level of page<br />
grouping, but possibly not sufficient for memory<br />
hotplug.<br />
Ideally, we would like to experiment with<br />
teaching the page allocator about the use of<br />
pages it is handing out. A simple thought<br />
would be to introduce the concept of sections<br />
to the allocator. Allocations of a specific type<br />
are directed to a section that is primarily used<br />
for allocations of that same type. For example,<br />
when allocations for use within the kernel are<br />
needed the allocator will attempt to allocate the<br />
page from a section that contains other internal<br />
kernel allocations. If no such pages can be<br />
found, then a new section is marked for internal<br />
kernel allocations. In this way pages which can<br />
not be easily freed are grouped together rather<br />
than spread throughout the system. In this way<br />
the page allocator’s use of sections would be<br />
analogous to the slab caches use of pages.
294 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
7 Conclusion<br />
<strong>The</strong> prevalence of hotplug-capable <strong>Linux</strong> systems<br />
is only expanding. Support for these systems<br />
will make <strong>Linux</strong> more flexible and will<br />
make additional capabilities available to other<br />
parts of the system.<br />
Legal Statement<br />
This work represents the view of the authors and<br />
does not necessarily represent the view of IBM or<br />
Intel.<br />
IBM is a trademarks or registered trademarks of International<br />
Business Machines Corporation in the<br />
United States and/or other countries.<br />
Intel and i386 are trademarks or registered trademarks<br />
of Intel Corporation in the United States,<br />
other countries, or both.<br />
Ottawa <strong>Linux</strong> Symposium. July 2003. pp<br />
181–196.<br />
[5] Gorman, Mel Understanding the <strong>Linux</strong><br />
Virtual Memory Manager Prentice Hall,<br />
NJ. 2004.<br />
[6] Martin Bligh and Dave Hansen <strong>Linux</strong><br />
Memory Management on Larger<br />
Machines Proceeedings of the Ottawa<br />
<strong>Linux</strong> Symposium 2003. pp 53–88.<br />
[7] Bonwick, Jeff <strong>The</strong> Slab Allocator: An<br />
Object-Caching <strong>Kernel</strong> Memory<br />
Allocator Proceedings of USENIX<br />
Summer 1994 Technical Conference<br />
http://www.usenix.org/<br />
publications/library/<br />
proceedings/bos94/bonwick.html<br />
<strong>Linux</strong> is a registered trademark of Linus Torvalds.<br />
VMware is a trademark of VMware, Inc.<br />
References<br />
[1] Five Nine at the IP Edge<br />
http://www.iec.org/online/<br />
tutorials/five-nines<br />
[2] Barham, Paul, et al. Xen and the Art of<br />
Virtualization Proceedings of the ACM<br />
Symposium on Operating System<br />
Principles (SOSP), October 2003.<br />
[3] Waldspurger, Carl Memory Resource<br />
Management in VMware ESX Server<br />
Proceedings of the USENIX Association<br />
Symposium on Operating System Design<br />
and Implementation, 2002. pp 181–194.<br />
[4] Dobson, Matthew and Gaughen, Patricia<br />
and Hohnbaum, Michael. <strong>Linux</strong> Support<br />
for NUMA Hardware Proceedings of the