Level: Intermediate Anand Santhanam (asanthan@in.ibm.com), Software Engineer, IBM Global Services
23 Sep 2003 The impending release of a new stable kernel promises greater adoption for Linux, as it becomes more reliable and scalable over a larger variety of processors. Here we'll highlight some of the changes, both big and small, with some code samples.
Linux kernel development has come a long way from the humble v 0.1
release in 1991 by Linus Torvalds that contained a basic scheduler and IPC
and memory management algorithms. It's now a a viable alternative operating
system that poses a serious challenge in the marketplace. More and more
governments and IT giants are moving to Linux. Linux now powers everything
from the smallest of embedded devices to S/390s, and from
wristwatches to huge enterprise servers.
Linux 2.6 is the next major release of the Linux kernel development
cycle, and it contains powerful features aimed at
improved performance on high-end enterprise servers as well as support for
a plethora of embedded devices (also see the link to Joseph Pranevich's Wonderful
World of Linux in the Resources
section, for more detailed analyses of Linux 2.6 support for large, small,
and multiple processors).
This article analyzes some of the key features of Linux 2.6 for the
avid Linux user, and discusses various changes that might be of interest
to driver developers.
Linux 2.6 highlights
Linux 2.6 is a big step for Linux on enterprise servers as well as for
embedded systems. For high-end machines, new features
target performance improvements, scalability, throughput, and NUMA support
for SMP machines. For the embedded world, new architectures and processor
types have been added -- including support for MMU-less systems that do
not have a hardware-controlled memory management scheme. And, as usual, a
whole new set of drivers for audio and multimedia have been added to cater
to the desktop crowd.
We analyze some of the most notable features of Linux 2.6 in this
article, but there are many worthy changes we could not cover here,
including enhanced kernel core dumping, fast mutex support, an improved
I/O subsystem, and much more. Some of these are summarized in the sidebar, and we have included links to others
in the Resources section where possible.
New scheduler
The 2.6 Linux kernel uses a new scheduler algorithm, developed by Ingo
Molnar. Called the O(1) algorithm, it should perform notably well under
high loads and it scales well with a large number of processors.
In the 2.4 scheduler, the timeslice recalculation algorithm requires
that all processes exhaust their timeslice before their new timeslices can
be recomputed. So on a system with a large number of processors, most of
the processors become idle as processes complete their timeslice and wait
for recalculation (for a new timeslice); this impacts SMP efficiency.
Moreover, idle processors start executing waiting processes (if their own
processor is busy) whose timeslices haven't yet been exhausted, causing
processes to start "bouncing" between processors. When a bouncing process
happens to be a high priority or interactive process, overall system
performance is affected.
The new scheduler addresses these issues by distributing timeslices on
a per-CPU basis and eliminating the global synchronization and
recalculation loop. The scheduler uses two priority arrays, namely active
and expired arrays, that are accessed through pointers. The active array
contains all tasks that are affine to a CPU and have timeslices left. The
expired array contains a sorted list of all tasks whose timeslices have
expired. If all active tasks are used up, then the access pointers to the
arrays are switched, and the expired array (containing the ready to run
tasks) becomes the active array and the empty active array becomes the new
array for expired tasks. The array indices are stored in a 64-bit bitmap,
and finding the highest priority task is very trivial.
The new scheduler does not have a big runqueue_lock anymore. It
maintains a per-processor run queue/lock mechanism so that two processes
on two different processors can sleep, wake up, and context-switch
completely in parallel. The recalculation loop (which recomputes the time
slices for the processes) and the goodness loop have been eliminated,
and O(1) algorithms are used for wakeup() and schedule().
The main benefits of the new scheduler include:
-
SMP efficiency: If there is
work to be done, all the processors should work.
-
Waiting processes: No process should stay without processor
time for long periods of time; additionally, no process should take an
unreasonably high amount of CPU time.
-
SMP affinity: Processors should affine to one CPU and will not
bounce between CPUs.
-
Priorities: Less important tasks should start with lower
priority (the converse is also true).
-
Load balancing: The scheduler will decrease the priority of
any process that generates more load than the processor can handle.
-
Interactive performance
With the new scheduler, the user should not see the system taking longer
to respond to things like mouse clicks or key taps, even under very high
loads.
Kernel preemption
The kernel preemption patch has been merged into the 2.5 series and
subsequently will make it into 2.6. This should significantly lower
latencies for user interactive applications, multimedia applications, and
the like. This feature is especially good for real-time systems and
embedded devices.
The 2.5 kernel preemption work was done by Robert Love. In previous
versions of the kernel (including the 2.4 kernels), it was not possible to
preempt a task executing in kernel mode (including user tasks that had
entered into kernel mode via system calls) until or unless the task
voluntarily relinquished the CPU.
As of kernel 2.6, the kernel is preemptible. A kernel task can be
preempted, so that some important user application can continue to run.
The main advantage of this is that there will be a big boost to user
interactiveness of the system, and the user will feel that things are
happening at a faster pace for a key stroke or mouse click.
Of course, not all sections of the kernel code can be preempted.
Certain critical sections of the kernel code are locked against
preemption. Locking should ensure both that per-CPU data structures and
state are always protected against preemption.
The following code-snippet demonstrates the per-CPU data structure problem
(in an SMP system):
Listing 1. Code with kernel preemption problem
int arr[NR_CPUS];
arr[smp_processor_id()] = i;
/* kernel preemption could happen here */
j = arr[smp_processor_id()] /* i and j are not equal as
smp_processor_id() may not be the same */
|
In this situation, if kernel preemption had happened at the specified
point, the task would have been assigned to some other processor upon
re-schedule -- in which case smp_processor_id() would have returned a
different value.
This situation should be prevented by locking.
FPU mode is another case where the state of the CPU should be
protected from preemption. When the kernel is executing floating point
instructions, the FPU state is not saved. If preemption happens here, then
upon reschedule, the FPU state is completely different from what was there
before preemption. So FPU code must always be locked against kernel
preemption.
Locking can be done by disabling preemption for the critical section
and re-enabling it afterwards. The following #defines have been provided
by the 2.6 kernel to disable and enable preemption:
-
preempt_enable()
-- decrements the preempt counter
-
preempt_disable()
-- increments the preempt counter
-
get_cpu()
-- calls preempt_disable() followed by a call to
smp_processor_id()
-
put_cpu()
-- re-enables preemption()
Using these defines, Listing 1 could be rewritten as:
Listing 2. Code with locking against preemption
int cpu, arr[NR_CPUS];
arr[get_cpu()] = i; /* disable preemption */
j = arr[smp_processor_id()];
/* do some critical stuff here */
put_cpu() /* re-enable preemption */
|
Note that preempt_disable()/enable() calls are nested. That is,
preempt_disable() can be called n number of times and preemption will only
be re-enabled when the nth preempt_enable() is encountered.
Preemption is implicitly disabled if any spin locks are held. For
instance, a call to spin_lock_irqsave() implicitly prevents preemption by
calling preempt_disable(); a call to spin_unlock_irqrestore() re-enables
preemption by calling preempt_enable().
Improved threading model and support for NPTL
A lot of work in improving threading performance has gone into the 2.5
kernel. The improved threading model in 2.6 was also done by Ingo Molnar.
Based on a 1:1 threading model (one kernel thread for one user thread), it
includes in-kernel support for the new Native Posix Threading Library (or
NPTL), which was developed by Molnar in cooperation with Ulrich Drepper.
 |
Other notable changes in the 2.6 kernel
-
Filesystems
Improvements to the ext2/ext3 filesystem have been made, including
support for extended attributes and POSIX access control lists. Also the
NTFS driver had a rewrite to support (reentrant safe) SMP, cluster size
over 4KB, and more. IBM's JFS (journaling file system) and SGI's XFS have
both been merged into 2.6.
-
Audio
A much-awaited feature is the new Linux audio architecture known as ALSA
(Advanced Linux Sound Architecture). This is a boon for desktop users as
it replaces the old OSS (Open Sound System) architecture that had many
limitations. The new sound architecture supports USB audio and MIDI
devices, full duplex playback, and more. Playing MP3 or other audio files
on a desktop will never be the same again!
-
Buses
The SCSI/IDE subsystems have undergone major rewrites, and some of the
drivers are still under testing or in the cleanup stage.
-
Power management
There is support for ACPI (Advanced Configuration and Power Interface),
for CPU scaling (the ability to vary the CPU clock frequency under varying
loads to conserve power) and software suspend (this feature is still being tested).
-
Networking and IPSec
IPSec (IP Security) support has been added to the kernel. So has support
for various RFCs like IP payload compression. The in-kernel HTTP server,
kttpd, has been removed. The IPSec feature uses the new crypto API
provided by the kernel. This crypto API includes various popular
algorithms like MD4, MD5 DES, and so on. Support for the new NFSv4 (Network
File System) client/server has been added.
-
User interface layer
The 2.6 kernel has also seen a reworking of the framebuffer/console
layer. This will mean that various userspace framebuffer tools like fbset
and fbdesl will need updating. Also a whole range of new devices -- from
touchscreens to braille devices to different varieties of mice -- will
enjoy support from the human interface layer.
|
|
Threading operations will enjoy an increase in speed; the 2.6
kernel can now handle an arbitrary number of threads and up to 2 billion
PIDs (on IA32).
Another change is the introduction of TLS (Thread Local Storage)
system calls, which allow for the allocation of one or more GDT (Global
Descriptor Table) entries that can be used as thread registers. The GDT
is per-CPU and the entries are per-thread. This enables a 1:1 threading
model without limitations on the number of threads being created (since a
new kernel thread is created for every user thread). The 2.4 kernel
allowed only a maximum of 8,192 threads per processor.
The clone system call has been extended to optimize thread creation.
The kernel stores the thread ID in a given memory location if the
CLONE_PARENT_SETID flag is set and clears the memory location upon thread
termination if CLONE_CLEARID is set. This helps user-level memory
management recognize unused memory blocks. Also, support for the
signal-safe loading of thread registers has been incorporated. Futex
(fast user space mutex) is done by the kernel on the thread ID upon
pthread_join (for more on futex, please see Resources).
POSIX signal handling is done in kernel space. A signal is delivered
to one of the available threads in the process; fatal signals terminate
the entire process. Stop and continue signals also affect the entire
process, and this enables job control for multithreaded processes.
A new variant of the exit system call has been introduced. Called
exit_group(), this system call terminates the entire process and its
threads. Moreover, exit handling has been improved with the introduction
of the O(1) algorithm so that it takes but two seconds to terminate a
process with a hundred thousand threads (compare with 15 minutes to do the
same thing in the 2.4 kernel).
The proc filesystem has been modified to report only the original
thread, rather than all threads. This avoids the slowing down of /proc
reporting. The kernel ensures that the original thread remains until all
other threads are terminated.
VM changes
On the VM side, Rik van Riel's r-map (reverse mapping) has been
incorporated, which should show significant improvements in VM behaviour
under certain loads.
To understand the reverse mapping technique, let's quickly go through
some fundamentals of the Linux virtual memory system.
The Linux kernel operates in a virtual memory mode: for every
virtual page there is a corresponding physical page of memory in the
system. Address translation between the virtual and physical pages is done
by the hardware page tables. For a particular virtual page, a page table
entry will give a corresponding physical page or note that the page is not
present (indicating a page fault). But this virtual-to-physical page
mapping is not always one-to-one: multiple virtual pages (pages shared by
different processes) might point to the same physical page. In this case,
each shared process page entry will have mappings to the corresponding
physical page. In cases like this, the situation becomes complicated when
the kernel wants to free a particular physical page, as it has to traverse
through all the processes' page table entries looking for references to
the physical page; it can only release the physical page when the
reference count reaches 0. This can be a long and arduous task as the
kernel scans through all the processes' page tables, because it has no way
of knowing which processes have actual references to this page. This
slows the VM down considerably under high loads.
The reverse mapping patch addresses this situation by introducing a
data structure called pte_chain in struct page (the physical page
structure). The pte_chain is a simple linked list that points to the PTEs
of the page, and can return a list of PTEs where a particular page is
referenced. Page freeing is suddenly very easy.
There is, however, a pointer overhead to this scheme. Every struct
page in the system has to have an extra struct for pte_chain. On a 256 MB
system with 64 KB worth of physical pages, there is 64KB * (sizeof(struct
pte_chain)) amount of memory that needs to be allocated for this purpose
-- a significant number.
Several techniques are being used to address this issue including
removal of the wait_queue_head_t field from the struct page (used for
exclusive access to the page). Since this wait queue is used sparsely, a
smaller wait queue implementation using hash queues to find the correct
wait queue has been implemented in the rmap patch.
Despite this, the performance of the rmap patch -- especially on
high-end systems and under heavy loads -- is significantly better than the
2.4 kernel's VM system.
Driver porting to Linux 2.6
The 2.6 kernel has brought along with it a pretty significant list of
changes for driver developers. This section highlights some of the
important aspects of device driver porting from the 2.4 to 2.6 kernel.
First, the kernel build system has been improved for quicker builds
compared with 2.4. Improved graphical tools have been added: make xconfig
(requires Qt libraries) and make gconfig (requiring GTK libraries).
The following are some of the highlights of the 2.6 build system:
- Creates arch-zImage and modules by default when just
make is issued
- For parallel
make, make -jN is the preferred choice
-
make is not verbose by default (set KBUILD_VERBOSE=1 or use make V=1 to change this)
-
make subdir/ will compile all the files within subdir/ and below
-
make help will provide the make targets supported
-
make dep need not be run at any stage
The kernel module loader has also been completely reimplemented in 2.5,
which means that the module-building mechanism is much different compared
to 2.4. A new set of module utilities is needed for module loading and
unloading (please see Resources for a download
link for those). Old 2.4 makefiles will no longer work as of 2.6.
The new kernel module loader is the work of Rusty Russel. It makes
use of the kernel build mechanism and produces a .ko (kernel object)
module object instead of an .o module object. The kernel build system
compiles the module first and then links to vermagic.o. This creates a
special section in the object module that contains information like the
compiler version used, the kernel version, whether kernel preemption is
used, and so on.
Now let's take an example and analyze how the new kernel build system
should be used to compile and load a simple module. The module concerned
is a "hello world" module and is similar to 2.4 module code except that
module_init and module_exit need to be used instead of init_module and
cleanup_module (kernel 2.4.10 modules onwards use this mechanism). The
module is named hello.c and the Makefile is as follows:
Listing 3. Example driver makefile
KERNEL_SRC = /usr/src/linux
SUBDIR = $(KERNEL_SRC)/drivers/char/hello/
all: modules
obj-m := module.o
hello-objs := hello.o
EXTRA_FLAGS += -DDEBUG=1
modules:
$(MAKE) -C $(KERNEL_SRC) SUBDIR=$(SUBDIR) modules
|
This makefile uses the kernel build mechanism to build the module. The
compiled module will be named module.ko and is obtained by compiling
hello.c and linking with vermagic. KERNEL_SRC indicates the kernel source
directory and SUBDIR indicates the directory where the module is located.
EXTRA_FLAGS indicate any compile-time flags that need to be given.
Once the new module (module.ko) is created, it can be loaded and
unloaded by using the new module utilities. Older module utilities used in
2.4 cannot be used for loading and unloading the 2.6 kernel modules. The
new module loading utility tries to minimize the occurrence of race
conditions that may occur when a device is opened while the corresponding
module is being unloaded, but after the module has made sure that no one
is using it. One of the reasons for these race conditions is that the
module usage count is manipulated within the module code itself (via
MOD_DEC/INC_USE_COUNT).
In 2.6, the module need not do this step of increasing/decreasing the
reference count. Instead, this is now done outside of module code. Any
code that tries to reference the module has to call
try_module_get(&module), and upon success can access the module; this call
will fail if the module is being unloaded. Correspondingly, references to
the module can be released by using module_put().
Memory management changes
Memory pools were added during 2.5 development to satisfy memory
allocations without sleeping. The concept involves pre-allocating a pool
of memory and reserving it until it is actually needed. A mempool is
created by using the mempool_create() call (linux/mempool.h should be
included):
mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
where min_nr is the number of pre-allocated objects needed, alloc_fn
and free_fn are pointers to the standard object allocation and
de-allocation routines that the mempool mechanism provides. They are of
type:
typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);
pool_data is the pointer used by allocation/de-allocation functions
and gfp_mask is the allocation flag. The allocation function will not
sleep unless the __GFP_WAIT flag is specified.
Allocating and de-allocating objects within the pool is done by:
void *mempool_alloc(mempool_t *pool, int gfp_mask);
void mempool_free(void *element, mempool_t *pool);
mempool_alloc() is for allocating objects; if the mempool allocator
fails to provide memory, then the pre-allocation pool will be used.
The mempool is returned to the system using mempool_destroy().
In addition to introducing the mempool feature for memory allocation,
the 2.5 kernel also introduced three new GFP flags for normal memory
allocations. They are:
-
__GFP_REPEAT -- Tells the page allocator to try hard for
memory. Memory allocation failures should reduce the use of this flag.
-
__GFP_NOFAIL -- Memory allocation failures must not
happen. This might take a longer time to complete since the caller is put
to sleep in order to satisfy requirements.
-
__GFP_NORETRY -- This ensures that failed allocations
are not retried. Instead, failure status is reported back to the caller.
Apart from the memory allocation changes, the remap_page_range() call
-- which is used to map pages to user space -- has been modified slightly.
It now takes an extra parameter compared to 2.4. The virtual memory area
(VMA) pointer needs to be added in as the first parameter followed by the
usual four parameters (start, end, size, and protection flags).
Workqueue interface
Workqueue interface is introduced in 2.5 development to replace the
task queue interface (used to schedule kernel tasks). Each workqueue has
dedicated worker threads associated with it and all the tasks from the run
queue run in the context of the process (so they are allowed to sleep).
Drivers can create and use their own workqueues or use the one that comes
with the kernel. Workqueues are created using:
struct workqueue_struct *create_workqueue(const char *name);
where name is the name of the workqueue.
Workqueue tasks can be initialized at compile time or at run time. The
task needs to be packaged into a structure called the work_struct
structure. A workqueue task is initialized during compile time using:
DECLARE_WORK(name, void (*function)(void *), void *data);
where name is the work_struct name, function is the function to be
invoked when the task is scheduled, and data is the pointer to that
function.
A workqueue task is initialized at run time by using:
INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);
Queuing a job (a workqueue job/task of type work_struct structure)
into the workqueue is done with the following function calls:
int queue_work(struct workqueue_struct *queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue, struct work_struct
*work, unsigned long delay);
The delay in queue_delayed_work() is given to ensure that at least a
minimum delay in jiffies is given before the workqueue entry actually
starts executing.
Entries in the workqueue are executed by the associated worker thread
at an unspecified time (depending on load, interrupts, etc.) or after a
delay time. Any workqueue entry that takes an inordinately long time to
run can be cancelled with:
int cancel_delayed_work(struct work_struct *work);
If the entry is actually executing when the cancellation call returns,
the entry continues to execute but will not be added to the queue again.
The workqueue is flushed of entries with:
void flush_workqueue(struct workqueue_struct *queue);
and is destroyed with:
void destroy_workqueue(struct workqueue_struct *queue);
It is not necessary for all drivers to have their own custom
workqueue. Drivers can use the default workqueue provided by the kernel.
Since this workqueue is shared by many drivers, the entries may take a
long time to execute. To alleviate this, delays in worker functions should
be kept to a minimum or avoided altogether.
An important note is that the default queue is available to all
drivers, but only GPL licensed drivers can use their own custom-defined
workqueues:
-
int schedule_work(struct work_struct *work); -- add an
entry to the workqueue
-
int schedule_delayed_work(struct work_struct *work, unsigned
long delay); -- add a workqueue entry and delay execution
There is also a flush_scheduled_work() function that waits for
everything in the queue to be executed, and that should be called by
modules upon unloading.
Interrupt routine changes
The 2.5 interrupt handler internals have undergone a lot of changes, but
most of them do not affect the average driver developer. However, there
are some important changes that affect device drivers.
The interrupt handler function now has a return code of type
irqreturn_t. This change, introduced by Linus, means that the interrupt
handlers tell the generic IRQ layer whether the interrupt is really meant
for it or not. This is done to catch spurious interrupts (especially on
shared PCI lines) where interrupts keep coming in (because the driver has
accidentally enabled the interrupt bits or the hardware has simply gone bad)
and none of the drivers can do anything about it. In 2.6, the driver needs
to return IRQ_HANDLED if the interrupt is from that device or IRQ_NONE if
it has nothing to do with it. This helps the kernel IRQ layer clearly identify which driver is handling that particular interrupt. If an
interrupt keeps coming in and there is no registered handler for that
device (for instance, all drivers return IRQ_NONE), then the kernel can
block interrupts from that device. By default, the driver IRQ routine
should return IRQ_HANDLED as it is a bug to return IRQ_NONE when the
driver has actually handled that interrupt. The new interrupt handler
might look like:
Listing 4. 2.6 Interrupt handler pseudocode
irqreturn_t irq_handler(...) {
..
if (!(my_interrupt)
return IRQ_NONE; // not our interrupt
...
return IRQ_HANDLED; // return by default
}
|
Note that cli(), sti(), save_flags(), and restore_flags() are all
deprecated. Instead, local_save_flags() and local_irq_disable() should be
used to disable all interrupts locally (within that processor). It is not
possible to disable interrupts across all processors.
Unified device model
One of the most notable changes in the 2.5 development is the creation
of a unified device model. The device model represents the overall device
architecture and the overlay of the system by maintaining a number of data
structures. Benefits include improved power management control over
devices and easier management of device-related tasks, including tracking
of:
- The devices that exist in the system and which bus it connects to
- The power state of the device at that particular instance
- The device drivers known to the system and which device(s) they
control
- The bus structure of the system: which devices connect to which bus
and which buses are interconnected (for instance, a USB - PCI
interconnect)
- The device classes present in the system (classes include disks,
partitions, and so on)
Other developments in the 2.5 kernel related to device drivers include:
-
malloc.h has been removed. All code that includes <linux/malloc.h>
(for memory allocation) should now use <linux/slab.h>.
- The HZ value has been increased to 1000 for x86 architecture. A new
jiffy counter called
jiffies_64 has been introduced to avoid the rapid
overflow of the jiffy variable because of the HZ value change.
- A new delay function called
ndelay() has been introduced, which allows
waits for nanoseconds.
- A new type of lock called
seqlock() has been introduced for locking a
small section of data (no pointers) that is frequently accessed.
- Since 2.6 kernels can be preempted,
preempt_disable() and
preempt_enable() need to be used in the driver to protect sections of the
code that should not be preempted (disabling IRQ disables preemption
implicitly).
- Asynchronous I/O has been added in 2.5. This means that user
processes can initiate multiple I/O operations and need not wait for them
to complete. Asynchronous APIs have been introduced to be used by
character drivers.
- The block layer has undergone major changes in 2.5 development. This
means that block device drivers need to be redesigned from 2.4.
- sysfs, which gives a userspace representation of the system's device
model, has been introduced in 2.5. It is mounted on /sys.
 |
Conclusion
Since there are so many changes to Linux 2.6 as compared with 2.4,
there is talk in the kernel world that the new release may actually be
named 3.0. Linus will have the final say in this, and the release may be official as soon as November 2003. Whatever
its version number, the new kernel release promises to give us faster,
more scalable, and more reliable performance across more platforms and
architectures than 2.4.
Linus has asked testers around the world to hunt bugs and report
problems, and has asked distribution maintainers to provide 2.6 versions
for download. If you would like to take part, you can find download and
install links in the Resources below.
Resources
About the author  | |  | Anand K. Santhanam has a Bachelor of Engineering degree in Computer
Science from Madras University, India. He has been in IBM Global Services
(Software Labs), India since July 1999. He is a member of the Linux Group
at IBM, where he has worked with ARM-Linux, character/X-based device
drivers, power management in embedded systems, PCI device drivers, and
multithreaded programming in Linux. Other areas of interest are OS
internals and networking. You can contact him at asanthan@in.ibm.com.
|
Rate this page
|