Speed up the Kernel – The Kernel Column with Jon Masters

Jon Masters examines performance tweaks for the Linux kernel and summarises the latest goings-on in the kernel community


Kernel performance tweaks

The Linux kernel operates at the most fundamental (hardware) level, and it is responsible for providing the many software abstractions that modern Linux systems rely upon. Fundamentally, the job of the kernel is to provide all of these conveniences while keeping out of the way as much as possible. Time spent by the kernel is time that cannot be spent performing useful work within application code, even if that application is the latest version of Angry Birds. The kernel is also intended to run on a wide range of different hardware systems ‘out of the box’ – from the smallest embedded device, to the largest supercomputer. As such, the many algorithms it uses have been heavily optimised over the past two decades, but there are limits.

It’s a fact of life that a system used predominantly as a desktop, for example, has different needs from (say) a system used to process real-time stock trade transactions, or to serve webpages to millions of social media consumers. These different use cases have lead to various flexibility, especially in the scheduling algorithms used within the kernel to determine which task (user application process) should receive time (quantum) on the CPU(s), or which I/O blocks should be written out to disk next. If the task scheduler isn’t performing adequately to the user’s needs, the system will appear to lack interactive responsiveness (for example, sound glitching or mouse cursor stuttering), while if the I/O scheduler is not performing, then individual applications will appear to hang or become extremely sluggish.

The default Linux task scheduler, known as the CFS (Completely Fair Scheduler) has been designed to be as general purpose as possible. It will keep track of how often individual tasks run (actually, how often they sleep), and adjust the amount of time they are given based upon various metrics. It is possible to influence the default CFS scheduler by using various system tunables. These are set using the sysctl command (see ‘man sysctl’) and are typically documented at length in your Linux distribution manuals. Various books are available that walk you through tweaking these values for different systems. If you experiment with these settings, and their results, you will quickly come to realise how much performance tuning really is more an art than an exact science. While there are benchmarks and tests, your system may have unique attributes that set it apart. Given this, it’s amazing how good CFS is out of the box, and how much work has gone into trying to make it behave so.

Beyond global system tunables, individual applications may have their own scheduler priority levels, across two general classes: SCHED_OTHER (normal) and SCHED_RT (real- time). The regular SCHED_OTHER levels are set using the ‘nice’ and ‘renice’ commands (see ‘man nice’ for details), while the real- time priorities are set using special code, and only by applications with suitable privileges. Real-time priority applications will generally run until they relinquish the CPU, potentially introducing horrible performance penalties on other programs if they are not written to be well behaved. An example of a real-time priority program is the PulseAudio sound daemon, which has sufficient privilege to allow it to always receive the CPU time that it needs in order to process audio data quickly.

The real-time priorities achieved with SCHED_RT aren’t by default truly ‘real-time’ in the sense of low latency – they by default don’t guarantee that the kernel itself won’t interrupt whatever is running to perform housekeeping duties. To achieve truly low latency real-time, of the kind used by stock exchanges, you need to use these priorities with a special kernel, one built with the preempt-rt patches (usually available in your distribution as a separate kernel package). These modify the kernel (though increasing amounts of the code are getting into stock Linux with time) to sacrifice a little throughput in favour of responsiveness. True ‘RT’ kernels of the kind used by Wall Street introduce a 5-10% (or more) hit to overall system throughput (number of calculations possible) in favour of a guarantee that no task will experience an unwanted latency of higher than a fixed very small duration. ‘RT’ kernels are also popular with the Linux audio and gaming community. If you don’t want to swap out the whole kernel, but you are willing to configure and rebuild it, you can find various PREEMPT settings in the standard kernel build system. These can be configured such that the kernel will perform more optimally on servers or desktop systems. CFS itself can be replaced with another scheduler, but this is not an easy operation. At this time, one popular alternative with some desktop users is the BFS (Brain F**k Scheduler) from Con Kolivas. Con provides patches ( that are intended for desktop systems. His BFS is far less tunable than CFS, but this is intentional. BFS isn’t intended to be good for servers, it is optimised specifically for desktop users, and in particular those who play games or wish to enjoy low- latency audio experience. Some distributions have gone as far as to provide specially built kernels with Con’s patches applied.

As with task scheduling, it is possible to also tune how system I/O is scheduled. Linux provides several possible I/O schedulers, also known as ‘elevator’ algorithms. The latter is a historical nod to the original way in which early rotation disk drivers were written so as to read or write from the disk in sequential sweeps across the surface (keeping related data reads and writes closer physically), rather than by asking the moving disk heads to move from one location to another rapidly. The latter is both mechanically more wearing and also less efficient than reordering I/O access in the same way that elevators (lifts) in buildings reorder requests to stop at individual floors that precede one another. While rotational disks are becoming a thing of the past, they’re not gone yet and even when they are, there are a multitude of other factors that can render one I/O scheduler more efficient than another on a specific system.

The default I/O scheduler is known as ‘CFQ’ (Completely Fair Queuing) but there is also a ‘Deadline’ scheduler, and the ‘noop’ (do not reorder any I/O requests) scheduler. These can be switched at system boot time by using the kernel parameter ‘elevator=’ with an option of ‘deadline’, ‘noop’ or ‘cfq’. Each provides various parameters within the /sys file system, under /sys/block//queue/iosched.

If you would like to read more about tuning in future issues, drop me a line using @jonmasters on Twitter.

Linux kernel 3.7 update

The final Linux 3.7 release is imminent at the time of writing. The final release candidate (3.7-rc7) was announced by Linus soon after returning from vacation. He had considered skipping a final release candidate, but claimed in the end that he was right not to skip it. Apparently, -rc7 is “in fact slightly scarier” (than previous RCs) because it contains last-minute fixes to various storage subsystems (MD, SCSI and the generic block layer), which will warrant some testing.

Linus’s week-long vacation contained the usual amount of diving-related fun. When he is not maintaining the Linux kernel, he maintains a diving app that is growing in popularity, and takes some awesome photos that are visible in his public Google+ profile. One photo of a giant tortoise led to the suggestion (from this author) that we need a ‘Giant Tortoise’ Linux kernel release code name for a future kernel, joining the likes of the ‘SaberToothed Squirrel’.

Ongoing kernel development

A flurry of activity is happening in the virtual memory subsystem, in particular with relation to memory pressure. When a system is running low on free memory, the kernel uses a ‘reclaim’ algorithm to scan through physical memory for those chunks (pages) that can be freed for reuse. Typically, these pages will come from the page cache, which contains copies of data that should (in the longer term anyway) be stored onto some kind of disk or other persistent media. Since these pages are intended for storage to disk, they will be written out (even synchronously, if the system is really busy, resulting in the ‘thrashing’ experience of sluggishness). In some cases, copies of program code can be discarded and reloaded (at a noticeable performance penalty) when later accessed.

Although the kernel contains special code to handle very low memory situations, it does not currently convey this to (user-space) applications until it is about to force- terminate them in the case of memory exhaustion. Various patches currently under review modify this situation by adding the ability for the kernel to report various kinds of memory pressure ahead of an emergency to interested applications (which register by opening a special file descriptor located in /dev), or even to ask an application for assistance in freeing unnecessary memory. Many applications contain built-in caches of many hundreds of MB (eg a web browser cache) that can be readily recreated as necessary later. In the future, systems will be smarter in asking such greedy applications to voluntarily release memory before they are killed.