Kernel 3.17 and kdbus – the kernel column

Jon Masters summarises the latest happenings in the Linux kernel community, including ongoing work towards the 3.17 kernel


Linus Torvalds announced the first release candidate 3.17 kernel (RC1) just ahead of this year’s Kernel Summit. Noting that he would be travelling (and thus not able to keep up with a massive influx of patches), he had closed the “merge window” (the period of time during which disruptive churn is allowed in any kernel development cycle) for 3.17 one day early. He also noted that, typically of northern hemisphere summers, this merge window had been “slightly smaller than the last few ones”. New features pulled into 3.17-rc1 were spread all over the kernel. They include the getrandom() system call, and support for the “memfd” and “file sealing” features needed for kdbus.

We covered getRandom in LU&D a few issues ago. It provides a new resource-safe API for obtaining random numbers (entropy) that is even safe against very low availability of resources (starvation) that can occur, or be arranged to occur by an attacker, on some systems. In the absence of sufficient resources available to open /proc and /dev filesystem entries, it might be that programs such as the OpenSSL library and the OpenSSH system service are unable to use the traditional /dev/random style interfaces for obtaining the random numbers used to seed the cryptographic algorithms that power secure sessions. The new system call provides an always-available random number source, such as has been available on BSD systems for years, because a lack of free file descriptors does not affect the ability to use syscalls. Thus, a possible attack vector is mitigated and tools can begin to use the new system call on Linux kernels greater than or equal to 3.17. More specifically, library functions that provide entropy can choose to do so.

At the architecture level, 3.17 adds four- level page table support to the 64-bit AArch64 (arm64) architecture, which brings it on par with x86 in terms of the addressable physical memory that can be used by Linux on such hardware. Though there is no way that systems with that much physical memory to explicitly require addressability of the full physical 64-bit range (actually 2^48, which is 256 Tebibytes) will exist for a long, long time, there can be value in being able to have a sparsely-populated physical memory map on 64-bit systems (mixing memory and IO devices). 3.17 removed support for several ancient architectures that are no longer in widespread use. As is typically the case, no formal planning was in place to remove such architecture support, but rather, breakage went unfixed for long enough to make it obvious to the kernel community that there were no actual users with an interest in support.

Preparing for kdbus

Preparation work landed in 3.17 for the forthcoming “kdbus” in-kernel implementation of D-BUS. D-BUS is an IPC (Inter-Process Communicaton) system originally defined by the FreeDesktop, and GNOME communities (and also now used in KDE as of QT4). D-BUS allows system services to register themselves using an assigned name within a global namespace and then to send and receive messages to other services. Services can be activated in response to messages, which is how many of the modern Linux desktop capabilities are provided. In a typical implementation, there is one specially privileged “system channel” provided upon which various core services communicate, and one “session channel” per login session for system users. Typically, K-BUS is implemented using the (now venerable) dbus-daemon, but work is underway to roll much of this into the all-consuming systemd service. As part of that effort, another broader effort is underway to enhance performance using a kernel mechanism.

Kdbus aims to reduce the number of different interactions required in order for one service to send a message to another using the D-BUS mechanism. As a purely userspace service, the existing services necessitate that a sending process make a system call to transit a message, which is then handled by a (lightweight, but still in the middle) dbus-daemon, and finally received by the intended recipient of the message via yet more round trips into and out of the kernel. Rather than requiring so many context switches between various processes (known as “tasks” within the kernel), kdbus obviates the need for a special daemon in many cases. Instead, messages can be directly sent and received in the fast-path previously described. In some other cases, more heavyweight handling might be required, so a system service will remain, especially for legacy applications.

In order for kdbus to operate, some changes to the kernel are required (beyond simply providing kdbus itself). These include the (newly merged) “memfd” mechanism that provides for zero-copy sharing of trackable memory regions between processes (using a memory file descriptor as one would use a traditional file descriptor). The “file sealing” capability works hand in hand with “memfd” since it is necessary to allow a message sent within a shared memfd to be decoded and interpreted by a receiving process without risk of a sending process trying to modify the content after it has been sent, thus the “sealing” part. Once a memfd is sealed it cannot be changed by the sending process. A typical kdbus message transmission will take the form of a process allocating a memfd, building a message into that newly allocated memfd, sealing it and then passing it to kdbus for transmission. Since kdbus is an in-kernel mechanism, that transmission can take place in a very lightweight “zero copy” fashion, in so much that the kernel can play a few classical memory management tricks to directly share the memfd buffer.

Ongoing development

A number of increasingly urgent conversations have taken place over the past few months about Linux readiness for the 2038. These culminated in a (lengthy) “discussion fodder” post from John Stultz on the topic ahead of the 2014 Kernel Summit. John proposed a number of responses to the “end of time for 32-bit architectures”, such as the introduction of a new special sub-architecture variant of the popular 32-bit architectures with a newer ABI supporting full 64- bit time quantities, as well as other less drastic options. 2038 is special for Unix and Unix-like operating systems in that just after 3am UTC on 19 January 2038, 32-bit Unix time_t quantities representing the number of seconds since the Unix epoch (birth) in 1970 will overflow, and affected systems will believe they are reliving the Seventies all over again. There is some time (pun intended) to address this problem, but not as much as readers may think. While most financial systems (calculating interest on mortgages 25 and 30 years out, for example) have long since been upgraded to 64-bit (or have other solutions that are unaffected), other embedded or less critical systems having to represent future dates will take many years to be replaced, so the sooner a solution is in place (other than “just go 64-bit”) the better for everyone.

Matt Fleming, Yinghai Lu, Harold Hoyer, and Mantas Mikulenas had a discussion around support for loading kernel initramfs (initrd) image files above the 4GB boundary in physical memory on 64-bit EFI-based x86 systems. Traditionally, such support has been limited due to various assumptions that might be in place within system firmware and kernel code around the venerable x86 PC platform. Although the kernel handles this well, and UEFI is more than capable, the introduction of a patch allowing for an initramfs above 4GB resulted in breakage for Mantas due
to “buggy firmware” on his system. Yinghai noted that his request for introducing the patch in the first place had been to support multi-GB sized initramfs images, which generally are rare corner cases today.

Daniel Thompson posted patches allowing “selective reduction in capabilities” within kdb (the optionally built-in Linux kernel debugger) that can be used for so called “kiosk mode” applications in which physical system access might be made available to those who should not be able to trivially launch a kernel debugger through a few physical keypresses. This is similar to work done to restrict general use of the kernel’s “magic sysrq” key combinations such that users can’t arbitrarily reboot the system or cause kernel diagnostic output.

Mitchell Joblin of the University of Passau posted to the Linux Kernel Mailing List (LKML) noting that the university is conducting research into various “Linux Development Collaboration Patterns” concerning social interaction between developers in open source communities. He solicited for feedback in a survey.

Finally this month, Stephen Rothwell followed up to the Linux 3.17-rc1 release with his usual set of linux-next statistics for the preceding cycle. Linux-next is the feeder for many features that end up in future kernels; this is supported by the fact that 90.7 per cent of the changes in 3.17-rc1 had previously been listed in linux-next.