Ongoing developments – the kernel column

Jon Masters covers the latest in the kernel community, as work wraps up on Linux 3.18 and new feature development continues


Linus Torvalds announced Linux 3.18-rc5, noting that “Hmm. We had a very calm -rc4, and I wish I could say that things continued to calm down, but… Yeah, rc5 is clearly bigger than rc4 was. Oh well.” He proceeded to note that rc4 had been smaller than usual and that the content “[looks] fairly normal”. All in all, the level of churn is about what one would expect in the latter part of a kernel development cycle. We’ll have a full summary of 3.18 in our next issue.

Cache Allocation Technology

Recent Intel chips contain a new feature known as Cache Allocation Technology (CAT). This architecture “provides a way for the Software (OS/ VMM) to restrict cache allocation to a defined ‘subset’ of cache which may be overlapping with other ‘subsets’”. In other words, Intel’s new feature enables the (increasingly large) on-chip caches to be controlled in a fairly fine-grained fashion, biasing the allocation policy in favour of specific running applications. Traditionally, the CPU caches were not really visible directly to applications. Instead they are handled entirely automatically under hardware control as memory locations are accessed that may be used again. The kernel has to be aware of the caches only as far as handling invalidation and other system-level details, but it has not normally been so involved in the process of handling the underlying allocation policy.

The new CAT cgroup patches expose the underlying cache allocation policy to the user, allowing for some fairly novel and exciting tunings at runtime. The new CBM (Cache Bit Mask) file within the cpuset cgroup entries enables a user to assign a bitmask range of cache memory to a given cgroup, which the CPU will then enforce in future allocations. This means that it is possible to reserve chunks of the cache to specific parts of the cgroup hierarchy. An example provided by Intel’s Vikas Shivappa shows a 16-bit mask being used in which the latter 4 bits (0xf) are written into the cpuset.cbm file for a given cgroup. In the example, this allocates the “right 1/4th(512KB)” of a 2MB cache to the specified cgroup, which will have exclusive use of that part of the cache. Cache allocation technology is likely to be of interest to very sophisticated users who have exacting control over their environment and wish to ensure that their applications do not suffer from unwanted latencies due to cache and memory interaction.

Restart blocks

The Linux kernel contains support for restarting system calls. Usually, a restart occurs if a system call is interrupted (by a signal to the underlying process) midway through, provided that the associated signal handler has SA_RESTART specified as a flag in its sigaction. When such an interruption occurs, the system call (for example, a file IO operation such as a read) will be aborted and the kernel-side implementation of that system call will ‘return -ERESTARTSYS’. The kernel system call mechanism will automatically intercept this and restart once the interrupting signal has been dealt with. The user application need not implement special support for such restart handling, which happens automatically.

Restart handling is implemented using a small, embedded structure that is contained within the special ‘thread_info’ per-task (process) structure. Each running process (known as a ‘task’ within the kernel) has a thread_info, typically at the very bottom of its kernel stack. The thread_info, or ti, contains various critical data pieces that the kernel requires very frequent access to and thus optimisations can be made for locating those pieces of data via the kernel stack pointer. The thread_info is also used to locate certain essential task data, such as the larger per- task task_struct that provides the task name, and metadata pointers like VFS information for the running process. Storing system call restart information within the thread_info is not unreasonable – it’s why the struct restart_block exists today. When a restart is needed, a function pointer within the restart block is populated and called to rerun the system call.

One of the problems with implementing restart handling in this way is that it’s possible to overflow the kernel stack, corrupting the thread_info in the process. If an attacker can find an exploit sufficient to cause the thread_info to become corrupted in the right (controlled) way, then they can arrange to abuse restart handling to cause arbitrary code execution of their choice. Andy Lutomirski (of AMA Capital) posted a patch series that moves the restart_block structure into the task_struct, mitigating the theoretical potential for such an exploit to be possible.

Ongoing development

Discussion around Andrea Arcangeli’s userfault memory management subsystem patch set continued as he polishes his work in preparation for merging. We explained last issue how userfault allows userspace code (such as that providing backing for virtual machine live migration) to become aware of the page fault activity of monitored processes. This enables an optimisation in migrating VMs since it is possible to begin virtual machine execution on a new physical machine while the background copy of memory pages (the units by which memory is referenced) completes. Zhanghailiang liked Andrea’s patches and saw potential for a simplified implementation of live memory snapshotting of virtual machines. As such, Zhang requested a few additional tweaks to Andrea’s code, including the ability to specify what should trigger a user fault. If it were possible to configure that only memory writes will trigger a fault, then memory snapshot support could rely upon an event occurring only when changes to the previous snapshot contents have occurred.

Speaking of virtual memory developments, Jerome Glisse posted a patch series implementing ‘Heterogeneous Memory Management’ (HMM). This patch set further helps to realise the natural course of CPU and GPU memory unification efforts over the past few years by enabling the compute and graphics (or, more likely, GPU offload) sides of the universe to share the “exact same address space on the GPU as on the CPU”. The HMM patches deal with the fact that memory bandwidth differs greatly between the CPU and GPU, with the latter typically having far greater bandwidth available. Thus. a particular virtual address may live in GPU or CPU memory, and may need to migrate transparently between the two. HMM allows the kernel to handle this migration by performing a background DMA API-driven movement of underlying memory depending on CPU or GPU ownership.

Martin Tournoij posted a question that led to a small debate around whether Linux should implement the SIGINFO signal for userspace processes, as is the case on the various BSDs (and Mac OS X). SIGINFO lets a user send the Ctrl+T key sequence and receive a printout of the impact of this process upon system load, as well as key status information. Unlike SIGUSR, SIGNINFO does not cause a process to be killed if it does not have special support added to capture the signal and display metrics. Thus, it is a useful and non-dangerous mechanism. SIGINFO is not a POSIX-specified standard signal number (which is why it is not already supported), but Linux has already had a number of non-standard signals defined for decades. The general consensus was that there would be little wrong with adding support. It’s only a matter of time before someone posts such a patch.

Yann Droneaud posted in response to a thread discussing 32-bit userspace compatibility with a link to a presentation that he wrote on the topic of best practices for building kernel to userspace ABIs. Different architectures have varying natural alignments (the requirement that certain memory access occur against addresses that are ‘aligned’ or rounded to a multiple of the access type – e.g. a 64-bit access can only occur against an address at a multiple of 8 bytes), different structural padding requirements and backward compatibility complications with 32-bit installs. Yann’s paper can also be read online.