News

Linux kernel 3.16 development

Jon Masters covers the latest goings on in the Linux kernel community, as the development on the 3.15 kernel comes to a close and 3.16 work begins

unionroom

Linus Torvalds announced the final Release Candidate (RC) for what will become Linux 3.15, noting that he felt pretty comfortable with the state of things at this point. The 3.15- rc8 kernel contains just a smattering of core kernel fixes (some in the scheduler, some in the filesystem code), and a few more architecture- specific patches, but relatively little overall in the way of churn. In other words, 3.15 is largely baked and ready to go, with the weekly RCs serving their purpose of gradually tapering off toward the final RC7 or RC8 release. Oftentimes, final Linux kernels are released following the RC7 timeframe, with no need for an RC8 to be issued, but on this particular occasion there was enough in the way of small last-minute fixes for Linus to feel justified in holding off another week with an RC8 instead.

As Linus notes, “Normally, an RC8 isn’t really a big deal – 3.15 is one of the biggest (if not the biggest) releases in a long time, and we do RC8s with some regularity”. It’s true that, as he goes on to say in his announcement, roughly half of the kernel releases have an RC8, where the other releases go out after RC7. The concern Linus has with having an RC8 for Linux 3.15 is more elementary: he’s going on vacation, and he will be travelling enough over the coming weeks that he didn’t want another merge window during which there was any potential for disruption due to his travel plans. As a consequence, Linus is trying something new for the 3.16 kernel cycle. Rather than wait a week for the final 3.15 to go out before opening the 3.16 merge window for new 3.16 features to land in his development tree, Linus decided to open the ‘merge window’ concurrent with the 3.15- rc8 release and see if he could streamline the process a little bit more.

In order to achieve a concurrent release of 3.15-rc8 alongside the opening of the merge window for 3.16, Linus is using a ‘-next’ git branch (similar in name to the branches used by his maintainers to stage bits for the next kernel cycle while the previous cycle is ongoing) in his source code repository, in which he is staging the bits that will be pulled in for Linux 3.16-rc1 while leaving his standard ‘master’ branch to carry the final 3.15 patches ahead of the 3.15 release. If things work out, he will release 3.15 and move the 3.16 bits over into the usual place (the master branch), saving a week of productive time at a point in the kernel development cycle where the risk of disruption is pretty low. It is an interesting experiment, and if it works out, perhaps will be repeated in future cycles.

16K kernel stacks

One of the ‘fixes’ that did make it into Linux 3.15 after all was the switch to a 16K kernel stack on 64- bit x86 systems. This ‘one liner’ from Minchan Kim was called out in Linus’ 3.15-rc8 announcement as doing “something I’ve been trying to delay for a long time”. Linux, like many other operating systems, uses a separate (fixed size) kernel stack for every task (known as a ‘process’ to users) in the system. This stack is statically allocated for each task and is used to store the kernel context whenever a task makes a system call (or an interrupt occurs) that causes the kernel to do something on behalf of an application. The kernel context essentially includes local data being used by functions within the kernel that are servicing the system call request at a given moment. Depending upon how deep the call chain (how many functions call one another) goes, this stack can come close to being exhausted. If it is exhausted, the kernel will experience a fault that will minimally kill whatever task is running, but will likely also result in a system crash immediately, or within a short space of time.

Over the years, the kernel has become increasingly complex, to the point that the original 4K, and later 8K, stack was no longer sufficient to handle the possible level of nesting of function calls – especially when performing some complex file system operations on network storage or other layered protocols (the XFS filesystem particularly suffers when trying to live within a smaller stack). Theoretically, the level of nesting could result in very large amounts of memory use, but in practice, this doesn’t occur and Linux still largely gets away with the 8K choice on x86_64. But enough developers now feel strongly that the stack is insufficiently sized that it will be increased in 3.15 to avoid potential problems.

You might wonder how the kernel could run out of stack space. After all, applications don’t have the same problems, right? It turns out applications don’t have to deal with a fixed size stack because their stack is automatically extended by the kernel on demand. Whenever the userspace stack for a given task (‘process’) reaches its limit (because many nested functions were called, or because a large amount of data was allocated on the stack through local functions – programmers today don’t even think about this), the subsequent page fault that results from the application attempting to access beyond the current stack limit is trapped by the kernel, which allocates more stack to the process.

This cannot be done within the kernel itself however. Kernel stacks consume a fixed amount of (unpageable – it cannot be ‘swapped out’ to disk) memory. These need to be linear in physical memory since they may be used in cases where that matters, and consequently cannot be too large to avoid the additional complication of finding available chunks of contiguous linear physical memory. For these reasons, the stack size is kept as small as it can possibly be, but it was long since time for x86 to catch up with the other architecture stack hogs that had long since also moved to 16K.

Ongoing development

Borislav Petkov posted a patch entitled ‘CPU hotplug: Slow down hotplug operations’ that introduces a delay in the onlining and offlining of CPUs within a hotplug system. He claims this is because many of those running tests are creating implausible scenarios with hundreds or thousands of continuous offline/online operations, and this is exposing fundamental problems with the hotplug code. His sentiment was respected, but the consensus was that hiding such problems won’t make the situation better. Thomas Gleixner noted that many had proposed they would work on cleaning up hotplug and that he hoped to focus on this again soon.

Alex Williamson posted a ‘new device binding path using pci_dev.driver_override’ patch that extends the existing support in the kernel for dynamically ‘binding’ and ‘unbinding’ drivers to devices by adding support for telling the kernel which driver should bind to a given device in the case that there is more than one driver loaded that could provide support for a piece of hardware. The new sysfs entry found in /sys/ bus/pci/devices/…/driver_override allows for fine-grained control in the case that a single device must be bound to a meta driver within a virtualised environment.

Christoph Hellwig announced a new scsi patch queue tree to help out James Bottomley by staging many of the SCSI patches that have been pending for upstream. The new tree is split into SCSI core patches and those of drivers. It is available from the git.infradead.org site under users/hch/scsi-queue.git and comes with a series of rules of engagement that were posted to the Linux kernel mailing list.

Finally, Ted Ts’o posted a reminder that the nomination process for the 2014 Kernel Summit is open, with those wishing to attend able to propose topics. Kernel Summit typically draws about 100 of the core developer kernel community, but also aims to be inclusive of the broader community. Those with a topic worthy of consideration can consult online archives of the relevent mailing list: kernel-summit@lists. linuxfoundation.org

×