Linus Torvalds announced Linux 3.17, the Shuffling Zombie Juror, saying, “The past week was fairly calm, and so I have no qualms about releasing 3.17 on the normal schedule”. The latest kernel includes a number of nice headline features, such as the new getrandom() system call and sealed files APIs that we covered in previous issues of LU&D. Linux 3.17 also includes support for less highlighted new features, such as new signature checking of kexec()’d kernel images and sparse files on Samba file systems (which is significant for those mounting Windows and Mac shares).
The Linux kernel includes support for the kexec subsystem which provides a means for a running Linux kernel to chain load into another kernel. It can be used to test another kernel without a complete reboot, but it is most commonly used as part of the kdump kernel crash dumping framework to capture a kernel crash dump and save logs. When the kernel crashes (especially in Enterprise Linux distros), it can be configured to automatically execute a special kdump kernel that has been previously loaded into a reserved region of memory. This special kernel (and initramfs containing the necessary scripts for crash capture) performs the task of capturing the state of the crashed (panicked) kernel, including the physical content of memory (with region as an option or all thereof) as well as other useful debugging information. This can be saved to local disk, or it can be automatically uploaded across the network to a crash server.
There are a number of issues with using crash dumps or kexec()’d kernels in general. The existing kernel might have had in-flight DMA operations that haven’t been cancelled and which may result in modification of the hardware or the crash kernel’s memory after the crash has occurred. In the worst case scenario that an IOMMU is involved, things can get difficult due to the fact that those DMA operations may be going through a translation that disappears mid-operation. But that isn’t the hardest problem with using kexec. Perhaps one of the harder problems has been using kexec with a UEFI Secure Boot system, which enables the platform to enforce that the bootloader (and OS kernel) is signed by a known Certificate Authority (CA). UEFI binaries are of a PECOFF file format created by Microsoft and are typically signed by Microsoft as the current signing authority for Intel x86 systems.
Microsoft provides signing services for a variety of vendors to provide UEFI Secure Boot support beyond just Windows, including Linux distros. In order for a distro to support Secure Boot, it must have certain bootstack components signed using a Microsoft-defined process. Once a distro supports Secure Boot, there is a little left to be done in order to provide kernels that can be used on a platform running in Secure mode. For one thing, it is necessary to restrict a few of the capabilities provided to even the root user so that a system that has booted securely cannot be abused to load unsigned operating systems.
The fear is that if Linux were able to load and boot into unsigned operating systems, someone might create a specially hacked version of Windows that could be loaded this way, and then embed this into a special piece of malware that could be shipped along with a trusted Linux OS. Such Windows users would reboot their computer and have it boot a malicious version of Windows (via the Linux OS) without flagging any warnings. It is believed that Microsoft could then have grounds to react by revoking the keys used to signed Linux bootloaders. This is somewhat of a contrived argument, but since it is (vaguely) plausible, it has been enough to force disabling of kexec from within kernels that have been Secure Booted. The latest patches in Linux 3.17 (from Vivek Goyal) add a new kexec_file_load system call that plumbs into a new mechanism for validating a SHA256 hash of the (to be) kexec()’d kernel before actually calling into it.
Sparse Samba Files
Linux 3.17 includes support for sparse files on Samba shares. Sparse files are files that may contain holes; a file may appear to be hundreds of TB, but may take virtually no storage on a disk. This can happen if the file is largely full of zeros, such as is the case in a newly allocated virtual machine disk image. Rather than allocate terrabytes of zeros ahead of time, Linux will wait until the file is modified, filling in the holes with actual data at that time and saving previous space on the underlying physical media. Over the years, sparse file support has become standard on many different operating systems, but until Linux 3.17 it has not been possible to take advantage of sparse files hosted on Windows (or Macs, which use Samba to provide SMB shares) systems from a Linux client.
Bastien Nocera posted a “desktop kernel wishlist” (bit.ly/1zfvVjh), which he described as “Similar to systemd’s ‘Plumber’s wishlist’” that had previously been added to the kernel (as well as some that have not, or are controversial). The idea is to foster a conversation between the GNOME and kernel communities (who have not always gotten along as well as might be desired). Andrea Arcangeli (of Linux Memory Management fame) posted a RFC patch for his userfault feature. The proposal allows for “postcopy live migration in qemu, and in general demand paging of remote memory, hosted in different cloud nodes”. In other words, if userfault makes it upstream, it will be possible for a virtual machine to instantaneously live migrate to another server without having to first copy all of its memory content. This can take many seconds, and the machine will not actually be suspended (a clever proximity trick is played where pages – the unit of system memory – are repeatedly copied until there are very few being modified that haven’t already been copied, and then the virtual machine is shut down briefly and the rest moved over), but is at least not running on new hardware.
Using postcopy would presumably allow the inverse of the current migration path to happen. This would mean copying the frequently used pages first with the virtual machine, then copying the rest after the fact. Any “page faults” that occurred due to memory that hadn’t yet been copied would be trapped and handled in userspace by the virtual machine hypervisor control software (such as qemu-kvm), which would simply arrange that the missing data be moved from the old machine more rapidly. This could be good for instantaneous live migrations.
Dave Jones pointed out that at the time the kernel stack had been increased to 16K (from 8K previously) on 64-bit x86 systems earlier in the year. Linus Torvalds suggested that checks should be added to monitor the slowly developing increase in kernel stack use over time. Linux uses a fixed size kernel stack (per task – otherwise known as a “process” to users) and if this overruns then bad things will quickly happen. Thus kernel code is written carefully so that it will not result in deep stack use. But this can happen anyway, especially in deeply nested file system functions (such as XFS). Linus had suggested that a special guard page be added to catch stack use beyond the first 8K when running on a debug kernel build. Dave wanted to follow up.
An interesting discussion was instigated by Christoph Lameter as to why it is that we use 32-bit counters in Linux kernels (even on 64-bit systems) to represent the number of interrupts that have occured since boot. As an example, Christoph cites a figure of 46 days as the point at which an interrupt count since boot (2^32 / 1000 / 86400) will wrap assuming the usual 1000 timer ticks per second that is the modern Linux default. His contention that diagnostic tools wouldn’t handle this well was shrugged off with comments similar to “it’s always been that way, and this hasn’t come up before” but there was some interest in fixing this, with test patches.