Tux Machines

New LWN Articles About Kernel (Linux), Outside Paywall

Posted by Roy Schestowitz on Jun 22, 2023

Red Hat Closing RHEL and Other Red Hat News (UPDATED)

Addressing priority inversion with proxy execution

↺ Addressing priority inversion with proxy execution

> Priority inversion comes about when a low-priority task holds a resource that is needed by a higher-priority task, with the result that the wrong task is the only one that can run. This problem is arguably most acute in realtime settings, but it can happen in just about any system that has multiple tasks running. The variety of scheduling classes provided by the Linux kernel make handling priority inversion a difficult problem; the latest version of the proxy execution patch series points toward a possible solution.

> To understand priority inversion, imagine that a low-priority, background task acquires a mutex. If a realtime task happens to need that same mutex, it will find itself blocked, waiting for the low-priority task to let go of it. Should yet another task, with medium priority, come along, it may prevent the low-priority task from executing at all, meaning that the mutex will not be released and the realtime task will be blocked indefinitely. That is exactly the sort of outcome that the priority mechanism is intended to prevent.

> A classic solution to priority inversion is priority inheritance. If a high-priority task finds itself blocked on a resource held by another, it lends its priority to the owning task, allowing that task to complete its work and release the resource. The Linux kernel has supported priority inheritance for a long time, but that is not a complete solution to the problem. Deadline scheduling complicates the situation, in that it is not priority based. Since a task running in the deadline class has no priority, it cannot lend that priority to another task. So priority inheritance will not work with tasks using deadline scheduling.

Yet another memory allocator for executable code

↺ Yet another memory allocator for executable code

> The kernel is an increasingly dynamic body of code, where new executable text can show up at any time. Currently, the task of allocating memory for new kernel code falls on the subsystem that first brought the ability to load code into a running kernel: the module loader. This patch set from Mike Rapoport looks to move the responsibility for these allocations to a new "JIT allocator", addressing a number of rough edges in the process.

> In order to support the ability to load modules at run time, the kernel had to gain the ability to allocate memory to hold those modules. Early on, that was just a matter of calling vmalloc() to obtain the requisite number of pages and enabling execute permission for the resulting pages. Over time, though, things have grown more complicated — as they so often seem to do.

Deadline servers as a realtime throttling replacement

↺ Deadline servers as a realtime throttling replacement

> The CPU scheduler's one job at any given time is to run the task that has the strongest claim to the CPU. There are many factors that complicate that job, not the least of which is that the "strongest claim" is sometimes a bit of a fuzzy concept. Realtime throttling, a mechanism designed to keep a runaway realtime task from monopolizing the CPU, is one case where developers have concluded that the task with, ostensibly, the highest priority should not actually be the one that runs. But realtime throttling has rarely pleased anybody; the deadline-server infrastructure patches posted by Daniel Bristot de Oliveira are the latest attempt to find a better solution.

> The POSIX realtime scheduling classes are conceptually simple; at any given time, the task with the highest priority runs to the exclusion of anything else. In the real world, though, the rule enables a runaway realtime task to take over the system to the point that the only way to recover it may be to pull the plug. Power failures, as it turns out, have an even higher priority than realtime tasks.

Two VFS topics

↺ Two VFS topics

> Two different topics concerning the virtual filesystem (VFS) layer were the subject of a session led by VFS co-maintainer Christian Brauner at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. As might be guessed, it was a filesystem-track session; Brauner had three separate items he planned on bringing up, but the discussion on the first two consumed the whole half-hour—and then some. A mechanism to avoid media-change races when mounting loop (or loopback) and other devices was disposed of fairly quickly, but the discussion around the mount-beneath feature went on at length.

Mounting images inside a user namespace

↺ Mounting images inside a user namespace

> There has long been a desire to enable users to mount filesystem images without requiring privileges, but the security implications of allowing it are seriously concerning. Few, if any, kernel filesystems are hardened against maliciously crafted images, after all. Lennart Poettering led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit where he presented a possible path forward.

> He started with an overview of the problem, noting that "everybody wants to be able to mount disk images that contain arbitrary filesystems" in user space, without needing to be root. Since malicious images could crash the kernel—or worse—the only way to do that is to establish some trust in the image before it gets mounted. He talked about some components that the systemd developers want to add that would allow container managers and other unprivileged user-space programs to accomplish this.

Hardening magic links

↺ Hardening magic links

> There are some "magic links" in kernel pseudo-filesystems, like procfs, that can be—have been—(ab)used to cause security problems, such as a container-confinement breach in 2019. Aleksa Sarai has long been working on ways to blunt the impact of these magic links. He led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit to discuss the status of those efforts.

> Sarai said that he worked on hardening for these links as part of adding the openat2() system call, but he removed some of that work before it was merged because the semantics were unclear. So, he wanted to have a discussion on those pieces to try to ensure that they make sense to everyone, that attendees are happy with them, and to avoid "having things thrown at me when I post them to the list".

Retrieving mount and filesystem information in user space

↺ Retrieving mount and filesystem information in user space

> In something of a follow-on from the mount-operation monitoring session the previous day, Christian Brauner led another discussion about providing user space with a mechanism to get current mount information on day two of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. The session also continued on from one at last year's summit—and likely others before that. There are two separate proposals for ways to retrieve this kind of information, one from Miklos Szeredi and another from David Howells, both of whom were present this year; Brauner's intent was to try to reach some kind of agreement on the way forward in the session.

Reports from OSPM 2023, part 1

↺ Reports from OSPM 2023, part 1

> The fifth conference on Power Management and Scheduling in the Linux Kernel (abbreviated "OSPM") was held on April 17 to 19 in Ancona, Italy. LWN was not there, unfortunately, but the attendees of the event have gotten together to write up summaries of the discussions that took place and LWN has the privilege of being able to publish them. Reports from the first day of the event appear below.

> Reports from day 2 are also available.

gemini.tuxmachines.org