CS270 -- Linux Internals

These notes are an amalgamation of excerpts from

Basic System Calls from CS170,
Memory Management from CS170,
Love R. Linux kernel development. Pearson Education; 2010 Jun 22.
Billimoria KN. Linux Kernel Programming: A comprehensive and practical guide to kernel internals, writing modules, and kernel synchronization. Packt Publishing Ltd; 2024 Feb 29.

Linux internals are not static. These notes should be considered a guide rather than a definitive description of what the most current version of linux implements.

Process Virtual Memory

The virtual memory layout for a running process is a bit more complicated, particularly for 64-bit architectures, than the "standard" Unix 4-segment model. To begin with, the 6.9 kernel supports both 4-level (48-bit) paged addressing and 5-level (57-bit) paged addressing. In the following examples, we will use 4-level page table and 48-bit addressing. We will also use the x86_64 MMU paging specification.

In this mode, on the x86_64, Linux uses the low-order 48 bits of a 64-bit word for virtual addresses. The high-order 16 bits are either set to 0x0000 (if it is a user-space address) or 0xffff if the virtual address is a kernel address. Further, the default virtual address split is 128 TB of user address space and 128 TB of kernel address space. As a result, use by a process range from

0x0000000000000000 - 0x00007fffffffffff -- user space
0xffff800000000000 - 0xffffffffffffffff -- kernel space

That is, the default VM address split uses 47 of the 48 bits that the 4-level page mapping MMUs can support.

Paging is hierarchical. The address format is

   
-----------------------------------------------------------------------------------------
| 16 bits | PGD (9 bits) | PUD (9 bits) | PMD (9 bits) | PT (9 bits) | offset (12 bits) |
-----------------------------------------------------------------------------------------

Where

PGD: Page Global Directory
PUD: Page Upper Directory
PMD: Page Middle Directory
PT: Page Table
offset Offset within a page

and a page is 4K.

Thus, each process has 4 page tables, each containing 512 entries. The PGD, PUD, and PMD tables contain physical addresses of the next page table. PT contains Page Table Entries (PTEs) that include a page frame reference.

Notice that 48 bits can address 256TB which is more than most machines can support. For reasons that we will discuss presently, large memory machines (say with 32TB of memory) use a 5-level (57-bit) paging scheme but such machines are not typical. Thus, Linux is fairly profligate with respect to address space usage. Because there are so many more addresses in an address space than can be mapped to physical memory, Linux uses addresses to designate specific regions of the process address space for different purposes.

Consider the following C code

#include < unistd.h >
#include < stdlib.h >
#include < stdio.h >
#include < string.h >

int A;

int main(int argc, char **argv)
{
        FILE *fd;
        pid_t pid;
        char filename[128];
        char line[1024];

        pid = getpid();

        snprintf(filename,sizeof(filename),"/proc/%d/maps",(int)pid);

        fd = fopen(filename,"r");
        if(fd == NULL) {
                printf("could not open %s\n",filename);
                exit(1);
        }
        memset(line,0,sizeof(line));
        while(fgets(line,sizeof(line),fd) != NULL) {
                printf("%s",line);
                memset(line,0,sizeof(line));
        }
        fclose(fd);


        snprintf(filename,sizeof(filename),"/usr/bin/pmap %d",(int)pid);
        system(filename);

        printf("&A %p\n",&A);
        return(0);
}

When I compile and run this code on an Ubuntu 20.04 system, I get

5629a741a000-5629a741b000 r--p 00000000 fc:01 259160                     /home/ubuntu/src/test/a.out
5629a741b000-5629a741c000 r-xp 00001000 fc:01 259160                     /home/ubuntu/src/test/a.out
5629a741c000-5629a741d000 r--p 00002000 fc:01 259160                     /home/ubuntu/src/test/a.out
5629a741d000-5629a741e000 r--p 00002000 fc:01 259160                     /home/ubuntu/src/test/a.out
5629a741e000-5629a741f000 rw-p 00003000 fc:01 259160                     /home/ubuntu/src/test/a.out
5629ceddf000-5629cee00000 rw-p 00000000 00:00 0                          [heap]
7f84f6beb000-7f84f6c0d000 r--p 00000000 fc:01 6252                       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f84f6c0d000-7f84f6d85000 r-xp 00022000 fc:01 6252                       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f84f6d85000-7f84f6dd3000 r--p 0019a000 fc:01 6252                       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f84f6dd3000-7f84f6dd7000 r--p 001e7000 fc:01 6252                       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f84f6dd7000-7f84f6dd9000 rw-p 001eb000 fc:01 6252                       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f84f6dd9000-7f84f6ddf000 rw-p 00000000 00:00 0 
7f84f6de6000-7f84f6de7000 r--p 00000000 fc:01 6239                       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f84f6de7000-7f84f6e0a000 r-xp 00001000 fc:01 6239                       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f84f6e0a000-7f84f6e12000 r--p 00024000 fc:01 6239                       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f84f6e13000-7f84f6e14000 r--p 0002c000 fc:01 6239                       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f84f6e14000-7f84f6e15000 rw-p 0002d000 fc:01 6239                       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f84f6e15000-7f84f6e16000 rw-p 00000000 00:00 0 
7ffe2eca0000-7ffe2ecc1000 rw-p 00000000 00:00 0                          [stack]
7ffe2ed40000-7ffe2ed43000 r--p 00000000 00:00 0                          [vvar]
7ffe2ed43000-7ffe2ed44000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
1018983:   ./a.out
00005629a741a000      4K r---- a.out
00005629a741b000      4K r-x-- a.out
00005629a741c000      4K r---- a.out
00005629a741d000      4K r---- a.out
00005629a741e000      4K rw--- a.out
00005629ceddf000    132K rw---   [ anon ]
00007f84f6beb000    136K r---- libc-2.31.so
00007f84f6c0d000   1504K r-x-- libc-2.31.so
00007f84f6d85000    312K r---- libc-2.31.so
00007f84f6dd3000     16K r---- libc-2.31.so
00007f84f6dd7000      8K rw--- libc-2.31.so
00007f84f6dd9000     24K rw---   [ anon ]
00007f84f6de6000      4K r---- ld-2.31.so
00007f84f6de7000    140K r-x-- ld-2.31.so
00007f84f6e0a000     32K r---- ld-2.31.so
00007f84f6e13000      4K r---- ld-2.31.so
00007f84f6e14000      4K rw--- ld-2.31.so
00007f84f6e15000      4K rw---   [ anon ]
00007ffe2eca0000    132K rw---   [ stack ]
00007ffe2ed40000     12K r----   [ anon ]
00007ffe2ed43000      4K r-x--   [ anon ]
ffffffffff600000      4K --x--   [ anon ]
 total             2492K
&A 0x5629a741e014

The format is as follows

start_va-end_va perms offset-in-file device inode# path-to-file

So this entry

5629a741a000-5629a741b000 r--p 00000000 fc:01 259160                     /home/ubuntu/src/test/a.out

says that virtual addresses 0x5629a741a000 to 0x5629a741b000 reside in the first 4K of the file, that their pages are read-only and private, that the file is on fc:01 (major:minor dev #), the inode is 259160 and the file name is /home/ubuntu/src/test/a.out.

Take a look at the first 6 entries. Each one is a 4K page. They are

5629a741a000-5629a741b000 r--p 00000000 -> ELF header
5629a741b000-5629a741c000 r-xp 00001000 -> text segment
5629a741c000-5629a741d000 r--p 00002000 -> read/only data
55d7caac2000-55d7caac3000 r--p 00002000 -> read/only data
5629a741e000-5629a741f000 rw-p 00003000 -> initialized globals
5629ceddf000-5629cee00000 rw-p 00000000 -> bss/heap

The program also prints the address of a global variable A.

&A 0x5629a741e014

which is located at 0x14 (20) bytes from the beginning of the initialized globals section.

Why two read only data sections? One is for read-only program variables and the other is (likely) for read-only tables that get filled in as a result of dynamic linking and then read protected.

The rest of the process virtual address layout includes segments for libc (which has been dynamically linked) and ld (also dynamically linked), the stack, and some pages that the OS uses (including vsyscall which is in kernel space). Notice that the kernel simply creates new segments for the dynamically loaded libraries, each with its own separate text, data, and bss/heap. In short, dynamically linked libraries are just other executables loaded into the address space. For static linking, the linker will coalesce the libraries into a single text and data segment but dynamically linked libraries are really separate executables in the fullest sense of the word.

It is also important to understand that there are regions of virtual address space and not (necessarily) mapped to physical memory. Also look at the virtual addresses. What does the PGD look like?

First, there are no virtual addresses mapped between 0x0000000000000000 and 0x00005629a741a000 which has high-order 9 bits 010101100 (172). The number is the index into the PGD (which contains 512 entries) so the entries at index 0 through 171 are invalid.

Then, there are 38 pages mapped but the high-order 9 bits for all of the pages are the same so the same PGD entry (entry 172) refers to the same PUD for all of them.

Then, the next mapped address is 0x00007f43d1aad000 which has 011111110 (254) so the PGD is empty from 173 through 253. The last mapped address (in user space) is 0x000000007ffc5c3ee000 which has high order 9 bits of 011111111 (255) and the rest of the table is empty.

So the PGD is an array of 512 entries (each 64 bits in size) that contains 3 valid entries (172, 254, and 255) each of which contains the physical address of a PUD.

The page tables are created in the kernel when the process is created and they are updated as the process runs and the kernel cleans memory. However, page tables are architecture specific. The ARM processors also use multi-level paging, but the table formats and the code to load or switch between page tables is different.

Linux uses common "high-level" data structures to represent the process virtual address space which it then uses to initialize and maintain architecture-specific page tables. The main data structure is the mm_struct which contains information on the processes global address layout and a pointer to the PGD. Each contiguous region of virtual address space is describe by a vm_area_struct that carries the range of virtual addresses, access permissions (to be implemented in the page tables) and information (if any) on where to find the backing store for the area.

Thus, Linux uses a two-level strategy for implementing process virtual memory. The low level is architecture specific and depends on what format the MMU requires to implement demand paging. The high level uses a mm_struct and vm_area_structs which link to the low level data structures.

Kernel Virtual Memory

Linux inherits the relationship between a process' virtual memory layout and the kernel's memory from Unix. Specifically, as noted previously, all memory locations (x86 in the 4-level paging model) in which the high-order 16 bits are 1s are represented within the process address space, but can only be accessed in privileged mode. The PTEs on the x86 contain a User-mode/Kernel-mode bit. When the bit is 0, an attempt to access the page from ring 3 will page fault with a protection fault.

Prior to the Meltdown side-channel attack, the kernel addresses were mapped into all page tables by duplicating the kernel PUD, PMD, and PT physical addresses into each process' page table. As a result, during a system call, the page table base register (CR3 on the x86) need not change. After meltdown, the page table base register is switched to point to kernel page tables at the initiation of a system call and back to user-space page tables when the system call completes.

The kernel creates a duplicate map of physical memory in its address space at memory addresses below the VM split. That is, the kernel has a way to use a virtual address (in this direct-mapped region) to access a specific physical address. It uses this mechanism to create and storage page tables in physical memory and to load their entries with physical addresses.

The kernel also creates a page frame structure (a struct page for every physical frame in the memory configured into the system. The page frame number indexes this table. The PTE in the PT contains a frame reference to this structure. The page frame structure contains either a pointer to anonymous vm_area_struct data structures (for pages that do not have backing store) or a list of address_space files that have pointers to the vm_area_struct data structures that refer to page tables. It walks these data structures to invalidate PTEs when a frame is reclaimed.

Linux Processes

The Linux process model is somewhat complex because, with the 2.6 kernel, it full embraced preemption (even in the kernel). At the same time, it remains backward compatible with the Unix process model which supports multitasking through user processes exclusively.

For a process that is not multi-threaded (i.e. a "traditional" Unix-style process), the virtual address space is as we described above, with a single stack at the highest possible user-space address that grows "down" towards the heap.

The kernel allocates a kernel-space record (a task_struct) for the process that is essentially identified by a 16-bit process identifier (an integer) that is unique among all running processes. The kernel also allocates a 16KB (4 page) kernel stack for the process in kernel space. For modern x86_64 Linux, the task_struct has a pointer to the kernel stack for the process. The kernel stack has a back pointer to the task_struct which is stored in the first (lowest address) word of the kernel stack. As a result, the kernel can find the current process' task_struct by masking the stack pointer by the stack size to get the word at the lowest address of the stack.

When a process makes a system call, the kernel switches the stack pointer from the user-space stack to the kernel stack so that the kernel program state is protected from manipulation by code that runs in user mode. Additionally, device interrupts can either specify that they are to be handled on the kernel stack or a separate interrupt stack (which is one 4K page).

If you think about it a bit, this model makes the kernel a kind of library that is loaded with the user space program. The only difference is that the user-space program cannot use the compiler's function call assembly language implementation to invoke a kernel routine. Instead, it must

change to protected mode
switch the stack pointer
jump through a dispatch table

On the x86_64, The C runtime passes the system call number in a register and the first 5 arguments to the system call in registers. Each system call must perform sanity checks on the arguments it takes.

Notice, however, that the page tables for the process are not changed. That is, it is possible to think of the kernel stack as simply "overlaying" or "extending" the user stack (by 16KB from where the stack pointer was at the point of the system call) and that the process's address space consists of the user space addresses and the kernel addresses. The switch to unprotected mode "turns off" access to some of these addresses but they are logically still part of the process.

Threads

The Linux process model (which is really the Unix process model) is simple and elegant with respect to multitasking. Essentially, each process "owns" the machine when it is running (whether in user mode or kernel mode) because both user space and kernel space share an address. To switch the entire machine to another process, the kernel need only change address spaces. Only a small part of the kernel (associated with device interrupts) does not "belong" to a process address space. When processes were the only model of concurrency this implementation strategy proved to be extremely successful -- perhaps the most successful amongst all operating systems that have ever existed.

Threads, however, complicate this model. Logically, threads share an address space. Here are some questions that arise immediately.

Does this mean that multiple processes "own" the machine at the same time or does it mean that Linux schedules an address space and then the threads within the address space run until Linux decides to run another address space?
Linux signals are a user-space mechanism for delivering an asynchronous event. When a signal is sent to a process, which thread is awakened?
If a thread makes a system call that blocks (say reading the disk), does the whole process block (i.e. do all threads that share the address space block)?
If a thread makes a Linux sleep call, do all threads sleep or just the calling thread?
If a thread makes a Linux exit call, do all threads exit or just the calling thread?

There are many other ambiguities as well (e.g. thread-specific permissions, network packet delivery, etc.) all of which stem from a fundamental change in how Linux implements abstractions for managing concurrency compared to the original Unix process model.

There are essentially two ways to introduce threads as a concurrency abstraction when processes are the baseline abstraction.

The first is to make threads an abstraction that exists strictly within a process. In this model, the kernel treats the process as a scheduable unit of resource allocation and, once it is scheduled, each process contains an internal thread scheduler that is active during the processes time slice. This "two-level" approach has several advantages, most of which accrue to specialization. That is, each process essentially implements its own (possibly customized) thread scheduler internally.

From the perspective of the Linux kernel implementation, however, this approach requires a substantial refactorization of the kernel. Under this model, if multiple threads within a process make system calls, and while the system calls are in progress the process the scheduler chooses another process to run, the kernel would need to suspend and then keep track of all of the pending system calls so they could be continued when the process is rescheduled in the future. The Linux system call architecture, inherited from Unix, is one in which each system call is independent and, as such, it can be managed individually. As a result, the kernel scheduler and resource allocators need not consider "group" system calls and any possible interactions between threads within a group and outside of the group.

If this sounds a little hand wavy, then consider the following scenario. Let's imagine that a process has three threads and all three of them make a system call. Further, Linux treats all three threads as part of a single schdulable process. During the three simultaneous systems calls, the scheduler deschedules the process. Later, the process's scheduling priority is such that it should run next, but only two of the three threads are ready to execute. What should the kernel do? You could run the process and allow the two threads to continue while the third waits for its system call to complete. The problem here is that if the process is descheduled again and then the third thread becomes runnable, you either have to schedule the process (again -- early) or wait for the process to get a time slice so that third thread can continue. It is possible to come up with a rational approach to this two level scheme (and some high-performance Linux implementations include such implementations) but the book-keeping required in the kernel is fairly substantial.

The alternative (originally due to Silicon Graphics) which Linux implements is to treat threads as independent processes in the kernel. That is, under Linux, when either a "classic" process or a thread within a process makes a system call, the code in the kernel does not differentiate between the two (with a few small exceptions).

This structural decision has several ramifications.

The first, and most obvious, is for CPU scheduling. Linux accounts for thread CPU occupancy independently and, thus, threads contend for the CPU with other threads regardless of what processes they inhabit (at least, under the SCHED_NORMAL policy). Secondly, each thread has a complete set of kernel data structures that allows the kernel to schedule it and deschedule it independently.

For example, each thread has a task struct that represents the anchor point for the thread's state. As such, each thread has a unique PID (process ID) just as each process does. Each thread also has its own kernel stack which is allocated when the thread is created and linked to the task struct for the thread.

Threads within a process must share

an address space (represented by a set of page tables)
a set of file descriptors
a set of credentials

These data structures are all referenced via pointers in the task struct and reference counted.

The advantage of this approach is that it allows the kernel to have only one way to synchronize data transfer from user space to kernel space and vice versa. Specifically, the kernel has a mechanism to allow a process/thread to make a system call (transitioning from user space to kernel space) and to block until such time that the system call can be completed. Usually the blocking happens because the system call requires some form of device I/O and the device is far slower than the CPU. The mechanism (which uses wait channels or "queues" within the kernel) depends upon there being a kernel stack and a task struct. For historical reasons, the term for this state is process context.

Thus, the Linux kernel is only able to block a thread of control if it has a process context. For reasons that are not at all clear, there is a veritable zoo of asynchronous control abstractions in the kernel (softirqs, tasklets, and workqueues) for prioritizing and deferring work but none of these can "block." Further, various kernel facilities (like the kernel memory allocator) can block so there are restrictions on what kernel routines these more exotic abstractions can access. Anything with a process context can call anything else in the kernel, however.

Kernel Threads

One consequence of the decision to make all threads appear as processes in the kernel is that it is possible to have kernel threads -- threads that are not part of a process but, instead, execute exclusively in kernel mode. Linux kernel threads are represented like all other threads, with a task struct that anchors their state. The difference is that Linux does not allocate a user address space for them. Instead, they get the same set of kernel page tables but the user address space structure is NULL.

Because each kernel thread has a kernel stack, however, it can block. Thus a kernel thread is like a regular process that contains a single thread which makes a system call that never returns to user space. Under Unix, this concept was called a kernel process. The idea was that a regular process (running as root) would make a special system call that would never return to user space. The user space address allocation and management were "wasted" in this case, but it allowed the process to do work on the kernel's behalf and then sleep when there was no more kernel work to do (instead of returning to user space).

Why does the kernel need kernel threads?

This turns out to be an important question in the evolution between Unix and Linux. Really, it is possible to think of the only difference between Linux and Unix (architecturally) as being the reliance of Linux on kernel threads. Unix, originally, did everything in process context (including interrupt handling which it did on what ever the current kernel stack was when the interrupt occurred). Thus the kernel was really a dynamically loaded library with respect to a process that got loaded when the process was scheduled and the process was responsible for the machine until it was descheduled.

This method was elegant and simple but as kernel functionality grew, it became complex to manage. In particular, the kernel is responsible for functionality that must be able to block on I/O. When the kernel writes dirty pages from the page cache to disk, for example, the disk I/O is synchronous which means that a process context must block until the interrupt indicating the disk write has completed unblocks it. Unix solved this problem by using the current process context for the write and then returning to what ever that process context was doing when the I/O starts. The, when the I/O completed, it would use another process context to handle the interrupt.

Today, the kernel makes heavy use of kernel threads. There are three primary use cases

blocking I/O for backing-store management
deferred work in device drivers
distributed Linux functionality

The first is old. Linux uses the memory (the RAM) of the machine as a cache of pages that, logically, are stored in some kind of persistent storage device. Because these storage devices are much slower than the CPU, the CPU context switches away while the I/O is in progress and, to do so, Linux uses a kernel thread and its process context to block. For example, pdflush functionality which writes back dirty pages from the page cache is implemented using kernel threads.

The second usage is for device drivers. Originally, writing a device driver for Unix was quite difficult because the driver interrupt handler code had to do its work and exit very quickly. The reason is that when an interrupt is handled by the CPU, all other interrupts of the same or lower priority are disabled. Some devices have watchdog timers that trigger an error if their interrupts are not serviced within a certain time window so one had to be careful to get out of interrupt context as soon as possible.

Interrupt handlers, today, do need to run to completion quickly, but as devices become more and more complex, the work needed to handle an interrupt becomes more and more extensive. This work is deferred so that it can run at a lower CPU priority (relative to the interrupt handler priority) using a kernel thread. Thus, an interrupt handler than can't do all of its work in interrupt context (because it would take too long) can hand the work off to a kernel thread and then exit interrupt context. The kernel will run the kernel thread and the next moment when the CPU is not busy fielding another interrupt and that kernel thread will be preempted by interrupts that come in while it is executing.

The third usage is for functionality that Linux implements as part of a network collection of systems. For example, it is possible for a Linux system (using iptables and/or ebtables) to act a network switch or gateway. Network packets pass through the kernel that never reference or are accessed by a user-space process running on that system. Another example is the Network File System (NFS) on the server side need not transition user space because NFS is essentially a disk-block level protocol at the network layer.

Kernel Multithreading and Preemption

It is tempting to equate the adoption of kernel threads with kernel multi-threading but, in fact, they are separate issues. Originally, the kernel (Unix and Linux) was not preemptable. That is, once a process entered the kernel or a kernel thread began to run, no other blocking context would run until the current context blocked or left kernel space. Interrupts might preempt the current context executing in the kernel but they had to be careful not to "mess up" the kernel state for the context and there was a way for a context to (temporarily) "shut off" interrupts to avoid race conditions between process context and interrupt context, but the kernel would not deschedule a context (process or kernel thread) while in kernel mode.

The modern Linux kernel is multithreaded and preemptable. As a result, process contexts executing in the kernel (on some core of the CPU) can be scheduled and descheduled by the Linux scheduler in the same way that they can be context switched when they are in user space. Kernel threads are included, although they run in a higher-priority scheduling class compared to "regular" threads and processes.

In addition, the kernel is parallelized. That is, multiple threads (each executing on a separate core) can be executing in the kernel at the same time. To avoid race conditions, there are, again, a zoo of synchronization primitives. Roughly speaking they can be categorized into those that can block the current context and those that do not.

When executing in a process or thread context (i.e. the code can sleep), the kernel synchronization primitive is a counting semaphore. The kernel includes and API which implements a mutex lock using a semaphore initialized to 1 (and presumably disallows unlock calls on locks that are not locked). The key requirement here, however, is that a call to one of these primitives can result in the current context being descheduled.

When executing in interrupt context, however, semaphores cannot be used. Instead, the kernel includes various atomic data operations (as in C++) and an explicit spin lock primitive. When a process context needs to synchronize with an interrupt context or interrupt contexts must synchronize with each other, they must use one of these non-blocking mechanisms.

Note that it is up to the kernel developer to identify and code the usage of these primitives correctly or the kernel will crash. There is no compiler support or automated runtime system that can implement synchronization correctly.

The Linux Scheduler

One way to think of the kernel, in terms of concurrency, is that there are two types of computations: those that execute in process context and those that execute in interrupt context. Interrupt context is defined by a set of priorities that are "baked" into the kernel. That is, when one configures a device, one specifies an irq (interrupt request queue). Thus the interrupt priority (which devices can interrupt which other devices) is set at configuration and/or device installation time. Interrupt context must run to completion (over a short period of time) once they start.

Process context can be scheduled dynamically and when a process is not scheduled it uses the kernel state to sleep.

Notice that interrupts can interrupt other interrupts (the term of art is that interrupts can "stack") but there is no notion of time division multiplexing (i.e. "time slicing") for interrupts. Linux does implement time slicing for process contexts, but in a very specific way.

First, though, the Linux kernel implements 5 scheduling priority levels. All tasks at level n that are runnable will be executed before any task at level n+1 are chosen. The 5 scheduling priorities are

SCHED_STOP
SCHED_DEADLINE
SCHED_RR / SCHED_FIFO
SCHED_OTHER / SCHED_NORMAL
SCHED_IDLE

The first (highest) priority (SCHED_STOP) is reserved for special kernel functions (like debug tracing and power down). Even though a thread can block, this priority gives the thread pretty much exclusive access to the entire machine (e.g. interrupts are shut off, no other threads will be run on any other cores, etc.) as if it were the highest-priority interrupt.

The next two levels are available as configuration options in the kernel and they require permission levels to be set for processes to request them. For example, SCHED_DEADLINE is typically only accessible to processes running as root.

SCHED_OTHER (also called SCHED_NORMAL) is the default option and SCHED_IDLE (which is really for low-priority background threads) is available.

Each scheduling class defines its own rules for how to pick the next task to run. The kernel has various scheduler entry points defined where it will consult the scheduling classes (in order) and ask if they wish to run a process.

The Completely Fair Scheduler

The SCHED_OTHER class uses a scheduling algorithm called the Completely Fair Scheduler (CFS) which implements time division multiplexing of the CPU cores, but in a way that sets the time slice duration dynamically. The typical CPU scheduler chooses a fixed maximum time slice duration for each process. A process that is CPU bound will use the fill time slice, but a process that executes and then "sleeps" (because it is performing I/O, for example) gives up the remainder of its slice.

The complexities arise out of the need to manage the CPU efficiently (i.e. maintain good utilization) while providing good performance to long-running CPU-bound tasks (low overhead) and fast response time (low latency). These requirements are in tension.

For example, for many years, the Linux scheduler implemented short-term priority inflation after a process blocked. That is, when a process called sleep to wait for an I/O to complete, when it was re-enabled, it would get a very high priority for a very short time. Doing so enabled processes running text editors, for example, to echo the typed character (in full-duplex mode) quickly (low latency) before going immediately back to sleep. CPU-bound processes would then be interrupted when ever a character needed to be echoed which hopefully would not introduce too much overhead (cache pollution, memory pressure, TLB pressure, etc.) This methodology works well, but it can lead to some pathological situations, especially with respect to network traffic.

After many years of fixing up the existing priority scheduler, Linux adopted CFS which uses a different approach.

The abstract idea behind the CFS is that the system maintains a target latency which represents time period over which all runnable processes share the CPU fairly. It then uses the number of running processes to time division multiplex the CPU so that the target latency period is equally shared among all runnable processes.

For example, imagine that the target latency is 20ms and there are two runnable processes. In the "perfect" time slicing case, each process is entitled to 10ms of execution time during each 20ms period. Linux will set the time slice to be 10ms when there are two runnable processes and the configured target latency is 20ms. If there are 4 runnable processes, then the time slice is set to 5ms, and so on, down to some minimum latency (which is also configurable).

Thus the first part of the CFS is that it adjusts the time slice duration as a function of the number of runnable processes in the SCHED_OTHER run queue. When more processes join the queue, the time slice duration goes down. When processes block or die, the time slice duration goes up.

The second part of the scheduler is how it chooses the next task to run at a scheduler entry point. Using a fast timer, Linux records the time a process actually uses when it is assigned the CPU. It is entitled to complete its full time slice (in which case the virtual runtime increases by a full time slice) but if the process performs I/O then the virtual runtime increases by less.

The third part is that the scheduler using a weighting system (based on nice value) to change the relative virtual time values. Each nice value corresponds to a weight (given in a kernel table). The "baseline" or neutral weight is divided by the weight mapped by the nice value of the process to compute a virtual expansion or contraction weight that is then use to increase or decrease the rate of virtual runtime accumulation.

For example, imagine that the weight for a nice 0 process is set to 1024, the weight for a nice 5 process is set to 335, and the weight for a nice -5 process is set to 3121 (thanks chatGPT). The relative fractions that the scheduler wants to assign to the processes are

nice -5 : (3121 / (3121 + 1024 + 335)) == 69%
nice 0 : (1024 / (3121 + 1024 + 335)) == 22%
nice 5 : (335 / (3121 + 1024 + 335)) == 7%

When each process is run, the virtual runtime is incremented by the proportion of its weighting factor to the neutral factor. That is, when nice -5 runs,

nice -5: vruntime += (1024 / 3121) * actual_runtime
nice 0 : vruntime  += (1024 / 1024) * actual_runtime
nice 5 : vruntime  += (1024 / 335) * actual_runtime

Thus, for each 1 ms that a nice -5 process runs, it accumulates 0.32 ms of virtual runtime. For each 1 ms a nice 0 process run, it accumulates 1 ms of virtual runtime. Finally, for each 1 ms of actual runtime, a nice 5 process accumulates 3.05 ms of virtual runtime.

The fourth part is that the CFS chooses the process with the smallest virtual runtime at the moment that a scheduling decision must be made. Runnable processes are sorted in a red-black tree using virtual runtime from smallest to biggest. The CFS uses a tick (a minimum clock interrupt duration set at config time) to adjust the virtual runtime of the process at the head of the list. Then, when the scheduler decides it needs to choose another process, the list is resorted and the new, lowest virtual runtime process is chosen.

There are two conditions under which it might choose a new task. Either the current task has used up its fraction of the target latency time slice or the virtual runtime of a process other than the running process has become smaller than that of the running process.

Let's work through an example. On an Ubuntu 20.04 VM I have running, the target latency is 12ms

cat /proc/sys/kernel/sched_latency_ns
12000000

and the tick interval is 4ms

grep CONFIG_HZ= /boot/config-$(uname -r)
CONFIG_HZ=250

Let's say all three processes start with vruntime = 0.

nice -5 runs for 4 ms and its vruntime increments by 4ms * 0.32

nice -5: vruntime = 1.2ms
nice 0: vruntime = 0
nice 5: vruntime = 0

CFS decides to switch processes and let's say it chooses nice 0. After the next 4ms tick, the accounting looks like

nice -5: vruntime = 1.2ms
nice 0: vruntime = 4ms
nice 5: vruntime = 0

CFS then decides to run nice 5 for 4ms and its vruntime increments by 4*3.05 ms.

nice -5: vruntime = 1.2ms
nice 0: vruntime = 4ms
nice 5: vruntime = 12.2ms

CFS runs again, chooses nice -5, and adds 1.2ms to its vruntime.

nice -5: vruntime = 2.4ms
nice 0: vruntime = 4ms
nice 5: vruntime = 12.2ms

At the next tick, CFS chooses nice -5 again (does not context switch), but adds another 1.2ms.

nice -5: vruntime = 3.6ms
nice 0: vruntime = 4ms
nice 5: vruntime = 12.2ms

nice -5 runs at the next tick and the accounting looks like

nice -5: vruntime = 4.8ms
nice 0: vruntime = 4ms
nice 5: vruntime = 12.2ms

CFS now chooses nice 0 and runs it for 4ms.

nice -5: vruntime = 4.8ms
nice 0: vruntime = 8ms
nice 5: vruntime = 12.2ms

And so on. It will add 1.2ms to nice -5 each time it runs for 4ms, it will add 4ms to nice 0 every time it runs for 4ms, and it will add 12.2ms every time nice 5 runs for 4ms.

Note that it does not change the amount it increments vruntime by when the number of runnable processes changes. However, because it is choosing between a different number of processes, the frequency with which a process is chosen changes when the number of processes changes.

For example, let's say that at this moment, nice -5 dies. The accounting looks like

nice 0: vruntime = 8ms
nice 5: vruntime = 12.2ms

CFS chooses nice 0 and the accounting goes to

nice 0: vruntime = 12ms
nice 5: vruntime = 12.2ms

and again to

nice 0: vruntime = 16ms
nice 5: vruntime = 12.2ms

at which point CFS will choose nice 5. With nice -5 running, nice 5 would run (roughly) every 72 ms because nice 0 runs every 16 ms and nice -5 needs to wait for 5 runs of nice 0 to run once. With only nice 0 and nice 5, nice 5 runs every 24ms because nice 0 runs 5 times before its vruntime exceeds the increment for nice -5.

Note that this accounting assumes that all three processes are CPU bound. If they perform I/O, they will block. The kernel uses a high-resolution clock to update vruntime. It uses tick for the "time slice" but if the process executes for a fraction of tick the vruntime value (stored in ns) will be updated correctly.