In this mode, on the x86_64, Linux uses the low-order 48 bits of a 64-bit word for virtual addresses. The high-order 16 bits are either set to 0x0000 (if it is a user-space address) or 0xffff if the virtual address is a kernel address. Further, the default virtual address split is 128 TB of user address space and 128 TB of kernel address space. As a result, use by a process range from
0x0000000000000000 - 0x00007fffffffffff -- user space 0xffff800000000000 - 0xffffffffffffffff -- kernel spaceThat is, the default VM address split uses 47 of the 48 bits that the 4-level page mapping MMUs can support.
Paging is hierarchical. The address format is
----------------------------------------------------------------------------------------- | 16 bits | PGD (9 bits) | PUD (9 bits) | PMD (9 bits) | PT (9 bits) | offset (12 bits) | -----------------------------------------------------------------------------------------Where
Thus, each process has 4 page tables, each containing 512 entries. The PGD, PUD, and PMD tables contain physical addresses of the next page table. PT contains Page Table Entries (PTEs) that include a page frame reference.
Notice that 48 bits can address 256TB which is more than most machines can support. For reasons that we will discuss presently, large memory machines (say with 32TB of memory) use a 5-level (57-bit) paging scheme but such machines are not typical. Thus, Linux is fairly profligate with respect to address space usage. Because there are so many more addresses in an address space than can be mapped to physical memory, Linux uses addresses to designate specific regions of the process address space for different purposes.
Consider the following C code
#include < unistd.h >
#include < stdlib.h >
#include < stdio.h >
#include < string.h >
int A;
int main(int argc, char **argv)
{
FILE *fd;
pid_t pid;
char filename[128];
char line[1024];
pid = getpid();
snprintf(filename,sizeof(filename),"/proc/%d/maps",(int)pid);
fd = fopen(filename,"r");
if(fd == NULL) {
printf("could not open %s\n",filename);
exit(1);
}
memset(line,0,sizeof(line));
while(fgets(line,sizeof(line),fd) != NULL) {
printf("%s",line);
memset(line,0,sizeof(line));
}
fclose(fd);
snprintf(filename,sizeof(filename),"/usr/bin/pmap %d",(int)pid);
system(filename);
printf("&A %p\n",&A);
return(0);
}
When I compile and run this code on an Ubuntu 20.04 system, I get
5629a741a000-5629a741b000 r--p 00000000 fc:01 259160 /home/ubuntu/src/test/a.out 5629a741b000-5629a741c000 r-xp 00001000 fc:01 259160 /home/ubuntu/src/test/a.out 5629a741c000-5629a741d000 r--p 00002000 fc:01 259160 /home/ubuntu/src/test/a.out 5629a741d000-5629a741e000 r--p 00002000 fc:01 259160 /home/ubuntu/src/test/a.out 5629a741e000-5629a741f000 rw-p 00003000 fc:01 259160 /home/ubuntu/src/test/a.out 5629ceddf000-5629cee00000 rw-p 00000000 00:00 0 [heap] 7f84f6beb000-7f84f6c0d000 r--p 00000000 fc:01 6252 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f84f6c0d000-7f84f6d85000 r-xp 00022000 fc:01 6252 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f84f6d85000-7f84f6dd3000 r--p 0019a000 fc:01 6252 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f84f6dd3000-7f84f6dd7000 r--p 001e7000 fc:01 6252 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f84f6dd7000-7f84f6dd9000 rw-p 001eb000 fc:01 6252 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f84f6dd9000-7f84f6ddf000 rw-p 00000000 00:00 0 7f84f6de6000-7f84f6de7000 r--p 00000000 fc:01 6239 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f84f6de7000-7f84f6e0a000 r-xp 00001000 fc:01 6239 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f84f6e0a000-7f84f6e12000 r--p 00024000 fc:01 6239 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f84f6e13000-7f84f6e14000 r--p 0002c000 fc:01 6239 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f84f6e14000-7f84f6e15000 rw-p 0002d000 fc:01 6239 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f84f6e15000-7f84f6e16000 rw-p 00000000 00:00 0 7ffe2eca0000-7ffe2ecc1000 rw-p 00000000 00:00 0 [stack] 7ffe2ed40000-7ffe2ed43000 r--p 00000000 00:00 0 [vvar] 7ffe2ed43000-7ffe2ed44000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] 1018983: ./a.out 00005629a741a000 4K r---- a.out 00005629a741b000 4K r-x-- a.out 00005629a741c000 4K r---- a.out 00005629a741d000 4K r---- a.out 00005629a741e000 4K rw--- a.out 00005629ceddf000 132K rw--- [ anon ] 00007f84f6beb000 136K r---- libc-2.31.so 00007f84f6c0d000 1504K r-x-- libc-2.31.so 00007f84f6d85000 312K r---- libc-2.31.so 00007f84f6dd3000 16K r---- libc-2.31.so 00007f84f6dd7000 8K rw--- libc-2.31.so 00007f84f6dd9000 24K rw--- [ anon ] 00007f84f6de6000 4K r---- ld-2.31.so 00007f84f6de7000 140K r-x-- ld-2.31.so 00007f84f6e0a000 32K r---- ld-2.31.so 00007f84f6e13000 4K r---- ld-2.31.so 00007f84f6e14000 4K rw--- ld-2.31.so 00007f84f6e15000 4K rw--- [ anon ] 00007ffe2eca0000 132K rw--- [ stack ] 00007ffe2ed40000 12K r---- [ anon ] 00007ffe2ed43000 4K r-x-- [ anon ] ffffffffff600000 4K --x-- [ anon ] total 2492K &A 0x5629a741e014The format is as follows
start_va-end_va perms offset-in-file device inode# path-to-fileSo this entry
5629a741a000-5629a741b000 r--p 00000000 fc:01 259160 /home/ubuntu/src/test/a.outsays that virtual addresses 0x5629a741a000 to 0x5629a741b000 reside in the first 4K of the file, that their pages are read-only and private, that the file is on fc:01 (major:minor dev #), the inode is 259160 and the file name is /home/ubuntu/src/test/a.out.
Take a look at the first 6 entries. Each one is a 4K page. They are
&A 0x5629a741e014which is located at 0x14 (20) bytes from the beginning of the initialized globals section.
Why two read only data sections? One is for read-only program variables and the other is (likely) for read-only tables that get filled in as a result of dynamic linking and then read protected.
The rest of the process virtual address layout includes segments for libc (which has been dynamically linked) and ld (also dynamically linked), the stack, and some pages that the OS uses (including vsyscall which is in kernel space). Notice that the kernel simply creates new segments for the dynamically loaded libraries, each with its own separate text, data, and bss/heap. In short, dynamically linked libraries are just other executables loaded into the address space. For static linking, the linker will coalesce the libraries into a single text and data segment but dynamically linked libraries are really separate executables in the fullest sense of the word.
It is also important to understand that there are regions of virtual address space and not (necessarily) mapped to physical memory. Also look at the virtual addresses. What does the PGD look like?
First, there are no virtual addresses mapped between 0x0000000000000000 and 0x00005629a741a000 which has high-order 9 bits 010101100 (172). The number is the index into the PGD (which contains 512 entries) so the entries at index 0 through 171 are invalid.
Then, there are 38 pages mapped but the high-order 9 bits for all of the pages are the same so the same PGD entry (entry 172) refers to the same PUD for all of them.
Then, the next mapped address is 0x00007f43d1aad000 which has 011111110 (254) so the PGD is empty from 173 through 253. The last mapped address (in user space) is 0x000000007ffc5c3ee000 which has high order 9 bits of 011111111 (255) and the rest of the table is empty.
So the PGD is an array of 512 entries (each 64 bits in size) that contains 3 valid entries (172, 254, and 255) each of which contains the physical address of a PUD.
The page tables are created in the kernel when the process is created and they are updated as the process runs and the kernel cleans memory. However, page tables are architecture specific. The ARM processors also use multi-level paging, but the table formats and the code to load or switch between page tables is different.
Linux uses common "high-level" data structures to represent the process virtual address space which it then uses to initialize and maintain architecture-specific page tables. The main data structure is the mm_struct which contains information on the processes global address layout and a pointer to the PGD. Each contiguous region of virtual address space is describe by a vm_area_struct that carries the range of virtual addresses, access permissions (to be implemented in the page tables) and information (if any) on where to find the backing store for the area.
Thus, Linux uses a two-level strategy for implementing process virtual memory. The low level is architecture specific and depends on what format the MMU requires to implement demand paging. The high level uses a mm_struct and vm_area_structs which link to the low level data structures.
Prior to the Meltdown side-channel attack, the kernel addresses were mapped into all page tables by duplicating the kernel PUD, PMD, and PT physical addresses into each process' page table. As a result, during a system call, the page table base register (CR3 on the x86) need not change. After meltdown, the page table base register is switched to point to kernel page tables at the initiation of a system call and back to user-space page tables when the system call completes.
The kernel creates a duplicate map of physical memory in its address space at memory addresses below the VM split. That is, the kernel has a way to use a virtual address (in this direct-mapped region) to access a specific physical address. It uses this mechanism to create and storage page tables in physical memory and to load their entries with physical addresses.
The kernel also creates a page frame structure (a struct page for every physical frame in the memory configured into the system. The page frame number indexes this table. The PTE in the PT contains a frame reference to this structure. The page frame structure contains either a pointer to anonymous vm_area_struct data structures (for pages that do not have backing store) or a list of address_space files that have pointers to the vm_area_struct data structures that refer to page tables. It walks these data structures to invalidate PTEs when a frame is reclaimed.
For a process that is not multi-threaded (i.e. a "traditional" Unix-style process), the virtual address space is as we described above, with a single stack at the highest possible user-space address that grows "down" towards the heap.
The kernel allocates a kernel-space record (a task_struct) for the process that is essentially identified by a 16-bit process identifier (an integer) that is unique among all running processes. The kernel also allocates a 16KB (4 page) kernel stack for the process in kernel space. For modern x86_64 Linux, the task_struct has a pointer to the kernel stack for the process. The kernel stack has a back pointer to the task_struct which is stored in the first (lowest address) word of the kernel stack. As a result, the kernel can find the current process' task_struct by masking the stack pointer by the stack size to get the word at the lowest address of the stack.
When a process makes a system call, the kernel switches the stack pointer from the user-space stack to the kernel stack so that the kernel program state is protected from manipulation by code that runs in user mode. Additionally, device interrupts can either specify that they are to be handled on the kernel stack or a separate interrupt stack (which is one 4K page).
If you think about it a bit, this model makes the kernel a kind of library that is loaded with the user space program. The only difference is that the user-space program cannot use the compiler's function call assembly language implementation to invoke a kernel routine. Instead, it must
Notice, however, that the page tables for the process are not changed. That is, it is possible to think of the kernel stack as simply "overlaying" or "extending" the user stack (by 16KB from where the stack pointer was at the point of the system call) and that the process's address space consists of the user space addresses and the kernel addresses. The switch to unprotected mode "turns off" access to some of these addresses but they are logically still part of the process.
Threads, however, complicate this model. Logically, threads share an address space. Here are some questions that arise immediately.
There are essentially two ways to introduce threads as a concurrency abstraction when processes are the baseline abstraction.
The first is to make threads an abstraction that exists strictly within a process. In this model, the kernel treats the process as a scheduable unit of resource allocation and, once it is scheduled, each process contains an internal thread scheduler that is active during the processes time slice. This "two-level" approach has several advantages, most of which accrue to specialization. That is, each process essentially implements its own (possibly customized) thread scheduler internally.
From the perspective of the Linux kernel implementation, however, this approach requires a substantial refactorization of the kernel. Under this model, if multiple threads within a process make system calls, and while the system calls are in progress the process the scheduler chooses another process to run, the kernel would need to suspend and then keep track of all of the pending system calls so they could be continued when the process is rescheduled in the future. The Linux system call architecture, inherited from Unix, is one in which each system call is independent and, as such, it can be managed individually. As a result, the kernel scheduler and resource allocators need not consider "group" system calls and any possible interactions between threads within a group and outside of the group.
If this sounds a little hand wavy, then consider the following scenario. Let's imagine that a process has three threads and all three of them make a system call. Further, Linux treats all three threads as part of a single schdulable process. During the three simultaneous systems calls, the scheduler deschedules the process. Later, the process's scheduling priority is such that it should run next, but only two of the three threads are ready to execute. What should the kernel do? You could run the process and allow the two threads to continue while the third waits for its system call to complete. The problem here is that if the process is descheduled again and then the third thread becomes runnable, you either have to schedule the process (again -- early) or wait for the process to get a time slice so that third thread can continue. It is possible to come up with a rational approach to this two level scheme (and some high-performance Linux implementations include such implementations) but the book-keeping required in the kernel is fairly substantial.
The alternative (originally due to Silicon Graphics) which Linux implements is to treat threads as independent processes in the kernel. That is, under Linux, when either a "classic" process or a thread within a process makes a system call, the code in the kernel does not differentiate between the two (with a few small exceptions).
This structural decision has several ramifications.
The first, and most obvious, is for CPU scheduling. Linux accounts for thread CPU occupancy independently and, thus, threads contend for the CPU with other threads regardless of what processes they inhabit (at least, under the SCHED_NORMAL policy). Secondly, each thread has a complete set of kernel data structures that allows the kernel to schedule it and deschedule it independently.
For example, each thread has a task struct that represents the anchor point for the thread's state. As such, each thread has a unique PID (process ID) just as each process does. Each thread also has its own kernel stack which is allocated when the thread is created and linked to the task struct for the thread.
Threads within a process must share
The advantage of this approach is that it allows the kernel to have only one way to synchronize data transfer from user space to kernel space and vice versa. Specifically, the kernel has a mechanism to allow a process/thread to make a system call (transitioning from user space to kernel space) and to block until such time that the system call can be completed. Usually the blocking happens because the system call requires some form of device I/O and the device is far slower than the CPU. The mechanism (which uses wait channels or "queues" within the kernel) depends upon there being a kernel stack and a task struct. For historical reasons, the term for this state is process context.
Thus, the Linux kernel is only able to block a thread of control if it has a process context. For reasons that are not at all clear, there is a veritable zoo of asynchronous control abstractions in the kernel (softirqs, tasklets, and workqueues) for prioritizing and deferring work but none of these can "block." Further, various kernel facilities (like the kernel memory allocator) can block so there are restrictions on what kernel routines these more exotic abstractions can access. Anything with a process context can call anything else in the kernel, however.
Because each kernel thread has a kernel stack, however, it can block. Thus a kernel thread is like a regular process that contains a single thread which makes a system call that never returns to user space. Under Unix, this concept was called a kernel process. The idea was that a regular process (running as root) would make a special system call that would never return to user space. The user space address allocation and management were "wasted" in this case, but it allowed the process to do work on the kernel's behalf and then sleep when there was no more kernel work to do (instead of returning to user space).
Why does the kernel need kernel threads?
This turns out to be an important question in the evolution between Unix and Linux. Really, it is possible to think of the only difference between Linux and Unix (architecturally) as being the reliance of Linux on kernel threads. Unix, originally, did everything in process context (including interrupt handling which it did on what ever the current kernel stack was when the interrupt occurred). Thus the kernel was really a dynamically loaded library with respect to a process that got loaded when the process was scheduled and the process was responsible for the machine until it was descheduled.
This method was elegant and simple but as kernel functionality grew, it became complex to manage. In particular, the kernel is responsible for functionality that must be able to block on I/O. When the kernel writes dirty pages from the page cache to disk, for example, the disk I/O is synchronous which means that a process context must block until the interrupt indicating the disk write has completed unblocks it. Unix solved this problem by using the current process context for the write and then returning to what ever that process context was doing when the I/O starts. The, when the I/O completed, it would use another process context to handle the interrupt.
Today, the kernel makes heavy use of kernel threads. There are three primary use cases
The second usage is for device drivers. Originally, writing a device driver for Unix was quite difficult because the driver interrupt handler code had to do its work and exit very quickly. The reason is that when an interrupt is handled by the CPU, all other interrupts of the same or lower priority are disabled. Some devices have watchdog timers that trigger an error if their interrupts are not serviced within a certain time window so one had to be careful to get out of interrupt context as soon as possible.
Interrupt handlers, today, do need to run to completion quickly, but as devices become more and more complex, the work needed to handle an interrupt becomes more and more extensive. This work is deferred so that it can run at a lower CPU priority (relative to the interrupt handler priority) using a kernel thread. Thus, an interrupt handler than can't do all of its work in interrupt context (because it would take too long) can hand the work off to a kernel thread and then exit interrupt context. The kernel will run the kernel thread and the next moment when the CPU is not busy fielding another interrupt and that kernel thread will be preempted by interrupts that come in while it is executing.
The third usage is for functionality that Linux implements as part of a network collection of systems. For example, it is possible for a Linux system (using iptables and/or ebtables) to act a network switch or gateway. Network packets pass through the kernel that never reference or are accessed by a user-space process running on that system. Another example is the Network File System (NFS) on the server side need not transition user space because NFS is essentially a disk-block level protocol at the network layer.
The modern Linux kernel is multithreaded and preemptable. As a result, process contexts executing in the kernel (on some core of the CPU) can be scheduled and descheduled by the Linux scheduler in the same way that they can be context switched when they are in user space. Kernel threads are included, although they run in a higher-priority scheduling class compared to "regular" threads and processes.
In addition, the kernel is parallelized. That is, multiple threads (each executing on a separate core) can be executing in the kernel at the same time. To avoid race conditions, there are, again, a zoo of synchronization primitives. Roughly speaking they can be categorized into those that can block the current context and those that do not.
When executing in a process or thread context (i.e. the code can sleep), the kernel synchronization primitive is a counting semaphore. The kernel includes and API which implements a mutex lock using a semaphore initialized to 1 (and presumably disallows unlock calls on locks that are not locked). The key requirement here, however, is that a call to one of these primitives can result in the current context being descheduled.
When executing in interrupt context, however, semaphores cannot be used. Instead, the kernel includes various atomic data operations (as in C++) and an explicit spin lock primitive. When a process context needs to synchronize with an interrupt context or interrupt contexts must synchronize with each other, they must use one of these non-blocking mechanisms.
Note that it is up to the kernel developer to identify and code the usage of these primitives correctly or the kernel will crash. There is no compiler support or automated runtime system that can implement synchronization correctly.
Process context can be scheduled dynamically and when a process is not scheduled it uses the kernel state to sleep.
Notice that interrupts can interrupt other interrupts (the term of art is that interrupts can "stack") but there is no notion of time division multiplexing (i.e. "time slicing") for interrupts. Linux does implement time slicing for process contexts, but in a very specific way.
First, though, the Linux kernel implements 5 scheduling priority levels. All tasks at level n that are runnable will be executed before any task at level n+1 are chosen. The 5 scheduling priorities are
The next two levels are available as configuration options in the kernel and they require permission levels to be set for processes to request them. For example, SCHED_DEADLINE is typically only accessible to processes running as root.
SCHED_OTHER (also called SCHED_NORMAL) is the default option and SCHED_IDLE (which is really for low-priority background threads) is available.
Each scheduling class defines its own rules for how to pick the next task to run. The kernel has various scheduler entry points defined where it will consult the scheduling classes (in order) and ask if they wish to run a process.
The complexities arise out of the need to manage the CPU efficiently (i.e. maintain good utilization) while providing good performance to long-running CPU-bound tasks (low overhead) and fast response time (low latency). These requirements are in tension.
For example, for many years, the Linux scheduler implemented short-term priority inflation after a process blocked. That is, when a process called sleep to wait for an I/O to complete, when it was re-enabled, it would get a very high priority for a very short time. Doing so enabled processes running text editors, for example, to echo the typed character (in full-duplex mode) quickly (low latency) before going immediately back to sleep. CPU-bound processes would then be interrupted when ever a character needed to be echoed which hopefully would not introduce too much overhead (cache pollution, memory pressure, TLB pressure, etc.) This methodology works well, but it can lead to some pathological situations, especially with respect to network traffic.
After many years of fixing up the existing priority scheduler, Linux adopted CFS which uses a different approach.
The abstract idea behind the CFS is that the system maintains a target latency which represents time period over which all runnable processes share the CPU fairly. It then uses the number of running processes to time division multiplex the CPU so that the target latency period is equally shared among all runnable processes.
For example, imagine that the target latency is 20ms and there are two runnable processes. In the "perfect" time slicing case, each process is entitled to 10ms of execution time during each 20ms period. Linux will set the time slice to be 10ms when there are two runnable processes and the configured target latency is 20ms. If there are 4 runnable processes, then the time slice is set to 5ms, and so on, down to some minimum latency (which is also configurable).
Thus the first part of the CFS is that it adjusts the time slice duration as a function of the number of runnable processes in the SCHED_OTHER run queue. When more processes join the queue, the time slice duration goes down. When processes block or die, the time slice duration goes up.
The second part of the scheduler is how it chooses the next task to run at a scheduler entry point. Using a fast timer, Linux records the time a process actually uses when it is assigned the CPU. It is entitled to complete its full time slice (in which case the virtual runtime increases by a full time slice) but if the process performs I/O then the virtual runtime increases by less.
The third part is that the scheduler using a weighting system (based on nice value) to change the relative virtual time values. Each nice value corresponds to a weight (given in a kernel table). The "baseline" or neutral weight is divided by the weight mapped by the nice value of the process to compute a virtual expansion or contraction weight that is then use to increase or decrease the rate of virtual runtime accumulation.
For example, imagine that the weight for a nice 0 process is set to 1024, the weight for a nice 5 process is set to 335, and the weight for a nice -5 process is set to 3121 (thanks chatGPT). The relative fractions that the scheduler wants to assign to the processes are
nice -5 : (3121 / (3121 + 1024 + 335)) == 69% nice 0 : (1024 / (3121 + 1024 + 335)) == 22% nice 5 : (335 / (3121 + 1024 + 335)) == 7%When each process is run, the virtual runtime is incremented by the proportion of its weighting factor to the neutral factor. That is, when nice -5 runs,
nice -5: vruntime += (1024 / 3121) * actual_runtime nice 0 : vruntime += (1024 / 1024) * actual_runtime nice 5 : vruntime += (1024 / 335) * actual_runtimeThus, for each 1 ms that a nice -5 process runs, it accumulates 0.32 ms of virtual runtime. For each 1 ms a nice 0 process run, it accumulates 1 ms of virtual runtime. Finally, for each 1 ms of actual runtime, a nice 5 process accumulates 3.05 ms of virtual runtime.
The fourth part is that the CFS chooses the process with the smallest virtual runtime at the moment that a scheduling decision must be made. Runnable processes are sorted in a red-black tree using virtual runtime from smallest to biggest. The CFS uses a tick (a minimum clock interrupt duration set at config time) to adjust the virtual runtime of the process at the head of the list. Then, when the scheduler decides it needs to choose another process, the list is resorted and the new, lowest virtual runtime process is chosen.
There are two conditions under which it might choose a new task. Either the current task has used up its fraction of the target latency time slice or the virtual runtime of a process other than the running process has become smaller than that of the running process.
Let's work through an example. On an Ubuntu 20.04 VM I have running, the target latency is 12ms
cat /proc/sys/kernel/sched_latency_ns 12000000and the tick interval is 4ms
grep CONFIG_HZ= /boot/config-$(uname -r) CONFIG_HZ=250Let's say all three processes start with vruntime = 0.
nice -5 runs for 4 ms and its vruntime increments by 4ms * 0.32
nice -5: vruntime = 1.2ms nice 0: vruntime = 0 nice 5: vruntime = 0CFS decides to switch processes and let's say it chooses nice 0. After the next 4ms tick, the accounting looks like
nice -5: vruntime = 1.2ms nice 0: vruntime = 4ms nice 5: vruntime = 0CFS then decides to run nice 5 for 4ms and its vruntime increments by 4*3.05 ms.
nice -5: vruntime = 1.2ms nice 0: vruntime = 4ms nice 5: vruntime = 12.2msCFS runs again, chooses nice -5, and adds 1.2ms to its vruntime.
nice -5: vruntime = 2.4ms nice 0: vruntime = 4ms nice 5: vruntime = 12.2msAt the next tick, CFS chooses nice -5 again (does not context switch), but adds another 1.2ms.
nice -5: vruntime = 3.6ms nice 0: vruntime = 4ms nice 5: vruntime = 12.2msnice -5 runs at the next tick and the accounting looks like
nice -5: vruntime = 4.8ms nice 0: vruntime = 4ms nice 5: vruntime = 12.2msCFS now chooses nice 0 and runs it for 4ms.
nice -5: vruntime = 4.8ms nice 0: vruntime = 8ms nice 5: vruntime = 12.2msAnd so on. It will add 1.2ms to nice -5 each time it runs for 4ms, it will add 4ms to nice 0 every time it runs for 4ms, and it will add 12.2ms every time nice 5 runs for 4ms.
Note that it does not change the amount it increments vruntime by when the number of runnable processes changes. However, because it is choosing between a different number of processes, the frequency with which a process is chosen changes when the number of processes changes.
For example, let's say that at this moment, nice -5 dies. The accounting looks like
nice 0: vruntime = 8ms nice 5: vruntime = 12.2msCFS chooses nice 0 and the accounting goes to
nice 0: vruntime = 12ms nice 5: vruntime = 12.2msand again to
nice 0: vruntime = 16ms nice 5: vruntime = 12.2msat which point CFS will choose nice 5. With nice -5 running, nice 5 would run (roughly) every 72 ms because nice 0 runs every 16 ms and nice -5 needs to wait for 5 runs of nice 0 to run once. With only nice 0 and nice 5, nice 5 runs every 24ms because nice 0 runs 5 times before its vruntime exceeds the increment for nice -5.
Note that this accounting assumes that all three processes are CPU bound. If they perform I/O, they will block. The kernel uses a high-resolution clock to update vruntime. It uses tick for the "time slice" but if the process executes for a fraction of tick the vruntime value (stored in ns) will be updated correctly.