This is ninth part of the Linux Kernel initialization process and in the previous part we stopped at the scheduler initialization. In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the RCU. We can see that the next step in the init/main.c after the sched_init
is the call of the preempt_disable
. There are two macros:
preempt_disable
preempt_enable
for preemption disabling and enabling. First of all let's try to understand what is preempt
in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one init
process for the early boot time and we don't need to stop it before we call cpu_idle
function. The preempt_disable
macro is defined in the include/linux/preempt.h and depends on the CONFIG_PREEMPT_COUNT
kernel configuration option. This macro is implemented as:
#define preempt_disable() \
do { \
preempt_count_inc(); \
barrier(); \
} while (0)
and if CONFIG_PREEMPT_COUNT
is not set just:
#define preempt_disable() barrier()
Let's look on it. First of all we can see one difference between these macro implementations. The preempt_disable
with CONFIG_PREEMPT_COUNT
set contains the call of the preempt_count_inc
. There is special percpu
variable which stores the number of held locks and preempt_disable
calls:
DECLARE_PER_CPU(int, __preempt_count);
In the first implementation of the preempt_disable
we increment this __preempt_count
. There is API for returning value of the __preempt_count
, it is the preempt_count
function. As we called preempt_disable
, first of all we increment preemption counter with the preempt_count_inc
macro which expands to the:
#define preempt_count_inc() preempt_count_add(1)
#define preempt_count_add(val) __preempt_count_add(val)
where preempt_count_add
calls the raw_cpu_add_4
macro which adds 1
to the given percpu
variable (__preempt_count
) in our case (more about precpu
variables you can read in the part about Per-CPU variables). Ok, we increased __preempt_count
and the next step we can see the call of the barrier
macro in the both macros. The barrier
macro inserts an optimization barrier. In the processors with x86_64
architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
preempt_disable();
foo();
preempt_enable();
Compiler can rearrange it as:
preempt_disable();
preempt_enable();
foo();
In this case non-preemptible function foo
can be preempted. As we put barrier
macro in the preempt_disable
and preempt_enable
macros, it prevents the compiler from swapping preempt_count_inc
with other statements. More about barriers you can read here and here.
In the next step we can see following statement:
if (WARN(!irqs_disabled(),
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
which check IRQs state, and disabling (with cli
instruction for x86_64
) if they are enabled.
That's all. Preemption is disabled and we can go ahead.
In the next step we can see the call of the idr_init_cache
function which defined in the lib/idr.c. The idr
library is used in a various places in the linux kernel to manage assigning integer IDs
to objects and looking up objects by id.
Let's look on the implementation of the idr_init_cache
function:
void __init idr_init_cache(void)
{
idr_layer_cache = kmem_cache_create("idr_layer_cache",
sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
}
Here we can see the call of the kmem_cache_create
. We already called the kmem_cache_init
in the init/main.c. This function create generalized caches again using the kmem_cache_alloc
(more about caches we will see in the Linux kernel memory management chapter). In our case, as we are using kmem_cache_t
which will be used by the slab allocator and kmem_cache_create
creates it. As you can see we pass five parameters to the kmem_cache_create
:
and it will create kmem_cache
for the integer IDs. Integer IDs
is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the i2c drivers subsystem. For example drivers/i2c/i2c-core.c which represents the core of the i2c
subsystem defines ID
for the i2c
adapter with the DEFINE_IDR
macro:
static DEFINE_IDR(i2c_adapter_idr);
and then uses it for the declaration of the i2c
adapter:
static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
{
int id;
...
...
...
id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);
...
...
...
}
and id2_adapter_idr
presents dynamically calculated bus number.
More about integer ID management you can read here.
The next step is RCU initialization with the rcu_init
function and it's implementation depends on two kernel configuration options:
CONFIG_TINY_RCU
CONFIG_TREE_RCU
In the first case rcu_init
will be in the kernel/rcu/tiny.c and in the second case it will be defined in the kernel/rcu/tree.c. We will see the implementation of the tree rcu
, but first of all about the RCU
in general.
RCU
or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including lock-free data structures, percpu data structures and other. One of these mechanisms is - the read-copy update
. The RCU
technique is designed for rarely-modified data structures. The idea of the RCU
is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
Of course this description of the RCU
is very simplified. To understand some details about RCU
, first of all we need to learn some terminology. Data readers in the RCU
executed in the critical section. Every time when data reader get to the critical section, it calls the rcu_read_lock
, and rcu_read_unlock
on exit from the critical section. If the thread is not in the critical section, it will be in state which called - quiescent state
. The moment when every thread is in the quiescent state
called - grace period
. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is removal
- atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the grace period
finished, the second step of the element removal will be started, it just removes the element from the physical memory.
There a couple of implementations of the RCU
. Old RCU
called classic, the new implementation called tree
RCU. As you may already understand, the CONFIG_TREE_RCU
kernel configuration option enables tree RCU
. Another is the tiny
RCU which depends on CONFIG_TINY_RCU
and CONFIG_SMP=n
. We will see more details about the RCU
in general in the separate chapter about synchronization primitives, but now let's look on the rcu_init
implementation from the kernel/rcu/tree.c:
void __init rcu_init(void)
{
int cpu;
rcu_bootup_announce();
rcu_init_geometry();
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
__rcu_init_preempt();
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
/*
* We don't need protection against CPU-hotplug here because
* this is called early in boot, before either interrupts
* or the scheduler are operational.
*/
cpu_notifier(rcu_cpu_notify, 0);
pm_notifier(rcu_pm_notify, 0);
for_each_online_cpu(cpu)
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
rcu_early_boot_tests();
}
In the beginning of the rcu_init
function we define cpu
variable and call rcu_bootup_announce
. The rcu_bootup_announce
function is pretty simple:
static void __init rcu_bootup_announce(void)
{
pr_info("Hierarchical RCU implementation.\n");
rcu_bootup_announce_oddness();
}
It just prints information about the RCU
with the pr_info
function and rcu_bootup_announce_oddness
which uses pr_info
too, for printing different information about the current RCU
configuration which depends on different kernel configuration options like CONFIG_RCU_TRACE
, CONFIG_PROVE_RCU
, CONFIG_RCU_FANOUT_EXACT
, etc. In the next step, we can see the call of the rcu_init_geometry
function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually RCU
provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? RCU
API provides the rcu_state
structure which presents RCU global state including node hierarchy. Hierarchy is presented by the:
struct rcu_node node[NUM_RCU_NODES];
array of structures. As we can read in the comment of above definition:
The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level
in ->node[m+1] and following (->node[m+1] referenced by ->level[2]). The number of levels is
determined by the number of CPUs and by CONFIG_RCU_FANOUT.
Small systems will have a "hierarchy" consisting of a single rcu_node.
The rcu_node
structure is defined in the kernel/rcu/tree.h and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every rcu_node
contains a lock for a couple of CPUs. These rcu_node
structures are embedded into a linear array in the rcu_state
structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the NUM_RCU_NODES
which depends on number of available CPUs:
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
where levels values depend on the CONFIG_RCU_FANOUT_LEAF
configuration option. For example for the simplest case, one rcu_node
will cover two CPU on machine with the eight CPUs:
+-----------------------------------------------------------------+
| rcu_state |
| +----------------------+ |
| | root | |
| | rcu_node | |
| +----------------------+ |
| | | |
| +----v-----+ +--v-------+ |
| | | | | |
| | rcu_node | | rcu_node | |
| | | | | |
| +------------------+ +----------------+ |
| | | | | |
| | | | | |
| +----v-----+ +-------v--+ +-v--------+ +-v--------+ |
| | | | | | | | | |
| | rcu_node | | rcu_node | | rcu_node | | rcu_node | |
| | | | | | | | | |
| +----------+ +----------+ +----------+ +----------+ |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
+---------|-----------------|-------------|---------------|-------+
| | | |
+---------v-----------------v-------------v---------------v--------+
| | | | |
| CPU1 | CPU3 | CPU5 | CPU7 |
| | | | |
| CPU2 | CPU4 | CPU6 | CPU8 |
| | | | |
+------------------------------------------------------------------+
So, in the rcu_init_geometry
function we just need to calculate the total number of rcu_node
structures. We start to do it with the calculation of the jiffies
till to the first and next fqs
which is force-quiescent-state
(read above about it):
d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
if (jiffies_till_first_fqs == ULONG_MAX)
jiffies_till_first_fqs = d;
if (jiffies_till_next_fqs == ULONG_MAX)
jiffies_till_next_fqs = d;
where:
#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
#define RCU_JIFFIES_FQS_DIV 256
As we calculated these jiffies, we check that previous defined jiffies_till_first_fqs
and jiffies_till_next_fqs
variables are equal to the ULONG_MAX (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the ULONG_MAX
:
static ulong jiffies_till_first_fqs = ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;
In the next step of the rcu_init_geometry
, we check that rcu_fanout_leaf
didn't change (it has the same value as CONFIG_RCU_FANOUT_LEAF
in compile-time) and equal to the value of the CONFIG_RCU_FANOUT_LEAF
configuration option, we just return:
if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
nr_cpu_ids == NR_CPUS)
return;
After this we need to compute the number of nodes that an rcu_node
tree can handle with the given number of levels:
rcu_capacity[0] = 1;
rcu_capacity[1] = rcu_fanout_leaf;
for (i = 2; i <= MAX_RCU_LVLS; i++)
rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;
And in the last step we calculate the number of rcu_nodes at each level of the tree in the loop.
As we calculated geometry of the rcu_node
tree, we need to go back to the rcu_init
function and next step we need to initialize two rcu_state
structures with the rcu_init_one
function:
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
The rcu_init_one
function takes two arguments:
RCU
state;RCU
.Both variables defined in the kernel/rcu/tree.h with its percpu
data:
extern struct rcu_state rcu_bh_state;
DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
About this states you can read here. As I wrote above we need to initialize rcu_state
structures and rcu_init_one
function will help us with it. After the rcu_state
initialization, we can see the call of the __rcu_init_preempt
which depends on the CONFIG_PREEMPT_RCU
kernel configuration option. It does the same as previous functions - initialization of the rcu_preempt_state
structure with the rcu_init_one
function which has rcu_state
type. After this, in the rcu_init
, we can see the call of the:
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
function. This function registers a handler of the pending interrupt
. Pending interrupt or softirq
supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure:
struct softirq_action
{
void (*action)(struct softirq_action *);
};
which is defined in the include/linux/interrupt.h and contains only one field - handler of an interrupt. You can check about softirqs
in the your system with the:
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
HI: 2 0 0 1 0 2 0 0
TIMER: 137779 108110 139573 107647 107408 114972 99653 98665
NET_TX: 1127 0 4 0 1 1 0 0
NET_RX: 334 221 132939 3076 451 361 292 303
BLOCK: 5253 5596 8 779 2016 37442 28 2855
BLOCK_IOPOLL: 0 0 0 0 0 0 0 0
TASKLET: 66 0 2916 113 0 24 26708 0
SCHED: 102350 75950 91705 75356 75323 82627 69279 69914
HRTIMER: 510 302 368 260 219 255 248 246
RCU: 81290 68062 82979 69015 68390 69385 63304 63473
The open_softirq
function takes two parameters:
and adds interrupt handler to the array of the pending interrupts:
void open_softirq(int nr, void (*action)(struct softirq_action *))
{
softirq_vec[nr].action = action;
}
In our case the interrupt handler is - rcu_process_callbacks
which is defined in the kernel/rcu/tree.c and does the RCU
core processing for the current CPU. After we registered softirq
interrupt for the RCU
, we can see the following code:
cpu_notifier(rcu_cpu_notify, 0);
pm_notifier(rcu_pm_notify, 0);
for_each_online_cpu(cpu)
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
Here we can see registration of the cpu
notifier which needs in systems which supports CPU hotplug and we will not dive into details about this theme. The last function in the rcu_init
is the rcu_early_boot_tests
:
void rcu_early_boot_tests(void)
{
pr_info("Running RCU self tests\n");
if (rcu_self_test)
early_boot_test_call_rcu();
if (rcu_self_test_bh)
early_boot_test_call_rcu_bh();
if (rcu_self_test_sched)
early_boot_test_call_rcu_sched();
}
which runs self tests for the RCU
.
That's all. We saw initialization process of the RCU
subsystem. As I wrote above, more about the RCU
will be in the separate chapter about synchronization primitives.
Ok, we already passed the main theme of this part which is RCU
initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following:
After we initialized RCU
, the next step which you can see in the init/main.c is the - trace_init
function. As you can understand from its name, this function initialize tracing subsystem. You can read more about linux kernel trace system - here.
After the trace_init
, we can see the call of the radix_tree_init
. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the Radix tree. This function is defined in the lib/radix-tree.c and you can read more about it in the part about Radix tree.
In the next step we can see the functions which are related to the interrupts handling
subsystem, they are:
early_irq_init
init_IRQ
softirq_init
We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like init_timers
, hrtimers_init
, time_init
, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers.
The next couple of functions are related with the perf events - perf_event-init
(there will be separate chapter about perf), initialization of the profiling
with the profile_init
. After this we enable irq
with the call of the:
local_irq_enable();
which expands to the sti
instruction and making post initialization of the SLAB with the call of the kmem_cache_init_late
function (As I wrote above we will know about the SLAB
in the Linux memory management chapter).
After the post initialization of the SLAB
, next point is initialization of the console with the console_init
function from the drivers/tty/tty_io.c.
After the console initialization, we can see the lockdep_info
function which prints information about the Lock dependency validator. After this, we can see the initialization of the dynamic allocation of the debug objects
with the debug_objects_mem_init
, kernel memory leak detector initialization with the kmemleak_init
, percpu
pageset setup with the setup_per_cpu_pageset
, setup of the NUMA policy with the numa_policy_init
, setting time for the scheduler with the sched_clock_init
, pidmap
initialization with the call of the pidmap_init
function for the initial PID
namespace, cache creation with the anon_vma_init
for the private virtual memory areas and early initialization of the ACPI with the acpi_early_init
.
This is the end of the ninth part of the linux kernel initialization process and here we saw initialization of the RCU. In the last paragraph of this part (Rest of the initialization process
) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
It is the end of the ninth part about the linux kernel initialization process. In this part, we looked on the initialization process of the RCU
subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the start_kernel
function and will go to the rest_init
function from the same init/main.c source code file and will see the start of the first process.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.