Linux 内核揭密

Timers and time management in the Linux kernel. Part 3.

The tick broadcast framework and dyntick

This is third part of the chapter which describes timers and time management related stuff in the Linux kernel and we stopped on the clocksource framework in the previous part. We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first part of this chapter is - jiffies. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:

register_refined_jiffies(CLOCK_TICK_RATE);

function which defined in the kernel/time/jiffies.c source code file and executes initialization of the refined_jiffies clock source for us. Recall that this function is called from the setup_arch function that defined in the https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c source code and executes architecture-specific (x86_64 in our case) initialization. Look on the implementation of the setup_arch and you will note that the call of the register_refined_jiffies is the last step before the setup_arch function will finish its work.

There are many different x86_64 specific things already configured after the end of the setup_arch execution. For example some early interrupt handlers already able to handle interrupts, memory space reserved for the initrd, DMI scanned, the Linux kernel log buffer is already set and this means that the printk function is able to work, e820 parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the setup_arch function and Linux kernel initialization process in the second chapter of this book).

Now, the setup_arch finished its work and we can back to the generic Linux kernel code. Recall that the setup_arch function was called from the start_kernel function which is defined in the init/main.c source code file. So, we shall return to this function. You can see that there are many different function are called right after setup_arch function inside of the start_kernel function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:

tick_init();

in the start_kernel. The tick_init function defined in the kernel/time/tick-common.c source code file and does two things:

Initialization of tick broadcast framework related data structures;
Initialization of full tickless mode related data structures.

We didn't see anything related to the tick broadcast framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what are they.

The idle process

First of all, let's look on the implementation of the tick_init function. As I already wrote, this function defined in the kernel/time/tick-common.c source code file and consists from the two calls of following functions:

void __init tick_init(void)
{
	tick_broadcast_init();
	tick_nohz_init();
}

As you can understand from the paragraph's title, we are interesting only in the tick_broadcast_init function for now. This function defined in the kernel/time/tick-broadcast.c source code file and executes initialization of the tick broadcast framework related data structures. Before we will look on the implementation of the tick_broadcast_init function and will try to understand what does this function do, we need to know about tick broadcast framework.

Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - idle. When the processor has no anything to execute, the Linux kernel launches idle task. We already saw a little about this in the last part of the Linux kernel initialization process. When the Linux kernel will finish all initialization processes in the start_kernel function from the init/main.c source code file, it will call the rest_init function from the same source code file. Main point of this function is to launch kernel init thread and the kthreadd thread, to call the schedule function to start task scheduling and to go to sleep by calling the cpu_idle_loop function that defined in the kernel/sched/idle.c source code file.

The cpu_idle_loop function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the idle process will finish its work and the control will be moved to a new runnable task with the call of the schedule_preempt_disabled function:

static void cpu_idle_loop(void)
{
	while (1) {
		while (!need_resched()) {
		...
		...
		...
	    /* the main idle function */
		cpuidle_idle_call();
	}
	...
	...
	...
	schedule_preempt_disabled();
}

Of course, we will not consider full implementation of the cpu_idle_loop function and details of the idle state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop idle process if the processor executes infinite loop in the cpu_idle_loop? The answer is system timer interrupts. When an interrupt occurs, the processor stops the idle thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the need_resched will return true and the Linux kernel will stop idle process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for power management, because if a processor is in idle state, there is little point in sending it a system timer interrupt.

By default, there is the CONFIG_HZ_PERIODIC kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts:

The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the CONFIG_NO_HZ_IDLE kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - dyntick-idle mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do?

Whenever the idle task is selected to run, the periodic tick is disabled with the call of the tick_nohz_idle_enter function that defined in the kernel/time/tick-sched.c source code file and enabled with the call of the tick_nohz_idle_exit function. There is special concept in the Linux kernel which is called - clock event devices that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the clock_event_device structure in the Linux kernel. We will not dive into implementation of the clock_event_device structure now. We will see it in the next prat of this chapter. But there is one interesting moment for us right now.

The second way is to omit scheduling-clock ticks on processors that are either in idle state or that have only one runnable task or in other words busy processor. We can enable this feature with the CONFIG_NO_HZ_FULL kernel configuration option and it allows to reduce the number of timer interrupts significantly.

Besides the cpu_idle_loop, idle processor can be in a sleeping state. The Linux kernel provides special cpuidle framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - C-states. But how does a processor will be woken if local timer is disabled? The linux kernel provides tick broadcast framework for this. The main point of this framework is assign a timer which is not affected by the C-states. This timer will wake a sleeping processor.

Now, after some theory we can return to the implementation of our function. Let's recall that the tick_init function just calls two following functions:

void __init tick_init(void)
{
	tick_broadcast_init();
	tick_nohz_init();
}

Let's consider the first function. The first tick_broadcast_init function defined in the kernel/time/tick-broadcast.c source code file and executes initialization of the tick broadcast framework related data structures. Let's look on the implementation of the tick_broadcast_init function:

void __init tick_broadcast_init(void)
{
        zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);
        zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);
        zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);
#ifdef CONFIG_TICK_ONESHOT
         zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);
         zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);
         zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);
#endif
}

As we can see, the tick_broadcast_init function allocates different cpumasks with the help of the zalloc_cpumask_var function. The zalloc_cpumask_var function defined in the lib/cpumask.c source code file and expands to the call of the following function:

bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
{
        return alloc_cpumask_var(mask, flags | __GFP_ZERO);
}

Ultimately, the memory space will be allocated for the given cpumask with the certain flags with the help of the kmalloc_node function:

*mask = kmalloc_node(cpumask_size(), flags, node);

Now let's look on the cpumasks that will be initialized in the tick_broadcast_init function. As we can see, the tick_broadcast_init function will initialize six cpumasks, and moreover, initialization of the last three cpumasks will be depended on the CONFIG_TICK_ONESHOT kernel configuration option.

The first three cpumasks are:

tick_broadcast_mask - the bitmap which represents list of processors that are in a sleeping mode;
tick_broadcast_on - the bitmap that stores numbers of processors which are in a periodic broadcast state;
tmpmask - this bitmap for temporary usage.

As we already know, the next three cpumasks depends on the CONFIG_TICK_ONESHOT kernel configuration option. Actually each clock event devices can be in one of two modes:

periodic - clock events devices that support periodic events;
oneshot - clock events devices that capable of issuing events that happen only once.

The linux kernel defines two mask for such clock events devices in the include/linux/clockchips.h header file:

#define CLOCK_EVT_FEAT_PERIODIC        0x000001
#define CLOCK_EVT_FEAT_ONESHOT         0x000002

So, the last three cpumasks are:

tick_broadcast_oneshot_mask - stores numbers of processors that must be notified;
tick_broadcast_pending_mask - stores numbers of processors that pending broadcast;
tick_broadcast_force_mask - stores numbers of processors with enforced broadcast.

We have initialized six cpumasks in the tick broadcast framework, and now we can proceed to implementation of this framework.

The `tick broadcast` framework

Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these special clock source devices which can raise an interrupt at a specified time. We already know that such timers called clock events devices in the Linux kernel. Besides clock events devices. Actually, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating jiffies and etc. These timers represented by the tick_device structure in the Linux kernel. This structure defined in the kernel/time/tick-sched.h header file and looks:

struct tick_device {
        struct clock_event_device *evtdev;
        enum tick_device_mode mode;
};

Note, that the tick_device structure contains two fields. The first field - evtdev represents pointer to the clock_event_device structure that defined in the include/linux/clockchips.h header file and represents descriptor of a clock event device. A clock event device allows to register an event that will happen in the future. As I already wrote, we will not consider clock_event_device structure and related API in this part, but will see it in the next part.

The second field of the tick_device structure represents mode of the tick_device. As we already know, the mode can be one of the:

num tick_device_mode {
        TICKDEV_MODE_PERIODIC,
        TICKDEV_MODE_ONESHOT,
};

Each clock events device in the system registers itself by the call of the clockevents_register_device function or clockevents_config_and_register function during initialization process of the Linux kernel. During the registration of a new clock events device, the Linux kernel calls the tick_check_new_device function that defined in the kernel/time/tick-common.c source code file and checks the given clock events device should be used by the Linux kernel. After all checks, the tick_check_new_device function executes a call of the:

tick_install_broadcast_device(newdev);

function that checks that the given clock event device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the tick_install_broadcast_device function:

void tick_install_broadcast_device(struct clock_event_device *dev)
{
	struct clock_event_device *cur = tick_broadcast_device.evtdev;

	if (!tick_check_broadcast_device(cur, dev))
		return;

	if (!try_module_get(dev->owner))
		return;

	clockevents_exchange_device(cur, dev);

	if (cur)
		cur->event_handler = clockevents_handle_noop;

	tick_broadcast_device.evtdev = dev;

	if (!cpumask_empty(tick_broadcast_mask))
		tick_broadcast_start_periodic(dev);

	if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
		tick_clock_notify();
}

First of all we get the current clock event device from the tick_broadcast_device. The tick_broadcast_device defined in the kernel/time/tick-common.c source code file:

static struct tick_device tick_broadcast_device;

and represents external clock device that keeps track of events for a processor. The first step after we got the current clock device is the call of the tick_check_broadcast_device function which checks that a given clock events device can be utilized as broadcast device. The main point of the tick_check_broadcast_device function is to check value of the features field of the given clock events device. As we can understand from the name of this field, the features field contains a clock event device features. Available values defined in the include/linux/clockchips.h header file and can be one of the CLOCK_EVT_FEAT_PERIODIC - which represents a clock events device which supports periodic events and etc. So, the tick_check_broadcast_device function check features flags for CLOCK_EVT_FEAT_ONESHOT, CLOCK_EVT_FEAT_DUMMY and other flags and returns false if the given clock events device has one of these features. In other way the tick_check_broadcast_device function compares ratings of the given clock event device and current clock event device and returns the best.

After the tick_check_broadcast_device function, we can see the call of the try_module_get function that checks module owner of the clock events. We need to do it to be sure that the given clock events device was correctly initialized. The next step is the call of the clockevents_exchange_device function that defined in the kernel/time/clockevents.c source code file and will release old clock events device and replace the previous functional handler with a dummy handler.

In the last step of the tick_install_broadcast_device function we check that the tick_broadcast_mask is not empty and start the given clock events device in periodic mode with the call of the tick_broadcast_start_periodic function:

if (!cpumask_empty(tick_broadcast_mask))
	tick_broadcast_start_periodic(dev);

if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
	tick_clock_notify();

The tick_broadcast_mask filled in the tick_device_uses_broadcast function that checks a clock events device during registration of this clock events device:

int cpu = smp_processor_id();

int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
{
	...
	...
	...
	if (!tick_device_is_functional(dev)) {
		...
		cpumask_set_cpu(cpu, tick_broadcast_mask);
		...
	}
	...
	...
	...
}

More about the smp_processor_id macro you can read in the fourth part of the Linux kernel initialization process chapter.

The tick_broadcast_start_periodic function check the given clock event device and call the tick_setup_periodic function:

static void tick_broadcast_start_periodic(struct clock_event_device *bc)
{
	if (bc)
		tick_setup_periodic(bc, 1);
}

that defined in the kernel/time/tick-common.c source code file and sets broadcast handler for the given clock event device by the call of the following function:

tick_set_periodic_handler(dev, broadcast);

This function checks the second parameter which represents broadcast state (on or off) and sets the broadcast handler depends on its value:

void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast)
{
	if (!broadcast)
		dev->event_handler = tick_handle_periodic;
	else
		dev->event_handler = tick_handle_periodic_broadcast;
}

When an clock event device will issue an interrupt, the dev->event_handler will be called. For example, let's look on the interrupt handler of the high precision event timer which is located in the arch/x86/kernel/hpet.c source code file:

static irqreturn_t hpet_interrupt_handler(int irq, void *data)
{
	struct hpet_dev *dev = (struct hpet_dev *)data;
	struct clock_event_device *hevt = &dev->evt;

	if (!hevt->event_handler) {
		printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",
				dev->num);
		return IRQ_HANDLED;
	}

	hevt->event_handler(hevt);
	return IRQ_HANDLED;
}

The hpet_interrupt_handler gets the irq specific data and check the event handler of the clock event device. Recall that we just set in the tick_set_periodic_handler function. So the tick_handler_periodic_broadcast function will be called in the end of the high precision event timer interrupt handler.

The tick_handler_periodic_broadcast function calls the

bc_local = tick_do_periodic_broadcast();

function which stores numbers of processors which have asked to be woken up in the temporary cpumask and call the tick_do_broadcast function:

cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);
return tick_do_broadcast(tmpmask);

The tick_do_broadcast calls the broadcast function of the given clock events which sends IPI interrupt to the set of the processors. In the end we can call the event handler of the given tick_device:

if (bc_local)
	td->evtdev->event_handler(td->evtdev);

which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about tick broadcast framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a clock event device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not real to cover all aspects of it. I think it will be interesting to dive into with yourself.

If you remember, we have started this part with the call of the tick_init function. We just consider the tick_broadcast_init function and releated theory, but the tick_init function contains another call of a function and this function is - tick_nohz_init. Let's look on the implementation of this function.

We already saw some information about dyntick concept in this part and we know that this concept allows kernel to disable system timer interrupts in the idle state. The tick_nohz_init function makes initialization of the different data structures which are related to this concept. This function defined in the kernel/time/tick-sched.c source code file and starts from the check of the value of the tick_nohz_full_running variable which represents state of the tick-less mode for the idle state and the state when system timer interrups are disabled during a processor has only one runnable task:

if (!tick_nohz_full_running) {
    if (tick_nohz_init_all() < 0)
    return;
}

If this mode is not running we call the tick_nohz_init_all function that defined in the same source code file and check its result. The tick_nohz_init_all function tries to allocate the tick_nohz_full_mask with the call of the alloc_cpumask_var that will allocate space for a tick_nohz_full_mask. The tck_nohz_full_mask will store numbers of processors that have enabled full NO_HZ. After successful allocation of the tick_nohz_full_mask we set all bits in the tick_nogz_full_mask, set the tick_nohz_full_running and return result to the tick_nohz_init function:

static int tick_nohz_init_all(void)
{
        int err = -1;
#ifdef CONFIG_NO_HZ_FULL_ALL
        if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
                WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
                return err;
        }
        err = 0;
        cpumask_setall(tick_nohz_full_mask);
        tick_nohz_full_running = true;
#endif
        return err;
}

In the next step we try to allocate a memory space for the housekeeping_mask:

if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
	WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
	cpumask_clear(tick_nohz_full_mask);
	tick_nohz_full_running = false;
	return;
}

This cpumask will store number of processor for housekeeping or in other words we need at least in one processor that will not be in NO_HZ mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific arch_irq_work_has_interrupt function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during NO_HZ mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the arch/x86/include/asm/irq_work.h header file for the x86_64 and just checks that a processor has APIC from the CPUID:

static inline bool arch_irq_work_has_interrupt(void)
{
    return cpu_has_apic;
}

If a processor has not APIC, the Linux kernel prints warning message, clears the tick_nohz_full_mask cpumask, copies numbers of all possible processors in the system to the housekeeping_mask and resets the value of the tick_nohz_full_running variable:

if (!arch_irq_work_has_interrupt()) {
	pr_warning("NO_HZ: Can't run full dynticks because arch doesn't "
		   "support irq work self-IPIs\n");
	cpumask_clear(tick_nohz_full_mask);
	cpumask_copy(housekeeping_mask, cpu_possible_mask);
	tick_nohz_full_running = false;
	return;
}

After this step, we get the number of the current processor by the call of the smp_processor_id and check this processor in the tick_nohz_full_mask. If the tick_nohz_full_mask contains a given processor we clear appropriate bit in the tick_nohz_full_mask:

cpu = smp_processor_id();

if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
	pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
	cpumask_clear_cpu(cpu, tick_nohz_full_mask);
}

Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the cpu_possible_mask and not in the tick_nohz_full_mask:

cpumask_andnot(housekeeping_mask,
	       cpu_possible_mask, tick_nohz_full_mask);

After this operation, the housekeeping_mask will contain all processors of the system except a processor for timekeeping. In the last step of the tick_nohz_init_all function, we are going through all processors that are defined in the tick_nohz_full_mask and call the following function for an each processor:

for_each_cpu(cpu, tick_nohz_full_mask)
	context_tracking_cpu_set(cpu);

The context_tracking_cpu_set function defined in the kernel/context_tracking.c source code file and main point of this function is to set the context_tracking.active percpu variable to true. When the active field will be set to true for the certain processor, all context switches will be ignored by the Linux kernel context tracking subsystem for this processor.

That's all. This is the end of the tick_nohz_init function. After this NO_HZ related data structures will be initialzed. We didn't see API of the NO_HZ mode, but will see it soon.

Conclusion

This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the clocksource concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the tick broadcast framework and tick-less mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of idle processors.

In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - timers.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.

Linux 内核揭密

Timers and time management in the Linux kernel. Part 3.

The tick broadcast framework and dyntick

The idle process

The `tick broadcast` framework

Conclusion

Links

Linux 内核揭密

命令行的艺术

雪城大学计算机与网络安全讲义

Web前端写给Web设计师的注意事项

笨办法学 Python 3

Spring Boot 开源电子书

Linux 内核揭密

Timers and time management in the Linux kernel. Part 3.

The tick broadcast framework and dyntick

The idle process

The tick broadcast framework

Initialization of dyntick related data structures

Conclusion

Links

Linux 内核揭密

命令行的艺术

雪城大学计算机与网络安全讲义

Web前端写给Web设计师的注意事项

笨办法学 Python 3

Spring Boot 开源电子书

The `tick broadcast` framework