ARM assembler in Raspberry Pi – Chapter 24
Today we will continue with nested functions.
Sorting
We will first take a detour. The C function qsort
can be used to sort any kind of array. Its C signature is the following.
void qsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *)); |
qsort
returns void
, this is, it does not return anything because it performs the sort in place. This means that we will pass a (potentially unsorted) array called base
of length nmemb
to qsort
. When qsort
returns, the elements in this array will be sorted. If qsort
were able to just sort a specific kind of arrays it would be rather limited. In order to be able to sort any array, qsort
requires the size
of each element in the array. Note that the array is passed by reference (otherwise the in place sorting would not be possible): void*
is the C way to say «I accept an address to any kind of data».
We will come back later to the compar
bit of qsort
.
Print an array
Before we sort an array, we need a way to examine it. We will use for that a function print_array
that prints an array of integers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
/* print-array.s */ .data /* declare an array of 10 integers called my_array */ .align 4 my_array: .word 82, 70, 93, 77, 91, 30, 42, 6, 92, 64 /* format strings for printf */ /* format string that prints an integer plus a space */ .align 4 integer_printf: .asciz "%d " /* format string that simply prints a newline */ .align 4 newline_printf: .asciz "\n" .text print_array: /* r0 will be the address of the integer array */ /* r1 will be the number of items in the array */ push {r4, r5, r6, lr} /* keep r4, r5, r6 and lr in the stack */ mov r4, r0 /* r4 ← r0. keep the address of the array */ mov r5, r1 /* r5 ← r1. keep the number of items */ mov r6, #0 /* r6 ← 0. current item to print */ b .Lprint_array_check_loop /* go to the condition check of the loop */ .Lprint_array_loop: /* prepare the call to printf */ ldr r0, addr_of_integer_printf /* r0 ← &integer_printf */ /* load the item r6 in the array in address r4. elements are of size 4 bytes so we need to multiply r6 by 4 */ ldr r1, [r4, +r6, LSL #2] /* r1 ← *(r4 + r6 << 2) this is the same as r1 ← *(r4 + r6 * 4) */ bl printf /* call printf */ add r6, r6, #1 /* r6 ← r6 + 1 */ .Lprint_array_check_loop: cmp r6, r5 /* perform r6 - r5 and update cpsr */ bne .Lprint_array_loop /* if cpsr states that r6 is not equal to r5 branch to the body of the loop */ /* prepare call to printf */ ldr r0, addr_of_newline_printf /* r0 ← &newline_printf */ bl printf pop {r4, r5, r6, lr} /* restore r4, r5, r6 and lr from the stack */ bx lr /* return */ addr_of_integer_printf: .word integer_printf addr_of_newline_printf: .word newline_printf .globl main main: push {r4, lr} /* keep r4 and lr in the stack */ /* prepare call to print_array */ ldr r0, addr_of_my_array /* r0 ← &my_array */ mov r1, #10 /* r1 ← 10 our array is of length 10 */ bl print_array /* call print_array */ mov r0, #0 /* r0 ← 0 set errorcode to 0 prior returning from main */ pop {r4, lr} /* restore r4 and lr in the stack */ bx lr /* return */ addr_of_my_array: .word my_array |
.data
/* declare an array of 10 integers called my_array */ .align 4 my_array: .word 82, 70, 93, 77, 91, 30, 42, 6, 92, 64
/* format strings for printf / / format string that prints an integer plus a space / .align 4 integer_printf: .asciz "%d " / format string that simply prints a newline */ .align 4 newline_printf: .asciz "\n"
.text
print_array: /* r0 will be the address of the integer array / / r1 will be the number of items in the array / push {r4, r5, r6, lr} / keep r4, r5, r6 and lr in the stack */
mov r4, r0 /* r4 ← r0. keep the address of the array */
mov r5, r1 /* r5 ← r1. keep the number of items */
mov r6, #0 /* r6 ← 0. current item to print */
b .Lprint_array_check_loop /* go to the condition check of the loop */
.Lprint_array_loop:
/* prepare the call to printf */
ldr r0, addr_of_integer_printf /* r0 ← &integer_printf */
/* load the item r6 in the array in address r4.
elements are of size 4 bytes so we need to multiply r6 by 4 */
ldr r1, [r4, +r6, LSL #2] /* r1 ← *(r4 + r6 << 2)
this is the same as
r1 ← *(r4 + r6 * 4) */
bl printf /* call printf */
add r6, r6, #1 /* r6 ← r6 + 1 */
.Lprint_array_check_loop:
cmp r6, r5 /* perform r6 - r5 and update cpsr */
bne .Lprint_array_loop /* if cpsr states that r6 is not equal to r5
branch to the body of the loop */
/* prepare call to printf */
ldr r0, addr_of_newline_printf /* r0 ← &newline_printf */
bl printf
pop {r4, r5, r6, lr} /* restore r4, r5, r6 and lr from the stack */
bx lr /* return */
addr_of_integer_printf: .word integer_printf addr_of_newline_printf: .word newline_printf
.globl main main: push {r4, lr} /* keep r4 and lr in the stack */
/* prepare call to print_array */
ldr r0, addr_of_my_array /* r0 ← &my_array */
mov r1, #10 /* r1 ← 10
our array is of length 10 */
bl print_array /* call print_array */
mov r0, #0 /* r0 ← 0 set errorcode to 0 prior returning from main */
pop {r4, lr} /* restore r4 and lr in the stack */
bx lr /* return */
addr_of_my_array: .word my_array
The code above is pretty straightforward and it does not feature anything that has not been seen in previous installments. Running it simply prints the current contents of the array.
$ ./print-array 82 70 93 77 91 30 42 6 92 64 |
Comparison
Above, when we talked about qsort
we skipped the compar
parameter. What is compar
? It is a an address to a function. The funky syntax for C tells us that this function, if it is ever called, will be passed two addreses (again, it does not care what they are, so they are void*
) and returns an integer. The manual of qsort explains that this function has to return lower than zero, zero or greater than zero. If the object in the address of the first parameter of compar
is lower than the object in the address of the second parameter, then it has to return lower than zero. If they are equal, it should return zero. If the first object is greater than the second, then it should return greater than zero.
If you wonder why the parameters of compar
are actually const void*
rather than void*
, it is the C way of telling us that the data of the referenced objects cannot change during the comparison. This may sound obvious given that changing things is not the job of a comparison function. Passing them by reference would let us to change them. So this is reminder that we should not.
Since our array is an array of integers we will have to compare integers: let’s write a function that, given two pointers to integers (i.e. addresses) behaves as stated above.
19 20 21 22 23 24 25 26 27 28 29 30 31 |
integer_comparison: /* r0 will be the address to the first integer */ /* r1 will be the address to the second integer */ ldr r0, [r0] /* r0 ← *r0 load the integer pointed by r0 in r0 */ ldr r1, [r1] /* r1 ← *r1 load the integer pointed by r1 in r1 */ cmp r0, r1 /* compute r0 - r1 and update cpsr */ moveq r0, #0 /* if cpsr means that r0 == r1 then r0 ← 0 */ movlt r0, #-1 /* if cpsr means that r0 < r1 then r0 ← -1 */ movgt r0, #1 /* if cpsr means that r0 > r1 then r0 ← 1 */ bx lr /* return */ |
cmp r0, r1 /* compute r0 - r1 and update cpsr */
moveq r0, #0 /* if cpsr means that r0 == r1 then r0 ← 0 */
movlt r0, #-1 /* if cpsr means that r0 < r1 then r0 ← -1 */
movgt r0, #1 /* if cpsr means that r0 > r1 then r0 ← 1 */
bx lr /* return */</p></div>
Function integer_comparison
does not feature anything new either: it simply avoids branches by using predication as we saw in chapter 11.
Now we have the last missing bit to be able to call qsort
. Here is a program that prints (only main
is shown) the array twice, before sorting it and after sorting it.
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
.globl main main: push {r4, lr} /* keep r4 and lr in the stack */ /* prepare call to print_array */ ldr r0, addr_of_my_array /* r0 ← &my_array */ mov r1, #10 /* r1 ← 10 our array is of length 10 */ bl print_array /* call print_array */ /* prepare call to qsort */ /* void qsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *)); */ ldr r0, addr_of_my_array /* r0 ← &my_array base */ mov r1, #10 /* r1 ← 10 nmemb = number of members our array is 10 elements long */ mov r2, #4 /* r2 ← 4 size of each member is 4 bytes */ ldr r3, addr_of_integer_comparison /* r3 ← &integer_comparison compar */ bl qsort /* call qsort */ /* now print again to see if elements were sorted */ /* prepare call to print_array */ ldr r0, addr_of_my_array /* r0 ← &my_array */ mov r1, #10 /* r1 ← 10 our array is of length 10 */ bl print_array /* call print_array */ mov r0, #0 /* r0 ← 0 set errorcode to 0 prior returning from main */ pop {r4, lr} /* restore r4 and lr in the stack */ bx lr /* return */ addr_of_my_array: .word my_array addr_of_integer_comparison : .word integer_comparison |
/* prepare call to print_array */
ldr r0, addr_of_my_array /* r0 ← &my_array */
mov r1, #10 /* r1 ← 10
our array is of length 10 */
bl print_array /* call print_array */
/* prepare call to qsort */
/*
void qsort(void *base,
size_t nmemb,
size_t size,
int (*compar)(const void *, const void *));
*/
ldr r0, addr_of_my_array /* r0 ← &my_array
base */
mov r1, #10 /* r1 ← 10
nmemb = number of members
our array is 10 elements long */
mov r2, #4 /* r2 ← 4
size of each member is 4 bytes */
ldr r3, addr_of_integer_comparison
/* r3 ← &integer_comparison
compar */
bl qsort /* call qsort */
/* now print again to see if elements were sorted */
/* prepare call to print_array */
ldr r0, addr_of_my_array /* r0 ← &my_array */
mov r1, #10 /* r1 ← 10
our array is of length 10 */
bl print_array /* call print_array */
mov r0, #0 /* r0 ← 0 set errorcode to 0 prior returning from main */
pop {r4, lr} /* restore r4 and lr in the stack */
bx lr /* return */
addr_of_my_array: .word my_array addr_of_integer_comparison : .word integer_comparison
If we put everything together, we can verify that our array is effectively sorted after the call to the qsort
function.
$ ./sort-array 82 70 93 77 91 30 42 6 92 64 6 30 42 64 70 77 82 91 92 93 |
What is going on?
C function qsort
implements a sorting algorithm (the C Standard does not specify which one must be but it is usually a fine-tuned version of quicksort) which at some point will require to compare two elements. To do this, qsort
calls the function compar
.
Count how many comparisons happen
Now, we want to count how many comparisons (i.e., how many times integer_comparison
is called) when sorting the array. We could change integer_comparison
so it increments a global counter.
.data global_counter: .word 0 .text integer_comparison_count_global: /* r0 will be the address to the first integer */ /* r1 will be the address to the second integer */ push {r4, r5} /* keep callee-saved registers */ ldr r0, [r0] /* r0 ← *r0 load the integer pointed by r0 in r0 */ ldr r1, [r1] /* r1 ← *r1 load the integer pointed by r1 in r1 */ cmp r0, r1 /* compute r0 - r1 and update cpsr */ moveq r0, #0 /* if cpsr means that r0 == r1 then r0 ← 0 */ movlt r0, #-1 /* if cpsr means that r0 < r1 then r0 ← -1 */ movgt r0, #1 /* if cpsr means that r0 > r1 then r0 ← 1 */ ldr r4, addr_of_global_counter /* r4 ← &global_counter */ ldr r5, [r4] /* r5 ← *r4 */ add r5, r5, #1 /* r5 ← r5 + 1 */ str r5, [r4] /* *r4 ← r5 */ pop {r4, r5} /* restore callee-saved registers */ bx lr /* return */ addr_of_global_counter: .word global_counter |
.text integer_comparison_count_global: /* r0 will be the address to the first integer / / r1 will be the address to the second integer / push {r4, r5} / keep callee-saved registers / ldr r0, [r0] / r0 ← *r0 load the integer pointed by r0 in r0 / ldr r1, [r1] / r1 ← *r1 load the integer pointed by r1 in r1 */
cmp r0, r1 /* compute r0 - r1 and update cpsr */
moveq r0, #0 /* if cpsr means that r0 == r1 then r0 ← 0 */
movlt r0, #-1 /* if cpsr means that r0 < r1 then r0 ← -1 */
movgt r0, #1 /* if cpsr means that r0 > r1 then r0 ← 1 */
ldr r4, addr_of_global_counter /* r4 ← &global_counter */
ldr r5, [r4] /* r5 ← *r4 */
add r5, r5, #1 /* r5 ← r5 + 1 */
str r5, [r4] /* *r4 ← r5 */
pop {r4, r5} /* restore callee-saved registers */
bx lr /* return */
addr_of_global_counter: .word global_counter
But this post is about nested functions so we will use nested functions. Recall that nested functions can access local variables of their enclosing functions. So we will use a local variable of main
as the counter and a nested function (of main
) that performs the comparison and updates the counter.
In the last chapter we ended with a short discussion about nested functions. A downside of nested functions it that a pointer to a nested function requires two things: the address of the function and the lexical scope. If you check again the previous example where we call qsort
, you will see that we do not mention anywhere the lexical scope. And there is a reason for that, it is not possible to pass it to qsort
. In C, functions cannot be nested so a pointer to a function can just be the address of the function.
Trampoline
We will continue using the convention of the last chapter: r10
will have the lexical scope upon the entry of the function. But qsort
, when calls integer_compare_count
will not set it for us: we cannout count on r10
having a meaningful value when called from qsort
. This means that qsort
should actually call something that first sets r10
with the right value and then jumps to integer_compare_count
. We will call this ancillary code (or pseudofunction) a trampoline. The technique used here is similar to the one used by GCC described in Lexical Closures for C++ (Thomas M. Breuel, USENIX C++ Conference Proceedings, October 17-21, 1988).
The trampoline is a small, always the same, sequence of instructions that behaves like a function and its only purpose is to set r10
and then make an indirect call to the nested function. Since the sequence of instructions is always the same, the instructions themselves look like a template.
174 175 176 177 178 179 180 181 182 183 |
.Laddr_trampoline_template : .word .Ltrampoline_template /* we will use this below */ .Ltrampoline_template: .Lfunction_called: .word 0x0 .Llexical_scope: .word 0x0 push {r4, r5, r10, lr} /* keep callee-saved registers */ ldr r4, .Lfunction_called /* r4 ← function called */ ldr r10, .Llexical_scope /* r10 ← lexical scope */ blx r4 /* indirect call to r4 */ pop {r4, r5, r10, lr} /* restore callee-saved registers */ bx lr /* return */ |
I used the word template because while the instructions are not going to change, there are two items in the trampoline, labeled function_called
and lexical_scope
, that will have to be appropriately set before using the trampoline.
It may be easier to understand if you consider the code above as if it were data: see it as an array of integers. The first two integers, function_called
and lexical_scope
, are still zero but will be set at some point. The remaining elements in the array are other integers (we do not care which ones) that happen to encode ARM instructions. The cool thing is that these instructions refer to the two first integers, so by changing them we are indirectly changing what the trampoline does. This trampoline takes 8 words, so 32 bytes.
Let’s start with this example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
/* trampoline-sort-arrays.s */ .data /* declare an array of 10 integers called my_array */ .align 4 my_array: .word 82, 70, 93, 77, 91, 30, 42, 6, 92, 64 /* format strings for printf */ /* format string that prints an integer plus a space */ .align 4 integer_printf: .asciz "%d " /* format string that simply prints a newline */ .align 4 newline_printf: .asciz "\n" .align 4 /* format string for number of comparisons */ comparison_message: .asciz "Num comparisons: %d\n" .text |
.data
/* declare an array of 10 integers called my_array */ .align 4 my_array: .word 82, 70, 93, 77, 91, 30, 42, 6, 92, 64
/* format strings for printf / / format string that prints an integer plus a space / .align 4 integer_printf: .asciz "%d " / format string that simply prints a newline / .align 4 newline_printf: .asciz "\n" .align 4 / format string for number of comparisons */ comparison_message: .asciz "Num comparisons: %d\n"
.text
The function print_array
will be the same as above. Next is main
.
54 55 56 57 58 59 60 61 62 63 64 |
.globl main main: push {r4, r5, r6, fp, lr} /* keep callee saved registers */ mov fp, sp /* setup dynamic link */ sub sp, sp, #4 /* counter will be in fp - 4 */ /* note that now the stack is 8-byte aligned */ /* set counter to zero */ mov r4, #0 /* r4 ← 0 */ str r4, [fp, #-4] /* counter ← r4 */ |
sub sp, sp, #4 /* counter will be in fp - 4 */
/* note that now the stack is 8-byte aligned */
/* set counter to zero */
mov r4, #0 /* r4 ← 0 */
str r4, [fp, #-4] /* counter ← r4 */</p></div>
Nothing fancy here, we set the dynamic link, allocate space in the stack for the counter and set it to zero.
Now we make room for the trampoline in the stack. Recall that our trampoline takes 32 bytes.
66 67 68 69 |
/* Make room for the trampoline */ sub sp, sp, #32 /* sp ← sp - 32 */ /* note that 32 is a multiple of 8, so the stack is still 8-byte aligned */ |
Now we will copy the trampoline template into the stack storage we just allocated. We do this with a loop that copies a word (4 bytes) at a time.
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
/* copy the trampoline into the stack */ mov r4, #32 /* r4 ← 32 */ ldr r5, .Laddr_trampoline_template /* r4 ← &trampoline_template */ mov r6, sp /* r6 ← sp */ b .Lcopy_trampoline_loop_check /* branch to copy_trampoline_loop_check */ .Lcopy_trampoline_loop: ldr r7, [r5] /* r7 ← *r5 */ str r7, [r6] /* *r6 ← r7 */ add r5, r5, #4 /* r5 ← r5 + 4 */ add r6, r6, #4 /* r6 ← r6 + 4 */ sub r4, r4, #4 /* r4 ← r4 - 4 */ .Lcopy_trampoline_loop_check: cmp r4, #0 /* compute r4 - 0 and update cpsr */ bgt .Lcopy_trampoline_loop /* if cpsr means that r4 > 0 then branch to copy_trampoline_loop */ |
.Lcopy_trampoline_loop:
ldr r7, [r5] /* r7 ← *r5 */
str r7, [r6] /* *r6 ← r7 */
add r5, r5, #4 /* r5 ← r5 + 4 */
add r6, r6, #4 /* r6 ← r6 + 4 */
sub r4, r4, #4 /* r4 ← r4 - 4 */
.Lcopy_trampoline_loop_check:
cmp r4, #0 /* compute r4 - 0 and update cpsr */
bgt .Lcopy_trampoline_loop /* if cpsr means that r4 > 0
then branch to copy_trampoline_loop */</p></div>
In the loop above, r4
counts how many bytes remain to copy. r5
and r6
are pointers inside the (source) trampoline and the (destination) stack, respectively. Since we copy 4 bytes at a time, all three registers are updated by 4.
Now we have the trampoline copied in the stack. Recall, it is just an array of words, the two first of which must be updated. The first 4 bytes must be the address of function to be called, i.e. integer_comparison_count
and the second 4 bytes must be the static link, i.e. fp
.
88 89 90 91 92 93 94 95 96 |
/* setup the trampoline */ ldr r4, addr_of_integer_comparison_count /* r4 ← &integer_comparison_count */ str r4, [fp, #-36] /* *(fp - 36) ← r4 */ /* set the function_called in the trampoline to be &integer_comparison_count */ str fp, [fp, #-32] /* *(fp - 32) ← fp */ /* set the lexical_scope in the trampoline to be fp */ |
Recall that our trampoline takes 32 bytes but in the stack we also have the counter. This is the reason why the trampoline starts in fp - 36
(this is also the address of the first word of the trampoline, of course). The second word is then at fp - 32
.
Now we proceed like in the sort example above: we print the array before sorting it and after sorting it. Before printing the sorted array, we will also print the number of comparisons that were performed.
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
/* prepare call to print_array */ ldr r0, addr_of_my_array /* r0 ← &my_array */ mov r1, #10 /* r1 ← 10 our array is of length 10 */ bl print_array /* call print_array */ /* prepare call to qsort */ /* void qsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *)); */ ldr r0, addr_of_my_array /* r0 ← &my_array base */ mov r1, #10 /* r1 ← 10 nmemb = number of members our array is 10 elements long */ mov r2, #4 /* r2 ← 4 size of each member is 4 bytes */ sub r3, fp, #28 /* r3 ← fp - 28 */ bl qsort /* call qsort */ /* prepare call to printf */ ldr r1, [fp, #-4] /* r1 ← counter */ ldr r0, addr_of_comparison_message /* r0 ← &comparison_message */ bl printf /* call printf */ /* now print again the array to see if elements were sorted */ /* prepare call to print_array */ ldr r0, addr_of_my_array /* r0 ← &my_array */ mov r1, #10 /* r1 ← 10 our array is of length 10 */ bl print_array /* call print_array */ |
/* prepare call to qsort */
/*
void qsort(void *base,
size_t nmemb,
size_t size,
int (*compar)(const void *, const void *));
*/
ldr r0, addr_of_my_array /* r0 ← &my_array
base */
mov r1, #10 /* r1 ← 10
nmemb = number of members
our array is 10 elements long */
mov r2, #4 /* r2 ← 4
size of each member is 4 bytes */
sub r3, fp, #28 /* r3 ← fp - 28 */
bl qsort /* call qsort */
/* prepare call to printf */
ldr r1, [fp, #-4] /* r1 ← counter */
ldr r0, addr_of_comparison_message /* r0 ← &comparison_message */
bl printf /* call printf */
/* now print again the array to see if elements were sorted */
/* prepare call to print_array */
ldr r0, addr_of_my_array /* r0 ← &my_array */
mov r1, #10 /* r1 ← 10
our array is of length 10 */
bl print_array /* call print_array */</p></div>
Note that the argument compar
passed to qsort (line 123) is not the address of the nested function but the trampoline. In fact, it is not the trampoline but its third word since, as we know, the two first words of the trampoline are the address of the nested function to call and the lexical scope (that we set earlier, lines 91 and 94).
Finally we return from main as usual.
139 140 141 142 143 144 145 146 |
mov r0, #0 /* r0 ← 0 set errorcode to 0 prior returning from main */ mov sp, fp pop {r4, r5, r6, fp, lr} /* restore callee-saved registers */ bx lr /* return */ addr_of_my_array: .word my_array addr_of_comparison_message : .word comparison_message |
mov sp, fp
pop {r4, r5, r6, fp, lr} /* restore callee-saved registers */
bx lr /* return */
addr_of_my_array: .word my_array addr_of_comparison_message : .word comparison_message
The nested comparison function is next.
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
/* nested function integer comparison */ addr_of_integer_comparison_count : .word integer_comparison_count integer_comparison_count: /* r0 will be the address to the first integer */ /* r1 will be the address to the second integer */ push {r4, r5, r10, fp, lr} /* keep callee-saved registers */ mov fp, sp /* setup dynamic link */ ldr r0, [r0] /* r0 ← *r0 load the integer pointed by r0 in r0 */ ldr r1, [r1] /* r1 ← *r1 load the integer pointed by r1 in r1 */ cmp r0, r1 /* compute r0 - r1 and update cpsr */ moveq r0, #0 /* if cpsr means that r0 == r1 then r0 ← 0 */ movlt r0, #-1 /* if cpsr means that r0 < r1 then r0 ← -1 */ movgt r0, #1 /* if cpsr means that r0 > r1 then r0 ← 1 */ ldr r4, [fp, #8] /* r4 ← *(fp + 8) get static link in the stack */ ldr r5, [r4, #-4] /* r5 ← counter get value of counter */ add r5, r5, #1 /* r5 ← r5 + 1 */ str r5, [r4, #-4] /* counter ← r5 update counter */ mov sp, fp /* restore stack */ pop {r4, r5, r10, fp, lr} /* restore callee-saved registers */ bx lr /* return */ |
ldr r0, [r0] /* r0 ← *r0
load the integer pointed by r0 in r0 */
ldr r1, [r1] /* r1 ← *r1
load the integer pointed by r1 in r1 */
cmp r0, r1 /* compute r0 - r1 and update cpsr */
moveq r0, #0 /* if cpsr means that r0 == r1 then r0 ← 0 */
movlt r0, #-1 /* if cpsr means that r0 < r1 then r0 ← -1 */
movgt r0, #1 /* if cpsr means that r0 > r1 then r0 ← 1 */
ldr r4, [fp, #8] /* r4 ← *(fp + 8)
get static link in the stack */
ldr r5, [r4, #-4] /* r5 ← counter
get value of counter */
add r5, r5, #1 /* r5 ← r5 + 1 */
str r5, [r4, #-4] /* counter ← r5
update counter */
mov sp, fp /* restore stack */
pop {r4, r5, r10, fp, lr} /* restore callee-saved registers */
bx lr /* return */</p></div>
As you can see, the nested function expects r10
to be correctly set. This is what the trampoline does.
Harvard architecture
If you try to run the program as shown, it will probably work. But it will do it by chance. The reason is that we are featuring some simple form of self modifying code.
The Raspberry Pi processor features a modified Harvard architecture. This means that at some point there exists a distinction between instructions memory (.text
) and the data memory (.data
). Nowadays there are not many processors that feature a strict distinction between instruction and data memory (so at some point the program and the data are both in main memory, commonly called the RAM) but such differentiation is kept for caches.
A cache is a smaller and faster memory, that sits between the processor and the main memory. It is used to speed up memory accesses since most of the time such accesses happen close to other memory accesses (i.e. accessing elements of an array, different local variables in the stack or one instruction after the other in implicit sequencing) or close in time (i.e. accessing several times the same local variable or executing the same instruction when the code is in a loop).
Most modern processors feature distinguished caches for data (called the data cache) and instructions (called the instruction cache). The reason for such differentiation is that accessing to memory to execute instruction has a different pattern than accessing to memory to load/store data. It is beneficial to make such distinction but it comes at some price: when a program manipulates data that later will be executed as instructions (like we did with the trampoline, but also when the operating system loads a program in memory) the view of the two caches respect to the program state becomes incoherent: changes that we did in the data will have effect in the data cache but not in the instruction cache. Conversely, since the instruction cache will only get data from the main memory (and not from the data cache), we need to write back all the changes we did in the data cache to the main memory (this is called flushing the cache). We also have to make sure the instruction cache effectively gets the instructions from the memory, rather than reusing previously loaded instructions (which would be stale now), so we have to invalidate (or clear) the instruction cache.
In ARM the instructions that flush and invalidate caches are privileged operations (done through coprocessor instructions on the coprocessor 15 which manages the memory system of the CPU). This means that only the operating system can execute such instructions. As you see, user code may have to request a cache clear. Linux provides a cacheflush
system call for this purpose.
Recall that in chapter 19 we saw ho to make system calls.
According to the Linux kernel, register r0 must contain the address of the beginning of the region to be flushed and invalidated. r1
must contain the address of the first byte that will not be invalidated. r2
must be zero. The cacheflush service number, that has to be set in r7
is 0xf0002
.
push {r7} /* keep r7 because we are going to modify it */ mov r7, #0xf0000 /* r7 ← 0xf0000 */ add r7, r7, #2 /* r7 ← r7 + 2. So r7 ← 0xf0002 We do this in two steps because we cannot encode 0xf0002 in the instruction */ mov r0, sp /* r0 ← sp */ add r1, sp, #32 /* r1 ← sp + 32 */ mov r2, #0 /* r2 ← 0 */ swi 0 /* system call */ pop {r7} /* restore r7 */ |
As an alternative we can call an internal function implemented in libgcc
(the GCC low-level runtime library) called __clear_cache. This function will internally call the Linux service.
98 99 100 101 |
/* prepare call to __clear_cache */ mov r0, sp /* r0 ← sp */ add r1, sp, #32 /* r1 ← sp + 32 */ bl __clear_cache /* call __clear_cache */ |
We will invalidate and flush the caches right after setting up the trampoline (lines 89 to 94).
Now it only remains to run our program.
$ ./trampoline-sort-array 82 70 93 77 91 30 42 6 92 64 Num comparisons: 22 6 30 42 64 70 77 82 91 92 93 |
You can see the full listing here.
Discussion
Given that nested functions require a lexical scope, they cannot be trivially passed as plain addresses to other functions. Today we have seen that by using a trampoline it is possible to pass them to functions that do not allow passing a lexical scope. The price is having to copy a template, the trampoline, having to set it up with the proper values. We also have to flush caches in order to avoid executing wrong code. It is complicated but doable.
Having to flush the cache is undesirable (although not required in all architectures) and may cause a severe degradation of performance. Performance-critical pieces of code typically would not want to do this.
A serious concern, though, with the trampoline approach relates to the fact that we need an executable stack. A modern operating system, like Linux, can mark regions of memory to be readable, writable or executable. A region of memory that is not executable may contain instructions but if we branch to that region the processor will signal a fault, and the operating system will likely kill our process. Being able to disable execution of specific memory regions is done for security purposes. Most of the time one does not have to execute instructions that are found in stack or .data
section. Only .text
makes sense in these cases to be executable.
If you check what we did above, we actually copied some code (which was in .text
) into the stack and then, qsort
branched to the stack. This is because our programs allow an executable stack. Executable stacks are linked to common program vulnerability exploits like buffer overflows.
As we’ve seen in this chapter and in the previous one, nested functions come with several downsides, so it is not surprising that several programming languages do not provide support for them.
That’s all for today.
Read DVDs with bogus permissions in Ubuntu When an array is not an array
So, the question is: why would you rather use nested functions? I mean, I know this is just an example of a neat feature, but, would you use it in real-life applications? It looks like a lot of trouble for little-to-none benefit.
I admit that the example in this chapter may not be very revealing on why a nested function may be interesting, so I understand you think that a global variable would be easier (and indeed it is!).
But it would not work in any context: if for some reason we have more than one thread and we want to sort more than one array concurrently, a global variable will not be appropiate because each thread will overwrite the value of the counter. If you want the aggregate sum of the comparisons this could suffice, but imagine you need each counter (for instance you want to compute the average of comparisons done by each thread). If rather than a global variable we use a local variable, each thread can call
qsort
without having to worry if another thread is going to overwrite it.We have not considered threads in any chapter, so if you are not happy with the explanation above may be you can be convinced that in general you want the state of your program to be isolated as much as possible in order to unnecessarily expose it to other parts of the program. A global variable may be modified everywhere and so you have to believe that only the comparison function is going to modify it. By wrapping the state as a local variable, only that function can access that state and discard it as soon as it is not needed anymore (when leaving the function by shrinking the stack). You cannot do this with a global variable which will always take some amount of memory.
Of course, there is a tradeoff: accessing a global variable is easy and fast. Passing a pointer to a nested function that needs to access a local variable requires some acrobatics. Sometimes such acrobatics are not acceptable and nested functions get out of the way.
Usually, for languages where there are no nested functions, the usual workaround involves having functions with an extra parameter that plays a similar role to the lexical scope. Note that for qsort this is not possible but it would be possible for qsort_r.
Your last reply was crystal clear: you are right that I was only thinking in single threaded programs, but still your point of minimizing the exposition of each function is good enough to think about nested functions.
Thanks!
thanks for your kind words. Several friends of mine also suggested me to compile the posts in a book or something but, admittedly, I never thought to make a textbook of it. I am afraid that I would not be able to devote much time to it so a collaboration does not seem possible now. My posting frequency is rather irregular already.
That said, feel free to use (and even improve, I bet there is room for that) the examples in this tutorial.
Kind regards,
Could you give us some insight on how you did this? Are the laptops running Windows?
Thank you very much!
Clyde Wilson