Predication

ARM assembler in Raspberry Pi – Chapter 11

March 16, 2013 Roger Ferrer Ibáñez, 8

Several times, in earlier chapters, I stated that the ARM architecture was designed with the embedded world in mind. Although the cost of the memory is everyday lower, it still may account as an important part of the budget of an embedded system. The ARM instruction set has several features meant to reduce the impact of code size. One of the features which helps in such approach is predication.

Predication

We saw in chapters 6 and 7 how to use branches in our program in order to modify the execution flow of instructions and implement useful control structures. Branches can be unconditional, for instance when calling a function as we did in chapters 9 and 10, or conditional when we want to jump to some part of the code only when a previously tested condition is met.

Predication is related to conditional branches. What if, instead of branching to some part of code meant to be executed only when a condition C holds, we were able to turn some instructions off when that C condition does not hold?. Consider some case like this.

if (C)
  T();
else
  E();

Using predication (and with some invented syntax to express it) we could write the above if as follows.

P = C;
[P]  T();
[!P] E();

This way we avoid branches. But, why would be want to avoid branches? Well, executing a conditional branch involves a bit of uncertainty. But this deserves a bit of explanation.

The assembly line of instructions

Imagine an assembly line. In that assembly line there are 5 workers, each one fully specialized in a single task. That assembly line executes instructions. Every instruction enters the assembly line from the left and leaves it at the right. Each worker does some task on the instruction and passes to the next worker to the right. Also, imagine all workers are more or less synchronized, each one ends the task in as much 6 seconds. This means that at every 6 seconds there is an instruction leaving the assembly line, an instruction fully executed. It also means that at any given time there may be up to 5 instructions being processed (although not fully executed, we only have one fully executed instruction at every 6 seconds).

The first worker fetches instructions and puts them in the assembly line. It fetches the instruction at the address specified by the register pc. By default, unless told, this worker fetches the instruction physically following the one he previously fetched (this is implicit sequencing).

In this assembly line, the worker that checks the condition of a conditional branch is not the first one but the third one. Now consider what happens when the first worker fetches a conditional branch and puts it in the assembly line. The second worker will process it and pass it to the third one. The third one will process it by checking the condition of the conditional branch. If it does not hold, nothing happens, the branch has no effect. But if the condition holds, the third worker must notify the first one that the next instruction fetched should be the instruction at the address of the branch.

But now there are two instructions in the assembly line that should not be fully executed (the ones that were physically after the conditional branch). There are several options here. The third worker may pick two stickers labeled as do nothing, and stick them to the two next instructions. Another approach would be the third worker to tell the first and second workers «hey guys, stick a do nothing to your current instruction». Later workers, when they see these do nothing stickers will do, huh, nothing. This way each do nothing instruction will never be fully executed.

But by doing this, that nice property of our assembly line is gone: now we do not have a fully executed instruction every 6 seconds. In fact, after the conditional branch there are two do nothing instructions. A program that is constantly doing branches may well reduce the performance of our assembly line from one (useful) instruction each 6 seconds to one instruction each 18 seconds. This is three times slower!

Truth is that modern processors, including the one in the Raspberry Pi, have branch predictors which are able to mitigate these problems: they try to predict whether the condition will hold, so the branch is taken or not. Branch predictors, though, predict the future like stock brokers, using the past and, when there is no past information, using some sensible assumptions. So branch predictors may work very well with relatively predictable codes but may work not so well if the code has unpredictable behaviour. Such behaviour, for instance, is observed when running decompressors. A compressor reduces the size of your files removing the redundancy. Redundant stuff is predictable and can be omitted (for instance in “he is wearing his coat” you could ommit “he” or replace “his” by “its”, regardless of whether doing this is rude, because you know you are talking about a male). So a decompressor will have to decompress a file which has very little redundancy, driving nuts the predictor.

Back to the assembly line example, it would be the first worker who attempts to predict where the branch will be taken or not. It is the third worker who verifies if the first worker did the right prediction. If the first worker mispredicted the branch, then we have to apply two stickers again and notify the first worker which is the right address of the next instruction. If the first worker predicted the branch right, nothing special has to be done, which is great.

If we avoid branches, we avoid the uncertainty of whether the branch is taken or not. So it looks like that predication is the way to go. Not so fast. Processing a bunch of instructions that are actually turned off is not an efficient usage of a processor.

Back to our assembly line, the third worker will check the predicate. If it does not hold, the current instruction will get a do nothing sticker but in contrast to a branch, it does not notify the first worker.

So it ends, as usually, that no approach is perfect on its own.

Predication in ARM

In ARM, predication is very simple to use: almost all instructions can be predicated. The predicate is specified as a suffix to the instruction name. The suffix is exactly the same as those used in branches in the chapter 5: eq, neq, le, lt, ge and gt. Instructions that are not predicated are assumed to have a suffix al standing for always. That predicate always holds and we do not write it for economy (it is valid though). You can understand conditional branches as predicated branches if you feel like.

Collatz conjecture revisited

In chapter 6 we implementd an algorithm that computed the length of the sequence of Hailstone of a given number. Though not proved yet, no number has been found that has an infinite Hailstone sequence. Given our knowledge of functions we learnt in chapters 9 and 10, I encapsulated the code that computes the length of the sequence of Hailstone in a function.

/* -- collatz02.s */
.data
 
message: .asciz "Type a number: "
scan_format : .asciz "%d"
message2: .asciz "Length of the Hailstone sequence for %d is %d\n"
 
.text
 
collatz:
    /* r0 contains the first argument */
    /* Only r0, r1 and r2 are modified, 
       so we do not need to keep anything
       in the stack */
    /* Since we do not do any call, we do
       not have to keep lr either */
    mov r1, r0                 /* r1 ← r0 */
    mov r0, #0                 /* r0 ← 0 */
  collatz_loop:
    cmp r1, #1                 /* compare r1 and 1 */
    beq collatz_end            /* if r1 == 1 branch to collatz_end */
    and r2, r1, #1             /* r2 ← r1 & 1 */
    cmp r2, #0                 /* compare r2 and 0 */
    bne collatz_odd            /* if r2 != 0 (this is r1 % 2 != 0) branch to collatz_odd */
  collatz_even:
    mov r1, r1, ASR #1         /* r1 ← r1 >> 1. This is r1 ← r1/2 */
    b collatz_end_loop         /* branch to collatz_end_loop */
  collatz_odd:
    add r1, r1, r1, LSL #1     /* r1 ← r1 + (r1 << 1). This is r1 ← 3*r1 */
    add r1, r1, #1             /* r1 ← r1 + 1. */
  collatz_end_loop:
    add r0, r0, #1             /* r0 ← r0 + 1 */
    b collatz_loop             /* branch back to collatz_loop */
  collatz_end:
    bx lr
 
.global main
main:
    push {lr}                       /* keep lr */
    sub sp, sp, #4                  /* make room for 4 bytes in the stack */
                                    /* The stack is already 8 byte aligned */
 
    ldr r0, address_of_message      /* first parameter of printf: &message */
    bl printf                       /* call printf */
 
    ldr r0, address_of_scan_format  /* first parameter of scanf: &scan_format */
    mov r1, sp                      /* second parameter of scanf: 
                                       address of the top of the stack */
    bl scanf                        /* call scanf */
 
    ldr r0, [sp]                    /* first parameter of collatz:
                                       the value stored (by scanf) in the top of the stack */
    bl collatz                      /* call collatz */
 
    mov r2, r0                      /* third parameter of printf: 
                                       the result of collatz */
    ldr r1, [sp]                    /* second parameter of printf:
                                       the value stored (by scanf) in the top of the stack */
    ldr r0, address_of_message2     /* first parameter of printf: &address_of_message2 */
    bl printf
 
    add sp, sp, #4
    pop {lr}
    bx lr
 
 
address_of_message: .word message
address_of_scan_format: .word scan_format
address_of_message2: .word message2

message: .asciz "Type a number: " scan_format : .asciz "%d" message2: .asciz "Length of the Hailstone sequence for %d is %d\n"

.text

collatz: /* r0 contains the first argument / / Only r0, r1 and r2 are modified, so we do not need to keep anything in the stack / / Since we do not do any call, we do not have to keep lr either / mov r1, r0 / r1 ← r0 / mov r0, #0 / r0 ← 0 / collatz_loop: cmp r1, #1 / compare r1 and 1 / beq collatz_end / if r1 == 1 branch to collatz_end / and r2, r1, #1 / r2 ← r1 & 1 / cmp r2, #0 / compare r2 and 0 / bne collatz_odd / if r2 != 0 (this is r1 % 2 != 0) branch to collatz_odd / collatz_even: mov r1, r1, ASR #1 / r1 ← r1 >> 1. This is r1 ← r1/2 / b collatz_end_loop / branch to collatz_end_loop / collatz_odd: add r1, r1, r1, LSL #1 / r1 ← r1 + (r1 << 1). This is r1 ← 3*r1 / add r1, r1, #1 / r1 ← r1 + 1. / collatz_end_loop: add r0, r0, #1 / r0 ← r0 + 1 / b collatz_loop / branch back to collatz_loop */ collatz_end: bx lr

.global main main: push {lr} /* keep lr / sub sp, sp, #4 / make room for 4 bytes in the stack / / The stack is already 8 byte aligned */

ldr r0, address_of_message      /* first parameter of printf: &amp;message */
bl printf                       /* call printf */

ldr r0, address_of_scan_format  /* first parameter of scanf: &amp;scan_format */
mov r1, sp                      /* second parameter of scanf: 
                                   address of the top of the stack */
bl scanf                        /* call scanf */

ldr r0, [sp]                    /* first parameter of collatz:
                                   the value stored (by scanf) in the top of the stack */
bl collatz                      /* call collatz */

mov r2, r0                      /* third parameter of printf: 
                                   the result of collatz */
ldr r1, [sp]                    /* second parameter of printf:
                                   the value stored (by scanf) in the top of the stack */
ldr r0, address_of_message2     /* first parameter of printf: &amp;address_of_message2 */
bl printf

add sp, sp, #4
pop {lr}
bx lr

address_of_message: .word message address_of_scan_format: .word scan_format address_of_message2: .word message2

Adding predication

Ok, let’s add some predication. There is an if-then-else construct in lines 22 to 31. There we check if the number is even or odd. If even we divide it by 2, if even we multiply it by 3 and add 1.

    and r2, r1, #1             /* r2 ← r1 & 1 */
    cmp r2, #0                 /* compare r2 and 0 */
    bne collatz_odd            /* if r2 != 0 (this is r1 % 2 != 0) branch to collatz_odd */
  collatz_even:
    mov r1, r1, ASR #1         /* r1 ← r1 >> 1. This is r1 ← r1/2 */
    b collatz_end_loop         /* branch to collatz_end_loop */
  collatz_odd:
    add r1, r1, r1, LSL #1     /* r1 ← r1 + (r1 << 1). This is r1 ← 3*r1 */
    add r1, r1, #1             /* r1 ← r1 + 1. */
  collatz_end_loop:

Note in line 24 that there is a bne (branch if not equal). We can use this condition (and its opposite eq) to predicate this if-then-else construct. Instructions in the then part will be predicated using eq, instructions in the else part will be predicated using ne. The resulting code is shown below.

    cmp r2, #0                 /* compare r2 and 0 */
    moveq r1, r1, ASR #1       /* if r2 == 0, r1 ← r1 >> 1. This is r1 ← r1/2 */
    addne r1, r1, r1, LSL #1   /* if r2 != 0, r1 ← r1 + (r1 << 1). This is r1 ← 3*r1 */
    addne r1, r1, #1           /* if r2 != 0, r1 ← r1 + 1. */

As you can se there are no labels in the predicated version. We do not branch now so they are not needed anymore. Note also that we actually removed two branches: the one that branches from the condition test code to the else part and the one that branches from the end of the then part to the instruction after the whole if-then-else. This leads to a more compact code.

Does it make any difference in performance?

Taken as is, this program is very small to be accountable for time, so I modified it to run the same calculation inside the collatz function 4194304 (this is 222) times. I chose the number after some tests, so the execution did not take too much time to be a tedium.

Sadly, while the Raspberry Pi processor provides some hardware performance counters I have not been able to use any of them. perf tool (from the package linux-tools-3.2) complains that the counter cannot be opened.

$ perf_3.2 stat -e cpu-cycles ./collatz02
  Error: open_counter returned with 19 (No such device). /bin/dmesg may provide additional information.
 
  Fatal: Not all events could be opened

Fatal: Not all events could be opened

dmesg does not provide any additional information. We can see, though, that the performance counters was loaded by the kernel.

$ dmesg | grep perf
[    0.061722] hw perfevents: enabled with v6 PMU driver, 3 counters available

Supposedly I should be able to measure up to 3 hardware events at the same time. I think the Raspberry Pi processor, packaged in the BCM2835 SoC does not provide a PMU (Performance Monitoring Unit) which is required for performance counters. Nevertheless we can use cpu-clock to measure the time.

Below are the versions I used for this comparison. First is the branches version, second the predication version.

collatz:
    /* r0 contains the first argument */
    push {r4}
    sub sp, sp, #4  /* Make sure the stack is 8 byte aligned */
    mov r4, r0
    mov r3, #4194304
  collatz_repeat:
    mov r1, r4                 /* r1 ← r0 */
    mov r0, #0                 /* r0 ← 0 */
  collatz_loop:
    cmp r1, #1                 /* compare r1 and 1 */
    beq collatz_end            /* if r1 == 1 branch to collatz_end */
    and r2, r1, #1             /* r2 ← r1 & 1 */
    cmp r2, #0                 /* compare r2 and 0 */
    bne collatz_odd            /* if r2 != 0 (this is r1 % 2 != 0) branch to collatz_odd */
  collatz_even:
    mov r1, r1, ASR #1         /* r1 ← r1 >> 1. This is r1 ← r1/2 */
    b collatz_end_loop         /* branch to collatz_end_loop */
  collatz_odd:
    add r1, r1, r1, LSL #1     /* r1 ← r1 + (r1 << 1). This is r1 ← 3*r1 */
    add r1, r1, #1             /* r1 ← r1 + 1. */
  collatz_end_loop:
    add r0, r0, #1             /* r0 ← r0 + 1 */
    b collatz_loop             /* branch back to collatz_loop */
  collatz_end:
    sub r3, r3, #1
    cmp r3, #0
    bne collatz_repeat
    add sp, sp, #4  /* Make sure the stack is 8 byte aligned */
    pop {r4}
    bx lr

collatz2:
    /* r0 contains the first argument */
    push {r4}
    sub sp, sp, #4  /* Make sure the stack is 8 byte aligned */
    mov r4, r0
    mov r3, #4194304
  collatz_repeat:
    mov r1, r4                 /* r1 ← r0 */
    mov r0, #0                 /* r0 ← 0 */
  collatz2_loop:
    cmp r1, #1                 /* compare r1 and 1 */
    beq collatz2_end           /* if r1 == 1 branch to collatz2_end */
    and r2, r1, #1             /* r2 ← r1 & 1 */
    cmp r2, #0                 /* compare r2 and 0 */
    moveq r1, r1, ASR #1       /* if r2 == 0, r1 ← r1 >> 1. This is r1 ← r1/2 */
    addne r1, r1, r1, LSL #1   /* if r2 != 0, r1 ← r1 + (r1 << 1). This is r1 ← 3*r1 */
    addne r1, r1, #1           /* if r2 != 0, r1 ← r1 + 1. */
  collatz2_end_loop:
    add r0, r0, #1             /* r0 ← r0 + 1 */
    b collatz2_loop            /* branch back to collatz2_loop */
  collatz2_end:
    sub r3, r3, #1
    cmp r3, #0
    bne collatz_repeat
    add sp, sp, #4             /* Restore the stack */
    pop {r4}
    bx lr

The tool perf can be used to gather performance counters. We will run 5 times each version. We will use number 123. We redirect the output of yes 123 to the standard input of our tested program. This way we do not have to type it (which may affect the timing of the comparison).

The version with branches gives the following results:

$ yes 123 | perf_3.2 stat --log-fd=3 --repeat=5 -e cpu-clock ./collatz_branches 3>&1
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
 
 Performance counter stats for './collatz_branches' (5 runs):
 
       3359,953200 cpu-clock                  ( +-  0,01% )
 
       3,365263737 seconds time elapsed                                          ( +-  0,01% )

Performance counter stats for './collatz_branches' (5 runs):

   3359,953200 cpu-clock                  ( +-  0,01% )

   3,365263737 seconds time elapsed                                          ( +-  0,01% )</p></div>

(When redirecting the input of perf one must specify the file descriptor for the output of perf stat itself. In this case we have used the file descriptor number 3 and then told the shell to redirect the file descriptor number 3 to the standard output, which is the file descriptor number 1).

$ yes 123 | perf_3.2 stat --log-fd=3 --repeat=5 -e cpu-clock ./collatz_predication 3>&1
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
 
 Performance counter stats for './collatz_predication' (5 runs):
 
       2318,217200 cpu-clock                  ( +-  0,01% )
 
       2,322732232 seconds time elapsed                                          ( +-  0,01% )

Performance counter stats for './collatz_predication' (5 runs):

   2318,217200 cpu-clock                  ( +-  0,01% )

   2,322732232 seconds time elapsed                                          ( +-  0,01% )</p></div>

So the answer is, yes. In this case it does make a difference. The predicated version runs 1,44 times faster than the version using branches. It would be bold, though, to assume that in general predication outperforms branches. Always measure your time.

That’s all for today.

arm, assembler, branches, function, function call, functions, pi, predication, raspberry

ARM assembler in Raspberry Pi – Chapter 10 ARM assembler in Raspberry Pi – Chapter 12

		<h3>8 thoughts on “<span>ARM assembler in Raspberry Pi – Chapter 11</span>”</h3>
	<ul class="commentlist">
			<li class="comment even thread-even depth-1 parent" id="comment-1130">
			<div id="div-comment-1130" class="comment-body">
			<div class="comment-author vcard">
		<img alt="" src="http://0.gravatar.com/avatar/9e81362ff9654533accffc53762c9520?s=54&amp;d=mm&amp;r=g" srcset="http://0.gravatar.com/avatar/9e81362ff9654533accffc53762c9520?s=64&amp;d=mm&amp;r=g 2x" class="avatar avatar-32 photo grav-hashed grav-hijack" height="32" width="32" originals="32" src-orig="http://0.gravatar.com/avatar/9e81362ff9654533accffc53762c9520?s=32&amp;d=mm&amp;r=g" scale="1.5" id="grav-9e81362ff9654533accffc53762c9520-0">			<cite class="fn">Fernando</cite> <span class="says">says:</span>		</div>
	
	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-1130">
		April 2, 2013 at 8:42 am</a>		</div>

	<p>Hi, why is predication supposed to be faster? As far as I can see, both ways there is a number of instructions that will be “noped”.</p>

Even more, if the if…then…else is ‘big’ (let’s say 10 or more instructions in each case), there will be much more instructions wasted than with a typical branch.

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=1130#respond" onclick="return addComment.moveForm( &quot;div-comment-1130&quot;, &quot;1130&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to Fernando">Reply</a></div>
			</div>
	<ul class="children">
	<li class="comment byuser comment-author-rferrer bypostauthor odd alt depth-2" id="comment-1140">
			<div id="div-comment-1140" class="comment-body">
			<div class="comment-author vcard">
		<img alt="" src="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=54&amp;d=mm&amp;r=g" srcset="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=64&amp;d=mm&amp;r=g 2x" class="avatar avatar-32 photo grav-hashed grav-hijack" height="32" width="32" originals="32" src-orig="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=32&amp;d=mm&amp;r=g" scale="1.5" id="grav-a779b8290b1ca104fdf84d8016fd010b-0">			<cite class="fn">rferrer</cite> <span class="says">says:</span>		</div>
	
	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-1140">
		April 6, 2013 at 9:28 pm</a>		</div>

	<p>It is not supposed to be always faster. It just happens that in some cases, like in this 3n+1 example, a small predicated part may beat branching if the branches are hard to predict.</p>

Of course, predicating big chunks of instructions is not beneficial, as the processor will be doing nothing for them. In such cases branching is the right choice.

No solution is always better. Sometimes predication is better, sometimes branching is the way to go.

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=1140#respond" onclick="return addComment.moveForm( &quot;div-comment-1140&quot;, &quot;1140&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to rferrer">Reply</a></div>
			</div>
	</li><!-- #comment-## -->

Loren Blaney says:

	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-1145">
		April 7, 2013 at 6:50 pm</a>		</div>

	<p>These lessons have been very helpful. Thanks!</p>

I’ve just gotten my XPL0 compiler to generate enough arm assembly code to run the classic Sieve of Eratosthenes benchmark. I’m really tired of seeing “Segmentation fault.” I haven’t found the manual that explains how the gcc assembler works.

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=1145#respond" onclick="return addComment.moveForm( &quot;div-comment-1145&quot;, &quot;1145&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to Loren Blaney">Reply</a></div>
			</div>
	<ul class="children">
	<li class="comment byuser comment-author-rferrer bypostauthor odd alt depth-2" id="comment-1171">
			<div id="div-comment-1171" class="comment-body">
			<div class="comment-author vcard">
		<img alt="" src="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=54&amp;d=mm&amp;r=g" srcset="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=64&amp;d=mm&amp;r=g 2x" class="avatar avatar-32 photo grav-hashed grav-hijack" height="32" width="32" originals="32" src-orig="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=32&amp;d=mm&amp;r=g" scale="1.5" id="grav-a779b8290b1ca104fdf84d8016fd010b-1">			<cite class="fn">rferrer</cite> <span class="says">says:</span>		</div>
	
	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-1171">
		April 11, 2013 at 8:51 pm</a>		</div>

	<p>Glad to know your XPL0 compiler is starting to be useful!</p>

Regarding the “gcc assembler” issue you had, you’ll never find a manual. GCC targets many architectures (arm, x86, powerpc, etc), so they generate different assembler for each one. Every architecture is different, so you first learn assembler and then you map your programming language to that assembler. Of course, the intermediate steps are more complicated than that, but you get the idea

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=1171#respond" onclick="return addComment.moveForm( &quot;div-comment-1171&quot;, &quot;1171&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to rferrer">Reply</a></div>
			</div>
	</li><!-- #comment-## -->

Mariani Antonio Mario says:

	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-1440">
		July 1, 2013 at 5:28 am</a>		</div>

	<p>very very very nice code … i loved it …</p>

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=1440#respond" onclick="return addComment.moveForm( &quot;div-comment-1440&quot;, &quot;1440&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to Mariani Antonio Mario">Reply</a></div>
			</div>
	</li><!-- #comment-## -->
	<li class="comment odd alt thread-odd thread-alt depth-1" id="comment-157446">
			<div id="div-comment-157446" class="comment-body">
			<div class="comment-author vcard">
		<img alt="" src="http://0.gravatar.com/avatar/c5128194ae0fa12dc2c34b146c3d11a5?s=54&amp;d=mm&amp;r=g" srcset="http://0.gravatar.com/avatar/c5128194ae0fa12dc2c34b146c3d11a5?s=64&amp;d=mm&amp;r=g 2x" class="avatar avatar-32 photo grav-hashed grav-hijack" height="32" width="32" originals="32" src-orig="http://0.gravatar.com/avatar/c5128194ae0fa12dc2c34b146c3d11a5?s=32&amp;d=mm&amp;r=g" scale="1.5" id="grav-c5128194ae0fa12dc2c34b146c3d11a5-0">			<cite class="fn"><a href="http://antoniovillena.es" rel="external nofollow" class="url">Antonio Villena</a></cite> <span class="says">says:</span>		</div>
	
	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-157446">
		June 15, 2014 at 3:49 pm</a>		</div>

	<p>I have optimized a little:</p>

main: mov r1, #123 /* r1 <- 123 */ mov r0, #0 /* r0 <- 0 */ loop: add r0, r0, #1 /* r0 <- r0 + 1 */ movs r1, r1, ASR #1 /* r1 > 1) */ bxeq lr bcc loop adc r1, r1, r1 adc r1, r1, r1, LSL #1 /* r1 <- r1 + (r1 << 1) */ b loop

Or predicated version:

main: mov r1, #123 /* r1 <- 123 */ mov r0, #0 /* r0 <- 0 */ loop: add r0, r0, #1 /* r0 <- r0 + 1 */ movs r1, r1, ASR #1 /* r1 > 1) */ bxeq lr adccs r1, r1, r1 adccs r1, r1, r1, LSL #1 /* r1 <- r1 + (r1 << 1) */ b loop

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=157446#respond" onclick="return addComment.moveForm( &quot;div-comment-157446&quot;, &quot;157446&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to Antonio Villena">Reply</a></div>
			</div>
	</li><!-- #comment-## -->
	<li class="comment even thread-even depth-1 parent" id="comment-969498">
			<div id="div-comment-969498" class="comment-body">
			<div class="comment-author vcard">
		<img alt="" src="http://1.gravatar.com/avatar/7c2d18fc22887437f09f99f2c60ab15d?s=54&amp;d=mm&amp;r=g" srcset="http://1.gravatar.com/avatar/7c2d18fc22887437f09f99f2c60ab15d?s=64&amp;d=mm&amp;r=g 2x" class="avatar avatar-32 photo grav-hashed grav-hijack" height="32" width="32" originals="32" src-orig="http://1.gravatar.com/avatar/7c2d18fc22887437f09f99f2c60ab15d?s=32&amp;d=mm&amp;r=g" scale="1.5" id="grav-7c2d18fc22887437f09f99f2c60ab15d-0">			<cite class="fn">Dennis Ng</cite> <span class="says">says:</span>		</div>
	
	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-969498">
		June 11, 2016 at 4:58 am</a>		</div>

	<p>119.247.241.235	</p>

It seemed it is not a hardware issue as the latest Jessie seemed to be able to run the cpu-cycles pref command (need to get to linux-tools 4.4 using upgrade/install/install 4.4 works). The error below is about Make which I after cutting everything still cannot make your command work under Make: (The last command is by cut-and-past and it works.)

 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Performance counter stats for ‘./collatz1’ (5 runs):
 6.189200 cpu-clock (msec) ( +- 1.52% )
 0.011474382 seconds time elapsed ( +- 0.40% )
 Makefile:42: recipe for target ‘perf1’ failed
 make: *** [perf1] Error 47
 d@raspberrypi:~/ch11 $ yes 123 | perf stat –repeat=5 -e cpu-clock ./collatz1
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Performance counter stats for ‘./collatz1’ (5 runs):
 6.151000 cpu-clock (msec) ( +- 1.57% )
 0.011320386 seconds time elapsed ( +- 0.37% )
 d@raspberrypi:~/ch11 $ yes 123 | perf stat –repeat=5 -e cpu-cycles ./collatz1
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Performance counter stats for ‘./collatz1’ (5 runs):
 4258777 cpu-cycles ( +- 1.08% )
 0.012091358 seconds time elapsed ( +- 5.45% )
 d@raspberrypi:~/ch11 $

Not sure what is #3>&1 and as Make cannot handle the –log-fd=3, I delete that phase.

Strangely the speed of 2 is slower than 1 i.e. prediction does not work:

 model name : ARMv6-compatible processor rev 7 (v6l)
 BogoMIPS : 697.95
 Features : half thumb fastmult vfp edsp java tls
 CPU implementer : 0x41
 CPU architecture: 7
 CPU variant : 0x0
 CPU part : 0xb76
 CPU revision : 7
 Hardware : BCM2708
 Revision : 0002
 Serial : 000000002ba0c3d4
 #perf stat -e cpu-cycles ./collatz2
 yes 123 | perf stat –repeat=5 -e cpu-clock ./collatz2 #3>&1
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Performance counter stats for ‘./collatz2’ (5 runs):
 2327.793800 cpu-clock (msec) ( +- 0.01% )
 2.344818639 seconds time elapsed ( +- 0.07% )
 Makefile:60: recipe for target ‘perf2’ failed
 make: *** [perf2] Error 47
 d@raspberrypi:~/ch11 $ yes 123 | perf stat –repeat=5 -e cpu-cycles ./collatz2 #3>&1
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Performance counter stats for ‘./collatz2’ (5 runs):
 1629471236 cpu-cycles ( +- 0.00% )
 2.345886541 seconds time elapsed ( +- 0.07% )
 d@raspberrypi:~/ch11 $

The result is the same under rasp pi 2 (the previous one are all pi 1 or B):

d@raspberrypi:~/ch11 $ make perf2 uname -a Linux raspberrypi 4.4.11-v7+ #888 SMP Mon May 23 20:10:33 BST 2016 armv7l GNU/Linux cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 5 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5

 processor : 1
 model name : ARMv7 Processor rev 5 (v7l)
 BogoMIPS : 38.40
 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
 CPU implementer : 0x41
 CPU architecture: 7
 CPU variant : 0x0
 CPU part : 0xc07
 CPU revision : 5
 processor : 2
 model name : ARMv7 Processor rev 5 (v7l)
 BogoMIPS : 38.40
 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
 CPU implementer : 0x41
 CPU architecture: 7
 CPU variant : 0x0
 CPU part : 0xc07
 CPU revision : 5
 processor : 3
 model name : ARMv7 Processor rev 5 (v7l)
 BogoMIPS : 38.40
 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
 CPU implementer : 0x41
 CPU architecture: 7
 CPU variant : 0x0
 CPU part : 0xc07
 CPU revision : 5
 Hardware : BCM2709
 Revision : a21041
 Serial : 000000000644243a
 #perf stat -e cpu-cycles ./collatz2
 yes 123 | perf stat –repeat=5 -e cpu-clock ./collatz2 #3>&1
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Type a number: Length of the Hailstone sequence for 123 is 46
 Performance counter stats for ‘./collatz2’ (5 runs):
 1560.273734 cpu-clock (msec) ( +- 0.14% )
 1.562665742 seconds time elapsed ( +- 0.16% )

Makefile:61: recipe for target ‘perf2’ failed make: *** [perf2] Error 47 d@raspberrypi:~/ch11 $

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=969498#respond" onclick="return addComment.moveForm( &quot;div-comment-969498&quot;, &quot;969498&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to Dennis Ng">Reply</a></div>
			</div>
	<ul class="children">
	<li class="comment byuser comment-author-rferrer bypostauthor odd alt depth-2" id="comment-970003">
			<div id="div-comment-970003" class="comment-body">
			<div class="comment-author vcard">
		<img alt="" src="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=54&amp;d=mm&amp;r=g" srcset="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=64&amp;d=mm&amp;r=g 2x" class="avatar avatar-32 photo grav-hashed grav-hijack" height="32" width="32" originals="32" src-orig="http://1.gravatar.com/avatar/a779b8290b1ca104fdf84d8016fd010b?s=32&amp;d=mm&amp;r=g" scale="1.5" id="grav-a779b8290b1ca104fdf84d8016fd010b-2">			<cite class="fn">Roger Ferrer Ibáñez</cite> <span class="says">says:</span>		</div>
	
	<div class="comment-meta commentmetadata"><a href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comment-970003">
		June 19, 2016 at 4:58 pm</a>		</div>

	<p>Hi Dennis,</p>

I’m not sure what is collatz1 or collatz2 in your case, but it seems that there is a difference indeed of performance in the two versions.

I assume your results do not match with mine, do they?

The difference on the perf command is likely due to the Raspbian support of linux-tools at the time of writing this article in 2013. Glad to see linux-tools is better supported.

Regards,

	<div class="reply"><a rel="nofollow" class="comment-reply-link" href="http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/?replytocom=970003#respond" onclick="return addComment.moveForm( &quot;div-comment-970003&quot;, &quot;970003&quot;, &quot;respond&quot;, &quot;772&quot; )" aria-label="Reply to Roger Ferrer Ibáñez">Reply</a></div>
			</div>
	</li><!-- #comment-## -->

<p></p>
	<div id="respond" class="comment-respond">
	<h3 id="reply-title" class="comment-reply-title">Leave a Reply <small><a rel="nofollow" id="cancel-comment-reply-link" href="/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#respond" style="display:none;">Cancel reply</a></small></h3>			<form action="http://thinkingeek.com/wp-comments-post.php" method="post" id="commentform" class="comment-form">
			<p class="comment-notes"><span id="email-notes">Your email address will not be published.</span> Required fields are marked <span class="required">*</span></p><p class="comment-form-comment"><label for="comment">Comment</label> <textarea id="comment" name="comment" cols="45" rows="8" maxlength="65525" aria-required="true" required="required"></textarea></p><p class="comment-form-author"><label for="author">Name <span class="required">*</span></label> <input id="author" name="author" type="text" value="" size="30" maxlength="245" aria-required="true" required="required"></p>

Email *

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.

软件开发平台及语言笔记大全(超详细)

Predication

ARM assembler in Raspberry Pi – Chapter 11

Predication

The assembly line of instructions

Predication in ARM

Collatz conjecture revisited

Adding predication

Does it make any difference in performance?

命令行的艺术

Shell脚本编程30分钟入门

Shell 编程范例

机器学习实战

Databricks Spark 知识库简体中文版

Protocol Buffer 3学习笔记