I came into some interesting stuff some time ago, while trying to detect performance issues in some ANSI-C software.
Sometimes, while coding, it may be difficult to decide whether to use a recursive function or a loop, and if so, what kind of loop.
The best example, is the factorial computation. That example is typically used to teach recursive function.
I always felt something strange with those examples. Why the hell using a recursive function for a factorial computation, since we can use a while
loop, for instance?
If you think about machine code, or assembly, calling a procedure implies to copy the values of some registers (instruction pointer, etc), to execute a jump into another portion of memory (maybe in another segment), to execute the code, to restore the registers, and finally to return to the previous memory location (RETF
).
In pure assembly, a simple loop, using the CX
register, on Intel, is far more efficient, as there's no jump, nor stack operation to copy the register values.
So what about C? Take a look at this typical example:
unsigned long factorial( unsigned long n ) { if( n == 0 ) { return 1; } return n * factorial( n - 1 ); }
The same can be achieved with a while
loop:
unsigned long x = n; if( n == 0 ) { x = 1; } else { while( n > 1 ) { x *= --n; } }
We can also choose to use a for
loop:
unsigned long x = n; if( n == 0 ) { x = 1; } for( ; n > 1 ; ) { x = --n * x; }
In pure assembly, we would also write a loop. So with GCC inline assembly:
unsigned long x; if( n == 0 ) { x = 1; } __asm__ ( " movq %[n], %%rax \n\t" " movq %[n], %%rcx \n\t" " decq %%rcx \n\t" "FACTORIAL:" " mulq %%rcx \n\t" " decq %%rcx \n\t" " jnz FACTORIAL \n\t" " movq %%rax, %[x]" : [ x ] "=m" ( x ) : [ n ] "m" ( n ) );
Which one will be the fastest? At first sight, I would have said inline assembly, while, for, and finally the recursive function.
But that's not the case.
Running each case 10000000 time gives the following results:
Recursive: 0.64492511749267578125 While: 1.47788429260253906250 Assembler: 0.62227106094360351562 For: 1.47960233688354492188
Inline assembly is the fastest, but the recursive function is very very close.
The while and for loops takes much longer.
I was very surprised. Why such results?
The answer is in the way GCC optimizes to source code to generate intermediate object code.
If we take a look closer at what GCC does, we can see it optimizes for us a lot of things.
For instance, our recusrisve function, in pure assembly, generated by GCC:
LFE7: .globl _factorial_recursive _factorial_recursive: LFB8: pushq %rbp LCFI3: movq %rsp, %rbp LCFI4: subq $16, %rsp LCFI5: movq %rdi, -8(%rbp) cmpq $0, -8(%rbp) jne L4 movq $1, -16(%rbp) jmp L6 L4: movq -8(%rbp), %rdi decq %rdi call _factorial_recursive movq %rax, %rdx imulq -8(%rbp), %rdx movq %rdx, -16(%rbp) L6: movq -16(%rbp), %rax leave ret
And now the while
loop:
LFB9: pushq %rbp LCFI6: movq %rsp, %rbp LCFI7: movq %rdi, -24(%rbp) movq -24(%rbp), %rax movq %rax, -8(%rbp) cmpq $0, -24(%rbp) jne L12 movq $1, -32(%rbp) jmp L11 L13: decq -24(%rbp) movq -8(%rbp), %rax imulq -24(%rbp), %rax movq %rax, -8(%rbp) L12: cmpq $1, -24(%rbp) ja L13 movq -8(%rbp), %rax movq %rax, -32(%rbp) L11: movq -32(%rbp), %rax
The same applies for the for
loop.
We can see here the compiler (GCC for instance) did a HUGE optimization of our source code.
It means actually that, except if you're a assembly genius, you probably won't code that will run faster, because you did it with inline assembly.
Nowadays, compilers are so efficient that some optimizations just won't work.
In doubt, always remember to benchmark your code, and always remember that compiler can usually generate intermediate assembly code, so you can check what your code will really look like, when executed by the CPU.