Paul Bone

Tail Recursion

New functional programmers tend to learn about tail recursion early on. We’re told how important it is, and that our code should use it, but that’s often all. Developers are often left guessing whether their code is tail recursive and if not, how to make it tail recursive. In this article I’ll be addressing these gaps for strict/eager functional languages with a brief note about lazy languages. What I think is the interesting part, and why I’m writing this at all, are the handful of cases where something looks tail recursive but isn’t. Plus what developers and compilers can do to get tail recursion when they need it.

I have been working on improving mutual tail call optimisation for Mercury, this post is not about that. But that work prompted me to write this post up, something I’d been considering for a while.

Definitions

Before we get started we’re going to need some definitions. We’ll be mostly working in a Haskell-like syntax. It is like Haskell except that it is eager and does not support currying,

Self Recursion

We’ll start simple. Recursion is when a function calls itself. Almost always with "less" input to solve part of the problem.

map f [] = []
map f x:xs = (f x):(map f xs)

map is recursive, because it calls itself (hence self-recursion).

Mutual Recursion

Mutual recursion occurs when two or more functions call each-other, such that execution may eventually loop.

odd 0 = False
odd n = even (n - 1)

even 0 = True
even n = odd (n - 1)

odd can call even and even can call odd, the recursion is mutual.

Note that there does not have to be a call from every function to every other function, for example, f can call g which calls h which calls f. You may hear people using the term strongly connected component (SCC). It’s not important for this article, but a SCC is a group of mutually recursive functions. Technically even a self-recursive function like map, or a non-recursive function is a SCC, the SCC just contains a single item. f is not a SCC on its own, it needs g and h.

Tail Position

Tail position is the last thing a function does before it exits. Using our Haskell-like syntax this is simply the outermost expression on the right hand side of the = symbol.

f1 = tail_position 1 2
f2 = 1 + (not_tail_position 2)

In f2 it’s the addition that’s in tail position.

Tail Recursion

Tail recursion is when the recursive (self or mutual) call appears in tail position. In other words, the last thing that the function does is to call itself. foldl is tail-recursive.

foldl _ [] acc = acc
foldl f x:xs acc = foldl f xs (f acc x)

Compilers can implement tail-recursion in a number of ways (called Tail Call Optimisation): with a loop, a trampoline (see below), or other methods. However most compilers implement Last Call Optimisation (LCO).

Last Call Optimisation (LCO)

In assembly and abstract machines procedure calls are handled by pushing a return address onto the stack and jumping to the callee. The call returns by popping the return address off the stack and jumping to it. Depending on the calling convention other information may also be placed on the stack. However the tail call, the last call of a procedure, doesn’t need to return to the caller, when it returns it can return to the caller’s caller, or if that’s a tail call the caller’s caller’s caller etc. It does this by not bothering to save the return address on the stack and simply jumping to the callee. When it’s time for the callee to return the desired return address will be in the correct place on the stack. In other words you can think of a tailcall as a goto. This is very easy for the compiler to do provided that it has low-level control over the generated code, it also works for recursive and non-recursive calls. Therefore we call it Last Call Optimisation.

Do I have it?

We’ve kind-of covered this already: you have tail recursion when a recursive call is in tail position. Our main example of this, is foldl:

foldl _ [] acc = acc
foldl f x:xs acc = foldl f xs (f acc x)

However it’s not quite that simple, language features such as laziness, logic variables and the order of operations can change or hide what is and isn’t tail recursive.

Lazy languages

Anything that’s tail recursive in an eager language, is also tail recursive in a lazy language. It may create a long chain of thunks on the heap, but that’s a different discussion. What’s different about lazy languages is that some things like map are tail recursive when they are not normally in eager languages.

map f [] = []
map f x:xs = (f x):(map f xs)

Why? Because the recursive call to map will not be executed until after this call returns. Once pattern matching chooses to execute the second case. The cons cell will be allocated containing thunks for f x and map f xs which will not be evaluated until something forces them.

Prolog

A similar thing occurs in Prolog. Okay so it’s not a functional language, but it is a language that makes a lot of use of recursion, and therefore tail recursion is important. In Prolog, map looks like this:

map(_, [], []).
map(P, [X | Xs], [Y | Ys]) :-
    P(X, Y),
    map(P, Xs, Ys).

In case you can’t read Prolog I’ll step through execution, assuming we pass ground values for the first two arguments and a free variable for the third (mode in, in, out).

First it’s easy to see that like the functional style, it’s divided into two cases (Prolog calls them clauses). In both clauses there are three arguments, we’ll execute the second clause where the inputs are P and [X | Xs] (a cons deconstruction), and the output is [Y | Ys].

  1. P unifies with the first argument, setting P to a ground value.

  2. The second argument is deconstructed into X and Xs, which are both ground.

  3. The third argument is unified with cons cell, this is the tricky part, the third argument is free (has no value) and so are Y and Ys. The third argument becomes bound to the new cons cell, and Y and Ys are bound to its free arguments. If Y or Ys ever get bound to values of their own, it’ll change the value of the third argument. It’s a little bit like a pointer-to something you fill in later, but more complex. This is what is called a logic variable.

  4. P(X, Y) is executed, this (probably) gives Y a ground value.

  5. map(P, Xs, Ys) is the last thing to be executed, it can be a tail call! It provides the value for Ys which makes our call’s third argument become ground.

Like laziness above, this allows us to have tail recursion by changing the order that things happen in; the order is just different.

Mercury

Although Prolog and Mercury are similar, how they’re executed is quite different. Mercury compiles each mode of each predicate separately. So the generated code will execute in a particular order, and definitely will not allow free/free unifications. This means that the construction [Y | Ys] is performed after the recursive call to map and before the predicate can return. (We’re specifically talking about the in, in, out mode, a similar thing is true for the in, out, in mode).

:- pred map(pred(A, B), list(A), list(B)).
:- mode map(pred(in, out) is det, in, out) is det.

map(_, [], []).
map(P, [X | Xs], [Y | Ys]) :-
    P(X, Y),
    map(P, Xs, Ys).

In this case, like on an eagerly evaluated function language, map is not tail recursive due to the construction after the recursive call. What’s worth noting here, is that this looks tail recursive but isn’t. It’s deceptive! The equivalent functional code is much more honest. To complicate this further, when last-call-module-constructor optimisation is enabled this becomes tail recursive, see below.

There is another difference from functional languages. Mercury supports returning more than one value.

foo(..., a, b).
foo(..., A, B) :-
    ...,
    foo(..., B, A).

This also looks tail recursive but isn’t. In this case the equivalent functional code would use a tuple and need to explicitly swap the cells.

-- Syntax may vary WRT the deconstruction of the tuple
foo ... = ...
foo ... = let (a, b) = foo(...)
            in (b, a)

How do I get it?

So now that you’ve realised that you may not have tail recursion; how can you get it? Obviously if your program is crashing due to running out of stack space you could just increase the stack space or use a segmented stack. However this isn’t ideal, a lot of stack space may still be being wasted. Ideally you can write the code differently and avoid unbounded stack growth. Sometimes however you may only be able to reduce the stack usage, in this case it can still be a good idea to use a larger stack or a segmented stack.

Before we discuss each method/trick, let’s digress and introduce a segmented stack. Stack space is usually a fixed amount of memory per thread. Once you run out of space (overflowing the stack) that’s it, the thread (or program) crashes. Some language runtimes, for example Mercury (stseg grade) and Go, support a segmented stack. Each thread’s stack is divided into segments. Each segment has a fixed size but new segments can be added as needed and removed when they’re no longer needed. The total size of the stack is now dynamic, but there is a small performance penalty, to check the stack size, add new segments and remove old segments. The check can be avoided using memory protection tricks, but this makes adding segments more expensive.

Some of the of the following tricks are the kinds of things that you can do as a developer, some are things that the compiler can do, and a few are things that both developers and compilers can do.

Accumulator introduction

By introducing an extra parameter, an accumulator, we can store the information we used to keep on the stack, in a parameter (it often trades stack storage for heap storage, but not always).

map f []   = []
map f x:xs = (f x) : (map f xs)

becomes

map f xs = map' f xs []

map' _ []   acc = reverse acc
map' f x:xs acc = map' xs (f x):acc

The accumulator accumulates the mapped items in reverse order, the base case has to un-reverse it.

This looks a lot like foldl, that shouldn’t be surprising, that’s what foldl does, maps an accumulator over a list. We could save ourselves typing/have more code reuse.

map f xs = reverse (foldl (\x, a -> (f x):a) xs [])

This is a pretty reasonable thing to do. You can be seasonably sure that the standard library, of whatever your using, implements most of it’s primitives, but especially foldl, to be tail recursive.

This is sometimes an optimisation that the compiler can perform.

Reduce the required stack depth

This is my favorite solution, because it’s one of those "obvious once someone tells you" things. It’s useful when introducing an accumulator isn’t enough, for example when the language you’re using doesn’t support tail call optimisation at all. It works by both using an accumulator, and two loops. The inner loop has a depth limit, once it has completed that many recursions it returns the incomplete work (the accumulator) and the other loop makes a recursion. This way all the stack frames used by the inner loop can be cleaned up periodically.

my_map :: (a -> b) -> [a] -> [b]
my_map f list = map' f [] list

map' :: (a -> b) -> [b] -> [a] -> [b]
map' f acc0 xs = let r = map'' 500 f acc0 xs in
                   case r of
                     Complete list       -> list
                     Incomplete acc list -> map' f acc list

data Result x y = Complete [y]
                | Incomplete [y] [x]

map'' :: Int -> (a -> b) -> [b] -> [a] -> Result a b
map'' 0 f acc  xs       = Incomplete acc xs
map'' n _ acc  []       = Complete $ reverse acc
map'' n f acc0 (x0:xs0) = let acc = (f x0):acc0 in
                              map'' (n-1) f acc xs0

If you find that your computation runs out of stack after about 1000 recursions, then try setting the inner loop’s limit to something like 500 iterations. This will allow it to handle a few less than 500 x 500 iterations.

Unlike other solutions this does not make the program use a constant stack space (unless the language supports LCO, in which case you didn’t need this trick anyway). It simply makes the computation unwind the stack periodically.

Last call modulo constructor optimisation (LCMC)

This is something that the compiler can do, and without specific language features, only the compiler can do it.

It works when the construction of a value is what’s blocking the recursive call, in the case of map that’s :.

map _ [] = []
map f x:xs = (f x):(map f xs)

Is optimised to:

map _ [] = []
map f x:xs = {
                let cons = (f x):_
                map' f xs address_of(cons, field 1)
                return cons
             }

map' f [] result_ptr = {
                          *result_ptr := []
                       }
map' f x:xs result_ptr = {
                            let cons = (f x):_
                            *result_ptr := cons
                            map' f xs address_of(cons, field 1)
                         }

This works well, it generates fast code. There is a minor problem that it makes it less obvious when tail recursion actually occurs, making it harder to predict the performance of your programs. This also changes as optimisation settings, and in Mercury grade features, are changed. In my opinion, in this case that’s a price worth paying, and one that can be worked-around to a degree by compiler warnings.

A brief note about C

As I understand, the C standard does not address last call optimisation or other guarantees about constant stack space usage. However some C compilers will optimize self and mutual recursion in specific circumstances. This article (from 2004) explains why LCO is difficult in C and how GCC optimises sibling calls, mutually tail recursive calls with matching parameter lists. There may be more recent work on these compilers to enable LCO more generally, but I’m not aware of it. For the purpose of this article, let’s assume that C and any other target languages do not support TCO/LCO at all.

Compiling to loops and state machines

When it comes to actually generating the code for recursive calls in a language that doesn’t support tail recursion (such as C / Java etc) the simplest solution is to transform direct recursion into a loop. This can be combined with LCMC above.

foldl _ [] acc = acc
foldl f x:xs acc = foldl f xs (f acc x)

becomes something like

foldl(f, list, acc) {
  while (true) {
    switch (list) {
      case []:
        return acc;
      case x:xs:
        acc1 = f(y, acc);
        // Replace the input variables and loop.
        acc = acc1;
        list = xs;
    }
  }
}

But what happens when the code is mutually recursive? One solution, and the solution we’ve chosen to implement for the Mercury high-level C backend, is to inline the bodies of each mutually recursive call into one procedure and use goto statements for tailcalls.

scc_entry(input1, input2) {
  proc1:
    ...
    // use input1 and input2.
    ...
    if (exit) {
      return result;
    } else {
      // Provide new inputs and "call" proc2
      input1 = ...;
      input3 = ...;
      goto proc2;
    }
  proc2:
    ...
    // use input1 and input3.
    ...
    // Provide new inputs and "call" proc1
    input1 = ...;
    input2 = ...;
    goto proc1;
}

This is analogous to a loop, in fact one can use a switch (on a token) and a while loop to get the same behaviour.

There are many other solutions also. For example a mutual tail recursion can often be made into a direct tail recursion by inlining one function into another.

Trampoline

Here’s a fun one. It can either be used by the compiler when generating C or similar code, or in hand-written code of various languages (and you may recognize the pattern). When used by the compiler it can be used throughout the program or applied to only those calls that are (normally) tail calls, or even only the tail calls that create mutual recursion. Like most optimisations, there are many variations and tuning decisions you could make.

Every call returns a pointer to a function, this represents the tail call that this function wants to make, but hasn’t made yet because it’s returning. A loop, called a trampoline or driver, performs the tail call which may return another function pointer representing the next tail call.

typedef void* Func (args);

void driver (Func* entry)
{
  struct arg_struct args;

  Func* fp = entry;
  while (fp != NULL) {
    fp = (Func*) (*fp)(&args);
  }
}

One downside is that all the calls must have the same signature, you will have to do extra work to ensure that all functions have the same parameter list and return code, for example by passing parameters in a struct.

If used throughout a program this could be a lot slower due to the indirection, and non-idiomatic code involved.

Advanced FP developers, Have you recognised the pattern? If you said "continuation passing style" give yourself a pat on the back, the returned function pointer is analogous to a continuation.

Like a lot of these techniques, this one is also used in Mercury. I believe this is used to compile nondet code in Mercury’s high-level C backend.

Conclusion

Most of the solutions are conceptually independent, even though they work together. So there’s nothing worth saying here about them.

If you’re new to functional programming I hope that you can put the earlier content in this article, particularly the definitions and also the discussion of when your code is tail recursive, to practical use.

If you’ve been doing functional programming for some time, I hope that I’ve impressed upon you that tail recursion is worth a revisit, it’s more detailed than our "Intro to Functional Programming" resources suggest! Additionally, it is not always easy to know if something is tail recursive or not.