Coroutines in C/C++: beyond Duff's Device Saturday, April 5, 2008

Introduction

Recently on the ns-developers email list there was a posting about using coroutines to support blocking applications models in the ns-3 network simulator. I find this an interesting topic: network applications fairly often are written in a sequential, or blocking, manner, where the application waits for possibly long-lived system calls to return. An example of this is calling read on a TCP socket, which waits until the data is received by TCP. Network simulators are generally discrete event simulators; they require that the application models return control to the event loop to allow the simulation to keep running. Therefore the application models must be written in some sort of non-blocking, event driven manner.

Enter coroutines. With coroutines one can build a blocking API on top of the event driven simulator and allow a simulation user to write models without needing to worry about the underlying event-driven nature. This also means porting real world applications to the simulator is made a lot simpler, as they can retain their existing design. Some languages, such as newer versions of Python, include support for coroutines as part of the language. ns-3 is written in C++, a language that has no built-in support for coroutines.

Coroutines for C/C++: the state of the art

There are many articles around pointing out the neat trick you can play with the switch statement in C to make a very efficient, and very basic, co-routine implementation. The best article around is probably this one by Simon Tantham, which explains the trick in detail. There is even a good library implementing this approach, designed for embedded devices where it is not appropriate to store the stack for each "thread". I'm not going to cover this approach here; suffice to say that it has some major drawbacks, such as not handling local variables and not being able to yield from within a subroutine/function call.

There are some libraries out there that instead use the idea of switching the process stack. This is semi-portable, as there are POSIX/SySV functions called makecontext, swapcontext, getcontext and setcontext which perform the functions of swapping and storing the stack. This approach has more general applicability, as it does not have the serious limitations listed earlier. An alternate approach to these functions is to use setjmp and longjmp. Alternatively, the assembler code to implement stack switching and saving is quite simple; many libraries have hand-crafted assembly code for various platforms. There are several libraries freely available that use one or more these methods of switching stacks, such as "CORO" (man page, download), libtask, GNU Portable Threads, and libCoroutine.

libCoroutine appears to be the only library with Win32 support; where it uses the Win32 "Fiber" functions. This library has wide ranging platform support, and actually uses all of the methods mentioned above to find a working coroutine implementation for a platform.

Rewind to makecontext, et. al.

Interestingly, the POSIX specification for makecontext and the other context functions has the following to say:

The obsolescent functions getcontext(), makecontext(), and swapcontext() can be replaced using POSIX threads functions.

The functions are now deprecated; at the bottom of the specification it makes this a little clearer:

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/57 is applied, obsoleting the functions and giving advice on the alternatives.

This may be the reason for all the methods used by the libraries earlier; not all platforms will implement deprecated functions. However, the suggestion is interesting: "[use] POSIX threads functions [instead]."

A pthreads based approach

A long time ago in my university honours project I implemented something similar using threads without realising the general applicability of my approach. I thought I would see how long it would take to implement coroutines using pthreads and how complex the solution really needs to be with this approach.

The implementation

So the implementation is in terms of the proposed ns-3 coroutines API which calls the contexts FiberContext. The API we need to implement is as follows:

/* Called by the scheduler itself in the main thread. */
struct FiberContext *FiberContextNewBlank (void);
/* Called by the scheduler to start a new context running the specified
   function. This function will not run right away, but when switched to
   later. */
struct FiberContext *FiberContextNew (void (*callback) (void), 
                                      uint32_t stackSize);
/* Switch to another context. */
void FiberContextSwitchTo (struct FiberContext *from,
                           struct FiberContext *to);
/* Cleanup a previously created context. */
void FiberContextDelete (struct FiberContext *context);

This C API is simple and enough to implement a scheduler on top of. I began by defining my struct FiberContext with the information I'll need:

struct FiberContext {
  pthread_t thread;
  pthread_mutex_t mutex;
  pthread_cond_t condvar;
  int running;
  int thread_started;
  size_t stack_size;
  void (*func) ();
};

The idea of the code is that every context has a thread, mutex, and condition variable. When switching to another context, we signal the condition variable of that context, then wait on it ourselves until that context has finished its processing. This can be thought of as a classical producer-consumer design with condition variables, where the producer only ever produces one item of work at a time, and waits for the consumer to finish its work every time.

The implementations of the functions to create the contexts are simple. The creation of the thread itself is deferred until required.

struct FiberContext *
FiberContextNewBlank (void)
{
  struct FiberContext *context;
  context = (struct FiberContext *)malloc (sizeof (*context));
  pthread_mutex_init (&context->mutex, NULL);
  pthread_cond_init (&context->condvar, NULL);
  context->running = 1;
  context->thread_started = 1;
  context->func = NULL;
  context->stack_size = 0;
  return context;
}

struct FiberContext *
FiberContextNew (void (*callback) (void), uint32_t stackSize)
{
  struct FiberContext *context;
  context = FiberContextNewBlank ();
  context->func = callback;
  context->running = 0;
  context->thread_started = 0;
  context->stack_size = stackSize;
  return context;
}

Switching between contexts is where the magic comes in. To implement this I've separated out the functionality of when the scheduler is starting up a context, and when a context is returning control to the scheduler. I call these two operations wakeup and yield respectively. Hence the code becomes:

void 
FiberContextSwitchTo (struct FiberContext *from, 
                      struct FiberContext *to)
{
  if (from->func != NULL) 
    {
      /* We're in an application thread, and we know our mutexes are locked */
      FiberContextYield (from);
    } 
  else 
    {
      /* We're the controller (main) thread */
      FiberContextWakeup (to);
    }
}

The yield case is the simplest so lets look at that:

static void
FiberContextYield (struct FiberContext *context)
{
  context->running = 0;
  pthread_cond_signal (&context->condvar);
  while (!context->running)
    {
      pthread_cond_wait (&context->condvar, &context->mutex);
    }
}

This function is called with the lock for the context already held (this will become evident later). The code should be fairly readable: mark ourselves as not running, signal our condition variable to wake up the main scheduler thread, and go to sleep, waiting on our condition variable. Note that the while loop is required because there is the possibility of spurious wakeups with condition variables. This is explained in books on concurrent programming and on the Internet in websites such as IBM's excellent multi-threaded programming documentation.

The wakeup function is somewhat more complex. It needs to signal the thread it is waking up, so the thread can exit from the wait loop shown above in FiberContextYield. It then needs to wait on the condition variable with the opposite predicate as above (this is the same design as is often used in producer-consumer situations). Code is as follows:

static void
FiberContextWakeup (struct FiberContext *context)
{
  int error;
  pthread_mutex_lock (&context->mutex);
  context->running = 1;

  if (context->thread_started)
    {
      pthread_cond_signal (&context->condvar);
    }
  else
    {
      FiberContextStartThread (context);
    }

  while (context->running)
    {
      pthread_cond_wait (&context->condvar, &context->mutex);
    }

  pthread_mutex_unlock (&context->mutex);
}

In this case the locks are not already held: that is only true in the non-main threads. So a lock is acquired and released at the start and end of the function respectively. In addition to what is described above, we also create the thread in this function if it is not already created. This design is because the FiberContext API breaks out creation of the FiberContext (FiberContextNew) and running of the FiberContext (FiberContextSwitchTo). This is not the case in pthreads: when a thread is created with pthread_create, it begins running right away.

The code to create the new thread is as follows:

static void
FiberContextStartThread (struct FiberConext *context)
{
  int error;
  pthread_attr_t attr;
  error = pthread_attr_init (&attr);
  assert (error == 0);
  error = pthread_attr_setstacksize (&attr, context->stack_size);
  assert (error == 0);
  error = pthread_create (&context->thread, &attr, &FiberContextRun,
      (void*) context);
  assert (error == 0);
  error = pthread_attr_destroy (&attr);
  assert (error == 0);
  context->thread_started = 1;
}

This code is fairly straightforward. To allow many threads, it is important to set the stack size. The default stack size can be quite large (e.g., 8MB, it depends on your system's set-up) which restricts the amount of threads you can create. pthread_create is specified to run another FiberContext function; FiberContextRun. It is detailed blow:

static void *
FiberContextRun (void *arg)
{
  struct FiberContext *context = (struct FiberContext *) arg;
  pthread_mutex_lock (&context->mutex);
  context->func();
  context->running = 0;
  pthread_cond_signal (&context->condvar);
  pthread_mutex_unlock (&context->mutex);
  pthread_detach(context->thread);
  return 0;
}

We can see in this function why we assume the lock is held in the non-main thread: all non-main threads will run this thread function, which holds the mutex for the duration of the thread function specified in the FiberContext structure. It is important to signal once the function returns, as there is no guarantee the function will explicitly yield before it finished.

The only other function in the implementation is FiberContextDelete which is trivial:

void 
FiberContextDelete (struct FiberContext *context)
{
  pthread_mutex_destroy (&context->mutex);
  pthread_cond_destroy (&context->condvar);
  free (context);
}

Together these functions implement coroutines with pthreads. The API being implemented is very similar, but simpler than, the makecontext et. al. API, so it may be of general use outside of ns-3.

The limitations

The main benefit of using a pthreads-based coroutine implementation is that it is portable to any system that has pthreads support. Unfortunately this is not as general as one might think; many of the coroutine libraries are written for systems that specifically don't have threading implementations.

Intuitively, performance with pthreads will be lower: to context-switch between threads, the kernel needs to be involved. Functions like swapcontext only need to switch the stack pointer and load/store the registers, which is probably a very quick set of operations. It's not obvious how different the performance characteristics are without running some tests though, so that is what I did.

I tested the FiberContext API introduced in this article that uses the context functions with my own implementation using the pthreads functions. I made each thread/context/coroutine run just a very simple function that switches between contexts a specified number of times:

void MyContextFunc() {
    for (int counter = 0; counter < num_counters; counter++) {
        Yield();
    }
}

The test program is given a number of contexts to create, and a number of times the function above will iterate through the loop (called counter here). Yield() above will return control to the scheduler, which executes the contexts in a round-robin fashion. I then ran this over a series of input and used the time command to time the total execution of the function. Each set of input parameters is run 5 times, and the median is recorded (not the most scientific, but probably good enough to show general trends). The computer used to run the experiments is an AMD Athlon 2100+ (1730MHz) with 1GB of RAM running Ubuntu Gutsy 7.10. The output of getconf GNU_LIBPTHREAD_VERSION is NPTL 2.6.1.

The graph show a large performance difference between the two results. We consistently see that the context-based approach is 20x faster or more as the number of contexts is increased. Increasing the counter above appears to scale the graphs as the next two graphs show.

These two graphs tell a slightly different tale. It appears as counter goes up, the difference stays approximately the same. We can verify this by fixing the number of contexts at some arbitrary number and increasing counter. This is shown in the following graph.

In this case there appears to be a fairly constant 7x difference in performance. Unsurprisingly, the context-based approach is one again much faster.

Conclusions

This article shows a basic review of current C/C++ coroutine implementations, a new implementation based on pthreads, and shows some performance results comparing the new pthreads implementation to an implementation based on the POSIX makecontext, getcontext and swapcontext functions. The pthreads implementation is much slower than the other, it seems it is only of use as a portable alternative to using the POSIX functions which have been deprecated and are not implemented on all systems.

As an interesting sidenote, I'm not sure how debuggers such as gdb and valgrind handle applications that switch their stacks. It might be that using threads actually makes things easier to debug in this case (as the tools already have support for threads). There's a sentence I never thought I would write!

2 comments:

Unknown said...

I realize this is an old blog post, but I found my way here via Google, so I thought I'd leave a comment for anyone else who ends up here.

It's very likely that even your makecontext-based implementation was way off the performance one would like for a cooperative threading implementation. The principal culprit is that most (all?) swapcontext implementations make syscalls for saving and restoring the signal mask. See the following for more details.

http://rethinkdb.com/blog/making-coroutines-fast/

Ben

Kazade said...

Thanks fro this article, it's really useful. One thing I've noticed, and maybe I'm misunderstanding, but if you destroy a coroutine while it is yielded, won't the thread say alive (waiting) until the end of the application's lifetime?

In fact, would destroying the condition variable cause undefined behaviour in this case?