Introduction
Recently on the ns-developers email list there was a posting about using
coroutines to support blocking applications models in the ns-3 network
simulator. I find this an interesting topic: network applications fairly often
are written in a sequential, or blocking, manner, where the application waits
for possibly long-lived system calls to return. An example of this is calling
read
on a TCP socket, which waits until the data is received by TCP. Network
simulators are generally discrete event simulators; they require that the
application models return control to the event loop to allow the simulation to
keep running. Therefore the application models must be written in some sort of
non-blocking, event driven manner.
Enter coroutines. With coroutines one can build a blocking API on top of the event driven simulator and allow a simulation user to write models without needing to worry about the underlying event-driven nature. This also means porting real world applications to the simulator is made a lot simpler, as they can retain their existing design. Some languages, such as newer versions of Python, include support for coroutines as part of the language. ns-3 is written in C++, a language that has no built-in support for coroutines.
Coroutines for C/C++: the state of the art
There are many articles around pointing out the neat trick you can play with the switch statement in C to make a very efficient, and very basic, co-routine implementation. The best article around is probably this one by Simon Tantham, which explains the trick in detail. There is even a good library implementing this approach, designed for embedded devices where it is not appropriate to store the stack for each "thread". I'm not going to cover this approach here; suffice to say that it has some major drawbacks, such as not handling local variables and not being able to yield from within a subroutine/function call.
There are some libraries out there that instead use the idea of switching the
process stack. This is semi-portable, as there are POSIX/SySV functions called
makecontext
, swapcontext
, getcontext
and setcontext
which perform the
functions of swapping and storing the stack. This approach has more general
applicability, as it does not have the serious limitations listed earlier. An
alternate approach to these functions is to use setjmp
and longjmp
.
Alternatively, the assembler code to implement stack switching and saving is
quite simple; many libraries have hand-crafted assembly code for various
platforms. There are several libraries freely available that use one or more
these methods of switching stacks, such as "CORO" (man page,
download), libtask, GNU Portable Threads, and libCoroutine.
libCoroutine appears to be the only library with Win32 support; where it uses the Win32 "Fiber" functions. This library has wide ranging platform support, and actually uses all of the methods mentioned above to find a working coroutine implementation for a platform.
Rewind to makecontext
, et. al.
Interestingly, the POSIX specification for makecontext
and the other context
functions has the following to say:
The obsolescent functions getcontext(), makecontext(), and swapcontext() can be replaced using POSIX threads functions.
The functions are now deprecated; at the bottom of the specification it makes this a little clearer:
IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/57 is applied, obsoleting the functions and giving advice on the alternatives.
This may be the reason for all the methods used by the libraries earlier; not all platforms will implement deprecated functions. However, the suggestion is interesting: "[use] POSIX threads functions [instead]."
A pthreads based approach
A long time ago in my university honours project I implemented something similar using threads without realising the general applicability of my approach. I thought I would see how long it would take to implement coroutines using pthreads and how complex the solution really needs to be with this approach.
The implementation
So the implementation is in terms of the proposed ns-3 coroutines API which
calls the contexts FiberContext
. The API we need to implement is as follows:
/* Called by the scheduler itself in the main thread. */
struct FiberContext *FiberContextNewBlank (void);
/* Called by the scheduler to start a new context running the specified
function. This function will not run right away, but when switched to
later. */
struct FiberContext *FiberContextNew (void (*callback) (void),
uint32_t stackSize);
/* Switch to another context. */
void FiberContextSwitchTo (struct FiberContext *from,
struct FiberContext *to);
/* Cleanup a previously created context. */
void FiberContextDelete (struct FiberContext *context);
This C API is simple and enough to implement a scheduler on top of. I began by
defining my struct FiberContext
with the information I'll need:
struct FiberContext {
pthread_t thread;
pthread_mutex_t mutex;
pthread_cond_t condvar;
int running;
int thread_started;
size_t stack_size;
void (*func) ();
};
The idea of the code is that every context has a thread, mutex, and condition variable. When switching to another context, we signal the condition variable of that context, then wait on it ourselves until that context has finished its processing. This can be thought of as a classical producer-consumer design with condition variables, where the producer only ever produces one item of work at a time, and waits for the consumer to finish its work every time.
The implementations of the functions to create the contexts are simple. The creation of the thread itself is deferred until required.
struct FiberContext *
FiberContextNewBlank (void)
{
struct FiberContext *context;
context = (struct FiberContext *)malloc (sizeof (*context));
pthread_mutex_init (&context->mutex, NULL);
pthread_cond_init (&context->condvar, NULL);
context->running = 1;
context->thread_started = 1;
context->func = NULL;
context->stack_size = 0;
return context;
}
struct FiberContext *
FiberContextNew (void (*callback) (void), uint32_t stackSize)
{
struct FiberContext *context;
context = FiberContextNewBlank ();
context->func = callback;
context->running = 0;
context->thread_started = 0;
context->stack_size = stackSize;
return context;
}
Switching between contexts is where the magic comes in. To implement this I've separated out the functionality of when the scheduler is starting up a context, and when a context is returning control to the scheduler. I call these two operations wakeup and yield respectively. Hence the code becomes:
void
FiberContextSwitchTo (struct FiberContext *from,
struct FiberContext *to)
{
if (from->func != NULL)
{
/* We're in an application thread, and we know our mutexes are locked */
FiberContextYield (from);
}
else
{
/* We're the controller (main) thread */
FiberContextWakeup (to);
}
}
The yield case is the simplest so lets look at that:
static void
FiberContextYield (struct FiberContext *context)
{
context->running = 0;
pthread_cond_signal (&context->condvar);
while (!context->running)
{
pthread_cond_wait (&context->condvar, &context->mutex);
}
}
This function is called with the lock for the context already held (this will become evident later). The code should be fairly readable: mark ourselves as not running, signal our condition variable to wake up the main scheduler thread, and go to sleep, waiting on our condition variable. Note that the while loop is required because there is the possibility of spurious wakeups with condition variables. This is explained in books on concurrent programming and on the Internet in websites such as IBM's excellent multi-threaded programming documentation.
The wakeup function is somewhat more complex. It needs to signal the thread it
is waking up, so the thread can exit from the wait loop shown above in
FiberContextYield
. It then needs to wait on the condition variable with the
opposite predicate as above (this is the same design as is often used in
producer-consumer situations). Code is as follows:
static void
FiberContextWakeup (struct FiberContext *context)
{
int error;
pthread_mutex_lock (&context->mutex);
context->running = 1;
if (context->thread_started)
{
pthread_cond_signal (&context->condvar);
}
else
{
FiberContextStartThread (context);
}
while (context->running)
{
pthread_cond_wait (&context->condvar, &context->mutex);
}
pthread_mutex_unlock (&context->mutex);
}
In this case the locks are not already held: that is only true in the non-main
threads. So a lock is acquired and released at the start and end of the function
respectively. In addition to what is described above, we also create the thread
in this function if it is not already created. This design is because the
FiberContext
API breaks out creation of the FiberContext
(FiberContextNew
)
and running of the FiberContext
(FiberContextSwitchTo
). This is not the case
in pthreads: when a thread is created with pthread_create
, it begins running
right away.
The code to create the new thread is as follows:
static void
FiberContextStartThread (struct FiberConext *context)
{
int error;
pthread_attr_t attr;
error = pthread_attr_init (&attr);
assert (error == 0);
error = pthread_attr_setstacksize (&attr, context->stack_size);
assert (error == 0);
error = pthread_create (&context->thread, &attr, &FiberContextRun,
(void*) context);
assert (error == 0);
error = pthread_attr_destroy (&attr);
assert (error == 0);
context->thread_started = 1;
}
This code is fairly straightforward. To allow many threads, it is important to
set the stack size. The default stack size can be quite large (e.g., 8MB, it
depends on your system's set-up) which restricts the amount of threads you can
create. pthread_create
is specified to run another FiberContext
function;
FiberContextRun
. It is detailed blow:
static void *
FiberContextRun (void *arg)
{
struct FiberContext *context = (struct FiberContext *) arg;
pthread_mutex_lock (&context->mutex);
context->func();
context->running = 0;
pthread_cond_signal (&context->condvar);
pthread_mutex_unlock (&context->mutex);
pthread_detach(context->thread);
return 0;
}
We can see in this function why we assume the lock is held in the non-main
thread: all non-main threads will run this thread function, which holds the
mutex for the duration of the thread function specified in the FiberContext
structure. It is important to signal once the function returns, as there is no
guarantee the function will explicitly yield before it finished.
The only other function in the implementation is FiberContextDelete
which is
trivial:
void
FiberContextDelete (struct FiberContext *context)
{
pthread_mutex_destroy (&context->mutex);
pthread_cond_destroy (&context->condvar);
free (context);
}
Together these functions implement coroutines with pthreads. The API being
implemented is very similar, but simpler than, the makecontext
et. al. API,
so it may be of general use outside of ns-3.
The limitations
The main benefit of using a pthreads-based coroutine implementation is that it is portable to any system that has pthreads support. Unfortunately this is not as general as one might think; many of the coroutine libraries are written for systems that specifically don't have threading implementations.
Intuitively, performance with pthreads will be lower: to context-switch between
threads, the kernel needs to be involved. Functions like swapcontext
only need
to switch the stack pointer and load/store the registers, which is probably a
very quick set of operations. It's not obvious how different the performance
characteristics are without running some tests though, so that is what I did.
I tested the FiberContext API introduced in this article that uses the context functions with my own implementation using the pthreads functions. I made each thread/context/coroutine run just a very simple function that switches between contexts a specified number of times:
void MyContextFunc() {
for (int counter = 0; counter < num_counters; counter++) {
Yield();
}
}
The test program is given a number of contexts to create, and a number of times
the function above will iterate through the loop (called counter
here).
Yield()
above will return control to the scheduler, which executes the
contexts in a round-robin fashion. I then ran this over a series of input and
used the time
command to time the total execution of the function. Each set of
input parameters is run 5 times, and the median is recorded (not the most
scientific, but probably good enough to show general trends). The computer used
to run the experiments is an AMD Athlon 2100+ (1730MHz) with 1GB of RAM running
Ubuntu Gutsy 7.10. The output of getconf GNU_LIBPTHREAD_VERSION
is NPTL
2.6.1
.
The graph show a large performance difference between the two results. We
consistently see that the context-based approach is 20x faster or more as the
number of contexts is increased. Increasing the counter
above appears to scale
the graphs as the next two graphs show.
These two graphs tell a slightly different tale. It appears as counter
goes
up, the difference stays approximately the same. We can verify this by fixing
the number of contexts at some arbitrary number and increasing counter
. This
is shown in the following graph.
In this case there appears to be a fairly constant 7x difference in performance. Unsurprisingly, the context-based approach is one again much faster.
Conclusions
This article shows a basic review of current C/C++ coroutine implementations, a
new implementation based on pthreads, and shows some performance results
comparing the new pthreads implementation to an implementation based on the
POSIX makecontext
, getcontext
and swapcontext
functions. The pthreads
implementation is much slower than the other, it seems it is only of use as a
portable alternative to using the POSIX functions which have been deprecated and
are not implemented on all systems.
As an interesting sidenote, I'm not sure how debuggers such as gdb
and
valgrind
handle applications that switch their stacks. It might be that using
threads actually makes things easier to debug in this case (as the tools already
have support for threads). There's a sentence I never thought I would write!
2 comments:
I realize this is an old blog post, but I found my way here via Google, so I thought I'd leave a comment for anyone else who ends up here.
It's very likely that even your makecontext-based implementation was way off the performance one would like for a cooperative threading implementation. The principal culprit is that most (all?) swapcontext implementations make syscalls for saving and restoring the signal mask. See the following for more details.
http://rethinkdb.com/blog/making-coroutines-fast/
Ben
Thanks fro this article, it's really useful. One thing I've noticed, and maybe I'm misunderstanding, but if you destroy a coroutine while it is yielded, won't the thread say alive (waiting) until the end of the application's lifetime?
In fact, would destroying the condition variable cause undefined behaviour in this case?
Post a Comment