Implementing a User Level Thread Library (28 Days of Hacking: Day 17)

So today is one of the larger projects I will cover, implementing a user level thread library with preemption and synchronization support, as well as basic true parallelism for improved performance. First, let’s think about what structures we will need implement. First we will need a struct for each thread which can act as an entry in the Thread Control Block (TCB). In here, we need to store a pointer to the stack space for the user thread, and a unique thread id which our library will assign. We also need some way to track the status of the thread. Right now, this is not as critical since we really only have two states before we add preemption, running or waiting. Now we need to store the thread structs. One way to do it is simply creating an array, and either dynamically resizing it, or defining a maximum number of threads our library will support. I chose to do it a little differently and used a linked list as the core data structure then implemented a queue around it.

Once we have that, we can write a push and pop function for the linked list, as well as a function to return the length of the linked list.

Now let’s start looking at the core of the user level thread library. First it is pretty critical to have the ability to create a thread. We will model the signature of the function after pthread (POSIX Threads) which looks like int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg). To do this, we can just malloc the thread struct and set the context to a wrapper function which will pass the arguments into the start routine. Then we push this onto the TCB. If this is the first time a thread is being created, we want to initialize the library, which also means assigning the parent thread as a running thread. We want the parent thread to know the id of the thread it created, so we pass it back through the pthread_t* thread argument. Finally if all of the above is successful, we must return 0 otherwise we want to return a nonzero value.

Now we want to have the ability to join on a thread, which means wait until it has finished executing before continuing. It also should pass a void** where we can pass back what the user thread exited with. It should be pretty obvious that to implement join, we must add a third status to the Thread struct, namely blocked. When a thread joins on another thread we don’t want it to start executing it until the thread it joined on has finished. Thus it also makes sense to include this when a thread blocks, so we know when to change the status to waiting. Now we just have to add some checks to avoid deadlocks such as making sure that if a join happens, we always have at least one thread to execute and that the thread being joined on exists and is in the waiting state (or running when we move to true parallelism). All we have to do is add a little clean up code that gets called by our wrapper function to ensure we change the status of the joined threads.

When a thread joins, it needs to give up the CPU for the other threads, so we need to implement a yield function. The yield function just needs to push the running thread back into the linked list, then get the next runnable thread and perform a context switch.

So back to joins, an existing thread should call the exit function in the user level thread library but we still should handle the case when it doesn’t. The exit function needs to accept a void* which enables the user thread to pass information back to the thread that joined on it, then perform a cleanup routine on joined threads.

Implementing the pthread_self() function is quite trivial. We just need to grab the thread id from the Thread struct and return it.

Having a threading library is great but we should probably extend it with preemption so no thread hogs the CPU. To do this, we can use sigevent and set up a signal on SIGUSR1, and write some simple handler code, which ultimately just calls our yield function. We use the itimerspec struct to set the quantum on the scheduler. Due to the queue, we get a simple round robin scheduler out of this.

But since we are now running multithreaded, we need to worry about synchronization! First let us implement a mutex. We can use GCC built-ins like __sync_lock_test_and_set and __sync_lock_release to realize this without diving too deep into mutex implementation. The trivial implementation is to use a spin lock. To do this, we can just loop on the test and set instruction, then break when we have acquired the lock. While this can be implemented in a single line of code, it is not ideal in our user level thread library. This is due to the thread just spinning, which prevents the thread that holds the lock from running, until the spinning thread is preempted. So we can improve this and add a blocking implementation. We need to test the lock and if we acquire it we can just return back. Otherwise, we need to block the current user thread, and tag it with the mutex we are blocking on. This is a critical region, since we have to make sure the mutex doesn’t get released between the trylock call and the thread formally being blocked.

To unlock, we just need to release the mutex, then run a similar cleanup routine to change the status of blocked threads to waiting if they blocked when trying to lock the mutex. A quick note, we only want to move one thread from blocking to waiting. If we bring up more than one, we will have a problem!

Once we have mutexes implemented, it is fairly straightforward to then implement conditions on top.

That gives a halfway decent user level thread library, but if we really want to speed things up, we want to implement it on top of kernel threads. Before we can consider doing this, we have to make sure the user level thread library is race condition free (every time we touch the linked list could cause an issue). Once we have this, the simplest way to extend to true parallelism is to create another struct which will hold our kernel threads, in addition to the current thread (if any), they are executing. In our init function, we can create our bank of kernel threads. We want these threads to run as long as there are still threads that haven’t exited, which means they are either in the TCB or a kernel thread is running them. This isn’t too difficult to check if every time a kernel thread pushes the user thread it was using back into the kernel, it sets the user thread in the struct to null. All we need to do is have the kernel threads look for user threads to run. We probably should have some affinity to help with caching, but for now we will ignore that optimization.

So overall the system runs by spinning up some kernel threads, then scheduling the user threads across them, thus not making any changes to the user level thread library API.