Introduction Last updated: 2023-10-22

atomik is an approach to shared-state concurrent programming based on atomic actions[1]. An atomic action guarantees atomicity (all-or-nothing execution, even on failure), isolation (no other code can observe its effects until it has committed), consistency (an action always takes a valid application state to another valid state), and composability (any sequence of atomic actions can be combined to form a new atomic action).

The most significant advantage of atomic actions over lock-based critical sections is that they can be executed concurrently: they do not require mutual exclusion. Atomic actions give you the simplicity of coarse-grained locking with the performance of fine-grained locking.

[1] In the database world, the equivalent of an atomic action is called a transaction. Transactions are often described using the acronym "ACID": Atomic, Consistent, Isolated, Durable. Note that one of the properties in our definition of atomic actions (composability) doesn't appear in "ACID", and one of the properties (durability) in "ACID" doesn't appear in our definition. "Composability" isn't part of "ACID" because its importance was poorly understood at the time that transactions were formalized (which is why SQL has poor composability). We don't include "durability" as a required property because atomic actions apply only to memory, not storage. However, the current state of the atomik store can be consistently written to storage (checkpointing) and later read from storage (recovery).

Atomicity

Atomicity, again, is the property that an action takes effect either entirely or not at all, regardless of failures. Lock-based critical sections do not have atomicity in the presence of side effects that can fail (unless you manually roll back changes after failure or depend on RAII for cleanup), although they can achieve atomicity in the absence of side effects (if there are no side effects that can fail, the critical section cannot fail since it has exclusive access to the data it's modifying). In contrast, atomic actions can fail a single execution due to data conflicts with concurrently executing atomic actions, but their modifications to data are always rolled back on failure and the action is retried until it succeeds (if you're wondering about starvation, we'll address that later). One advantage of lock-based critical sections is that they don't have to be retried, so side effects don't need to be somehow undone. Atomic actions can accommodate side effects in a variety of ways: they can defer executing them until the atomic action succeeds (atomik::on_commit()), they can execute them immediately, and then execute a compensating action if the atomic action must be retried (atomik::on_abort()), or they can designate themselves as infallible, which means they cannot fail, so they don't need to be retried. (Read-only atomic actions are automatically infallible.)

Isolation

Isolation, again, is the property that an atomic action cannot observe the effects of any concurrently executing atomic actions. The isolation of lock-based critical sections is limited, because any code that does not observe the locking protocol can observe their effects. But with atomic actions, there is no "back door": it is impossible to read or write any atomik::var outside an atomic action. This means that data races (unsynchronized concurrent access to a memory location, where at least one access is a write) on an atomik::var are literally impossible: the atomik runtime mediates all access to memory wrapped in an atomik::var, so you never need to run a data race detector. What it doesn't prevent, however, are race conditions, which unlike data races are application-specific: they are violations of an application's consistency requirements.

Consistency

"Consistent" has meaning only in the context of a particular application: it means that, given the set of invariants defined by the application on its state, the current application state satisfies those invariants. Imagine an application that stores a person's location information along with their name and other identifying information. One reasonable invariant for this application to enforce would be that the city and ZIP code associated with a person's location must match. Thus, it would not be a consistency violation to modify a person's name independently of their location, but it would violate consistency if we allowed updates to their city and ZIP code to be non-atomic: the initial state ("New York", "10001") and the final state ("Seattle", "98101") are consistent, but the intermediate state ("Seattle", "10001") is inconsistent, so it must never be observed from outside the updating thread. The isolation property of atomic actions ensures that such inconsistent intermediate states are unobservable to other code. But because consistent states are relative to the application, only the programmer can define consistency. In lock-based concurrency, this is done by defining the boundaries of critical sections. In the atomik concurrency model, this is done by defining the boundaries of atomic actions. Once those boundaries are defined, the atomik runtime ensures that all application states are consistent, as defined by the application.

Composability

Composability is the ability to make one new thing out of several existing things. In programming, it can mean making a new structure out of several smaller structures, or a new function out of several smaller functions. For atomik, it means making a new atomic action out of a set of existing atomic actions, by just writing a new atomic action that calls functions defining those atomic actions. (You can also combine atomic actions directly using atomik::and_then and atomik::or_else, but for now we'll just look at calling functions containing an atomic action.)

One of the reasons that locks make programming hard is that they don't compose. What does this mean? It means that if you have two methods on a data structure, each of which is thread-safe, and you call both of those methods from a new function, the new function is not automatically thread-safe. To make the new function thread-safe, you need to have access to the data structure's locking implementation. Here's a simple example: a bank account application that requires thread-safe access to individual accounts.

This is Your Code on Locks

Note that each account needs its own mutex, and the mutexes associated with each account need to be carefully managed to avoid deadlock and race conditions. The transfer() function cannot be naturally composed from the increment_balance() function. Instead we must have full knowledge of and access to the synchronization implementation of the bank_account type, and cannot reuse the increment_balance() function.

                                
#include <mutex>

struct bank_account {
    int balance;
    volatile std::mutex mutex;
};

void increment_balance(bank_account& account, int amount)
{
    std::lock_guard lock{account.mutex};

    account.balance += amount;
}

void racy_transfer(bank_account& from, bank_account& to, int amount)
{
    // we do not hold both locks at the same time,
    // so a partially completed transfer is observable to other threads,
    // breaking the invariant that the total change in balances is zero!
    increment_balance(from, -amount);
    increment_balance(to, amount);
}

void deadlocking_transfer(bank_account& from, bank_account& to, int amount)
{
    // if we concurrently call this function with `from` and `to` reversed,
    // it will deadlock, because we acquire the locks in the opposite order!
    std::lock_guard lock1{from.mutex};
    std::lock_guard lock2{to.mutex};

    from.balance -= amount;
    to.balance += amount;
}

void correct_transfer(bank_account& from, bank_account& to, int amount)
{
    // avoid deadlock in case of self-transfer
    if (&from == &to) return;

    // lock both mutexes without deadlock
    std::lock(from.mutex, to.mutex);

    // make sure both already-locked mutexes are unlocked at end of scope
    std::lock_guard lock1{from.mutex, std::adopt_lock};
    std::lock_guard lock2{to.mutex, std::adopt_lock};

    // finally, the business logic
    from.balance -= amount;
    to.balance += amount;
}
                                
                            

These toy examples of deadlock and race conditions have simple solutions, but real systems make composability much more difficult: you may not have access to the locking implementation at all, or there may be no way to acquire all locks in a single fixed order to avoid deadlock.

This is Your Code on atomik

Here's an example of the same bank account application using atomik instead of locks. Note that once we have the increment_balance() function, we can naturally compose it to create the transfer() function. Also, no explicit synchronization is required anywhere; the programmer is only responsible for defining the scope of atomic actions to preserve their application's invariants.

                                
#include "atomik.hpp"

struct bank_account {
    int balance;
};

void increment_balance(atomik::var<bank_account> account, int amount)
{
    atomik::do_atomically(
        [=]() {
            account.set({account.get().balance + amount});
        }
    );
}

void transfer(atomik::var<bank_account> from, atomik::var<bank_account> to, int amount)
{
    atomik::do_atomically(
        [=]() {
            increment_balance(from, -amount);
            increment_balance(to, amount);
        }
    );
}
                                
                            

Failure

An atomic action can fail because of a data conflict, an unsatisfied predicate, or an unhandled exception. In the first two cases, the atomic action will automatically retry (up to a user-specified limit or timeout, after which the atomic action will abort and throw an informative exception). In the last case, the atomic action will undo its changes and rethrow the exception. Thus, exceptions thrown from inside an atomic action can be handled either inside the atomic action (so it doesn't need to undo its changes and retry), or outside it (so you can handle exceptions rethrown after the atomic action has undone its changes).

When an atomic action fails due to a data conflict, it is automatically retried, but it doesn't necessarily restart immediately. Instead, it may wait for a short random period (increasing with the number of retries) to try to avoid another data conflict, or it may be added to a queue of atomic actions that are executed serially to avoid conflicts. Eventually, if it reaches a retry limit, then all atomic actions (except read-only actions) in the system may be forced to execute serially to prevent any further conflicts with the retrying action.

An atomic action can also fail because a predicate supplied to atomik::check() evaluated to false. This means that a necessary condition for progress is not satisfied, so the atomic action undoes its changes, waits until at least one atomik::var it read has changed, and then retries. atomik::check() has semantics roughly equivalent to a condition variable used with a mutex, but there is a crucial difference: the retried execution could follow a completely different codepath than the original execution, so the atomik::check() predicate that failed might not even be evaluated during the retried execution! This allows a programmer to specify many possible branches for an atomic action, each depending on a different atomik::check() predicate. The first predicate that is satisfied will determine the codepath taken for the current execution of the atomic action. It is possible, of course, for no atomik::check() predicate to ever be satisfied, so the programmer can specify a failure policy for this case. If no retried execution succeeds within a specified timeout, or if the number of retries reaches a specified limit, then the atomic action aborts and throws an informative exception.

Progress

Progress means that "something good will eventually happen." The "good thing" could be a thread acquiring a lock, an atomic action committing its changes, or all threads in a pool completing their tasks. Here we're interested in the second case: we want any atomic action to succeed within a given time or a given number of retries. Locks guarantee progress only under very restrictive conditions: deadlock, livelock, and starvation must be impossible. Avoiding deadlock requires either specifying a global order on all locks or lock acquirers, or monitoring the system for symptoms of deadlock. The former is often impossible in complex, dynamic systems (such as conventional database transactions), and the latter is expensive and can easily turn deadlock into livelock[2] if not carefully implemented. There are simple deadlock-, livelock-, and starvation-free locking protocols[3], but they require aborts and retries (just like atomic actions), which removes one of the major advantages of locks: that side effects never need to be deferred or undone.

The atomik platform guarantees progress for every atomic action by automatically retrying on data conflicts and unsatisfied predicates. The system ensures that every atomic action can eventually run without any data conflicts, and that unsatisfied predicates eventually cause an abort after a user-specified timeout or retry limit. Deadlock is impossible because no locks are used, and livelock and starvation are prevented by forcing conflicting atomic actions to wait until the retrying action has committed.

[2] A database lock manager typically monitors the "waits-for graph" of locks and their owning and waiting transactions for cycles in the graph, which indicate deadlock. One of the waiting transactions in a cycle must be aborted to resolve a deadlock, but if this choice is arbitrary it can lead to livelock: the same two deadlocking transactions can repeatedly abort each other. This is also true of the "no-wait" database locking algorithm that avoids waiting for locks entirely: when a lock acquirer encounters an owned lock, it immediately aborts and retries. Again, the same two transactions can repeatedly abort each other under this algorithm, leading to livelock.

[3] The "wound-wait" and "wait-die" database locking algorithms use transaction timestamps to prevent deadlock, livelock, and starvation. These algorithms can actually be used to implement atomic actions, but are difficult for a programmer to directly implement and integrate with an existing locking protocol.

Installation

The atomik platform can be downloaded as a shared or statically linked client library, plus a standalone server binary. A sample Makefile is provided, as well as a CMake install script for use within CMake projects.

Download

Download a GZIP archive from the "Releases" tab on the GitHub repository page.

Install

Install the atomik CMake package using the provided CMake script.

Build and Run Sample Application

Build the sample bank_account CMake project and run the test suite using CTest.

APIs

The atomik APIs fall into three broad categories:

  • Accessing data within an atomic action (atomik::var)
  • Executing an atomic action (atomik::do_atomically())
  • Retrying an atomic action when a predicate is not satisfied (atomik::check())

template<typename T> atomik::var<T>

Provides memory and concurrency management for a user-defined type T. T must either be trivial or implement the get_serialized_size(), serialize() and deserialize() methods (via either its definition or template specialization). atomik::var can only be accessed within an atomic action.

  • T atomik::var<T>::get()

    Reads the current version of this atomik::var, returning its contents by value.

  • const T* atomik::var<T>::get_ptr()

    Reads the current version of this atomik::var, returning a const pointer to its contents.

  • void atomik::var<T>::set(const T& value)

    Sets the contents of this atomik::var to value. Subsequent calls to atomik::var::get() from within this atomic action will return value.

  • T atomik::var<T, F>::update(F func)

    Evaluates func (typically a lambda, but can be any callable object) on the current value of this atomik::var, sets the contents of this atomik::var to the result, and returns the result by value. Subsequent calls to atomik::var::get() from within this atomic action will return the result.

template<typename A> void atomik::do_atomically(A action)

Atomically executes the callable object action, automatically retrying on data conflicts. All accesses to atomik::vars within A are guaranteed to be atomic and isolated.

void atomik::check(bool pred)

If the boolean expression pred evaluates to false, undoes all changes made by this atomic action (including executing all registered on_abort handlers), waits for any atomik::var already read by this atomic action to change, and then retries the atomic action. If a user-specified timeout or retry limit is reached, throws a retry_timeout or retry_limit exception respectively.

It's important to note that, as with any form of conditional blocking synchronization, it is technically possible to create a deadlock using atomik::check(). Here's the pattern to avoid (assume the atomik::vars "X" and "Y" are both initially false):

                                
atomik::do_atomically(
    []() {
        atomik::var x_var = atomik::get_var_by_key("X");
        atomik::var y_var = atomik::get_var_by_key("Y");
        atomik::check(x_var.get() == true);
        y_var.set(true);
    }
);

atomik::do_atomically(
    []() {
        atomik::var x_var = atomik::get_var_by_key("X");
        atomik::var y_var = atomik::get_var_by_key("Y");
        atomik::check(y_var.get() == true);
        x_var.set(true);
    }
);
                                
                            

Since neither atomic action can commit until the other commits, both would block indefinitely if atomik::check() were to block indefinitely. Fortunately, the fact that atomik::check() always takes a timeout or retry limit argument prevents a permanent deadlock: instead of both threads blocking forever, an informative exception will be thrown on both of the blocking threads.

template<typename A> void atomik::on_commit(A action)

Defers execution of the callable object action until the atomic action commits. All commit handlers (including those registered in nested actions) are executed in LIFO order.

template<typename A> void atomik::on_abort(A action)

Defers execution of the callable object action until the atomic action aborts or retries. All abort handlers (including those registered in nested actions) are executed in LIFO order.

Resources

The atomik platform is based on ideas from multiversioned in-memory databases and Software Transactional Memory. A good introduction to the former is [4]. A good introduction to the latter is [5]. More specific resources are detailed below.

[4] "An empirical evaluation of in-memory multi-version concurrency control"

[5] "Unlocking concurrency: multicore programming with transactional memory"

What Is Concurrent Programming?

"The Art of Multiprocessor Programming" is a comprehensive introduction to concurrent programming, covering both blocking (lock-based) and nonblocking (lock-free) techniques.

"Is Parallel Programming Hard, And, If So, What Can You Do About It?" covers hardware and systems programming aspects of concurrent programming in more depth.

Why Locks Are Hard

"The Trouble With Locks" is a good introduction to the subtleties of concurrent programming with locks, with many examples in C++.

"Unlocking concurrency: multicore programming with transactional memory" also describes several issues with locks and how they are solved by the transactional model.

Why Atomic Actions Should Be Composable

"Composable Memory Transactions", the original paper on the Haskell language's Software Transactional Memory implementation, introduces the retry API, which is the inspiration for atomik::check(). (You may also recognize TVar and atomically as the equivalents of atomik::var and atomik::do_atomically() respectively.) The paper explains why the somewhat surprising semantics of retry are essential for the composability of atomic actions.

"Industrial Experience with Transactional Memory at Wyatt Technology" describes a C++ STM with a very similar API to atomik (but a much less scalable implementation: it uses a single global reader-writer lock to coordinate all transactions). A companion talk is also available.

FAQs

Some common questions and answers about the atomik platform.

General

What problem is this solving?

The atomikplatform solves the problem of making shared-state concurrent programming easier and safer.

Why is shared-state concurrent programming even necessary?

CPUs stopped getting faster around 2005, limiting single-thread performance and forcing CPU designers to compensate by putting multiple CPU cores on a chip and multiple chips in a single machine. The only way to get good performance out of these multiple CPU cores is to give them many things to do at once, which is the definition of concurrent programming. There are many ways to do concurrent programming, but lots of programs need to change the same data from many threads at once, which requires shared-state concurrent programming. The only proven approaches to shared-state concurrent programming are locks and atomic actions.

Why aren't locks good enough for shared-state concurrent programming?

Locks can work well for simple programs, especially if the program can be structured rigidly enough to avoid deadlock. They are easy to reason about in simple cases and don't have hidden implementation overheads. But locks don't work well when a program is too complicated or dynamic for a static lock ordering protocol, or when you need to compose multiple thread-safe operations into a new thread-safe operation. Even worse, writing a complex system using locks forces you to choose between simplicity and safety on one hand (coarse-grained locks), and scalability on the other (fine-grained locks). Atomic actions give you simplicity, safety, and scalability, all at the same time.

Safety

What if I need to pop up a dialog or send an email in my atomic action?

In general, you should avoid irreversible side effects within an atomic action. But sometimes that's just not possible. In that case, you have the problem that the atomic action could retry many times due to data conflicts, and you probably don't want the side effect executed on every retry! The atomik platform gives you three options to deal with side effects in an atomic action:

  1. Defer execution of the side effect until the atomic action commits (atomik::on_commit()).
  2. Execute the side effect immediately, but execute a compensating action whenever the atomic action retries or throws an exception (atomik::on_abort()).
  3. Designate the atomic action as infallible (or read-only), so it's never retried (atomik::action_type::infallible or atomik::action_type::read_only).

How long will it take my atomic action to commit if it keeps encountering data conflicts?

Whenever an atomic action must retry due to a data conflict, the thread executing it increases its probability of scheduling the next retry for serial execution. Any thread executing an atomic action that causes a retry (i.e., the "winner" of a conflict) does the same thing. The result is that any pair of conflicting atomic actions quickly find themselves both scheduled for serial execution, where they cannot conflict. If an atomic action reaches a retry limit, it will become "infallible": it will wait for all active executions of atomic actions to complete and then force all new executions (including retries) to be serially scheduled. (Because read-only atomic actions cannot cause conflicts, they can still run concurrently.)

Can I access shared data both inside and outside an atomic action?

In general, no. The atomik platform does not restrict access to global variables inside an atomic action, but such variables should only be accessed if they are immutable. Otherwise, you risk data races, which is exactly the problem that atomik solves! Even if you use std::atomic types for global variables, accessing a non-atomik::var global variable inside an atomic action subverts the isolation guarantee that the effects of any atomic action will not be visible until it commits, nor will any effects outside an atomic action be visible within the atomic action. The only non-atomik::var variables that are safe to access from within an atomic action are local variables, thread-local variables, and immutable global variables. Fortunately, if you observe this restriction, the atomik platform ensures that you cannot access an atomik::var outside an atomic action (such access will throw an exception).

Can I lock or unlock a mutex, or wait on or signal a condition variable, within an atomic action?

No. Although atomik can easily accommodate side effects, mixing synchronization paradigms is a recipe for disaster. Even if a mutex within an atomic action is unlocked on abort by RAII, locking the mutex violates the isolation property. And, of course, signaling a condition variable is impossible to undo. The atomik platform (or any other synchronization library) cannot statically enforce this requirement, so the programmer must do so. (This doesn't prevent you from implementing heuristic approaches like linter rules, of course.)

Is it safe to reuse a pointer read from atomik::var::get_ptr()?

Yes! The internal copy-on-write implementation of atomik::var::set() guarantees that the address returned by atomik::var::get_ptr is stable: the address and the memory it references will never change during the lifetime of this atomic action, even if another atomic action concurrently overwrites the value read by this atomic action, causing it to later abort.

Performance

How much will this slow down my program?

The latency overhead of atomic actions is very low: for a single thread it's about 200ns (and improving). Of course, in a critical path it may still be too high, but only benchmarking will tell. Scalability is on par with fine-grained locking (you should expect linear scalability up to at least 4-8 threads, and considerably more threads in the near future).

How much memory does this use?

The base memory overhead of the atomik runtime (not accounting for long-lived data) is ~2.5MB per active thread. The runtime can accommodate up to 256GB of total data (with some internal overhead). Garbage collection can be delayed indefinitely by a long-running atomic action, but this can be addressed by timeout or memory usage policies configured by the user.

Could the implementation internally deadlock or livelock?

No. The implementation is fully nonblocking: there are no locks used internally at all. All operations on the critical path are lock-free or wait-free.