Introduction Last updated: 2023-10-22
atomik
is an approach to shared-state concurrent programming based on atomic actions[1]. An atomic action guarantees atomicity (all-or-nothing execution, even on failure), isolation (no other code can observe its effects until it has committed), consistency (an action always takes a valid application state to another valid state), and composability (any sequence of atomic actions can be combined to form a new atomic action).
The most significant advantage of atomic actions over lock-based critical sections is that they can be executed concurrently: they do not require mutual exclusion. Atomic actions give you the simplicity of coarse-grained locking with the performance of fine-grained locking.
[1] In the database world, the equivalent of an atomic action is called a transaction. Transactions are often described using the acronym "ACID": Atomic, Consistent, Isolated, Durable. Note that one of the properties in our definition of atomic actions (composability) doesn't appear in "ACID", and one of the properties (durability) in "ACID" doesn't appear in our definition. "Composability" isn't part of "ACID" because its importance was poorly understood at the time that transactions were formalized (which is why SQL has poor composability). We don't include "durability" as a required property because atomic actions apply only to memory, not storage. However, the current state of the atomik
store can be consistently written to storage (checkpointing) and later read from storage (recovery).
Atomicity
Atomicity, again, is the property that an action takes effect either entirely or not at all, regardless of failures. Lock-based critical sections do not have atomicity in the presence of side effects that can fail (unless you manually roll back changes after failure or depend on RAII for cleanup), although they can achieve atomicity in the absence of side effects (if there are no side effects that can fail, the critical section cannot fail since it has exclusive access to the data it's modifying). In contrast, atomic actions can fail a single execution due to data conflicts with concurrently executing atomic actions, but their modifications to data are always rolled back on failure and the action is retried until it succeeds (if you're wondering about starvation, we'll address that later). One advantage of lock-based critical sections is that they don't have to be retried, so side effects don't need to be somehow undone. Atomic actions can accommodate side effects in a variety of ways: they can defer executing them until the atomic action succeeds (atomik::on_commit()
), they can execute them immediately, and then execute a compensating action if the atomic action must be retried (atomik::on_abort()
), or they can designate themselves as infallible, which means they cannot fail, so they don't need to be retried. (Read-only atomic actions are automatically infallible.)
Isolation
Isolation, again, is the property that an atomic action cannot observe the effects of any concurrently executing atomic actions. The isolation of lock-based critical sections is limited, because any code that does not observe the locking protocol can observe their effects. But with atomic actions, there is no "back door": it is impossible to read or write any atomik::var
outside an atomic action. This means that data races (unsynchronized concurrent access to a memory location, where at least one access is a write) on an atomik::var
are literally impossible: the atomik
runtime mediates all access to memory wrapped in an atomik::var
, so you never need to run a data race detector. What it doesn't prevent, however, are race conditions, which unlike data races are application-specific: they are violations of an application's consistency requirements.
Consistency
"Consistent" has meaning only in the context of a particular application: it means that, given the set of invariants defined by the application on its state, the current application state satisfies those invariants. Imagine an application that stores a person's location information along with their name and other identifying information. One reasonable invariant for this application to enforce would be that the city and ZIP code associated with a person's location must match. Thus, it would not be a consistency violation to modify a person's name independently of their location, but it would violate consistency if we allowed updates to their city and ZIP code to be non-atomic: the initial state ("New York", "10001") and the final state ("Seattle", "98101") are consistent, but the intermediate state ("Seattle", "10001") is inconsistent, so it must never be observed from outside the updating thread. The isolation property of atomic actions ensures that such inconsistent intermediate states are unobservable to other code. But because consistent states are relative to the application, only the programmer can define consistency. In lock-based concurrency, this is done by defining the boundaries of critical sections. In the atomik
concurrency model, this is done by defining the boundaries of atomic actions. Once those boundaries are defined, the atomik
runtime ensures that all application states are consistent, as defined by the application.
Composability
Composability is the ability to make one new thing out of several existing things. In programming, it can mean making a new structure out of several smaller structures, or a new function out of several smaller functions. For atomik
, it means making a new atomic action out of a set of existing atomic actions, by just writing a new atomic action that calls functions defining those atomic actions. (You can also combine atomic actions directly using atomik::and_then
and atomik::or_else
, but for now we'll just look at calling functions containing an atomic action.)
One of the reasons that locks make programming hard is that they don't compose. What does this mean? It means that if you have two methods on a data structure, each of which is thread-safe, and you call both of those methods from a new function, the new function is not automatically thread-safe. To make the new function thread-safe, you need to have access to the data structure's locking implementation. Here's a simple example: a bank account application that requires thread-safe access to individual accounts.
This is Your Code on Locks
Note that each account needs its own mutex, and the mutexes associated with each account need to be carefully managed to avoid deadlock and race conditions. The transfer()
function cannot be naturally composed from the increment_balance()
function. Instead we must have full knowledge of and access to the synchronization implementation of the bank_account
type, and cannot reuse the increment_balance()
function.
#include <mutex>
struct bank_account {
int balance;
volatile std::mutex mutex;
};
void increment_balance(bank_account& account, int amount)
{
std::lock_guard lock{account.mutex};
account.balance += amount;
}
void racy_transfer(bank_account& from, bank_account& to, int amount)
{
// we do not hold both locks at the same time,
// so a partially completed transfer is observable to other threads,
// breaking the invariant that the total change in balances is zero!
increment_balance(from, -amount);
increment_balance(to, amount);
}
void deadlocking_transfer(bank_account& from, bank_account& to, int amount)
{
// if we concurrently call this function with `from` and `to` reversed,
// it will deadlock, because we acquire the locks in the opposite order!
std::lock_guard lock1{from.mutex};
std::lock_guard lock2{to.mutex};
from.balance -= amount;
to.balance += amount;
}
void correct_transfer(bank_account& from, bank_account& to, int amount)
{
// avoid deadlock in case of self-transfer
if (&from == &to) return;
// lock both mutexes without deadlock
std::lock(from.mutex, to.mutex);
// make sure both already-locked mutexes are unlocked at end of scope
std::lock_guard lock1{from.mutex, std::adopt_lock};
std::lock_guard lock2{to.mutex, std::adopt_lock};
// finally, the business logic
from.balance -= amount;
to.balance += amount;
}
These toy examples of deadlock and race conditions have simple solutions, but real systems make composability much more difficult: you may not have access to the locking implementation at all, or there may be no way to acquire all locks in a single fixed order to avoid deadlock.
This is Your Code on atomik
Here's an example of the same bank account application using atomik
instead of locks. Note that once we have the increment_balance()
function, we can naturally compose it to create the transfer()
function. Also, no explicit synchronization is required anywhere; the programmer is only responsible for defining the scope of atomic actions to preserve their application's invariants.
#include "atomik.hpp"
struct bank_account {
int balance;
};
void increment_balance(atomik::var<bank_account> account, int amount)
{
atomik::do_atomically(
[=]() {
account.set({account.get().balance + amount});
}
);
}
void transfer(atomik::var<bank_account> from, atomik::var<bank_account> to, int amount)
{
atomik::do_atomically(
[=]() {
increment_balance(from, -amount);
increment_balance(to, amount);
}
);
}
Failure
An atomic action can fail because of a data conflict, an unsatisfied predicate, or an unhandled exception. In the first two cases, the atomic action will automatically retry (up to a user-specified limit or timeout, after which the atomic action will abort and throw an informative exception). In the last case, the atomic action will undo its changes and rethrow the exception. Thus, exceptions thrown from inside an atomic action can be handled either inside the atomic action (so it doesn't need to undo its changes and retry), or outside it (so you can handle exceptions rethrown after the atomic action has undone its changes).
When an atomic action fails due to a data conflict, it is automatically retried, but it doesn't necessarily restart immediately. Instead, it may wait for a short random period (increasing with the number of retries) to try to avoid another data conflict, or it may be added to a queue of atomic actions that are executed serially to avoid conflicts. Eventually, if it reaches a retry limit, then all atomic actions (except read-only actions) in the system may be forced to execute serially to prevent any further conflicts with the retrying action.
An atomic action can also fail because a predicate supplied to atomik::check()
evaluated to false. This means that a necessary condition for progress is not satisfied, so the atomic action undoes its changes, waits until at least one atomik::var
it read has changed, and then retries. atomik::check()
has semantics roughly equivalent to a condition variable used with a mutex, but there is a crucial difference: the retried execution could follow a completely different codepath than the original execution, so the atomik::check()
predicate that failed might not even be evaluated during the retried execution! This allows a programmer to specify many possible branches for an atomic action, each depending on a different atomik::check()
predicate. The first predicate that is satisfied will determine the codepath taken for the current execution of the atomic action. It is possible, of course, for no atomik::check()
predicate to ever be satisfied, so the programmer can specify a failure policy for this case. If no retried execution succeeds within a specified timeout, or if the number of retries reaches a specified limit, then the atomic action aborts and throws an informative exception.
Progress
Progress means that "something good will eventually happen." The "good thing" could be a thread acquiring a lock, an atomic action committing its changes, or all threads in a pool completing their tasks. Here we're interested in the second case: we want any atomic action to succeed within a given time or a given number of retries. Locks guarantee progress only under very restrictive conditions: deadlock, livelock, and starvation must be impossible. Avoiding deadlock requires either specifying a global order on all locks or lock acquirers, or monitoring the system for symptoms of deadlock. The former is often impossible in complex, dynamic systems (such as conventional database transactions), and the latter is expensive and can easily turn deadlock into livelock[2] if not carefully implemented. There are simple deadlock-, livelock-, and starvation-free locking protocols[3], but they require aborts and retries (just like atomic actions), which removes one of the major advantages of locks: that side effects never need to be deferred or undone.
The atomik
platform guarantees progress for every atomic action by automatically retrying on data conflicts and unsatisfied predicates. The system ensures that every atomic action can eventually run without any data conflicts, and that unsatisfied predicates eventually cause an abort after a user-specified timeout or retry limit. Deadlock is impossible because no locks are used, and livelock and starvation are prevented by forcing conflicting atomic actions to wait until the retrying action has committed.
[2] A database lock manager typically monitors the "waits-for graph" of locks and their owning and waiting transactions for cycles in the graph, which indicate deadlock. One of the waiting transactions in a cycle must be aborted to resolve a deadlock, but if this choice is arbitrary it can lead to livelock: the same two deadlocking transactions can repeatedly abort each other. This is also true of the "no-wait" database locking algorithm that avoids waiting for locks entirely: when a lock acquirer encounters an owned lock, it immediately aborts and retries. Again, the same two transactions can repeatedly abort each other under this algorithm, leading to livelock.
[3] The "wound-wait" and "wait-die" database locking algorithms use transaction timestamps to prevent deadlock, livelock, and starvation. These algorithms can actually be used to implement atomic actions, but are difficult for a programmer to directly implement and integrate with an existing locking protocol.