# Overhauling SC atomics in C11 and OpenCL

John Wickerson. Mark Batty, and Alastair F. Donaldson

Imperial Concurrency Workshop July 2015

 The rules for sequentially-consistent atomic operations and fences ("SC atomics") in C11 and OpenCL are

- The rules for sequentially-consistent atomic operations and fences ("SC atomics") in C11 and OpenCL are
  - too **complex**,

- The rules for sequentially-consistent atomic operations and fences ("SC atomics") in C11 and **OpenCL** are
  - 😕 too **complex**,



too **weak**, and

- The rules for sequentially-consistent atomic operations and fences ("SC atomics") in C11 and **OpenCL** are

  - 😕 too **complex**,

  - too **weak**, and



- The rules for sequentially-consistent atomic operations and fences ("SC atomics") in C11 and OpenCL are
  - too **complex**,



too **weak**, and



• We suggest how to fix them  $\mathfrak{S}$ .

#### Outline

- Introduction to the C11 memory model
- Overhauling the rules for SC atomics in C11
- Introduction to the OpenCL memory model
- Overhauling the rules for SC atomics in OpenCL

• A collection of indivisible operations for lock-free programming, e.g.:

```
atomic_store_explicit(x, 1, memory_order_relaxed);
```

• A collection of indivisible operations for lock-free programming, e.g.:

```
atomic_store_explicit(x, 1, memory_order_release);
    memory_order_relaxed
```

 A collection of indivisible operations for lock-free programming, e.g.:

```
atomic_store_explicit(x, 1, memory_order_acquire);
    memory_order_release
    memory_order_relaxed
```

 A collection of indivisible operations for lock-free programming, e.g.:

```
atomic_store_explicit(x, 1, memory_order_rel_acq);
    memory_order_acquire
    memory_order_release
    memory_order_relaxed
```

 A collection of indivisible operations for lock-free programming, e.g.:

```
atomic_store_explicit(x, 1, memory_order_seq_cst);
    memory_order_rel_acq
    memory_order_acquire
    memory_order_release
    memory_order_relaxed
```

- A collection of indivisible operations for lock-free programming, e.g.:
- atomic\_store\_explicit(x, 1, memory\_order\_seq\_cst);
  memory\_order\_rel\_acq
  memory\_order\_acquire
  memory\_order\_acquire
  memory\_order\_release
  memory\_order\_release
  memory\_order\_relaxed

# C11 memory model

• Trace-based semantics ("executions").

# C11 memory model

- Trace-based semantics ("executions").
- First phase: generate an overapproximation, by considering each thread in isolation.

# C11 memory model

- Trace-based semantics ("executions").
- First phase: generate an overapproximation, by considering each thread in isolation.
- Second phase: remove executions that are inconsistent with the axioms of the memory model.

\*x = 42; atomic\_store\_explicit(y, 1, memory\_order\_release); if (atomic\_load\_explicit(y, memory\_order\_acquire)) print(\*x);

if (atomic\_load\_explicit(y, memory\_order\_acquire))
 print(\*x);



```
*x = 42;
atomic_store_explicit(y, 1,
memory_order_release); if (atomic_load_explicit(y,
memory_order_acquire))
print(*x);
```























































• Execution X is **consistent** iff

satisfies all the **consistency axioms**.

• Execution X is **consistent** iff

satisfies all the **consistency axioms**.

[[P]] = P's consistent executions\*

• Execution X is **consistent** iff

satisfies all the **consistency axioms**.

[[P]] = P's consistent executions\*

\*unless P also admits a **faulty** execution, then [P] = any execution

- Execution X is consistent iff there exists rf, mo and S such that (X,rf,mo,S) is well-formed and satisfies all the consistency axioms.
- [[P]] = P's consistent executions\*

\*unless P also admits a **faulty** execution, then [P] = any execution

#### Candidate executions

 $a: W_{na}(x, 0)$   $b: W_{na}(y, 0)$
#### Candidate executions







irreflexive(rf)



#### irreflexive(rf) irreflexive(mo; mo; rf<sup>-1</sup>)



irreflexive(rf) irreflexive(mo; mo; rf-1) irreflexive(mo; rf)

# All consistency axioms

$$\begin{array}{lll} \operatorname{irr}(hb) & (\operatorname{Hb}) \\ \operatorname{irr}(rf^{-1})^? ; mo ; rf^? ; hb) & (\operatorname{Coh}) \\ \operatorname{irr}(rf ; hb) & (\operatorname{Rf}) \\ \operatorname{empty}((rf ; [nal]) \setminus vis) & (\operatorname{NaRf}) \\ \operatorname{irr}(rf \cup (mo ; mo ; rf^{-1}) \cup (mo ; rf)) & (\operatorname{Rmw}) \\ \operatorname{irr}(S ; r_1) & \operatorname{where} r_1 = hb & (S1) \\ \operatorname{irr}(S ; r_2) & \operatorname{where} r_2 = Fsb^? ; mo ; sbF^? & (S2) \\ \operatorname{irr}(S ; r_3) & \operatorname{where} r_3 = rf^{-1} ; [\operatorname{SC}] ; mo & (\operatorname{S3}) \\ \operatorname{irr}(S ; r_5) & \operatorname{where} r_4 = rf^{-1} ; hbl ; [W] & (\operatorname{S4}) \\ \operatorname{irr}(S ; r_5) & \operatorname{where} r_5 = Fsb ; rb & (\operatorname{S5}) \\ \operatorname{irr}(S ; r_6) & \operatorname{where} r_6 = rb ; sbF & (\operatorname{S6}) \\ \operatorname{irr}(S ; r_7) & \operatorname{where} r_7 = Fsb ; rb ; sbF & (\operatorname{S7}) \end{array}$$

#### Derived relations

$$\begin{array}{rcl} acq & \stackrel{\text{def}}{=} & (\operatorname{ACQ} \cup \operatorname{AR} \cup \operatorname{SC}) \cap (R \cup F) \\ rel & \stackrel{\text{def}}{=} & (\operatorname{REL} \cup \operatorname{AR} \cup \operatorname{SC}) \cap (W \cup F) \\ rb & \stackrel{\text{def}}{=} & (rf^{-1}; mo) \setminus id \\ Fsb & \stackrel{\text{def}}{=} & [F]; sb \\ sbF & \stackrel{\text{def}}{=} & sb; [F] \\ rs' & \stackrel{\text{def}}{=} & thd \cup (E^2; [R \cap W]) \\ rs & \stackrel{\text{def}}{=} & mo \cap rs' \setminus ((mo \setminus rs'); mo) \\ sw & \stackrel{\text{def}}{=} & ([rel]; Fsb^?; [W \cap A]; rs^?; rf; \\ [R \cap A]; sbF^?; [acq]) \setminus thd \\ hb & \stackrel{\text{def}}{=} & (sb \cup (I \times \neg I) \cup sw)^+ \\ hbl & \stackrel{\text{def}}{=} & hb \cap =_{loc} \\ vis & \stackrel{\text{def}}{=} & (W \times R) \cap hbl \setminus (hbl; [W]; hb) \end{array}$$

#### Outline

- Introduction to the C11 memory model
- Overhauling the rules for SC atomics in C11
- Introduction to the OpenCL memory model
- Overhauling the rules for SC atomics in OpenCL

 $irr(S; r_1)$   $irr(S; r_2)$   $irr(S; r_3)$   $irr((S \setminus (mo; S)); r_4)$   $irr(S; r_5)$   $irr(S; r_6)$  $irr(S; r_7)$ 

where 
$$r_1 = hb$$
 (S1)  
where  $r_2 = Fsb^?$ ;  $mo$ ;  $sbF^?$  (S2)  
where  $r_3 = rf^{-1}$ ; [SC];  $mo$  (S3)  
where  $r_4 = rf^{-1}$ ;  $hbl$ ; [W] (S4)  
where  $r_5 = Fsb$ ;  $rb$  (S5)  
where  $r_6 = rb$ ;  $sbF$  (S6)  
where  $r_7 = Fsb$ ;  $rb$ ;  $sbF$  (S7)

 $irr(S; r_1)$   $irr(S; r_2)$   $irr(S; r_3)$   $irr(S; r_4)$   $irr(S; r_5)$   $irr(S; r_6)$   $irr(S; r_7)$ 

where  $r_1 = hb$  (S1) where  $r_2 = Fsb^?$ ; mo;  $sbF^?$  (S2) where  $r_3 = rf^{-1}$ ; [SC]; mo (S3) where  $r_4 = rf^{-1}$ ; hbl; [W] (S4) where  $r_5 = Fsb$ ; rb (S5) where  $r_6 = rb$ ; sbF (S6) where  $r_7 = Fsb$ ; rb; sbF (S7)

 $\operatorname{irr}(S;r_1)$  $\operatorname{irr}(S;r_2)$  $\operatorname{irr}(S;r_3)$  $\operatorname{irr}(S; r_4)$  $\operatorname{irr}(S; r_5)$  $\operatorname{irr}(S; r_6)$  $\operatorname{irr}(S; r_7)$ 

where  $r_1 = hb$ (S1)where  $r_2 = Fsb^?$ ; mo;  $sbF^?$ (S2) where  $r_3 = rf^{-1}$ ; [SC]; *mo* where  $r_4 = rf^{-1}$ Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it W wh Thibaut Balabonski wh INRIA Francesco Zappa Nardelli Viktor Vafeiadis MPI-SWS INRIA Robin Morisset INRIA thus to require that race-free code must exhibit only sequ consistent (that is, interleaving) behaviours, while racy c defined and has no semantics. This approach, usually r as DRF (data race freedom), is appealing to the progra cause under the hypothesis that the shared state is pro-We show that the weak memory model introduced by the 2011 tected by locks he has to reason only about interleaving C and C++ standards does not permit many common source-toaccesses. It is also appealing to the compiler because source program transformations (such as expression linearisation mise code freely provided that it respects synchronisati source program gansionnanous (such as expression meansation and "roach motel" reorderings) that modern compilers perform and by Ševčík [18] shows that it is indeed the case that in that are deemed to be correct. As such it cannot be used to de-DRF model common compiler optimisations are corre that are declined to be correct. As such it cannot be used to decline the semantics of intermediate languages of compilers, as, for clude elimination and reorderings of non-synchroni instance, LLVM aimed to. We consider a number of possible loaccesses, and the so-called "roach motel" reorderin real fixes, some strengthening and some weakening the model. We ing a memory access after a lock or before an unlo evaluate the proposed fixes by determining which program transfor-Intuitively, the latter amounts to enlarging a critical evaluate the proposed mass by determining which program transformations are valid with respect to each of the patched models. formal Coo proofs of their correctness or counterexamples Although the idealised DRF design is appealin should be obviously correct.

# Consistent executions

- Execution X is consistent iff there exists rf, mo and S such that (X,rf,mo,S) is well-formed and satisfies all the consistency axioms.
- [[P]] = P's consistent executions\*

\*unless P also admits a **faulty** execution, then [P] = any execution

#### $\operatorname{acy}(\operatorname{SC}^2 \cap (r_1 \cup r_2 \cup r_3 \cup r_4 \cup r_5 \cup r_6 \cup r_7) \setminus id) \quad (\mathsf{S}_{\mathsf{partial}})$

 $\operatorname{acy}(\operatorname{SC}^2 \cap (r_1 \cup r_2 \cup r_3 \cup r_4 \cup r_5 \cup r_6 \cup r_7) \setminus id) \quad (\mathsf{S}_{\mathsf{partial}})$ This axiom is faster to simulate!

 $\operatorname{acy}(\operatorname{SC}^2 \cap (r_1 \cup r_2 \cup r_3 \cup r_4 \cup r_5 \cup r_6 \cup r_7) \setminus id) \quad (\operatorname{S}_{\operatorname{partial}})$ This axiom is faster to simulate! Existing compilation schemes (x86 and Power) remain valid.

 $irr(S; r_1)$   $irr(S; r_2)$   $irr(S; r_3)$   $irr(S; r_4)$   $irr(S; r_5)$   $irr(S; r_6)$   $irr(S; r_7)$ 

where  $r_1 = hb$  (S1) where  $r_2 = Fsb^?$ ; mo;  $sbF^?$  (S2) where  $r_3 = rf^{-1}$ ; [SC]; mo (S3) where  $r_4 = rf^{-1}$ ; hbl; [W] (S4) where  $r_5 = Fsb$ ; rb (S5) where  $r_6 = rb$ ; sbF (S6) where  $r_7 = Fsb$ ; rb; sbF (S7)

 $irr(S; r_1)$   $irr(S; r_2)$   $irr(S; r_3)$   $irr(S; r_4)$   $irr(S; r_5)$   $irr(S; r_6)$   $irr(S; r_7)$ 

| / · · · · · · · · · · · · · · · · · · ·  |      |
|------------------------------------------|------|
| where $r_1 = hb$                         | (S1) |
| where $r_2 = Fsb^?$ ; $mo$ ; $sbF^?$     | (S2) |
| where $r_3 = rf^{-1}$ ; mo               | (S3) |
| where $r_4 = rf^{-1}$ ; <i>hbl</i> ; [W] | (S4) |
| where $r_5 = Fsb$ ; $rb$                 | (S5) |
| where $r_6 = rb$ ; $sbF$                 | (S6) |
| where $r_7 = Fsb$ ; $rb$ ; $sbF$         | (S7) |

#### $\operatorname{acy}(\mathsf{SC}^2 \cap (Fsb^?; (hb \cup rb \cup mo); sbF^?)). \tag{S_{simp}}$

 $\operatorname{acy}(\operatorname{SC}^2 \cap (Fsb^?; (hb \cup rb \cup mo); sbF^?)).$ 

$$(\mathsf{S}_{\mathsf{simp}})$$

This axiom is much simpler for programmers to understand and to use

Existing compilation schemes (x86 and Power) remain valid.

#### Candidate executions



# Changing the standard

- 6. There shall be a single total order S on all memory\_order\_seq\_cst operations, consistent with the "happens before" order and modification orders for all affected locations, such that each memory\_order\_seq\_cst operation B that loads a value from an atomic object M observes one of the following values:
  - the result of the last modification A of M that precedes B in S, if it exists, or
  - if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory\_order\_seq\_cst and that does not happen before A, or
  - if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory\_order\_seq\_cst.

#### [...]

- 9. For an atomic operation B that reads the value of an atomic object M, if there is a memory\_order\_seq\_cst fence X sequenced before B, then B observes either the last memory\_order\_seq\_cst modification of M preceding X in the total order S or a later modification of M in its modification order.
- 10. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory\_order\_seq\_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.
- 11. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory\_order\_seq\_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and Xprecedes Y in S, then B observes either the effects of Aor a later modification of M in its modification order.

- 1. A value computation A of an object M reads before a side effect B on M if A and B are different operations and B follows, in the modification order of M, the side effect that A observes.
- 2. If X reads before Y, or happens before Y, or precedes Y in modification order, then X (as well as any fences sequenced before X) is *SC-before* Y (as well as any fences sequenced after Y).
- 3. If A is SC-before B, and A and B are both memory\_ order\_seq\_cst, then A is restricted-SC-before B.
- 4. There must be no cycles in restricted-SC-before.

[93 words; FK reading ease 73.1]

[276 words; FK reading ease 41.2]

#### Outline

- Introduction to the C11 memory model
- Overhauling the rules for SC atomics in C11
- Introduction to the OpenCL memory model
- Overhauling the rules for SC atomics in OpenCL

• Execution hierarchy:

- Execution hierarchy:
  - Many threads form a work-group

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**
  - Several devices form a heterogeneous system

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**
  - Several devices form a heterogeneous system
- Memory hierarchy:

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**
  - Several devices form a heterogeneous system
- Memory hierarchy:
  - private (accessible to one thread)

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**
  - Several devices form a heterogeneous system
- Memory hierarchy:
  - private (accessible to one thread)
  - local (accessible to one work-group)

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**
  - Several devices form a heterogeneous system
- Memory hierarchy:
  - private (accessible to one thread)
  - local (accessible to one work-group)
  - global (accessible to all devices)

- Execution hierarchy:
  - Many threads form a **work-group**
  - Many work-groups execute on a **device**
  - Several devices form a heterogeneous system
- Memory hierarchy:
  - private (accessible to one thread)
  - local (accessible to one work-group)
  - global (accessible to all devices)
  - global\_fga (accessible to all devices, allows inter-device communication)
















T3



T3























## OpenCL memory scopes

• Memory consistency can be localised to one subtree of the execution hierarchy.

## OpenCL memory scopes

• Memory consistency can be localised to one subtree of the execution hierarchy.

```
atomic_store_explicit(x, 1,
    memory_order..., memory_scope_device );
    memory scope work group
```

## OpenCL memory scopes

• Memory consistency can be localised to one subtree of the execution hierarchy.





Threads in different work-groups, but same device if (atomic\_load\_explicit(y, \*x = 42;atomic\_store\_explicit(y, 1, memory order acquire, memory\_order\_release, memory\_scope\_work\_group)) memory\_scope\_work\_group); print(x);











# Scope inclusion

•  $(e1, e2) \in incl$  iff:

*e1*'s scope is wide enough to reach *e2*, and *e2*'s scope is wide enough to reach *e1*.

#### Outline

- Introduction to the C11 memory model
- Overhauling the rules for SC atomics in C11
- Introduction to the OpenCL memory model
- Overhauling the rules for SC atomics in OpenCL

• There is a total order *S* over [...]

- There is a total order *S* over [...]
- PROVIDING every SC operation
   has memory\_scope\_all\_svm\_devices and
   accesses global\_fga memory

- There is a total order *S* over [...]
- PROVIDING every SC operation
   has memory\_scope\_all\_svm\_devices and
   accesses global\_fga memory
- OR every SC operation
   has memory\_scope\_device and
   does not access global\_fga memory

Can't always tell whether a location is global or global\_fga!

Can't always tell whether a location is global or global\_fga!

S The default, which is memory\_scope\_device, is not always enough!

Can't always tell whether a location is global or global\_fga!

S The default, which is memory\_scope\_device, is not always enough!



Non-compositional!



The default, which is memory\_scope\_device, is not always enough!



Non-compositional!



Unnecessarily restrictive!



The default, which is memory\_scope\_device, is not always enough!



Non-compositional!





And too weak anyway!

# $\operatorname{acy}(\mathsf{SC}^2 \cap (Fsb^?; (ghb \cup lhb \cup rb \cup mo); sbF^?) \cap \frac{incl}{(\mathsf{O}-\mathsf{S}_{\mathsf{scoped}})})$

 $\begin{array}{l} \operatorname{acy}(\operatorname{SC}^2 \cap (Fsb^?;(ghb \cup lhb \cup rb \cup mo);sbF^?) \cap \underset{(\operatorname{O-S_{scoped}})}{incl} \\ & (\operatorname{O-S_{scoped}}) \end{array}$ 

#### Changing the standard

If one of the following two conditions holds:

- All memory\_order\_seq\_cst operations have the scope memory\_scope\_all\_svm\_devices and all affected memory locations are contained in system allocations or fine grain SVM buffers with atomics support
- All memory\_order\_seq\_cst operations have the scope memory\_scope\_device and all affected memory locations are not located in system allocated regions or fine-grain SVM buffers with atomics support

then there shall exist a single total order S for all memory\_order\_seq\_cst operations that is consistent with the modification orders for all affected locations, as well as the appropriate global-happens-before and local-happens-before orders for those locations, such that each memory\_order\_seq\_cst operation B that loads a value from an atomic object M in global or local memory observes one of the following values:

- the result of the last modification A of M that precedes B in S, if it exists, or
- if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory\_order\_seq\_cst and that does not happen before A, or
- if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory\_order\_seq\_cst.

[...]

If the total order S exists, the following rules hold:

- For an atomic operation B that reads the value of an atomic object M, if there is a memory\_order\_seq\_cst fence X sequenced-before B, then B observes either the last memory\_order\_seq\_cst modification of M preceding X in the total order S or a later modification of M in its modification order.
- For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory\_order\_seq\_cst fence X such that A is sequencedbefore X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.
- For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory\_order\_seq\_cst fences X and Y such that A is sequenced- before X, Y is sequenced-before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.
- For atomic operations A and B on an atomic object M, if there are memory\_order\_seq\_cst fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B occurs later than A in the modification order of M.



- 1. A value computation A of an object M reads before a side effect B on M if A and B are different operations and B follows, in the modification order of M, the side effect that A observes.
- 2. If X reads before Y, or global happens before Y, or local happens before Y, or precedes Y in modification order, then X (as well as any fences sequenced before X) is SC-before Y (as well as any fences sequenced after Y).
- 3. If A is SC-before B, and A and B are both memory\_ order\_seq\_cst, and A and B have inclusive scopes, then A is restricted-SC-before B.
- 4. There must be no cycles in restricted-SC-before.

[106 words; FK reading ease 71.0]

[391 words; FK reading ease -22.0]

# TL;DR

- The rules for sequentially-consistent atomic operations and fences ("SC atomics") in C11 and OpenCL are
  - too **complex**,



too **weak**, and



• We suggest how to fix them  $\mathfrak{S}$ .