

John Wickerson Ma Imperial 😹 K

Mark Batty Kent **ﷺ**  Brad BeckmannAlly DonaldsonAMD ■Imperial ■



#### Pittsburgh, 30 October 2015



#### In brief

 Remote-scope promotion is a GPU programming extension from AMD for efficient work-stealing

Synchronization Using Remote-Scope Promotion Marc S. Orr<sup>†§</sup>, Shuai Che<sup>§</sup>, Ayse Yilmazer<sup>§</sup>, Bradford M. Beckmann<sup>§</sup>, Mark D. Hill<sup>†§</sup>, David A. Wood<sup>†§</sup> {Shuai.Che, Ayse.Yilmazer, Brad.Beckmann}@amd.com <sup>†</sup>University of Wisconsin-Madison Computer Sciences {morr, markhill, david}@cs.wisc.edu As processors evolve to support more threads, synchroniz-1. Introduction ing among those threads becomes increasingly expensive. This is particularly true for massively-threaded, throughtereseneous system architecture (HSA) and OpenCL™ region to facilitate low overhead Abstract

#### In brief

- Remote-scope promotion is a GPU programming extension from AMD for efficient work-stealing
- We formalised the design (at SW and HW level). This led to a corrected and improved implementation.

Synchronization Using Remote-Scope Promotion Marc S. Orr<sup>†§</sup>, Shuai Che<sup>§</sup>, Ayse Yilmazer<sup>§</sup>, Bradford M. Beckmann<sup>§</sup>, Mark D. Hill<sup>†§</sup>, David A. Wood<sup>†§</sup> {Shuai.Che, Ayse.Yilmazer, Brad.Beckmann}@amd.com <sup>†</sup>University of Wisconsin-Madison Computer Sciences {morr, markhill, david}@cs.wisc.edu As processors evolve to support more threads, synchroniz-1. Introduction ing among those threads becomes increasingly expensive. This is particularly true for massively-threaded, throughterroreneous system architecture (HSA) and OpenCL™ registerion to facilitate low overhead Abstract

#### In brief

- Remote-scope promotion is a GPU programming extension from AMD for efficient work-stealing
- We formalised the design (at SW and HW level). This led to a corrected and improved implementation.
- Formalise
   early in the design process!

Synchronization Using Remote-Scope Promotion Marc S. Orr<sup>†§</sup>, Shuai Che<sup>§</sup>, Ayse Yilmazer<sup>§</sup>, Bradford M. Beckmann<sup>§</sup>, Mark D. Hill<sup>†§</sup>, David A. Wood<sup>†§</sup> {Shuai.Che, Ayse.Yilmazer, Brad.Beckmann}@amd.com <sup>†</sup>University of Wisconsin-Madison Computer Sciences {morr, markhill, david}@cs.wisc.edu As processors evolve to support more threads, synchroniz-1. Introduction ing among those threads becomes increasingly expensive. This is particularly true for massively-threaded, througharmanaous system architecture (HSA) and OpenCL™ registerion to facilitate low overhead Abstract

#### This talk

- 1. Background: What is RSP?
- 2. Adding RSP to the OpenCL memory model
- 3. A formalised implementation of OpenCL+RSP

#### This talk



- 2. Adding RSP to the OpenCL memory model
- 3. A formalised implementation of OpenCL+RSP

#### C11: flat thread structure



#### OpenCL: thread groupings



#### GPUs: hierarchical memory











































## Work-stealing



#### This talk

- 1. Background: What is RSP?
- 2. Adding RSP to the OpenCL memory model
- 3. A formalised implementation of OpenCL+RSP

Operations A and B only synchronise if they have inclusive scopes.

- Operations A and B only synchronise if they have inclusive scopes.
- A and B have inclusive scopes iff
   A reaches B and B reaches A.

- Operations A and B only synchronise if they have inclusive scopes.
- A and B have inclusive scopes iff
   A reaches B and B reaches A.

• A reaches B iff

A has <u>workgroup</u> scope and B is in the same workgroup, or
A has <u>device</u> scope and B is in the same device, or
A has <u>all-devices</u> scope.

- Operations A and B only synchronise if they have inclusive scopes.
- A and B have inclusive scopes iff
   A reaches B and B reaches A, or
   A reaches B and B is remote, or
   A is remote and B reaches A.
- A reaches B iff

A has <u>workgroup</u> scope and B is in the same workgroup, or
A has <u>device</u> scope and B is in the same device, or
A has <u>all-devices</u> scope.

```
15
  (* Scope annotations *)
   let s_wi = memory_scope_work_item
16
   let s_wg = memory_scope_work_group
17
   let s_dev = memory_scope_device
18
   let s_all = memory_scope_all_svm_devices
19
20
   (* Inclusive scopes *)
21
22
   let incl1 = ([s_wi] ; wi)
23
             | ([s_wg] ; wg)
24
             [ ([s_dev] ; dev)
25
             ([s_all] ; unv)
26
27
   let incl = (incl1 & (incl1^-1))
28
             ([remote] ; incl1)
29
             ((incl1^-1) ; [remote])
30
31
    (**********************
32
   (* Synchronisation *)
33
   (**********************
34
35
   let acq = (mo_acq | mo_sc | mo_acq_rel) & (R | F | rmw)
36
   let rel = (mo_rel | mo_sc | mo_acq_rel) & (W | F | rmw)
37
38
   (* Release sequence *)
39
```

### Testing OpenCL+RSP programs

# Testing OpenCL+RSP programs

 We simulated the **12** litmus tests designed by the original developers to define their expectations of RSP.

herd - bash - 115×34

Babillion:herd jpw48\$

🚞 herd - bash - 115×34

herd – bash – 115×34 Babillion:herd jpw48\$ less testsuite/RSPTests/RSP\_Test1.litmus

```
OpenCL RSP_Test1
{
  [x]=0;
 [y]=0;
}
P0 (global atomic_int* x, global atomic_int* y) {
  atomic_store_explicit
    (x, 1, memory_order_release, memory_scope_work_group);
}
P1 (global atomic_int* x, global atomic_int* y) {
  atomic_store_explicit
    (y, 1, memory_order_release, memory_scope_work_group);
}
P2 (global atomic_int* x, global atomic_int* y) {
  int r0 = atomic_load_explicit
    (x, memory_order_acquire, memory_scope_work_group);
  int r1 = atomic_load_explicit
    (y, memory_order_acquire, memory_scope_work_group);
}
P3 (global atomic_int* x, global atomic_int* y) {
  int r2 = atomic_load_explicit_remote
    (y, memory_order_acquire, memory_scope_device);
  int r3 = atomic_load_explicit_remote
    (x, memory_order_acquire, memory_scope_device);
}
scopeTree (device (work_group P0 P1 P2) (work_group P3))
exists (2:r0=1 /\ 2:r1=0 /\ 3:r2=1 /\ 3:r3=0)
testsuite/RSPTests/RSP_Test1.litmus (END)
```

...

herd - bash - 115x34

Babillion:herd jpw48\$ less testsuite/RSPTests/RSP\_Test1.litmus Babillion:herd jpw48\$ ./herd -initwrites true -model opencl\_rem.cat testsuite/RSPTests/RSP\_Test1.litmus

| ● ● ● ■ ■ herd - bash - 115×34                                                                           |
|----------------------------------------------------------------------------------------------------------|
| Babillion:herd jpw48\$ less testsuite/RSPTests/RSP_Test1.litmus                                          |
| Babillion:herd jpw48\$ ./herd -initwrites true -model opencl_rem.cat testsuite/RSPTests/RSP_Test1.litmus |
| Test RSP_Test1 Allowed                                                                                   |
| States 16                                                                                                |
| 2:r0=0; 2:r1=0; 3:r2=0; 3:r3=0;                                                                          |
| 2:r0=0; 2:r1=0; 3:r2=0; 3:r3=1;                                                                          |
| 2:r0=0; 2:r1=0; 3:r2=1; 3:r3=0;                                                                          |
| 2:r0=0; 2:r1=0; 3:r2=1; 3:r3=1;                                                                          |
| 2:r0=0; 2:r1=1; 3:r2=0; 3:r3=0;                                                                          |
| 2:r0=0; 2:r1=1; 3:r2=0; 3:r3=1;                                                                          |
| 2:r0=0; 2:r1=1; 3:r2=1; 3:r3=0;                                                                          |
| 2:r0=0; 2:r1=1; 3:r2=1; 3:r3=1;                                                                          |
| 2:r0=1; 2:r1=0; 3:r2=0; 3:r3=0;                                                                          |
| 2:r0=1; 2:r1=0; 3:r2=0; 3:r3=1;                                                                          |
| 2:r0=1; 2:r1=0; 3:r2=1; 3:r3=0;                                                                          |
| 2:r0=1; 2:r1=0; 3:r2=1; 3:r3=1;                                                                          |
| 2:r0=1; 2:r1=1; 3:r2=0; 3:r3=0;                                                                          |
| 2:r0=1; 2:r1=1; 3:r2=0; 3:r3=1;                                                                          |
| 2:r0=1; 2:r1=1; 3:r2=1; 3:r3=0;                                                                          |
| 2:r0=1; 2:r1=1; 3:r2=1; 3:r3=1;                                                                          |
| 0k                                                                                                       |
| Witnesses                                                                                                |
| Positive: 1 Negative: 15                                                                                 |
| Bad executions (0 in total):                                                                             |
| Condition exists (2:r0=1 /\ 2:r1=0 /\ 3:r2=1 /\ 3:r3=0)                                                  |
| Observation RSP_Test1 Sometimes 1 15                                                                     |
| Hash=305e6af6b482d95960e572605703996c                                                                    |
|                                                                                                          |
| Babillion:herd jpw48\$                                                                                   |
|                                                                                                          |

# Testing OpenCL+RSP programs

- We simulated the **12** litmus tests designed by the original developers to define their expectations of RSP.
- We found **8** were good, but:
  - 2 had unintentional races,
  - 1 enforced broken behaviour, and
  - **1** forbade reasonable behaviour.

# Testing OpenCL+RSP programs

- We simulated the **12** litmus tests designed by the original developers to define their expectations of RSP.
- We found **8** were good, but:
  - 2 had unintentional races,
  - 1 enforced broken behaviour, and
  - **1** forbade reasonable behaviour.
- We also found (and fixed) bugs in their workstealing queue implementation

#### This talk

- 1. Background: What is RSP?
- 2. Adding RSP to the OpenCL memory model
- 3. A formalised implementation of OpenCL+RSP

- Model of GPU hardware
- Assembly-like language

- Model of GPU hardware
- Assembly-like language
- Compiler mapping from OpenCL+RSP operations to sequences of assembly instructions

- Model of GPU hardware
- Assembly-like language
- Compiler mapping from OpenCL+RSP operations to sequences of assembly instructions
- Can then prove that all behaviours of the compiled program are allowed by the OpenCL+RSP memory model.

```
x \in Addr
 r \in Reg
 v \in Val \stackrel{\text{def}}{=} \mathbb{Z}
           FifoEl \stackrel{\text{def}}{=} Addr \cup \{FLUSH_{dwt} \mid d, w, t \in \mathbb{N}\}
               Fifo \stackrel{\text{def}}{=} FifoEl queue
       Hygiene \stackrel{\text{def}}{=} {CLEAN, DIRTY}
     Freshness \stackrel{\text{def}}{=} {VALID, INV'D}
 CacheEntry \stackrel{\text{def}}{=} Val × (hy: Hygiene) × (fr: Freshness)
C \in Cache \stackrel{\text{def}}{=} (Addr \rightharpoonup CacheEntry) \times (\text{fifo: } Fifo)
              Lock \stackrel{\text{def}}{=} \{ \widehat{\blacksquare}, \blacksquare \}
        ThState \stackrel{\text{def}}{=} Reg \rightarrow Val
        WqState \stackrel{\text{def}}{=} ThState \text{ list} \times (L1: Cache) \times (rmw: Lock)
        DvState \stackrel{\text{def}}{=} WgState \text{ list} \times (\text{L2: } Cache) \times
                                (lockfile: Addr \rightarrow Lock)
           Global \stackrel{\text{def}}{=} Addr \rightarrow Val
\Sigma \in SyState \stackrel{\text{def}}{=} DvState \text{ list} \times (\text{gl: } Global)
```

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* Freshness Hygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

| (SyState     | Global                                                                                             |
|--------------|----------------------------------------------------------------------------------------------------|
| DvState [L2: | $\begin{array}{c c} \hline Cache \\ \hline & S_{2} \\ \hline & S_{2} \\ \hline & & \\ \end{array}$ |
|              | ifo                                                                                                |

| SyState                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Contraction of the second s | Global             |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|--------------------|
| DvState                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | $\left[ L2: Cache \right]$                                                                                      | $\frac{Addr}{Val}$ |
| WgState $ThState$ $bar Variation In the second structurebar Variation In the second structureThStatebar Variation In the second structurebar Variation In the second structureTrow Th$ |                                                                                                                 |                    |

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* FreshnessHygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* FreshnessAddr Val Hygiene  $\frac{Reg}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

| SyState                                                 | Global        |
|---------------------------------------------------------|---------------|
| $ \begin{array}{ c c c c c c c c c c c c c c c c c c c$ | Global<br>Val |

| $\begin{array}{c c} DvState \\ \hline WgState \\ \hline ThState \\$ |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* Freshness Hygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* FreshnessHygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* Freshness Hygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* Freshness Hygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

SyState Global  $\frac{Addr}{Val}$ *DvState* L2: Cache Freshness WgState HygieneL1: Cache  $\frac{Addr}{Val}$ *ThState* Freshness Hygiene $\frac{Reg}{Val}$  $\frac{Addr}{Val}$ Fifo rmw: *Lock* Fifo AddrLock

|                | na or WG              | DV (not remote)                                                                                            | DV (remote)                                                                                                                        |
|----------------|-----------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LDrx                  | INV <sub>L1</sub> WG<br>LD r x                                                                             | $ \left. \begin{array}{c} FLU_{L1} & DV \\ INV_{L1} & WG \\ LD & r & x \end{array} \right\} LK & x $                               |
| store(x,r)     | ST r x                | FLU <sub>L1</sub> WG<br>ST r x                                                                             | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $                |
| r=fetch_inc(x) | INC <sub>L1</sub> r x | $	extsf{FLU}_{L1} \ 	extsf{WG} \ 	extsf{INV}_{L1} \ 	extsf{WG} \ 	extsf{INC}_{L2} \ 	extsf{r} \ 	extsf{x}$ | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV<br>LK x<br>LK x<br>LK <sub>rmw</sub> |

|                | na or WG              | DV (not remote)                                                                                            | DV (remote)                                                                                                                        |
|----------------|-----------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LDrx                  | INV <sub>L1</sub> WG<br>LD r x                                                                             | $ \left. \begin{array}{c} FLU_{L1} & DV \\ INV_{L1} & WG \\ LD & r & x \end{array} \right\} LK & x $                               |
| store(x,r)     | ST r x                | FLU <sub>L1</sub> WG<br>ST r x                                                                             | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $                |
| r=fetch_inc(x) | INC <sub>L1</sub> r x | $	extsf{FLU}_{L1} \ 	extsf{WG} \ 	extsf{INV}_{L1} \ 	extsf{WG} \ 	extsf{INC}_{L2} \ 	extsf{r} \ 	extsf{x}$ | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV<br>LK x<br>LK x<br>LK <sub>rmw</sub> |

| message-passi  | ng fails              | DV (not remote)                                                       | DV (remote)                                                                                                         |
|----------------|-----------------------|-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LDrx                  | INV <sub>L1</sub> WG<br>LD r x                                        | $ \left. \begin{array}{c} FLU_{L1} & DV \\ INV_{L1} & WG \\ LD & r & x \end{array} \right\} LK & x \\ $             |
| store(x,r)     | ST r x                | FLU <sub>L1</sub> WG<br>ST r x                                        | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $ |
| r=fetch_inc(x) | INC <sub>L1</sub> r x | FLU <sub>L1</sub> WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV<br>LK x<br>LK x       |



### Unnecessary cacheline stalling

| message-pass              | ing fails              | DV (not remote)                                                       | DV (remote)                                                                                                  |
|---------------------------|------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| r=load(x)                 | LDrx                   | INV <sub>L1</sub> WG<br>LD r x                                        | $ \begin{array}{c} FLU_{L1} & DV \\ INV_{L1} & WG \\ LD & r & x \end{array} \right\} LK & x \\ \end{array} $ |
| store(x,r)                | ST r x<br>micity fails | FLU <sub>L1</sub> WG<br>'r x                                          | $FLU_{L1} WG$ $ST r x$ $INV_{L1} DV$                                                                         |
| <pre>r=fetch_inc(x)</pre> | INC <sub>L1</sub> r x  | FLU <sub>L1</sub> WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV                |

#### Revised scheme

|                | na Or WG       | DV (not remote)                                                       | DV (remote)                                                                                                           |
|----------------|----------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LD r x         | LD r x<br>INV <sub>L1</sub> WG                                        | LD r x<br>FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG                                                                |
| store(x,r)     | ST r x         | FLU <sub>L1</sub> WG<br>ST r x                                        | $ \left. \begin{array}{c} FLU_{L1} & WG \\ INV_{L1} & DV \\ ST & r & x \end{array} \right\} LK_{rmw} $                |
| r=fetch_inc(x) | $INC_{L1}$ r x | FLU <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> WG | FLU <sub>L1</sub> WG<br>INV <sub>L1</sub> DV<br>INC <sub>L2</sub> r x<br>FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG |

|                | na or WG              | DV (not remote)                                                                                            | DV (remote)                                                                                                                        |
|----------------|-----------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LDrx                  | INV <sub>L1</sub> WG<br>LD r x                                                                             | $ \left. \begin{array}{c} FLU_{L1} & DV \\ INV_{L1} & WG \\ LD & r & x \end{array} \right\} LK & x $                               |
| store(x,r)     | ST r x                | FLU <sub>L1</sub> WG<br>ST r x                                                                             | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $                |
| r=fetch_inc(x) | INC <sub>L1</sub> r x | $	extsf{FLU}_{L1} \ 	extsf{WG} \ 	extsf{INV}_{L1} \ 	extsf{WG} \ 	extsf{INC}_{L2} \ 	extsf{r} \ 	extsf{x}$ | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV<br>LK x<br>LK x<br>LK <sub>rmw</sub> |

#### Proof of correctness

• Theorem stated in Isabelle, proved by hand.

#### Summary

- Remote-scope promotion is a GPU programming extension from AMD for efficient work-stealing
- We formalised the design (at SW and HW level). This led to a corrected and improved implementation.
- Formalise **early** in the design process!