-
Notifications
You must be signed in to change notification settings - Fork 4
/
C++11MemoryModel.tex
406 lines (285 loc) · 16.2 KB
/
C++11MemoryModel.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
\newpage
% \section{Foundations of the C++ Concurrency Memory Model\cite{PPTTheC19:online,boehm2008foundations}}
% \subsection{Introduction\cite{TheC11Me27:online}}
% Momory model is the guarantee provided by the runtime environment to a
% multithreaded program, regaring the order of memory operations.
% Each level of the environment(e.g. CPU, virtual machine, compiler) might have a different memory model
% .The
% correctness of parallel algorithms depends on the memory model.
% \texttt{Threads} are now part of the C++ 11language.
% Their behavior must be fully defined so that the same code
% run the same on different environments regardless of the CPU
% (Intel, AMD, ARM, …) or the OS.
% Undefined behaviour matters a lot. See the example shown in \ref{lst:e1} .
% The compiler knows x is bound between 0 and 2, so instead of
% creating a series of if’s, it can simply jump to
% (switch base address + x * constant).
% But if x is shared and unprotected, it might be changed
% between the if and the switch, causing an unrestricted jump,
% which can cause unpredictable damages.
% Races are forbidden because the compiler can’t always recognize them (static analysis in not required by the standard),
% and to avoid them the programmer must help the compiler.
% In the example above, even if x were somehow protected, it could still be modified between the if and the switch.
% This would have caused the code to behave differently from the serial implementation.
% However, had the compiler known x might be modified by another thread,
% it would not have used the unsafe jump table optimization.
% \begin{lstlisting}[language=C,frame=single, caption=An simple example ,label = lst:e1]
% if(x >= 0 && x < 3){
% switch(x){
% case 0 : do0(); break;
% case 1 : do1(); break;
% case 2 : do2(); break;
% }
% }
% \end{lstlisting}
% Single-threaded programs are fairly easy to optimize, because there’s no “external viewer” (other than the output).
% So most of the time they execute “atomically”. Shared variables, however, might be viewed by different threads at
% (almost) any given moment. So, their modifications by one thread must seem to be ordered to other threads.
% The listed optimizations in Listing\ref{lst:e1} are just an example obviously there are many more.
% Notice that in c++11 some optimizations can be seen by other threads, which is fully defined by c++11’s memory model.
% A memory location does not depend on the size of the type, and can be part of an array or struct/class.
% Bit fields are different: several might occupy a single memory location.
% About the example in Listing\ref{lst:e2}: writing single bytes can be inefficient on some systems.
% It could be faster to read the whole array of chars, for instance, as a single 32-bit word,
% update one element, and write the whole word back. But that might introduce a race on the other elements!
% This is similar to having two variables on the same cache line. Updating one must not override an update
% of the other. This requires non-trivial synchronization, and a performance overhead.
% This wasn’t a problem on single threaded programs.
% In Java, update of a variable smaller or equal to a dword should not affect an adjacent variable (like C++11).
% However, longs and doubles are not guaranteed to be atomic (unless defined as volatile, in which
% case they are always atomic). This is due to the fact that it is a multi-step operation at the machine code level.
% Note: int i:3 means that the compiler will use 3 bits to store i.
% \begin{lstlisting}[language=C,frame=single, caption=An simple example ,label = lst:e2]
% struct s {
% char c[4];
% int i: 3, j: 4;
% struct in {
% double d;
% } id;
% };
% \end{lstlisting}
% The Sequence Before is the order imposed by the appearance of (most) statements in the code, for a single thread.
% On some occasions, the sequence of execution is not specified (such as function argument evaluation).
% This cases are therefore unordered with respect to each other. For instance, in f(h(), g())
% it is unspecified whether h() or g() will be executed first. Yet, both h() and g() are always sequenced
% before f().
% Synchronized with is the order imposed by an atomic read of a value that has been atomically written
% by some other thread. The said synchronization does not rely on any synchronization object.
% It means that when E1 synchronizes with E2, every write done before E1 should be visible
% to reads done after E2; and no write done after E1 should be visible to reads done before E2.
% This defines an order that crosses thread boundaries.
% Inter-thread happens-before is a combination of sequenced before and synchronized with.
% Here \ref{fig:p224}, the events in thread 1 inter-thread happen-before the events in thread 2.
% \begin{figure}[H]
% \centering
% \includegraphics[width=0.4\textwidth]{p224.png}
% \caption{}
% \label{fig:p224}
% \end{figure}
% The blue arrows \ref{fig:p225} are sequenced-before. The green arrow is synchronized-with.
% Together they form an inter-thread happens-before relation,
% which guarantees the write to y happens-before reading it.
% Note that y isn’t atomic, and in general not safe to use as shared variable.
% However, in this particular scenario, its accesses are synchronized by the use of the atomic x.
% The assert is guaranteed not to fail
% \begin{figure}[H]
% \centering
% \includegraphics[width=0.4\textwidth]{p225.png}
% \caption{}
% \label{fig:p225}
% \end{figure}
% Synchronized-with relation exists only between the releasing thread and the acquiring thread
% Other threads might see updates in a different order.
% All writes before a release are visible after an acquire.
% Even if they were relaxed or non-atomic.
% Similar to release consistency memory model.
% The writes are on different threads. Each acquiring thread is synchronized with
% the thread that has done the release. Thus, the releases are unordered.
% If sequential consistency had been used instead of acquire-release,
% the results would have been different.
% There would have been a total order of writes, so if one of the asserts had succeeded,
% the other must have failed.
% \begin{figure}[H]
% \centering
% \includegraphics[width=0.4\textwidth]{p226.png}
% \caption{}
% \label{fig:p226}
% \end{figure}
% In this model, two processors might see writes to two variables(done by some third processor)
% in a different order.
% This is where C++ atomic differs from Java volatile: Java never relaxes program order between
% volatile variables, while C++ allows reordering of (and across) atomic variables.
% \begin{figure}[H]
% \centering
% \includegraphics[width=0.4\textwidth]{p227.png}
% \caption{}
% \label{fig:p227}
% \end{figure}
% Both asserts might fail because the stores on thread 1 are not ordered,
% and different threads might see them in a different order.
% One way for this to happen is if thread 3 has a speculative load of x before
% the if. When this is the case, x might be read before threads 1 and 2
% started to execute.
% \begin{figure}[H]
% \centering
% \includegraphics[width=0.4\textwidth]{p228.png}
% \caption{}
% \label{fig:p228}
% \end{figure}
% \begin{figure}[H]
% \centering
% \includegraphics[width=0.4\textwidth]{p229.png}
% \caption{}
% \label{fig:p229}
% \end{figure}
% However Java (and C\#) volatile is very similar to C++ atomic
% volatile was actually the only option in pre-C++11 code, but other than avoiding the use of a register it has no multi-threading semantics, because it was introduced way before C++ supported multi-threading. It does usually do the trick, but not due to some standard guarantee.
% Some compilers (VS) add ordering semantics (release consistency) on certain CPUs (Itanium) as an extension – but we’re talking about standard C++ here.
% Lock-free code is hard to write
% Unless speed is crucial, better use mutexes
% Sequential consistency is the default
% It is also the memory model for function that does not have memory-model as an input parameter
% Use the default sequential consistency and avoid data races. Really.
% Acquire-release is hard
% Relaxed is much harder
% Mixing orders is insane (but allowed)
% 27 Summary and recommendations (2)
% The memory model can be passed as run-time parameter
% But it is better to use it as a compilation constant to allow optimizations
% We only covered main orders
% C++11’s basic features are fully supported by all common compilers.
% \subsection{}
% The memory model, or memory consistency model, specifies the
% values that a shared variable read in a multithreaded program is allowed to return.
% The memory model clearly affects programmability.
% It also affects performance and portability by constraining the
% transformations that any part of the system may perform.
% A memory model, a.k.a memory consistency model, is a
% specification of the allowed behavior of multithreaded
% programs executing with shared memory .
% The most basic model is sequential consistency (SC),
% where all insructions from all threads (appear to) form a
% total order that is consistent with the program order on
% each thread.
% \subsection{Concurrency needs sync }
% For performance gains, modern CPUs often execute
% instructions out of order to fully utilize resources.
% Since the hardware enforces instructions integrity,
% we can never notice this in a single thread execution.
% However, for multiple threads, this can lead to
% unpredictable bahaviors.
% In multi-threaded execution, uncontrolled scheduling leads
% to data race, where results depend on the timing execution
% of the code. With some bad luck (i.e., context switches
% that occur at untimely points in the execution), we get
% the wrong result.
% \subsubsection{mutual exclusion (atomic)}
% To achieve atomicity, we can ask hardware for a few useful instructions to build mutual exclusion, which guarantees that if one thread is executing within the critical section, the others will be prevented from doing so.
% \subsubsection{ waiting for another (conditional variable)}
% There are many cases where a thread continues its execution
% only when some condition is met. Thus, one thread must wait
% for another to complete some action before it continues.
% \subsection{Low-Level Atomics}
\section{Memory Model}
\subsection{Why Memory Model?}
\subsection{Memory Coherence}
Memory coherence: a memory system is coherent if any read of a data item returns the most recently written value of that data item.
\subsection{Coherent memory system}
A Coherent memory system should satisfy the following requirements:
\begin{itemize}
\item For a memory address wrote by a processor P, the next reads of P should get the written value.
\item For a memory address wrote by a processor P1, after enough time, another processor P2 can get the value written by P1.
\item The write operations for one memory address are serialized, so if there are 2 writes to a memory address by any processor, any processor cannot get two results in different order.
\end{itemize}
The coherence model does not define when a wrote by P1 can be read by P2. The memory consistency model is responsible for it.
\subsection{Memory Consistency}
A memory consistency model for a shared address space specifies
constraints on the order in which memory operations must appear
to be performed (i.e. to become visible to the processors) with
respect to one another.(when a written value will be returned/seen
by a read).
The memory consistency model defines the order of operation pairs in different addresses.
\subsection{C++ Memory Order}
\subsubsection{Introduction}
The C++ specification does not make reference to any particular compiler,
operating system, or CPU. It makes reference to an abstract machine that
is a generalization of actual systems. In the Language Lawyer world,
the job of the programmer is to write code for the abstract machine;
the job of the compiler is to actualize that code on a concrete
machine. By coding rigidly to the spec, you can be certain that your
code will compile and run without modification on any system with
a compliant C++ compiler, whether today or 50 years from now.
The abstract machine in the C++98/C++03 specification is fundamentally
single-threaded. So it is not possible to write multi-threaded C++
code that is "fully portable" with respect to the spec. The spec does
not even say anything about the atomicity of memory loads and stores
or the order in which loads and stores might happen, never mind things
like mutexes.
Of course, you can write multi-threaded code in practice for particular
concrete systems like pthreads or Windows. But there is no standard way
to write multi-threaded code for C++98/C++03.
The abstract machine in C++11 is multi-threaded by design. It also has
a well-defined memory model; that is, \textbf{it says what the compiler
may and may not do when it comes to accessing memory}.
Consider the following example \ref{C++:1}, where a pair of global variables
are accessed concurrently by two threads:
\begin{lstlisting}[label={C++:1},caption={An example about C++ memory model}]
Global
int x, y;
Thread 1 Thread 2
x = 17; cout << y << " ";
y = 37; cout << x << endl;
\end{lstlisting}
What might Thread 2 output?
Under C++98/C++03, this is not even Undefined Behavior; the question itself is
meaningless because the standard does not contemplate anything called a
"thread".
Under C++11, the result is Undefined Behavior, because loads and stores
need not be atomic in general. Which may not seem like much of an improvement.
And by itself, it's not.
But with C++11, you can write this \ref{C++:2}
\begin{lstlisting}[label={C++:2},caption={Using atomic in C++}]
Global
atomic<int> x, y;
Thread 1 Thread 2
x.store(17); cout << y.load() << " ";
y.store(37); cout << x.load() << endl;
\end{lstlisting}
Now things get much more interesting. First of all, the behavior here is defined.
Thread 2 could now print \textbf{0 0} (if it runs before Thread 1),
\textbf{37 17} (if it runs after Thread 1), or \textbf{0 17} (if it runs after Thread 1 assigns
to x but before it assigns to y).
What it cannot print is 37 0, because the default mode for atomic
loads/stores in C++11 is to enforce sequential consistency.
This just means all loads and stores must be "as if" they happened
in the order you wrote them within each thread, while operations
among threads can be interleaved however the system likes.
So the default behavior of atomics provides both atomicity and
ordering for loads and stores.
Now, on a modern CPU, ensuring sequential consistency can be expensive.
In particular, the compiler is likely to emit full-blown memory barriers
between every access here. But if your algorithm can tolerate out-of-order
loads and stores; i.e., if it requires atomicity but not ordering; i.e.,
if it can tolerate 37 0 as output from this program, then you can write this\ref{C++:3}:
\begin{lstlisting}[label={C++:3},caption={Using memory\_order\_relaxed in C++}]
Global
int x, y;
Thread 1 Thread 2
x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;
\end{lstlisting}
This takes us back to the ordered loads and stores
so 37 0 is no longer a possible output: but it does so with minimal overhead.
(In this trivial example, the result is the same as full-blown sequential
consistency; in a larger program, it would not be.)
Of course, if the only outputs you want to see are 0 0 or 37 17,
you can just wrap a mutex around the original code. But if you have read
this far, I bet you already know how that works, and this answer is
already longer than I intended :-).
Memory Model should define the following:
\begin{itemize}
\item Atomic Operations:
\item Local Sequential: A set of operations should be reordered.
\item Visibility: how shared variables seen to other threads.
\end{itemize}
See more from \cite{CppConcu2:online} also \url{https://zhuanlan.zhihu.com/p/412680378}