-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathadvanced-computer-architectures.tex
4641 lines (3760 loc) · 209 KB
/
advanced-computer-architectures.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[english]{article}
\usepackage[title={Advanced Computer Architectures}, date={2022/2023}]{notestemplate}
\begin{document}
\makecover
\section*{Introduction}
This document contains the notes for the \textit{Advanced Computer Architectures} course, relative to the \textit{2022/2023} class of \textit{Computer Science and Engineering} held at \textit{Politecnico di Milano}.
\bigskip
\begin{itemize}
\item Teacher: \textit{Davide Conficconi}
\item Textbook: \textit{Hennessey and Patterson, Computer Architecture: A Quantitative Approach}
\end{itemize}
\bigskip
By comparing these notes with the material \textit{(slides, video lessons)} provided during the lectures, you will find a lot of discrepancies in the order in which they are represented.
This might look like a bizarre \textit{(if not completely stupid)} choice, but it has been deliberate as I have found the lessons quite confusing in their ordering and how each topic was introduced.
You will still find each of the topics explained during the lessons, including something more \textit{(sometimes)}.
\bigskip
If you find any errors and you are willing to contribute and fix them, feel free to send me a pull request on the GitHub repository found at \href{https://github.com/lorossi/advanced-computer-architectures-notes}{github.com/lorossi/advanced-computer-architectures-notes}.
\bigskip
A big thank you to everyone who helped me!
\clearpage
\section{Introduction to the Computer Architectures}
\subsection{Flynn Taxonomy}
\label{sec:flynn-taxonomy}
Created in \(1966\) and upgraded in \(1972\), it provides the first description of a computer.
\begin{itemize}
\item \textit{SISD} - \textbf{single} instruction, \textbf{single} data
\begin{itemize}
\item \textbf{Sequential} programs
\item \textbf{Serial} (non parallel) computer
\item \textbf{Deterministic} execution
\item Only \textbf{one instruction stream} is being executed at a time
\end{itemize}
\item \textit{MISD} - \textbf{multiple} instructions, \textbf{single} data
\begin{itemize}
\item Multiple processors working in \textbf{parallel} on the same data
\item \textbf{Fail safe} due to high redundancy
\item The same algorithm is programmed and implemented in different ways, so if one fails the others are still able to compute the result
\item \textbf{No practical market configuration}
\end{itemize}
\item \textit{SIMD} - \textbf{single} instruction, \textbf{multiple} data
\begin{itemize}
\item Each processor receives \textbf{different data} and performs the \textbf{same operations} on it
\item Used in fields where a single operation must be performed in many different pieces of information \textit{(like in image processing)}
\item Each instructions is executed in \textbf{synchronous} way on the same data
\item Best suited for specialized problems characterized by a high degree of regularity, such as graphics or images processing
\item \textbf{Data level parallelism} \textit{(DLP)}
\end{itemize}
\item \textit{MIMD} - \textbf{multiple} instructions, \textbf{multiple} data
\begin{itemize}
\item \textbf{Array of processors} in parallel, each of them executing its instructions
\item Execution can be \textbf{asynchronous} or \textbf{synchronous}, \textbf{deterministic} or \textbf{non-deterministic}
\item The \textbf{most common} type of parallel computer
\end{itemize}
\end{itemize}
\bigskip
\textit{MIMD} an \textit{SIMD} architectures will be further explained in Sections~\ref{sec:mimd}~and~\ref{sec:simd}.
An illustration of the different architectures is displayed in Figure~\ref{fig:flynn-taxonomy}.
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[1]{image-1.tikz}
\caption{Flynn Taxonomy}
\label{fig:flynn-taxonomy}
\bigskip
\end{figure}
\clearpage
\subsection{Hardware parallelism}
There are different types of hardware parallelisms:
\begin{itemize}
\item \textbf{Instruction Level parallelism} - \textit{(ILP)}
\begin{itemize}
\item Exploits data level parallelism at modest level through \textbf{compiler techniques} such as pipelining and at medium levels using speculation
\end{itemize}
\item \textbf{Vector Architectures} and \textbf{Graphic Processor Units}
\begin{itemize}
\item Exploit data level parallelism by applying a single instruction to a \textbf{collection of data} in parallel
\end{itemize}
\item \textbf{Thread level parallelism} - \textit{(TLP)}
\begin{itemize}
\item Exploits either data level parallelism or task level parallelism in a coupled hardware model that allows \textbf{interaction among threads}
\end{itemize}
\item \textbf{Request level parallelism}
\begin{itemize}
\item Exploits parallelism among largely decoupled tasks specified by the programmer or the OS
\end{itemize}
\end{itemize}
Nowadays, heterogeneous systems \textit{(systems that utilize more than one type of parallelism)} are commonly used among all commercial devices.
\clearpage
\section{Performance and cost}
There are multiple types \textit{(classes)} of computers, each with different needs:
the performance measurement is not the same for each of them.
\textit{Price, computing speed, power consumption} can be metrics to measure the performance of a computer.
Programming has become so complicated that it's not possible to balance all the constraints manually;
while the computational power has grown bigger than ever before, energy consumption is now a sensible constraint.
The computer engineering methodology is therefore described as:
\begin{figure}[htbp]
\bigskip
\centering
\begin{minipage}[h]{0.495\textwidth}
\centering
\tikzfig[1]{image-2.tikz}
\end{minipage}
\begin{minipage}[h]{0.495\textwidth}
\begin{enumerate}
\item \label{enum:methodology-start} \textit{evaluate} system by bottlenecks
\item \textit{simulate} new designs and organizations
\item \textit{implement} next generation systems
\item \textit{repeat}
\end{enumerate}
\end{minipage}
\caption{Computer engineering methodology}
\label{fig:computer-engineering-methodology}
\bigskip
\end{figure}
There are more constraints not contained in this model, such as technology trends.
\bigskip
When one computer is faster than another, what quality is being addressed?
The answer is not always the same, and it depends on the application;
different solutions may consider different metrics, such as
speed of execution, power consumption, cost, etc.
The picked qualities may change according to the use case or the user themself,
but 2 metrics are normally used:
\begin{enumerate}
\item \textbf{Elapsed time}
\begin{itemize}
\item \(\texttt{execution time} = \texttt{time end} - \texttt{time start}\)
\item More relevant for system user
\end{itemize}
\item \textbf{Completion rate}
\begin{itemize}
\item \(\texttt{completion rate} = \texttt{number of jobs} \div \texttt{elapsed time}\)
\item More relevant for the system designer
\end{itemize}
\end{enumerate}
\subsection{Response time vs throughput}
Generally, it can be stated that:
\[ \texttt{throughput} = {1} \div { \texttt{average response time}} \]
However, if the instructions are overlapping each other, the following expression is true:
\[ \texttt{throughput} > {1} \div {\texttt{average response time}} \]
This can be achieved via \textbf{pipelining}.
With pipelining, \textbf{execution time} of a single instruction is usually \textbf{increased} and the \textbf{average throughput} is also \textbf{increased}. It reduces latency and improves throughput.
\subsection{Factors affecting performance}
A few of the factors affecting the performance are:
\begin{itemize}
\item Algorithm complexity and data sets
\item Compiler
\item Instructions set
\item Available operations
\item Operating systems
\item Clock rate
\item Memory system performance
\item I/O system performance and overhead
\begin{itemize}
\item this it's the least optimizable factor, the main focus is then to optimize all the others
\end{itemize}
\end{itemize}
\bigskip
The locution \textbf{\(X\) is \(n\) times faster than \(Y\)} can be expressed as:
\begin{gather*}
\dfrac{ExTime\ (Y)}{ExTime\ (X)} = \dfrac{Performance\ (X)}{Performance\ (Y)} = Speedup\ (X, Y) \\
Performance(X) = \dfrac{1}{ExTime\ (x)}
\end{gather*}
So, in order to optimize a system, it's necessary to focus on common sense (a \textit{sadly} valuable quality).
While making a design trade-off one must favour the frequent case over the infrequent one.
\textit{For example}:
\begin{itemize}
\item Instructions fetch and decode unit is used more frequently than the multiplier
\begin{itemize}
\item it makes sense to optimize \textbf{instructions fetch unit} first
\end{itemize}
\item If database server has \(50\) disks processor, storage dependability is more important than system dependability
\begin{itemize}
\item it makes sense to optimize \textbf{storage dependability} first
\end{itemize}
\end{itemize}
\subsubsection{Amdahl's law}
As seen before, the speedup due to the enhancement \(E\) is:
\[ Speedup\ (E) = \dfrac{ExTime\ w/o\ E}{ExTime\ w/\ E} = \dfrac{Performance\ w/\ E}{Performance\ w/o\ E} \]
Suppose that enhancement \(E\) accelerates a fraction \(F\) of the task by a factor \(S\) and the remainder of the task is unaffected.
Amdahl's law states that:
\begin{gather*}
ExTime_{new} = ExTime_{old} \times \left[\left(1 - F\right) + \dfrac{F}{S}\right] \\
Speedup = \dfrac{ExTime_{old}}{ExTime_{new}} = \dfrac{1}{(1 - F) + \sfrac{F}{S}} = \dfrac{S}{S - SF + F}
\end{gather*}
\paragraph{Corollary}
The best possible outcome of the \textit{Amdahl's law} is:
\[ Speedup = \dfrac{1}{1 - F} \]
\indentquote{If an enhancement is only usable for a fraction of the task, we can't speed up the task by more than the reciprocal of \(1\) minus the fraction}
The \textit{Amdahl's law} expresses the law of diminishing returns.
It serves as a guide to how much an enhancement will improve performance and how to distribute resources to improve the cost-over-performance ratio.
\subsubsection{\textit{CPU} time}
\textbf{CPU time} is determined by:
\begin{itemize}
\item \textbf{Instruction Count} - \textit{IC}:
\begin{itemize}
\item The number of executed instructions, not the size of static code
\item Determined by multiple \textit{factors, including algorithm, compiler, ISA}
\end{itemize}
\item \textbf{Cycles per instructions} - \textit{CPI}:
\begin{itemize}
\item Determined by \textit{ISA} and \textit{CPU} organization
\item Overlap among instructions reduces this term
\item The \textit{CPI} relative to a process \texttt{P} is calculated as:
\[ CPI(P) = \dfrac{\texttt{\# of clock cycles to execute P}}{\texttt{\# of instructions}} \]
\end{itemize}
\item \textbf{Time per cycle} - \textit{TC}:
\begin{itemize}
\item It's determined by technology, organization and circuit design
\end{itemize}
\end{itemize}
\bigskip
Then, \textit{CPU} time can be calculated as:
\[ CPU_\text{TIME} = T_\text{CLOCK} \cdot CPI \cdot \left(\texttt{\# of instructions}\right) = \dfrac{CPI \cdot \left(\texttt{\# of instructions}\right)}{f_\text{CLOCK}} \]
Note that the \textit{CPI} can vary among instructions because each step of the pipeline might take different amounts of time.
The factors that can influence the \textit{CPU} time are shown in Table~\ref{tab:relation-factor-CPU-time}.
\begin{table}[htbp]
\centering
\begin{tabular}{r|c|c|c}
& \textit{IC} & \textit{CPI} & \textit{TC} \\ \hline
\textit{Program} & \xmark & & \\
\textit{Compiler} & \xmark & (\xmark) & \\
\textit{Instruction set} & \xmark & \xmark & \\
\textit{Organization} & & \xmark & \xmark \\
\textit{Technology} & & & \xmark \\
\end{tabular}
\caption{Relation between factors and \textit{CPU} time}
\label{tab:relation-factor-CPU-time}
\end{table}
\paragraph{\textit{CPU} time and cache}
To improve the \textit{CPU} time, an \textbf{instruction cache} can be used.
Using a more realistic model, while calculating \textit{CPU} time, one must also account for:
\begin{itemize}
\item The \textbf{execution} \(\text{CPI}\) - \(\text{CPI}_\text{EXE}\)
\item The \textbf{miss penalty} - \(\text{MISS}_\text{P}\)
\item The \textbf{miss rate} - \(\text{MISS}_\text{R}\)
\item The memory \textbf{references} - \(\text{MEM}\)
\end{itemize}
Then the \(\text{CPI}_{\text{CACHE}}\) can be calculated as:
\[ \text{CPI}_{\text{CACHE}} = \text{CPI}_\text{EXE} + \text{MISS}_\text{P} \cdot \text{MISS}_\text{R} \cdot \text{MEM} \]
\subsubsection{Other metrics}
There are other metrics to measure the performance of a \textit{CPU}:
\begin{itemize}
\item \textit{MIPS} - millions of instructions per second
\[ MIPS = \dfrac{\texttt{number of instructions}}{T_\text{EXE} \cdot 10^6} = \dfrac{\texttt{clock frequency}}{\texttt{CPI} \cdot 10^6} \]
\begin{itemize}
\item the higher the \textit{MIPS}, the faster the machine
\end{itemize}
\item \textit{Execution time}
\[ T_\text{EXE} = \dfrac{\texttt{instruction count}}{\texttt{MIPS} \cdot 10^6} \]
\item \textit{MFLOPS} - floating point operations in program
\begin{itemize}
\item assumes that floating points operations are independent of the compiler and ISA
\item it's not always safe, because of:
\begin{itemize}
\item missing instructions \textit{(e.g. FP divide, square root, sin, cos, \dots)}
\item optimizing compilers
\end{itemize}
\end{itemize}
\end{itemize}
Furthermore, the execution time is compared against test programs that:
\begin{itemize}
\item Are chosen to measure performance defined by some groups
\item Are available to the community
\item Run on machines whose performance is well-known and documented
\item Can compare to reports on other machines
\item Are representative
\end{itemize}
\subsection{Averaging metrics}
The simplest approach to summarizing relative performance is to use the total execution time of the \(n\) programs;
however, this generalization does account for the different durations of the benchmarking programs.
The \textbf{\(3\) different approaches} using means to average a metric are described in Table~\ref{tab:mean-approaches}.
\begin{table}[htbp]
\centering
\begin{tabular}{c|c}
\textit{metric} & \textit{type of mean} \\ \hline
times & arithmetic \\
rates & harmonic \\
execution time & geometric \\
\end{tabular}
\caption{Mean approaches}
\label{tab:mean-approaches}
\end{table}
\clearpage
\section{Multithreading and Multiprocessors}
\subsection{Why multithreading?}
Why is multithreading needed?
\begin{itemize}
\item \textit{80's}: expansions of superscalar processors
\begin{itemize}
\item In the '80s, people were writing languages in high-level programming languages
\item Since compiler optimization was not good enough, it was needed to improve the software translations by making \textit{CPU} instructions that were more similar to high-level instructions
\item In order to improve performances, the \textbf{pipeline} was introduced
\begin{itemize}
\item it sends \textbf{more than one instructions at a time}
\item more instructions completed in the same clock cycle
\item it's kind of a hardware-level implicit parallelism
\end{itemize}
\end{itemize}
\item \textit{90's}: decreasing returns on investments
\begin{itemize}
\item Since all the parallelism was implemented by the hardware \textit{(or, at most, the compiler)}, there was no effective way to manually handle the performance
\item There were many different issues:
\begin{itemize}
\item issue from \(2\) to \(6\) ways, issue out of order, branch prediction, all lowering from \(1\) \textit{CPI} to \(0.5\) \textit{CPI}
\item performance below expectations
\item these performance issues led to delayed and cancelled projects
\end{itemize}
\item All the previous improvements were due to the shrinking size of the transistors, which was slowly speeding down
\begin{itemize}
\item the number of transistors followed Moore's law, doubling each \(18\) to \(24\) months
\item the frequency and the performance per core were not, due to interferences and energy problems
\end{itemize}
\end{itemize}
\item \textit{2000}: beginning of the multi-core era
\begin{itemize}
\item Since increasing the \textit{CPU} frequency could not be achieved anymore, the only solution left was to increase the number of threads in every processor
\item This implied that there was a need of introducing a software way to handle the parallelism, in harmony with an enhanced hardware
\end{itemize}
\end{itemize}
\bigskip
Motivations for the paradigm change:
\begin{itemize}
\item Moderns processors \textbf{fail to utilize execution resources} well enough
\begin{itemize}
\item There's no single culprit:
\begin{itemize}[label=\xmarkthin]
\item \textit{memory conflicts}
\item \textit{control hazards}
\item \textit{branch misprediction}
\item \textit{cache miss}
\item \ldots
\end{itemize}
\item All those problems are correlated and there's no way of solving one of them without affecting all the others
\item There's the need for a general latency-tolerance approach that can hide all sources of latency
\begin{itemize}[label=\cmarkthin]
\item a solution is found in \textbf{parallel programming}
\end{itemize}
\end{itemize}
\end{itemize}
\subsubsection{Parallel Programming}
Explicit parallelism implies structuring the applications into concurrent and communicating tasks.
Operating systems offer systems to implement such features: \textbf{threads} and \textbf{processes}.
The multitasking is implemented differently based on the characteristics of the \textit{CPU}:
\begin{itemize}
\item \textbf{Single} core
\item \textbf{Single} core with \textbf{multithreading} support
\item \textbf{Multi} core
\end{itemize}
In multithreading, multiple threads share the function units of one processor via overlapping.
The processor must duplicate the independent state of each thread:
\begin{itemize}
\item Separate copy of the \textbf{register files}
\item Separate \textbf{\texttt{PC}}
\item Separate\textbf{ page table}
\end{itemize}
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig{image-3.tikz}
\caption{Multiplicity of processes and threads}
\label{fig:multiplicity-of-processes-and-threads}
\bigskip
\end{figure}
The memory is shared via the virtual memory mechanisms, which already support multi processes.
Finally, the hardware must be ready for a fast thread switch: it must be faster than a full process switch \textit{(which is in the order of hundreds to thousands of clocks)}.
There are \(2\) apparent solutions:
\begin{enumerate}
\item \textbf{Fine grained} multi-threading
\begin{itemize}
\item Switches from one thread to another at each instruction by \textbf{taking turns}
\begin{itemize}
\item the switch happens when one thread stalls
\end{itemize}
\item The executions of more threads is \textbf{interleaved}
\item The \textit{CPU} must be able to change thread at \textbf{every clock cycle}
\item For \(n\) processes, each gets \(\sfrac{1}{n}\) of \textit{CPU} time
\begin{itemize}
\item \(n\) times the original resources are needed
\end{itemize}
\end{itemize}
\item \textbf{Coarse grained} multithreading
\begin{itemize}
\item Switching from one thread to another occurs only when there are long \textbf{stalls} in the active process
\item The interested threads \textbf{share many resources}
\item The switching from one thread to the other requires \textbf{different clock cycles} to save the context
\end{itemize}
\end{enumerate}
\bigskip
\textbf{Disadvantages} of multithreading:
\begin{itemize}[label=\xmarkthin]
\item For short stalls it \textbf{does not reduce the throughput loss}
\item The \textit{CPU} starts the execution of instructions that belongs to a \textbf{single thread}
\item When there is one stall it is necessary to \textbf{empty the pipeline} before starting the new thread
\end{itemize}
\textbf{Advantages} of multithreading:
\begin{itemize}[label=\cmarkthin]
\item In normal conditions \textbf{the single thread is not slowed down}
\end{itemize}
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[1]{image-4.tikz}
\caption{\centering Comparison between fine and coarse multithreading. Each column shows the evolution of \(2\) different processes spread on \(4\) different threads over a set amount of time.
A filled square illustrates a thread occupied by a process, while the empty ones represent an empty \texttt{(idle)} thread.}
\label{fig:multithreading-fine-coarse-comparison}
\bigskip
\end{figure}
\paragraph{Thread Level Parallelism - \textit{TLP}}
\begin{itemize}
\item \textbf{Thread level parallelism}, simultaneous multithreading
\begin{itemize}
\item Uses the resources of \textbf{one superscalar processor} to exploit simultaneously \textit{ILP} and \textit{TLP}
\item A \textit{CPU} today has \textbf{more functional resources} than what one thread can in fact use
\item Simultaneously \textbf{schedule instructions} for execution from all threads
\end{itemize}
\item A \textit{CPU} that can handle these needs must be built
\begin{itemize}
\item A \textbf{large set of registers} is needed, allowing multiple processes to operate on different data on the same registers
\item \textbf{Mapping table for registers} is needed in order to tell each process where to write data
\item Each processor can manage a \textbf{set amount of threads}
\end{itemize}
\item This is the most \textbf{flexible} way to manage multithreading but it requires more \textbf{complex hardware}
\end{itemize}
\bigskip
Comparison between many multithreading paradigms is shown in Figure~\ref{fig:multithreading-comparison}.
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[0.85]{image-5.tikz}
\caption{\centering Multithreading comparison. Each column shows the evolution of \(5\) different processes spread on \(4\) different threads over a set amount of time.
A filled square illustrates a thread occupied by a process, while the empty ones represent an empty \texttt{(idle)} thread.}
\label{fig:multithreading-comparison}
\bigskip
\end{figure}
\subsubsection{Further improvements}
It's difficult to increase the performance and clock frequency of the single core.
The longest pipeline stage can be split into multiple smaller stages, allowing higher throughput.
This concept is called \textbf{deep pipeline} and has a few drawbacks:
\begin{itemize}
\item \textbf{Heat dissipation} problems due to the increased number of components
\item More stages imply \textbf{more faults} since sequential instructions are likely related
\item \textbf{Transmissions delay} in wires start to get relevant
\item \textbf{Harder design} and verifications by the hardware developers
\end{itemize}
\subsection{Parallel Architectures}
A \textbf{parallel computer} is a collection of processing elements that cooperate and communicate to solve large problems rapidly.
The aim is to replicate processors to add performance and not design a single faster processor.
The parallel architecture extends traditional computer architecture with a communication architecture.
This solution needs:
\begin{itemize}
\item \textbf{Abstractions} for HW/SW interface
\item \textbf{Different structures} to realize abstractions easily
\end{itemize}
Refer to Flynn Taxonomy \textit{(Section~\ref{sec:flynn-taxonomy})} for more details about these architectures.
\subsection{\textit{SIMD} architecture}
The characteristics of the \textit{SIMD} architectures \textit{(Single Instruction Multiple Data)} are:
\begin{itemize}
\item \textbf{Same instruction} executed by \textbf{multiple processors} using different data streams
\item Each processor has its data memory
\item \textbf{Single instruction memory} and \textbf{single control processor} to fetch and dispatch instructions
\item Processors are typically \textbf{special purpose}
\item A \textbf{simple} programming model
\end{itemize}
The programming model features synchronized units with a single program counter, while each unit has its own addressing registers to use different data addresses.
\bigskip
Motivations for \textit{SIMD}:
\begin{itemize}
\item The \textbf{cost} of the control unit is \textbf{shared} between all execution units
\item Only\textbf{ one copy of the code} in execution is necessary
\item All the computation is \textbf{fully synchronized}
\end{itemize}
In real life:
\begin{itemize}
\item \textit{SIMD} architectures are a \textbf{mix} of \textit{SISD} and \textit{SIMD}
\item A host computer executes \textbf{sequential operations}
\item \textit{SIMD} instructions sent to all the execution units
\begin{itemize}
\item each of them has its own memory and registers
\item an interconnection network is needed to exchange data
\end{itemize}
\end{itemize}
\bigskip
\textit{SIMD} architectures will be explored in depth in Section~\ref{sec:simd}.
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[1]{image-6.tikz}
\caption{\textit{SIMD} architecture}
\label{fig:simd-architecture}
\bigskip
\end{figure}
\subsection{\textit{MIMD} architecture}
\textit{MIMD} architectures are flexible as they can function as:
\begin{itemize}
\item \textbf{Single user} machines for high performance on one application
\item Multi programmed multiprocessors running many tasks \textbf{simultaneously}
\item Some combinations of the two aforementioned functions
\end{itemize}
They can also be built starting from standard CPUs.
\bigskip
Each processor fetches its own instructions and operates on its own data;
processors are often off-the-shelf microprocessors, with the upside of being able to build a scalable architecture and having a high cost-performance ratio.
In order to fully exploit a \textit{MIMD} with \(n\) processors, there must be:
\begin{itemize}
\item At least \(n\) \textbf{threads} or \textbf{processes} to execute
\begin{itemize}
\item those independent threads are typically identified by the programmer or created by the compiler
\end{itemize}
\item \textbf{Thread level parallelism} - \textit{TLP}
\begin{itemize}
\item parallelism is identified by the software \textit{(and not by hardware like in superscalar CPUs)}
\end{itemize}
\end{itemize}
\textit{MIMD} machines can be characterized in \(2\) classes, depending on the number of processors involved:
\begin{itemize}
\item \textbf{Centralized} shared-memory architectures
\begin{itemize}
\item At most a few dozen processors chips \textit{(less than \(100\) cores)}
\item Large caches, single memory multiple banks
\item This kind of structures is often \textbf{symmetric multiprocessors (SMP)} and the style of architecture is called \textbf{Uniform Memory Access (UMA)}
\end{itemize}
\item \textbf{Distributed} memory architectures
\begin{itemize}
\item Supports \textbf{large processor count}
\item Requires \textbf{high bandwidth} interconnection
\item It has the disadvantage of a high volume of data communication between processors
\end{itemize}
\end{itemize}
\bigskip
\textit{MIMD} architectures will be explored in depth in Section~\ref{sec:mimd}.
\clearpage
\section{Pipeline recap}
The pipeline \textit{CPI} \textit{(clocks per instruction)} can be calculated taking account of:
\begin{itemize}
\item \textbf{Ideal} pipeline \textit{CPI}
\begin{itemize}
\item measure of the maximum performance attainable by the implementation
\end{itemize}
\item \textbf{Structural stalls}
\begin{itemize}
\item due to the inability of the \textit{HW} to support this combination of instructions
\item can be solved with more \textit{HW} resources
\end{itemize}
\item \textbf{Data hazards}
\begin{itemize}
\item the current instruction depends on the result of a prior instruction still in the pipeline
\item can be solved with \textit{forwarding} or \textit{compiler scheduling}
\end{itemize}
\item \textbf{Control hazards}
\begin{itemize}
\item caused by delay between the \texttt{IF} and the decisions about changes in control flow \textit{(branches, jumps, executions)}
\item can be solved with \textit{early evaluation, delayed branch, predictors}
\end{itemize}
\end{itemize}
\bigskip
The main features of the pipeline are:
\begin{itemize}
\item \textbf{Higher throughput} for the entire workload
\item Pipeline rate is limited by the \textbf{slowest} pipeline stage
\item Multiple tasks operate \textbf{simultaneously}
\item It \textbf{exploits parallelism} among instructions
\item Time needed to \textit{"fill"} and \textit{"empty"} the pipeline reduces speedup
\end{itemize}
\subsection{Stages in \textit{MIPS} pipeline}
The \(5\) stages in the \textit{MIPS} pipeline are:
\begin{enumerate}
\item \textbf{Fetch} - \texttt{IF}
\begin{itemize}[label=\(\rightarrow\)]
\item Instruction fetch from memory
\end{itemize}
\item \textbf{Decode} - \texttt{ID}
\begin{itemize}[label=\(\rightarrow\)]
\item Instruction decode and register read
\end{itemize}
\item \textbf{Execute} - \texttt{EX}
\begin{itemize}[label=\(\rightarrow\)]
\item Execute operation or calculate the address
\end{itemize}
\item \textbf{Memory access} - \texttt{ME}
\begin{itemize}[label=\(\rightarrow\)]
\item Access memory operand
\end{itemize}
\item \textbf{Write back} - \texttt{WB}
\begin{itemize}[label=\(\rightarrow\)]
\item Write the result back to register
\end{itemize}
\end{enumerate}
Each instruction is executed after the previous one has completed its first stage, and so on.
When the pipeline is filled, five different activities are running at once.
Instructions are passed from one unit to the next through a storage buffer.
As each instruction progresses through the pipeline, all the information needed by the stages downstream must be passed along.
The stages are usually represented in Figure~\ref{fig:mips-pipeline-stages}.
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[1]{image-7.tikz}
\caption{Stages in \textit{MIPS} pipelines}
\label{fig:mips-pipeline-stages}
\bigskip
\end{figure}
\subsection{Pipeline hazards}
\label{sec:pipeline-hazards}
An \textbf{hazard} is a fault in a pipeline.
A hazard is created whenever there is a dependence between instructions that are close enough such that the overlap introduced by pipelining would change the order of access to the operands involved in said dependence.
It prevents the next instruction in the pipeline from executing during its designated clock cycle, reducing the performance from the ideal speedup.
There are \(3\) classes of hazards:
\begin{enumerate}
\item \textbf{Structural} hazards, due to attempting to use the same resource from different instructions at the same time
\item \textbf{Data} hazards \textit{stalls}, due to attempting to use a result before it is ready
\begin{itemize}
\item \textit{Read after write} - \textbf{RAW}
\begin{itemize}[label=\(\rightarrow\)]
\item the instruction tries to read a source register before it is written by a previous instruction
\item it's also called \textit{dependence} by compilers
\item caused by an actual need for communication
\end{itemize}
\item \textit{Write after read} - \textbf{WAR}
\begin{itemize}[label=\(\rightarrow\)]
\item the instruction tries to write a destination register before it is read by a previous instruction
\item it's also called \textit{anti dependence} by compilers
\item caused by the reuse of the same register for different purposes
\end{itemize}
\item \textit{Write after write} - \textbf{WAW}
\begin{itemize}[label=\(\rightarrow\)]
\item the instruction tries to write before it is written by a previous instruction
\item it's also called \textit{output dependence} by compilers
\item caused by the reuse of the same register for different purposes
\end{itemize}
\end{itemize}
\item \textbf{Control} hazards \textit{stalls}, due to the request of deciding on the next instruction to execute before the condition itself is evaluated
\end{enumerate}
\bigskip
\textit{Data stalls may occur with instructions such as}:
\begin{itemize}
\item \textbf{RAW} stall:
\begin{verbatim}
r3 := (r1) op (r2)
r5 := (r3) op (r4) // r3 has not been written yet\end{verbatim}
\item \textbf{WAR} stall:
\begin{verbatim}
r3 := (r1) op (r2)
r1 := (r4) op (r5) // r1 has not been read yet \end{verbatim}
\item \textbf{WAW} stall:
\begin{verbatim}
r3 := (r1) op (r2)
r3 := (r6) op (r7) // r3 has not been written yet \end{verbatim}
\end{itemize}
\subsubsection{Solutions to data hazards}
There are many ways in which data hazards can be solved, such as:
\begin{itemize}
\item \textbf{Compilation} techniques
\begin{itemize}
\item \textbf{Insertion of \texttt{nop}} instructions
\item \textbf{Instructions scheduling}
\begin{itemize}
\item the compiler tries to avoid that correlating instructions are too close
\item it tries to insert independent instructions among correlated ones
\item when it can't, it inserts \texttt{nop} operations
\end{itemize}
\end{itemize}
\item \textbf{Hardware} techniques
\begin{itemize}
\item \textbf{Insertion} on \textit{bubbles} or \textit{stalls} in the pipeline
\item Data \textbf{forwarding} or bypassing
\end{itemize}
\end{itemize}
Both the compilation and the hardware techniques will be analyzed in depth in Section~\ref{sec:strategies-to-support-ilp}.
\subsection{Complex in-order pipeline}
While using floating points operations, a few questions might arise:
\indentquote{What happens, architecture-wise, when mixing integer and floating point operations? How are different registers handled? How can \textit{GPRs (general purpose registers)} and \textit{FPRs (floating point registers)} be matched?}
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[1]{image-8.tikz}
\caption{Complex-in order pipeline}
\label{fig:complex-in-order-pipeline}
\bigskip
\end{figure}
In the complex in-order pipeline there isn't a single \textit{ALU} anymore but the execution of floating point operations is split between a number of \textit{Functional Units}.
The \textit{issue} stage detects conflicts while accessing any of them and it's able to delay the execution \textit{(by using stalls)} of instructions in case of errors.
The pipeline is now constituted by \(6\) stages, normally represented as in Figure~\ref{fig:mips-pipeline-stages-issue}.
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[1.5]{image-9.tikz}
\caption{Pipeline stages with issue}
\label{fig:mips-pipeline-stages-issue}
\bigskip
\end{figure}
Pipelining becomes complex when we want high performance in the presence of:
\begin{itemize}
\item \textbf{Long latency} or partially pipelined floating point units
\item \textbf{Multiple functions} and memory units
\item Memory systems with \textbf{variable access time}
\item Precise \textbf{exception}
\end{itemize}
Formally, all the different executions must be balanced.
\begin{figure}[htbp]
\bigskip
\centering
\tikzfig[0.7]{image-10.tikz}
\caption{Complex pipelining}
\label{fig:complex-pipelining-2}
\bigskip
\end{figure}
\bigskip
The main issues are found in:
\begin{itemize}
\item \textbf{Structural conflicts at the execution stage} if some \textit{FPU} or memory unit is not pipelined and takes more than one cycle
\item \textbf{Structural conflicts at the write-back stage} due to variable latencies of different Functional Units (or \textit{FUs})
\item \textbf{Out of order write hazards} due to variable latencies of different \textit{FUs}
\item Hard to handle exceptions
\end{itemize}
A possible technique to handle write hazards without equalizing all pipeline depths and without bypassing is by delaying all writebacks so all operations have the same latency into the \texttt{WB} stage.
In this approach, the following events will happen:
\begin{itemize}
\item Write ports are \textbf{never oversubscribed}
\begin{itemize}
\item one instruction \textit{in} and one instruction \textit{out} for every cycle
\end{itemize}
\item Instruction commit happens \textbf{in order}
\begin{itemize}
\item it simplifies the precise exception implementation
\end{itemize}
\end{itemize}
How is it possible to prevent increased write-back latency from slowing down single-cycle integer operations?
Is it possible to solve all write hazards without equalizing all pipeline depths and without bypassing them?
\subsection{Instructions issuing}
\label{sec:instructions-issuing}
To reach higher performance, more parallelism must be extracted from the program.
In other words, dependencies must be detected and solved, while instructions must be scheduled to achieve the highest parallelism of execution compatible with available resources.
A data structure keeping track of all the instructions in all the functional units is needed.
In order to work properly, it must make the following checks before the \texttt{issue} stage to dispatch an instruction:
\begin{enumerate}
\item Check if the \textbf{functional unit} is available
\item Check if the \textbf{input data} is available
\begin{itemize}
\item \textit{Failure in this step would cause a RAW}
\end{itemize}
\item Check if it's safe to write to the \textbf{destination}
\begin{itemize}
\item \textit{Failure in this step would cause a WAR or a WAW}
\end{itemize}
\item Check if there's a \textbf{structural conflict} at the \texttt{WB} stage
\end{enumerate}
Such a suitable data structure would look like in Table~\ref{tab:data-structure-keep-track-functional-units}.
\begin{table}
\bigskip
\centering
\begin{tabular}{r|c|cccc}
\hline
\textit{name} & \textit{busy} & \textit{op} & \textit{destination} & \textit{source 1} & \textit{source 2} \\ \hline
\texttt{int} & & & & & \\
\texttt{mem} & & & & & \\ \hline
\texttt{add 1} & & & & & \\
\texttt{add 2} & & & & & \\
\texttt{add 3} & & & & & \\ \hline
\texttt{mult 1} & & & & & \\
\texttt{mult 2} & & & & & \\ \hline
\texttt{div} & & & & & \\
\end{tabular}
\caption{Data structure to keep track of \textit{FUs}}
\label{tab:data-structure-keep-track-functional-units}
\bigskip
\end{table}
An instruction at the \texttt{issue} stage consults this table to check if:
\begin{itemize}
\item The \textit{FU} is \textbf{available} by looking at the \textit{busy} column
\item A \textit{RAW} can arise by looking at the \textit{destination} column for its \textbf{sources}
\item A \textit{WAR} can arise by looking at the \textit{source} columns for its \textbf{destinations}
\item A \textit{WAW} can arise by looking at the \textit{destination} columns for its \textbf{destinations}
\end{itemize}
When the checks are all completed:
\begin{itemize}
\item An entry is \textbf{added} to the table if no hazard is detected
\item An entry is \textbf{removed} from the table after \texttt{WB} stage
\end{itemize}
\bigskip
Later in the course \textit{(Section~\ref{par:CDC6600-scoreboard})}, this approach will be discussed more in-depth.
\subsection{Dependences}
Determining \textbf{dependences} among instructions is critical to defining the amount of parallelism existing in a program.
If two instructions are dependent, they cannot execute in parallel: they must be executed in order or only partially overlapped.
There exist \(3\) different types of dependences:
\begin{itemize}
\item \textbf{Name} Dependences
\item \textbf{Data} Dependences
\item \textbf{Control} Dependences
\end{itemize}
While hazards are a property of the pipeline, dependences are a property of the program.
As a consequence, the presence of dependences does not imply the existence of hazards.
\subsubsection{Name Dependences}
\textbf{Name dependences} occurs when \(2\) instructions use the same register or memory location \textit{(called name)}, but there is no flow of data between the instructions associated with that \textit{name}.
Two types of name dependences could exist between an instruction \texttt{i} that precedes an instruction \texttt{j}:
\begin{itemize}
\item \textbf{Antidependence}: when \texttt{j} writes a register or memory location that instruction \texttt{i} reads; the original instruction ordering must be preserved to ensure that \texttt{i} reads the correct value
\item \textbf{Output Dependence}: when \texttt{i} and \texttt{j} write the same register or memory location; the ordering of the original instructions must be preserved to ensure that the value finally written corresponds to \texttt{j}
\end{itemize}
Name dependences are not true data dependences, since there is no value \textit{(no data flow)} being transmitted between instructions.
If the name \textit{(either register or memory location)} used in the instructions could be changed, the instructions do not conflict.
Dependences through memory locations are more difficult to detect (this is called the \inlinequote{memory disambiguation} problem), since two apparently different addresses may refer to the same memory location.
As a consequence, it's easier to rename a \textbf{register} than rename a \textbf{memory location}.
It can be done either \textbf{statically} by the compiler or \textbf{dynamically} by the hardware.
\subsubsection{Data Dependences}
A data or name dependence can potentially generate a data hazard (\textit{RAW} for the former or \textit{WAR} and \textit{WAW} for the latter), but the actual hazard and the number of stalls to eliminate them are properties of the pipeline.
\subsubsection{Control Dependeces}
\textbf{Control dependences} determine the ordering of instructions.
They are preserved by two properties:
\begin{enumerate}
\item \textbf{Instructions execution in program order} to ensure that an instruction that occurs before a branch is executed at the right time \textit{(before the branch)}
\item \textbf{Detection of control hazards} to ensure that an instruction \textit{(that is control dependent on a branch)} is not executed until the branch direction is known
\end{enumerate}
Although preserving control dependence is a simple way to preserve program order, control dependence is not the critical property that must be preserved.
\clearpage