-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathposter.tex
383 lines (320 loc) · 16.1 KB
/
poster.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
%\title{LaTeX Portrait Poster Template}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% a0poster Portrait Poster
% LaTeX Template
% Version 1.0 (22/06/13)
%
% The a0poster class was created by:
% Gerlinde Kettl and Matthias Weiser ([email protected])
%
% This template has been downloaded from:
% http://www.LaTeXTemplates.com
%
% License:
% CC BY-NC-SA 3.0 (http://creativecommons.org/licenses/by-nc-sa/3.0/)
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%----------------------------------------------------------------------------------------
% PACKAGES AND OTHER DOCUMENT CONFIGURATIONS
%----------------------------------------------------------------------------------------
\documentclass[a0,portrait]{a0poster}
\usepackage{multicol} % This is so we can have multiple columns of text side-by-side
\columnsep=100pt % This is the amount of white space between the columns in the poster
\columnseprule=3pt
\usepackage[svgnames]{xcolor}
\usepackage{times} % Use the times font
%\usepackage{palatino} % Uncomment to use the Palatino font
\usepackage{graphicx} % Required for including images
\graphicspath{{figures/}} % Location of the graphics files
\usepackage{booktabs} % Top and bottom rules for table
\usepackage[font=small,labelfont=bf]{caption}
\usepackage{amsfonts, amsmath, amsthm, amssymb}
\usepackage{wrapfig} % Allows wrapping text around tables and figures
\usepackage{hyperref}
\usepackage{listings}
\usepackage{floatrow}
\usepackage{subfig}
\usepackage{color}
\usepackage{multicol}
\usepackage{lipsum}
\usepackage{float}
\usepackage{floatrow}
\usepackage{tikz}
\usetikzlibrary{shapes,arrows,shadows}
\begin{document}
%----------------------------------------------------------------------------------------
% POSTER HEADER
%----------------------------------------------------------------------------------------
\begin{minipage}[b]{0.75\linewidth}
\veryHuge \textbf{Code Clone Detection in Clang Static Analyzer} \color{Black}\\ % Title
\Huge\textit{Detecting Copy-paste using Static Analysis}\\[2cm] % Subtitle
\huge \textbf{Kirill Bobyrev \& Vassil Vassilev}\\[0.5cm] % Author(s)
\huge MIPT \& CERN\\[0.4cm] % University/organization
\Large \texttt{[email protected]}\\
\Large \texttt{[email protected]}\\
\end{minipage}
%
\begin{minipage}[b]{0.25\linewidth}
\includegraphics[width=20cm]{logo.png}\\
\end{minipage}
\vspace{1cm} % A bit of extra whitespace between the header and poster content
%----------------------------------------------------------------------------------------
\begin{multicols}{2}
\color{Black} % Black color for the rest of the content
%----------------------------------------------------------------------------------------
% OBJECTIVES
%----------------------------------------------------------------------------------------
\section*{Motivation}
The copy-paste is a common programming practice. Recent studies \cite{Chanchal2007} show
that large 5-20\% of code in large codebases is either equal or similar to other code
pieces in this project. This causes troubles and makes development process way harder.
Having a code clone in a system almost always means that its design needs serious
improvements. Large projects are developed by numerous people and fixing a bug found in
one code clone instance may cause unexpected results, because this fix most likely won't
be applied to every other instance similar to the fixed one. Thus said, the practice of
copy-pasting a piece of code into multiple locations. Having a tool that detects these
clones and produces a human-readable report allows finding such pieces of code and
improving the architecture.
%----------------------------------------------------------------------------------------
% INFRASTRUCTURE CHOICE
%----------------------------------------------------------------------------------------
\section*{Infrastructure choice}
Even though such tools exist, most of them are not easy to use. Most of them are also
outdated and because of that they can't catch up with the latest language standards and
features. Building a Code Clone Detection Tool around Clang and Clang Static Analyzer
allows not worrying about these problems. LLVM and Clang provide a great infrastructure
for building such a tool and provide the stability. Therefore the tool is implemented
as additional checkers for the Clang Static Analyzer. Clang SA along with scan-build
are really easy to use and can generate informative reports containing all the warnings.
% Define block styles used later
\tikzstyle{sensor}=[draw, fill=gray!20, text width=5em, text centered,
minimum height=2.5em,drop shadow, rounded corners=1cm]
% Define distances for bordering
\def\blockdist{2.3}
\def\edgedist{2.5}
\begin{figure}[H]
\centering
\begin{tikzpicture}
\node (Analyzed_Sources) [sensor] {Analyzed sources};
\path (Analyzed_Sources.east)+(3.0,0.0) node (scan-build)[sensor]
{scan-build};
\path (scan-build.east)+(3.0, 0.0) node (fake-compiler)[sensor] {Fake compiler};
\path (fake-compiler.east)+(4.0, 2.0) node (compiler)[sensor]
{Compiler};
\path (fake-compiler.east)+(4.0, -2.0) node (static-analyzer)[sensor, fill=gray!20]
{Static Analyzer};
\path (compiler.east)+(4.0, 0.0) node (object-code)
[sensor] {Object code};
\path (static-analyzer.east)+(4.0, 0.0) node (report)
[sensor, fill=green!20] {SA Report};
\draw[thick,->] (Analyzed_Sources.east) -- (scan-build.west);
\draw[thick,->] (scan-build.east) -- (fake-compiler.west);
\draw[thick,->] (fake-compiler.east) -- (compiler.west);
\draw[thick,->] (fake-compiler.east) -- (static-analyzer.west);
\draw[thick,->] (compiler.east) -- (object-code.west);
\draw[thick,->] (static-analyzer.east) -- (report.west);
\end{tikzpicture}
\caption{Full process of analysis and compilation}
\end{figure}
%----------------------------------------------------------------------------------------
% CLONE TYPES
%----------------------------------------------------------------------------------------
\section*{Clone types}
Let's quickly review how clones are different from each other and which types are there
to get a better understanding what can the current implementation do and what can it not.
\begin{description}
\item[Type I] Identical code fragments except for variations in whitespace, layout and
comments.
\item[Type II] Syntactically identical fragments except for variations in identifiers,
literals, types, whitespace, layout and comments.
\item[Type III] Copied fragments with further modifications such as changed, added or
removed statements, in addition to variations in identifiers, literals, types,
whitespace, layout and comments.
\item[Type IV] Two or more code fragments that perform the same computation but are
implemented by different syntactic variants.
\end{description}
%----------------------------------------------------------------------------------------
% CLONE.BASICCLONECHECKER
%----------------------------------------------------------------------------------------
\section*{clone.BasicCloneChecker}
\textit{BasicCloneChecker} is designed to be scalable and therefore we're focusing on
high precision and speed rather than high recall. \textit{BasicCloneChecker} is assumed
not to cause huge compilation time losses and performance issues. This checker defines a
"clones" to be a certain pair of AST subtrees.
This checker processes AST generated by Clang and performs enhanced Profiling technique
used to match AST nodes and their subtrees. It uses hashing based on AST nodes'
Profiles to divide AST nodes into clone groups and reports each clone group separately
cutting-off few false-positives and ensuring that the AST tree accused of being a clone
contains more than THRESHOLD AST nodes.
\textit{BasicCloneChecker} is able to find code clones of types I and II with high
precision. However, it is very sensitive to the slight changes in analyzed AST.
Computational complexity of analyzing AST tree with $n$ nodes is $O(n^2)$. The heaviest
part of implementation is building of profiles for an AST node and its subtrees. It's
easy to understand that incremental profiling + caching may reduce the complexity to
$O(n)$, but checker relies on current recursive implementation of profiling.
%----------------------------------------------------------------------------------------
% CLONE.ADVANCEDCLONECHECKER
%----------------------------------------------------------------------------------------
\section*{clone.AdvancedCloneChecker}
\textit{AdvancedCloneChecker} combines the Tree- and PDG-based techniques and implements
the approach proposed by Jens Krinke in his paper \cite{FineGrainedPDG}. It detects
clones of type 1, 2 and 3. It relies on the implemented Fine-Grained PDG data structure,
which is basically a variation of PDG built from AST nodes. Then it runs the maximum
similar subgraph searching algorithm to detect similar FGPDG subgraphs and creates if
these subgraphs contain more than THRESHOLD nodes.
The general problem "Finding maximum similar subgraphs" is NP-complete, that is why the
implemented algorithm is just an approximate approach, which is the best thing one could
possibly do.
Granularity is very important part of this approach, because it only processes the
comparison units and the partitioning part is essential. If chosen badly, it can lead to
the state, in which the checker is right now: even if the algorithm extracts subgraphs
with fairly good accuracy not many clones can be detected, because the comparison units
are either to big or do not fit the needed conditions. The implemented algorithm
partitions into connected components, which are compared against each other afterwards.
This approach has many negative effects I encountered while processing large codebases.
AdvancedCloneChecker should be used for a deeper analysis; it processes an AST with $n$
nodes in $O(n^4)$ time and it can not be reduced. It's quite slow, but the approach
allows to detect even type 3 clones.
\tikzset{
treenode/.style = {shape=circle,
draw, align=center},
root/.style = {treenode},
env/.style = {treenode},
dummy/.style = {circle,draw},
rd/.style = {treenode, fill=red!40},
rere/.style={edge from parent/.style={red,draw}},
norm/.style={edge from parent/.style={black, draw}}
}
\begin{figure}[H]
\centering
\begin{tikzpicture}
[
grow = down,
sibling distance = 4em,
level distance = 4em,
edge from parent/.style = {draw, -latex},
every node/.append style = {font=\footnotesize},
sloped
]
\node [root] {A}
child { node [env] {C}
edge from parent node [above] {2} }
child { node [rd] {A}
child[rere] { node [rd] {B}
child { node [rd, black, circle, fill=red!40] {C}
edge from parent node [above] {1} }
child { node [rd, black, circle, fill=red!40] {B}
edge from parent node [above] {1}}
child { node [rd, black, circle, fill=red!40] {C}
edge from parent node [above] {3} }
edge from parent node [above] {1} }
child { node [env] {F}
edge from parent node [above]
{2}}
edge from parent node [above] {3} };
\end{tikzpicture}
\begin{tikzpicture}
[
grow = down,
sibling distance = 4em,
level distance = 4em,
edge from parent/.style = {draw, -latex},
every node/.append style = {font=\footnotesize},
sloped
]
\node [root, fill=red!40] {A}
child { node [env] {D} edge from parent node [above] {4} }
child { node [env] {C}
edge from parent node [above] {2} }
child[rere] { node [rd, black, circle, fill=red!40] {B}
child { node [rd, black, circle, fill=red!40] {B}
child[norm] { node[black, circle, draw] {A}
edge from parent node [above] {2} }
child[norm] { node[black, circle, draw] {C}
edge from parent node [above] {2}}
child[norm] { node[black, circle, draw] {F}
edge from parent node [above] {1} }
edge from parent node [above] {1} }
child[norm] { node [black, circle, draw] {F}
edge from parent node [above]
{2}}
child { node[env, black, circle, fill=red!40] {C}
edge from parent node [above] {1}}
child { node[env, black, circle, fill=red!40] {C}
edge from parent node [above] {3}}
edge from parent node [above] {1} };
\end{tikzpicture}
\caption{Similar subgraphs}
\end{figure}
%----------------------------------------------------------------------------------------
% COMMON PROBLEMS
%----------------------------------------------------------------------------------------
\section*{Common problems}
Both checkers are built in Clang Static Analyzer and therefore have certain restrictions.
\begin{itemize}
\item Clang SA design doesn't allow the analysis of the whole codebase, it only allows
analyzing each Translation Unit separately. Obviously, some code clones may occur in
different Translation Units.
\item The diagnostics generated by both checkers are quite specific as warnings of each
clone class objects should come consequently. Therefore these diagnostics aren't hooked
to Clang SA BugReporter and aren't displayed in scan-build report.
\item Both checkers use magic numbers (THRESHOLD) while deciding whether to report a
code piece or not. It would be logical to give a control of these constants via some
Clang SA options. However, this isn't done for any of the existing SA checker and I
doubt that it meets expectations of the original Clang SA developers.
\end{itemize}
%----------------------------------------------------------------------------------------
% GETTING REAL
%----------------------------------------------------------------------------------------
\section*{Getting real}
Even though both checkers are WIP, they detected code clones in few built open-source
project. Here's a performance report for each of built projects using
\textit{BasicCloneChecker} only. I only count clones with more than 50 AST nodes.
\begin{lstlisting}[frame=single]
scan-build --use-analyzer=${PATH_TO_BUILT_CLANG} \
-enable-checker clone.BasicCloneChecker \
-disable-checker core \
-disable-checker unix \
-disable-checker cplusplus \
-disable-checker deadcode \
-k make
\end{lstlisting}
\begin{center}\vspace{1cm}
\begin{tabular}{l l l l}
\toprule
\textbf{Project} & \textbf{without fakec invocation} &
\textbf{BasicCloneChecker} & \textbf{Clones found} \\
\midrule
LLVM + Clang & 18m10s & 17m53s & 0 \\
OpenSSL & 1m26s & 9m27s & 180 \\
Git & 0m26s & 2m46s & 34 \\
SDL & 0m26s & 1m59s & 170 \\
\bottomrule
\end{tabular}
\captionof{table}{Running time measurement}
\end{center}\vspace{1cm}
%----------------------------------------------------------------------------------------
% CONCLUSIONS & FUTURE PLANS
%----------------------------------------------------------------------------------------
\section*{Conclusions \& Future plans}
The results show that the \textit{BasicCloneChecker} stable and scalable enough, though
it needs few enhancements, which will make it even better and production-ready. However,
\textit{AdvancedCloneChecker} is totally experimental and needs serious improvements
even though it detects clones in large codebases.
%----------------------------------------------------------------------------------------
% REFERENCES
%----------------------------------------------------------------------------------------
\nocite{*}
\bibliographystyle{plain} % Plain referencing style
\bibliography{biblio} % Use the example bibliography file sample.bib
%----------------------------------------------------------------------------------------
% ACKNOWLEDGEMENTS
%----------------------------------------------------------------------------------------
\section*{Acknowledgements}
I would like to thank ISP RAS researchers who were so kind to share the results of their
work with me.
I should also thank Nick Lewycky, who did some initial work on detecting useless
conditions for Clang. My first checker borrows few ideas from that work.
%----------------------------------------------------------------------------------------
\end{multicols}
\end{document}