Skip to content

Commit

Permalink
some final fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
mohlerm committed Aug 18, 2015
1 parent 3ea1ccc commit d3abd93
Show file tree
Hide file tree
Showing 9 changed files with 68 additions and 70 deletions.
4 changes: 2 additions & 2 deletions report/conclusion.tex
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
Modern Java Virtual Machines (JVM) like HotSpot gather profiling information about executed methods to improve the quality of the compiled code.
This thesis presents several approaches to reuse profiling information, that has been dumped to disk in previous executions of the JVM.
This thesis presents several approaches to reuse profiling information that has been dumped to disk in previous executions of the JVM.
\\\\
The expected advantage is a faster warmup of the Java Virtual Machine, because the JVM does not need to spend time profiling the code and can use cached profiles directly.
Furthermore, since the cached profiles originate from previous compilations, where extensive profiling already happened, compilations using these profiles produce more optimized code, which decreases the amount of deoptimizations.
\\\\
We show, using two benchmark suites, that cached profiles can indeed improve warmup performance and significantly lower the amount of deoptimizations as well as reduce the time spent in the JIT compilers.
Therefore, we believe that cached profiles are a valuable asset in scenarios where a fast JVM warmup is needed and performance fluctuations at runtime should be avoided.
\\\\
In addition, we evaluated the performance of our approach with individual benchmarks for the impact of cached profiles on the load of the compile queue and the amount and type of compilations. The results show, that neither of them gives one-to-one correspondence between the examined factor and performance. However, the results provide indications, where the performance increase or decrease could come from.
In addition, we evaluated the performance of our approach with individual benchmarks for the impact of cached profiles on the load of the compile queue and the amount and type of compilations. The results show that neither of them gives one-to-one correspondence between the examined factor and performance. However, the results provide indications, where the performance increase or decrease could come from.
\\\\
The functionality is implemented in the HotSpot JVM (OpenJDK9). Several new HotSpot options are added to allow fine tuning of the system, including the possibility to selectively enable or disable profile caching.
36 changes: 18 additions & 18 deletions report/implementation.tex

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion report/improvements.tex
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@
\item Additionally, all method compilations get dumped and stored. The system could be extended in a way that only the last compilation record of a method is kept in the file. On the cost of additional memory overhead, this can decrease the size of the cached profile file and result in a lower parsing time.
\item Currently, only the profiles of a single run are used. A possible improvement is to use multiple executions for gathering the cached profiles and come up with ways to merge the profiling information. More complete profiles can be achieved, which could further reduce the number of deoptimizations.
\item In addition to merging multiple profiles we also though about the possibility to modify the cached profile. That would allow the JVM user to manually improve profiling information by using his knowledge of the method execution which might not be available to the compiler.
\item There are several more interesting benchmarks that could be executed. For example, one could try optimizing the benchmarks by only selecting a subset of all methods to be cached. Or a more detailed investigation on different multi core systems, to get more insight which threads are limiting performance, could be executed.
\item There are several more, interesting benchmarks that could be executed. For example, one could try optimizing the benchmarks by only selecting a subset of all methods to be cached. Or conducting a more detailed investigation on different multi core systems, to get more insight which threads are limiting performance.
\end{itemize}
16 changes: 8 additions & 8 deletions report/motivation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@
Figure \ref{f:baseline_vs_usage} gives a schematic visualization of the expected effect on the performance of a single method when using cached profiles compared to the current state without such a system and standard tiered compilation.
Each blue bar corresponds to an invocation of the method. Higher bars mean higher compilation levels and therefore higher performance. The x-axis represents time since the start of the JVM. The figure shows the ideal case and abstracts away many details and other possible cases. However, it provides a good visualization for the examples provided in this chapter. A more detailed performance analysis, also considering possible performance regressions is done in Chapter \ref{c:performance}.
\\\\
We are using my implementation described in Chapter \ref{c:implementation} in CachedProfileMode 0 (see \ref{s:mode0}) built into OpenJDK 1.9.0.
All measurements in this chapter are done on a Dual-Core machine running at 2 GHz with 8GB of RAM. To measure the method invocation time we use hprof \cite{hprof} and the average of 10 runs. The evaluation process has been automated using a couple of python scripts. The error bars show the 95\% confidence interval.
We are using the implementation described in Chapter \ref{c:implementation} in CacheProfileMode 0 (see \ref{s:mode0}) built into OpenJDK 1.9.0.
All measurements in this chapter are done on a Dual-Core machine running at 2 GHz with 8GB of RAM. To measure the method invocation time we use hprof \cite{hprof} and the average of 10 runs. The evaluation process has been automated using a couple of Python scripts. The error bars show the 95\% confidence interval.
\section{Example 1: Benefit of early compilation}
\label{s:ex1}
For Example 1, on-stack replacement (OSR) has been disabled to keep the system simple and easy to understand.
We use a simple class that invokes a method one hundred times. The method consists of a long running loop. The source code is shown in Listing \ref{l:nocompile}.
Since OSR is disabled and a compilation to level 3 is triggered after 200 invocations this method never leaves the interpreter. We call this run the \textit{baseline}.
To show the influence of cached profiles we use a compiler flag to lower the compile threshold explicitly and, using the functionality written for this thesis, tell HotSpot to cache the profile.
In a next execution we use these profiles and achieve a significantly lower time spend executing the cached method as one can see in Figure \ref{f:nocompile}.
In a next execution we use these profiles and achieve a significantly lower time spent executing the cached method as one can see in Figure \ref{f:nocompile}.
This decrease comes mainly from the fact that having a cached profile available allows the JVM to compile highly optimized code for hot methods earlier (at a lower threshold) since there is no need to gather the profiling information first.
\\\\
Since the example is rather simple neither the baseline nor the profile usage run trigger any deoptimizations. This makes sense because after the first invocation, all the code paths of the method have been taken already and are therefore known to the interpreter and saved in the profile.
Expand Down Expand Up @@ -62,7 +62,7 @@ \section{Example 2: Benefit of fewer deoptimizations}
\label{s:ex2}
OSR is one of the core features of HotSpot to improve startup performance of a JVM and disabling it does not give us any practical results. We came up with a second more complex example sketched in Listing \ref{l:manydeopts}, that demonstrates the influence of cached profiles without disabling any HotSpot functionalities.
\\\\
The idea is to create a method that takes a different, long running branch on each of its method invocations. Each branch has been constructed in a way that it will trigger an OSR compilation. When compiling this method during its first iteration only the first branch will be included in the compiled code. The same will happen for each of the 100 method invocations. As one can see in Figure \ref{f:manydeopts} the baseline indeed averages at around 134 deoptimizations and a time per method invocation of 186 ms.
The idea is to create a method that takes a different, long running branch on each of its method invocations. Each branch has been constructed in a way that it will trigger an OSR compilation. When compiling this method during its first iteration only the first branch will be included in the compiled code. The same happens for each of the 100 method invocations. As one can see in Figure \ref{f:manydeopts} the baseline indeed averages at around 134 deoptimizations and a time per method invocation of 186 ms.
\\\\
Now we use a regular execution to dump the profiles and then use these profiles. Theoretically the profiles dumped after a full execution should include knowledge of all branches and therefore the compiled method using these profiles should not run into any deoptimizations. As one can see in Figure \ref{f:manydeopts} this is indeed the case. When using the cached profiles no more deoptimizations occur and because less time is spent profiling and compiling the methods the per method execution time is significantly faster with averaging at 169 ms now.
\begin{lstlisting}[float,caption=Simple method that causes many deoptimizations,label=l:manydeopts,language=Java]
Expand Down Expand Up @@ -103,15 +103,15 @@ \section{Example 2: Benefit of fewer deoptimizations}
\section{Similar systems}
\label{s:similarsystems}
In commercially available JVMs the idea of caching profiles is not new.
The JVM developed and sold by Azul Systems\textregistered\ called Zing\textregistered\ \cite{zing} already offers a similar functionality.
Zing\textregistered\ includes a feature set they call ReadyNow!\texttrademark\ \cite{readynow} which aims to increase startup performance of Java applications.
The JVM developed and sold by Azul Systems called Zing \cite{zing} already offers a similar functionality.
Zing includes a feature set they call ReadyNow! \cite{readynow} which aims to increase startup performance of Java applications.
Their system has been designed with financial markets in mind and to overcome the issue of slow performance in the beginning and performance drops during execution.
\\\\
Azul Systems clients reported that their production code usually experiences a significant performance decrease as soon as the market goes live and the clients start trading.
The reasons are deoptimizations that occur for example due to uncommon branch paths being taken or yet unused methods being invoked.
In the past, Azul Systems' clients used techniques to warm up the JVM, for example doing fake trades prior to market opening. However this does not solve the problem sufficiently well, since the JVM optimizes for these fake trades and still runs into deoptimizations once actual trades happen, because the code includes methods or specific code snippets that differ between the fake and the real trades.
\\\\
ReadyNow!\texttrademark\ is a rich set of improvements how a JVM can overcome this issues. It includes attempts to reduce the number of deoptimizations in general and other not further specified optimizations.
As one of the core features Azul Systems\textregistered\ implemented the ability to log optimization statistics and decisions and reuse these logs in future runs. This is similar to the approach presented in this thesis. However they do not record the actual optimization but the learning and the reasons why certain optimizations happen. This gives them the ability to give feedback to the user of the JVM whether or not certain optimizations have been applied. They also provide APIs for developers to interact with the system and allow further fine-grained custom-designed optimizations.
ReadyNow! is a rich set of improvements how a JVM can overcome this issues. It includes attempts to reduce the number of deoptimizations in general and other not further specified optimizations.
As one of the core features Azul Systems implemented the ability to log optimization statistics and decisions and reuse these logs in future runs. This is similar to the approach presented in this thesis. However they do not record the actual optimization but the learning and the reasons why certain optimizations happen. This gives them the ability to give feedback to the user of the JVM whether or not certain optimizations have been applied. They also provide APIs for developers to interact with the system and allow further fine-grained custom-designed optimizations.
\\\\
Unfortunately, Azul Systems does not provide any numbers about how their JVM actually improves performance when executing a software application or any analysis where the speedup originates from in detail.
12 changes: 6 additions & 6 deletions report/overview.tex
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ \section{Tiered compilation}
C1's goal is to provide a fast compilation with a low memory footprint.
The client compiler performs simple optimizations such as constant folding, null check elimination, and method inlining based on the information gathered during interpretation.
Most of the classes and methods have already been used in the interpreter, allowing C1 to inline them to avoid costly invocations.
More importantly, information about the program flow and state are gathered. This information contain for example which branches get taken or the final types of dynamically typed objects.
More importantly, information about the program flow and state is gathered. This information include which branches get taken or the final types of dynamically typed objects.
For example, if certain branches were not taken during execution, further compilations might abstain from compiling these branches and replace them with static code to provide a faster method execution time (see the example in Listing \ref{l:branchexample}). The uncommon branch includes an \textit{uncommon trap} which notifies the JVM that an assumption does not hold anymore. This then leads to so called \textit{deoptimizations} which are further explained in the separate Section \ref{s:deoptimizations}.
\begin{lstlisting}[float,caption=Example that shows potential compilation based on profiling information,label=l:branchexample,language=Java]
public static void m(int i) {
Expand Down Expand Up @@ -51,9 +51,9 @@ \section{Tiered compilation}
More information about C1 can be found in \cite{client_compiler_talk} and \cite{client_compiler}.
\\\\
Eventually, when further compile thresholds are exceeded, the JVM compiles the method with C2, also known as the \textit{server} compiler.
The server compiler uses the profiles gathered in Tier 0 and Tier 3 and produces highly optimized code. C2 includes far more and more complex optimizations like loop unrolling, common subexpression elimination and elimination of range and null checks. It performs optimistic method inlining, for example by converting some virtual calls to static calls. It relies heavily on the profiling information and richer profiles allow the compiler to do more and better optimizations.
The server compiler uses the profiles gathered in Tier 0 and Tier 3 and produces highly optimized code. C2 includes far more and more complex optimizations like loop unrolling, common subexpression elimination, and elimination of range and null checks. It performs optimistic method inlining, for example by converting some virtual calls to static calls. The C2 compiler relies heavily on the profiling information and richer profiles allow the compiler to do more and better optimizations.
While the code quality of C2 is a lot better than C1 this comes at the cost of compile time. A more detailed look at the server compiler can be found in \cite{server_compiler}.
Figure \ref{f:hs_tiers} gives a short overview as well as showing the standard transitions.
Figure \ref{f:hs_tiers} gives a short overview as well as showing the most common transitions.
\\\\
The naming scheme \textit{client/server} is due to historical reasons when tiered compilation was not available and users had to choose the JIT compiler via a HotSpot command line flag. The \textit{client} compiler was meant to be used for interactive client programs with graphical user interfaces where response time is more important than peak performance. For long running server applications, the highly optimized but slower \textit{server} compiler was the choice suggested.
\\\\
Expand All @@ -62,17 +62,17 @@ \section{Tiered compilation}

\section{Deoptimizations}
\label{s:deoptimizations}
Ideally a method is compiled by making use of as much profiling information as possible.
Ideally, a method is compiled by making use of as much profiling information as possible.
For example, since the profiling information is usually gathered in Levels 0 and 3, it can happen that a method compiled by C2 wants to execute a branch it never used before (again see Figure \ref{l:branchexample}).
In this case the information about this branch is not available in the profile and therefore have not been compiled into the C2-compiled code.
In this case, the information about this branch is not available in the profile and therefore have not been compiled into the C2-compiled code.
This is done to allow further even more optimistic optimization and to keep the compiled code smaller. So instead, the compiler places an uncommon trap at unused branches or unloaded classes which will get triggered in case they actually get used at a later time during execution.
\\\\
The JVM then stops execution of that method and returns control back to the interpreter. This process is called \textit{deoptimization} and considered very expensive. The previous interpreter state has to be restored and the method will be executed using the slow interpreter. Eventually the method might be recompiled with the newly gained information.

\section{On-Stack replacement}
\label{s:onstackreplacement}
In case a method contains a long running loop, counting the method invocations is not enough to determine the hotness of the method. The program still spends a significant amount of time in that method but because the invocation counter does not increase, no compilation is scheduled.
Therefore, HotSpot also counts loop back branches and when a threshold (see also Section \ref{s:compilethresholds}) is reached a compilation is invoked. The JVM then replaces the method's code directly on the program stack. HotSpot sets up a new stack frame for the compiled method which replaces the interpreters stack frame and execution will continue using the native method.
Therefore, HotSpot also counts loop back branches and when a threshold (see also Section \ref{s:compilethresholds}) is reached, a compilation is invoked. The JVM then replaces the method's code directly on the program stack. HotSpot sets up a new stack frame for the compiled method which replaces the interpreters stack frame and execution will continue using the native method.
\\\\
This process is called \textit{on-stack replacement} and usually shortened to OSR. The Figure \ref{f:osr} presented in a talk by T. Rodriguez and K. Russel \cite{client_compiler_talk} gives a graphical representation.
The benefits of OSR will become more obvious when looking at the first example in Chapter \ref{c:motivation}.
Expand Down
Loading

0 comments on commit d3abd93

Please sign in to comment.